## Case study  - Uber Data Analysis


#### Dataset - 
There are two datasets to be used here. The dataset contains-

**Uberdrive.csv**
- Trip_Id - Id for the trip
- Start Date - the date and time of the start of the trip
- End Date - the date and time of the end of the trip
- Start Location - staring location of the trip 
- End Location  - location where the trip ended
- Purpose of drive - Purpose of the trip (Business, Personal, Meals, Errands, Meetings, Customer Support etc.)


**Uberdrive_Miles.csv**
- Trip_Id - Id for the trip
- Miles Driven  - Total miles driven between the start and the end of the trip

----------------------
 #### Concepts To cover 
----------------------
- 1. <a href = #link1>Overview of the data at hand</a>
- 2. <a href = #link3>Filtering Data</a> 
- 3. <a href = #link2>Data profiling and the functions offered by pandas for understanding the data</a>
- 4. <a href = #link4>DateTime operations</a> 




In [1]:
# Import the libraries 
import numpy as np
import pandas as pd
from datetime import datetime

### <a id = "link1"></a>Overview of the data

In [2]:
# Read the Data uberdrive into a df

In [3]:
df = pd.read_csv('uberdrive.csv')

In [2]:
# View first 3 rows of data 

In [4]:
df.head(3) 

Unnamed: 0,Trip_Id,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,PURPOSE*
0,1,01-01-2016 21:11,01-01-2016 21:17,Business,Fort Pierce,Fort Pierce,Meal/Entertain
1,2,01-02-2016 01:25,01-02-2016 01:37,Business,Fort Pierce,Fort Pierce,
2,3,01-02-2016 20:25,01-02-2016 20:38,Business,Fort Pierce,Fort Pierce,Errand/Supplies


In [5]:
# Read the Data UberDrive_Miles into another df
df_miles = pd.read_csv('UberDrive_Miles.csv')

In [3]:
# View first 3 rows of data 

In [6]:
df_miles.head(3)

Unnamed: 0,Trip_Id,MILES*
0,1,5.1
1,2,5.0
2,3,4.8


In [4]:
# understand shape and size of data from Uberdrive

In [8]:
print(df.shape)
print(df.size)

(1155, 7)
8085


In [9]:
# check info about data (includes column names, the number of non-null values in it, and data-type for each column.)

In [10]:
df.info()
print()
df_miles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1155 entries, 0 to 1154
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Trip_Id      1155 non-null   int64 
 1   START_DATE*  1155 non-null   object
 2   END_DATE*    1155 non-null   object
 3   CATEGORY*    1155 non-null   object
 4   START*       1155 non-null   object
 5   STOP*        1155 non-null   object
 6   PURPOSE*     653 non-null    object
dtypes: int64(1), object(6)
memory usage: 63.3+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1155 entries, 0 to 1154
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Trip_Id  1155 non-null   int64  
 1   MILES*   1155 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 18.2 KB


### Renaming columns

In [6]:
# Replace the * character from all the  columns

In [11]:
# Approach 1
# Replace the * character from all the  columns
df.columns = df.columns.str.replace("*", "")

# Approach 2
#
# You can also rename the specific column names 
df_miles.rename(columns = {'MILES*':'MILES'}, inplace=True)
print(df.columns,"\n", df_miles.columns)

Index(['Trip_Id', 'START_DATE', 'END_DATE', 'CATEGORY', 'START', 'STOP',
       'PURPOSE'],
      dtype='object') 
 Index(['Trip_Id', 'MILES'], dtype='object')


  df.columns = df.columns.str.replace("*", "")


### <a id = "link3"></a>Filtering dataframes
#### Using null values

In [7]:
# shows the top 5 entries where PURPOSE is null

In [12]:
df[df.PURPOSE.isnull()].head(5)

Unnamed: 0,Trip_Id,START_DATE,END_DATE,CATEGORY,START,STOP,PURPOSE
1,2,01-02-2016 01:25,01-02-2016 01:37,Business,Fort Pierce,Fort Pierce,
32,33,1/19/2016 9:09,1/19/2016 9:23,Business,Whitebridge,Lake Wellingborough,
85,86,02-09-2016 10:54,02-09-2016 11:07,Personal,Whitebridge,Northwoods,
86,87,02-09-2016 11:43,02-09-2016 11:50,Personal,Northwoods,Tanglewood,
87,88,02-09-2016 13:36,02-09-2016 13:52,Personal,Tanglewood,Preston,



#### Filtering out records based on conditions

In [8]:
# check how many trips were longer than 30 miles

In [14]:
df_miles[df_miles['MILES'] > 30]['Trip_Id'].count()

52

**Name and Number of all unique start and stop points**

In [10]:
# Get the unique starting point names, unique destination

In [15]:
print(df['START'].unique())

['Fort Pierce' 'West Palm Beach' 'Cary' 'Jamaica' 'New York' 'Elmhurst'
 'Midtown' 'East Harlem' 'Flatiron District' 'Midtown East'
 'Hudson Square' 'Lower Manhattan' "Hell's Kitchen" 'Downtown' 'Gulfton'
 'Houston' 'Eagan Park' 'Morrisville' 'Durham' 'Farmington Woods'
 'Whitebridge' 'Lake Wellingborough' 'Fayetteville Street' 'Raleigh'
 'Hazelwood' 'Fairmont' 'Meredith Townes' 'Apex' 'Chapel Hill'
 'Northwoods' 'Edgehill Farms' 'Tanglewood' 'Preston' 'Eastgate'
 'East Elmhurst' 'Jackson Heights' 'Long Island City' 'Katunayaka'
 'Unknown Location' 'Colombo' 'Nugegoda' 'Islamabad' 'R?walpindi'
 'Noorpur Shahan' 'Heritage Pines' 'Westpark Place' 'Waverly Place'
 'Wayne Ridge' 'Weston' 'East Austin' 'West University' 'South Congress'
 'The Drag' 'Congress Ave District' 'Red River District' 'Georgian Acres'
 'North Austin' 'Coxville' 'Convention Center District' 'Austin' 'Katy'
 'Sharpstown' 'Sugar Land' 'Galveston' 'Port Bolivar' 'Washington Avenue'
 'Briar Meadow' 'Latta' 'Jacksonville'

In [11]:
# Get the number of unique starting point, unique destination

In [16]:
print(df['START'].nunique())     

177


In [12]:
# Get the names of stopping destinations, unique destinations

In [17]:
print(df['STOP'].unique())

['Fort Pierce' 'West Palm Beach' 'Palm Beach' 'Cary' 'Morrisville'
 'New York' 'Queens' 'East Harlem' 'NoMad' 'Midtown' 'Midtown East'
 'Hudson Square' 'Lower Manhattan' "Hell's Kitchen" 'Queens County'
 'Gulfton' 'Downtown' 'Houston' 'Jamestown Court' 'Durham' 'Whitebridge'
 'Lake Wellingborough' 'Raleigh' 'Umstead' 'Hazelwood' 'Westpark Place'
 'Meredith Townes' 'Leesville Hollow' 'Apex' 'Chapel Hill'
 'Williamsburg Manor' 'Macgregor Downs' 'Edgehill Farms' 'Northwoods'
 'Tanglewood' 'Preston' 'Walnut Terrace' 'Jackson Heights' 'East Elmhurst'
 'Midtown West' 'Long Island City' 'Jamaica' 'Unknown Location' 'Colombo'
 'Nugegoda' 'Katunayaka' 'Islamabad' 'R?walpindi' 'Noorpur Shahan'
 'Heritage Pines' 'Waverly Place' 'Wayne Ridge' 'Depot Historic District'
 'Weston' 'West University' 'South Congress' 'Arts District'
 'Congress Ave District' 'Red River District' 'The Drag'
 'Convention Center District' 'North Austin' 'Coxville' 'Katy' 'Alief'
 'Sharpstown' 'Sugar Land' 'Galveston' 'Port

In [13]:
# Names of unique stopping points

In [18]:
print(len(df['STOP'].unique())) 

188


**Identify popular start points**

In [14]:
#get top 10 most frequent Start points
#hint: use value_conts()

In [19]:
df['START'].value_counts().head(10)

Cary                201
Unknown Location    148
Morrisville          85
Whitebridge          68
Islamabad            57
Durham               37
Lahore               36
Raleigh              28
Kar?chi              27
Westpark Place       17
Name: START, dtype: int64

**Identify popular stop destinations**

In [15]:
#get top 10 most frequent Stop points
#hint: use value_conts()

In [20]:
df['STOP'].value_counts().head(10)

Cary                203
Unknown Location    149
Morrisville          84
Whitebridge          65
Islamabad            58
Lahore               36
Durham               36
Raleigh              29
Kar?chi              26
Apex                 17
Name: STOP, dtype: int64

**Are there cases where the start and the stop location are the same ? If so how many**

In [22]:
df[df['START'] == df['STOP']]['Trip_Id'].count()

288

**Starting point from which the most miles have been driven**

**In order to use the miles feature, you need to merge the two dataframes so that the all the information is in one dataframe.**
- using merge 


In [23]:
df = pd.merge(df, df_miles, on = 'Trip_Id', how = 'left')
df.head(5)

Unnamed: 0,Trip_Id,START_DATE,END_DATE,CATEGORY,START,STOP,PURPOSE,MILES
0,1,01-01-2016 21:11,01-01-2016 21:17,Business,Fort Pierce,Fort Pierce,Meal/Entertain,5.1
1,2,01-02-2016 01:25,01-02-2016 01:37,Business,Fort Pierce,Fort Pierce,,5.0
2,3,01-02-2016 20:25,01-02-2016 20:38,Business,Fort Pierce,Fort Pierce,Errand/Supplies,4.8
3,4,01-05-2016 17:31,01-05-2016 17:45,Business,Fort Pierce,Fort Pierce,Meeting,4.7
4,5,01-06-2016 14:42,01-06-2016 15:49,Business,Fort Pierce,West Palm Beach,Customer Visit,63.7


**Use groupby function to find the starting point from which the most miles have been driven**

In [24]:
df.groupby('START')['MILES'].sum().sort_values(ascending = False ).head(10)

START
Unknown Location    1976.5
Cary                1791.3
Morrisville          671.7
Raleigh              433.0
Islamabad            401.2
Durham               384.4
Jacksonville         375.2
Latta                310.3
Asheville            287.7
Whitebridge          273.4
Name: MILES, dtype: float64

**Find the top10 start stop pair that have the most miles covered between them ever.**

In [25]:
df2 = df[df['START'] != 'Unknown Location']             # Makes a new dataframe, which don't have "Unknown Location" as starting point
df2 = df2[df2['STOP'] != 'Unknown Location']            # Further updates the df2 dataframe, by removing "Unknown Location" as stopping point

In [26]:
# Creating a dataframe with the top 10 most miles covered between a start stop pair
k3 = df2.groupby(['START','STOP'])['MILES'].sum().sort_values(ascending=False).head(10) 
k3= k3.reset_index() # flatten the dataframe 
k3['Start-Stop'] = k3['START'] + ' - ' + k3['STOP']
k3

Unnamed: 0,START,STOP,MILES,Start-Stop
0,Morrisville,Cary,395.7,Morrisville - Cary
1,Cary,Durham,390.0,Cary - Durham
2,Cary,Morrisville,380.0,Cary - Morrisville
3,Raleigh,Cary,365.7,Raleigh - Cary
4,Cary,Raleigh,336.5,Cary - Raleigh
5,Durham,Cary,334.4,Durham - Cary
6,Latta,Jacksonville,310.3,Latta - Jacksonville
7,Cary,Cary,255.9,Cary - Cary
8,Jacksonville,Kissimmee,201.0,Jacksonville - Kissimmee
9,Asheville,Mebane,195.9,Asheville - Mebane


**Working with dates.**

In [None]:
#check column type for uberdrive df

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1155 entries, 0 to 1154
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Trip_Id     1155 non-null   int64  
 1   START_DATE  1155 non-null   object 
 2   END_DATE    1155 non-null   object 
 3   CATEGORY    1155 non-null   object 
 4   START       1155 non-null   object 
 5   STOP        1155 non-null   object 
 6   PURPOSE     653 non-null    object 
 7   MILES       1155 non-null   float64
dtypes: float64(1), int64(1), object(6)
memory usage: 81.2+ KB


In [16]:
# create columns by converting the start and end date into a datatime format
# you can also over write the same column - but for the sake of understanding the difference in formats, we create new columns

In [28]:
df['start_dt'] = pd.to_datetime(df['START_DATE'])
df['end_dt'] = pd.to_datetime(df['END_DATE'])

In [17]:
#print first 5 rows of data.

In [29]:
df.head()

Unnamed: 0,Trip_Id,START_DATE,END_DATE,CATEGORY,START,STOP,PURPOSE,MILES,start_dt,end_dt
0,1,01-01-2016 21:11,01-01-2016 21:17,Business,Fort Pierce,Fort Pierce,Meal/Entertain,5.1,2016-01-01 21:11:00,2016-01-01 21:17:00
1,2,01-02-2016 01:25,01-02-2016 01:37,Business,Fort Pierce,Fort Pierce,,5.0,2016-01-02 01:25:00,2016-01-02 01:37:00
2,3,01-02-2016 20:25,01-02-2016 20:38,Business,Fort Pierce,Fort Pierce,Errand/Supplies,4.8,2016-01-02 20:25:00,2016-01-02 20:38:00
3,4,01-05-2016 17:31,01-05-2016 17:45,Business,Fort Pierce,Fort Pierce,Meeting,4.7,2016-01-05 17:31:00,2016-01-05 17:45:00
4,5,01-06-2016 14:42,01-06-2016 15:49,Business,Fort Pierce,West Palm Beach,Customer Visit,63.7,2016-01-06 14:42:00,2016-01-06 15:49:00


In [18]:
# see how the dtype is different, print column types

In [31]:
df.dtypes

Trip_Id                int64
START_DATE            object
END_DATE              object
CATEGORY              object
START                 object
STOP                  object
PURPOSE               object
MILES                float64
start_dt      datetime64[ns]
end_dt        datetime64[ns]
dtype: object

In [None]:
# Create more columns by using the inbuilt functionalities of datatime module, day, hour, month, dayofweek

In [32]:
df['start_day'] = df['start_dt'].dt.day
df['start_hour'] = df['start_dt'].dt.hour
df['start_month'] = df['start_dt'].dt.month
df['d_of_wk'] = df['start_dt'].dt.dayofweek   # Days encoded as 0-6  ( monday =0, Tue =1 .... )

In [None]:
#create a new column that has week day by name and month by name

In [33]:
df.head().T #un .T for better viewing results of adding new columns

Unnamed: 0,0,1,2,3,4
Trip_Id,1,2,3,4,5
START_DATE,01-01-2016 21:11,01-02-2016 01:25,01-02-2016 20:25,01-05-2016 17:31,01-06-2016 14:42
END_DATE,01-01-2016 21:17,01-02-2016 01:37,01-02-2016 20:38,01-05-2016 17:45,01-06-2016 15:49
CATEGORY,Business,Business,Business,Business,Business
START,Fort Pierce,Fort Pierce,Fort Pierce,Fort Pierce,Fort Pierce
STOP,Fort Pierce,Fort Pierce,Fort Pierce,Fort Pierce,West Palm Beach
PURPOSE,Meal/Entertain,,Errand/Supplies,Meeting,Customer Visit
MILES,5.1,5.0,4.8,4.7,63.7
start_dt,2016-01-01 21:11:00,2016-01-02 01:25:00,2016-01-02 20:25:00,2016-01-05 17:31:00,2016-01-06 14:42:00
end_dt,2016-01-01 21:17:00,2016-01-02 01:37:00,2016-01-02 20:38:00,2016-01-05 17:45:00,2016-01-06 15:49:00


In [34]:
df['weekday'] = df['start_dt'].apply(lambda x : datetime.strftime(x,'%a'))  # ( or directly convert into the short form)

In [35]:
df['cal_month'] =  df['start_dt'].apply(lambda x : datetime.strftime(x,'%b'))

In [36]:
df.head().T

Unnamed: 0,0,1,2,3,4
Trip_Id,1,2,3,4,5
START_DATE,01-01-2016 21:11,01-02-2016 01:25,01-02-2016 20:25,01-05-2016 17:31,01-06-2016 14:42
END_DATE,01-01-2016 21:17,01-02-2016 01:37,01-02-2016 20:38,01-05-2016 17:45,01-06-2016 15:49
CATEGORY,Business,Business,Business,Business,Business
START,Fort Pierce,Fort Pierce,Fort Pierce,Fort Pierce,Fort Pierce
STOP,Fort Pierce,Fort Pierce,Fort Pierce,Fort Pierce,West Palm Beach
PURPOSE,Meal/Entertain,,Errand/Supplies,Meeting,Customer Visit
MILES,5.1,5.0,4.8,4.7,63.7
start_dt,2016-01-01 21:11:00,2016-01-02 01:25:00,2016-01-02 20:25:00,2016-01-05 17:31:00,2016-01-06 14:42:00
end_dt,2016-01-01 21:17:00,2016-01-02 01:37:00,2016-01-02 20:38:00,2016-01-05 17:45:00,2016-01-06 15:49:00


**Find the busiest month in terms of number of drives and miles driven**

In [19]:
#groupby calender months and count the number of drives

In [37]:
df.groupby('cal_month').count()['Trip_Id'].sort_values(ascending = False).head(1)           

cal_month
Dec    146
Name: Trip_Id, dtype: int64

In [20]:
#groupby calender months and sum the number of miles driven

In [38]:
df.groupby('cal_month').sum()['MILES'].sort_values(ascending = False).head(1)         

cal_month
Oct    1810.0
Name: MILES, dtype: float64

**Busiest day in terms of number of rides**

In [21]:
# Which day did the driver get most drives? 

In [40]:
df.groupby(['weekday']).size().head(1)

weekday
Fri    206
dtype: int64

**Peak hours?**

In [22]:
#use groupby on start_hour to find peak hours

In [44]:
df.groupby('start_hour').sum().sort_values(by='MILES',ascending=False).head(1)           # The number of trips started for each hour.

Unnamed: 0_level_0,Trip_Id,MILES,start_day,start_month,d_of_wk
start_hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
16,46404,1386.4,1334,567,239


**Most frequent trip category**

In [24]:
#Category

In [45]:
df['CATEGORY'].value_counts().head(1)

Business    1078
Name: CATEGORY, dtype: int64

**Most frequent Purpose**

In [23]:
#Purpose

In [46]:
df['PURPOSE'].value_counts().head(1)

Meeting    187
Name: PURPOSE, dtype: int64

**Most trips are for meetings**

In [25]:
#Average distance traveled for each activity

In [47]:
df.groupby('PURPOSE').mean()['MILES'].sort_values(ascending = False)

PURPOSE
Commute            180.200000
Customer Visit      20.688119
Meeting             15.247594
Charity ($)         15.100000
Between Offices     10.944444
Temporary Site      10.474000
Meal/Entertain       5.698125
Airport/Travel       5.500000
Moving               4.550000
Errand/Supplies      3.968750
Name: MILES, dtype: float64

In [26]:
#How many miles were driven per category and purpose ?

In [49]:
df.groupby(['PURPOSE','CATEGORY']).sum()['MILES'].sort_values(ascending = False)

PURPOSE          CATEGORY
Meeting          Business    2851.3
Customer Visit   Business    2089.5
Meal/Entertain   Business     911.7
Temporary Site   Business     523.7
Errand/Supplies  Business     508.0
Between Offices  Business     197.0
Commute          Personal     180.2
Moving           Personal      18.2
Airport/Travel   Business      16.5
Charity ($)      Personal      15.1
Name: MILES, dtype: float64

In [27]:
#How many miles were driven per category and purpose ?

In [50]:
df.groupby('CATEGORY').sum()['MILES'].sort_values(ascending = False)

CATEGORY
Business    11487.0
Personal      717.7
Name: MILES, dtype: float64

In [28]:
#What is percentage of business miles vs personal?