<a name="top"></a>
# Capstone Project


# Problem Statement:
### Build a model that predicts the price of a commute & suggests the cheaper service between Lyft and Uber.

## Introduction: 
### There are two kinds of ride-hailing services in the United States. Many consumers are unsure of which service offers the cheaper ride. Many consumers are keen in savings.Therefore, they constantly check both the services before making a final decision on which service to use. Such a time-inducing process wastes a lot of time. There should be a better way of doing things.This is where the model that we are going to build comes to the rescue. The model predicts the price of rides based on the weather patterns and timing information. Hence, reducing the time needed to ascertain the cheaper ride.

## Contents:
**1. [Loading and Cleaning Dataset](#Loading-and-Cleaning-Dataset)**

**2. [Time Conversion](#Time-Conversion)**

**3. [Joining two Datasets](#Joining-two-Datasets)**

**4. [Removing unwanted features from Merged dataframe](#Removing-unwanted-features-from-Merged-dataframe)**

**5. [Splitting data to create seperate Lyft and Uber datasets](#Splitting-data-to-create-seperate-Lyft-and-Uber-datasets)**


<a name="Loading and Cleaning Dataset"></a>
# Loading and Cleaning Dataset 


### Import Libraries

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pytz

import datetime 

plt.style.use('fivethirtyeight')

%matplotlib inline

### Importing in the trips and weather csv file 

In [3]:
cab_df = pd.read_csv("cab_rides.csv",delimiter='\t',encoding = "utf-16")
weather_df = pd.read_csv("weather.csv",delimiter='\t',encoding = "utf-16")

In [4]:
cab_df.head()

Unnamed: 0,distance,cab_type,time_stamp (Epoch Timing),destination,source,price,surge_multiplier,id,product_id,name
0,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared
1,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux
2,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft
3,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,26.0,1.0,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL
4,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,9.0,1.0,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL


In [5]:
weather_df.head()

Unnamed: 0,temp/f,location,cloud cover,pressure,rain,time_stamp,humidity,wind
0,42.42,Back Bay,1.0,1012.14,0.1228,1545003901,0.77,11.25
1,42.43,Beacon Hill,1.0,1012.15,0.1846,1545003901,0.76,11.32
2,42.5,Boston University,1.0,1012.15,0.1089,1545003901,0.76,11.07
3,42.11,Fenway,1.0,1012.13,0.0969,1545003901,0.77,11.09
4,43.13,Financial District,1.0,1012.14,0.1786,1545003901,0.75,11.49


In [6]:
cab_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693071 entries, 0 to 693070
Data columns (total 10 columns):
distance                     693071 non-null float64
cab_type                     693071 non-null object
time_stamp (Epoch Timing)    693071 non-null float64
destination                  693071 non-null object
source                       693071 non-null object
price                        637976 non-null float64
surge_multiplier             693071 non-null float64
id                           693071 non-null object
product_id                   693071 non-null object
name                         693071 non-null object
dtypes: float64(4), object(6)
memory usage: 52.9+ MB


In [7]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6276 entries, 0 to 6275
Data columns (total 8 columns):
temp/f         6276 non-null float64
location       6276 non-null object
cloud cover    6276 non-null float64
pressure       6276 non-null float64
rain           894 non-null float64
time_stamp     6276 non-null int64
humidity       6276 non-null float64
wind           6276 non-null float64
dtypes: float64(6), int64(1), object(1)
memory usage: 392.3+ KB


In [8]:
#To remove the rows that do not have the price value 
cab_df.dropna(subset = ['price'], inplace = True)

In [149]:
cab_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 637976 entries, 0 to 693070
Data columns (total 10 columns):
distance                     637976 non-null float64
cab_type                     637976 non-null object
time_stamp (Epoch Timing)    637976 non-null float64
destination                  637976 non-null object
source                       637976 non-null object
price                        637976 non-null float64
surge_multiplier             637976 non-null float64
id                           637976 non-null object
product_id                   637976 non-null object
name                         637976 non-null object
dtypes: float64(4), object(6)
memory usage: 53.5+ MB


In [150]:
#Fill in missing weather Data
weather_df['rain'].fillna(0, inplace = True)

In [151]:
#To check wether the the rain columns have been filled correctly
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6276 entries, 0 to 6275
Data columns (total 8 columns):
temp/f         6276 non-null float64
location       6276 non-null object
cloud cover    6276 non-null float64
pressure       6276 non-null float64
rain           6276 non-null float64
time_stamp     6276 non-null int64
humidity       6276 non-null float64
wind           6276 non-null float64
dtypes: float64(6), int64(1), object(1)
memory usage: 392.3+ KB


In [152]:
#Observing the Class balance between Uber and Lyft Queries
cab_df['cab_type'].value_counts(normalize =True)

Uber    0.518151
Lyft    0.481849
Name: cab_type, dtype: float64

In [153]:
cab_df[cab_df['cab_type'] == 'Lyft'].shape

(307408, 10)

In [154]:
cab_df[cab_df['cab_type'] == 'Uber'].shape

(330568, 10)

In [155]:
#Converting the epoch time to real time 
cab_df['timing'] = pd.to_datetime(cab_df['time_stamp (Epoch Timing)']/1000,unit = "s")
weather_df['timing'] = pd.to_datetime(weather_df['time_stamp'],unit ='s')

In [156]:
#To check the new column called timing in the cab df
cab_df.tail()

Unnamed: 0,distance,cab_type,time_stamp (Epoch Timing),destination,source,price,surge_multiplier,id,product_id,name,timing
693065,1.0,Uber,1543710000000.0,North End,West End,9.5,1.0,353e6566-b272-479e-a9c6-98bd6cb23f25,9a0e7b09-b92b-4c41-9779-2ad22b4d779d,WAV,2018-12-02 00:20:00
693066,1.0,Uber,1543710000000.0,North End,West End,13.0,1.0,616d3611-1820-450a-9845-a9ff304a4842,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL,2018-12-02 00:20:00
693067,1.0,Uber,1543710000000.0,North End,West End,9.5,1.0,633a3fc3-1f86-4b9e-9d48-2b7132112341,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,2018-12-02 00:20:00
693069,1.0,Uber,1543710000000.0,North End,West End,27.0,1.0,727e5f07-a96b-4ad1-a2c7-9abc3ad55b4e,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,2018-12-02 00:20:00
693070,1.0,Uber,1543710000000.0,North End,West End,10.0,1.0,e7fdc087-fe86-40a5-a3c3-3b2a8badcbda,997acbb5-e102-41e1-b155-9df7de0a73f2,UberPool,2018-12-02 00:20:00


In [157]:
cab_df['timing'].describe()

count                  637976
unique                    123
top       2018-11-29 00:06:40
freq                    14112
first     2018-11-26 02:40:00
last      2018-12-18 19:06:40
Name: timing, dtype: object

In [70]:
# To observe the representation of the different cab types in the dataset
cab_df.groupby('cab_type')['name'].value_counts(normalize =True)


cab_type  name        
Lyft      Lux             0.166668
          Lux Black       0.166668
          Lux Black XL    0.166668
          Lyft            0.166668
          Lyft XL         0.166668
          Shared          0.166661
Uber      Black SUV       0.166671
          UberXL          0.166671
          WAV             0.166671
          Black           0.166668
          UberX           0.166665
          UberPool        0.166656
Name: name, dtype: float64

In [72]:
#To check the new column called timing in the weather df

weather_df.head()

Unnamed: 0,temp/f,location,cloud cover,pressure,rain,time_stamp,humidity,wind,timing
0,42.42,Back Bay,1.0,1012.14,0.1228,1545003901,0.77,11.25,2018-12-16 23:45:01
1,42.43,Beacon Hill,1.0,1012.15,0.1846,1545003901,0.76,11.32,2018-12-16 23:45:01
2,42.5,Boston University,1.0,1012.15,0.1089,1545003901,0.76,11.07,2018-12-16 23:45:01
3,42.11,Fenway,1.0,1012.13,0.0969,1545003901,0.77,11.09,2018-12-16 23:45:01
4,43.13,Financial District,1.0,1012.14,0.1786,1545003901,0.75,11.49,2018-12-16 23:45:01


[Back to top](#top)

<a name="Time Conversion"></a>
## Time Conversion

In [74]:
#Converting the time to be in the same timezone of Boston in that period, UTC -5
cab_df['timing_utc_5'] = pd.to_datetime(cab_df['time_stamp (Epoch Timing)']/1000,unit = "s")
cab_df['timing_utc_5'] = cab_df['timing_utc_5'].dt.tz_localize('utc').dt.tz_convert('US/Eastern')

weather_df['timing_utc_5'] = pd.to_datetime(weather_df['time_stamp'],unit ='s')
weather_df['timing_utc_5'] = weather_df['timing_utc_5'].dt.tz_localize('utc').dt.tz_convert('US/Eastern')

In [75]:
cab_df.head(5)

Unnamed: 0,distance,cab_type,time_stamp (Epoch Timing),destination,source,price,surge_multiplier,id,product_id,name,timing,timing_utc_5
0,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00
1,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00
2,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00
3,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,26.0,1.0,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00
4,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,9.0,1.0,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00


In [76]:
weather_df.head(5)

Unnamed: 0,temp/f,location,cloud cover,pressure,rain,time_stamp,humidity,wind,timing,timing_utc_5
0,42.42,Back Bay,1.0,1012.14,0.1228,1545003901,0.77,11.25,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00
1,42.43,Beacon Hill,1.0,1012.15,0.1846,1545003901,0.76,11.32,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00
2,42.5,Boston University,1.0,1012.15,0.1089,1545003901,0.76,11.07,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00
3,42.11,Fenway,1.0,1012.13,0.0969,1545003901,0.77,11.09,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00
4,43.13,Financial District,1.0,1012.14,0.1786,1545003901,0.75,11.49,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00


In [77]:
cab_df['merge_date'] = cab_df['source'].astype('str') +" - "+ cab_df['timing_utc_5'].dt.date.astype("str") +" - "+ cab_df['timing_utc_5'].dt.hour.astype("str")
weather_df['merge_date'] = weather_df['location'].astype('str') +" - "+ weather_df['timing_utc_5'].dt.date.astype("str") +" - "+ weather_df['timing_utc_5'].dt.hour.astype("str")

In [78]:
cab_df.head(5)

Unnamed: 0,distance,cab_type,time_stamp (Epoch Timing),destination,source,price,surge_multiplier,id,product_id,name,timing,timing_utc_5,merge_date
0,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3
1,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3
2,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3
3,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,26.0,1.0,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3
4,0.44,Lyft,1544950000000.0,North Station,Haymarket Square,9.0,1.0,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL,2018-12-16 08:46:40,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3


In [79]:
weather_df.head(5)

Unnamed: 0,temp/f,location,cloud cover,pressure,rain,time_stamp,humidity,wind,timing,timing_utc_5,merge_date
0,42.42,Back Bay,1.0,1012.14,0.1228,1545003901,0.77,11.25,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00,Back Bay - 2018-12-16 - 18
1,42.43,Beacon Hill,1.0,1012.15,0.1846,1545003901,0.76,11.32,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00,Beacon Hill - 2018-12-16 - 18
2,42.5,Boston University,1.0,1012.15,0.1089,1545003901,0.76,11.07,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00,Boston University - 2018-12-16 - 18
3,42.11,Fenway,1.0,1012.13,0.0969,1545003901,0.77,11.09,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00,Fenway - 2018-12-16 - 18
4,43.13,Financial District,1.0,1012.14,0.1786,1545003901,0.75,11.49,2018-12-16 23:45:01,2018-12-16 18:45:01-05:00,Financial District - 2018-12-16 - 18


In [80]:
cab_df.drop(['id','product_id', 'time_stamp (Epoch Timing)', 'timing'], axis = 1 , inplace = True)

In [81]:
weather_df.drop(['timing', 'timing_utc_5' ], axis = 1 , inplace = True)

In [82]:
weather_df.index = weather_df['merge_date']

[Back to top](#top)

<a name=" Joining two Datasets"></a>
## Joining two Datasets


In [83]:
merged_df = cab_df.join(weather_df, on = ['merge_date'], rsuffix = '_w')

In [84]:
merged_df.tail()

Unnamed: 0,distance,cab_type,destination,source,price,surge_multiplier,name,timing_utc_5,merge_date,temp/f,location,cloud cover,pressure,rain,time_stamp,humidity,wind,merge_date_w
693065,1.0,Uber,North End,West End,9.5,1.0,WAV,2018-12-01 19:20:00-05:00,West End - 2018-12-01 - 19,35.63,West End,0.2,1024.21,0.0,1543712000.0,0.78,1.85,West End - 2018-12-01 - 19
693066,1.0,Uber,North End,West End,13.0,1.0,UberXL,2018-12-01 19:20:00-05:00,West End - 2018-12-01 - 19,35.63,West End,0.2,1024.21,0.0,1543712000.0,0.78,1.85,West End - 2018-12-01 - 19
693067,1.0,Uber,North End,West End,9.5,1.0,UberX,2018-12-01 19:20:00-05:00,West End - 2018-12-01 - 19,35.63,West End,0.2,1024.21,0.0,1543712000.0,0.78,1.85,West End - 2018-12-01 - 19
693069,1.0,Uber,North End,West End,27.0,1.0,Black SUV,2018-12-01 19:20:00-05:00,West End - 2018-12-01 - 19,35.63,West End,0.2,1024.21,0.0,1543712000.0,0.78,1.85,West End - 2018-12-01 - 19
693070,1.0,Uber,North End,West End,10.0,1.0,UberPool,2018-12-01 19:20:00-05:00,West End - 2018-12-01 - 19,35.63,West End,0.2,1024.21,0.0,1543712000.0,0.78,1.85,West End - 2018-12-01 - 19


In [85]:
merged_df[(merged_df['surge_multiplier']> 1)].shape

(29070, 18)

In [86]:
#merged_df['day'] = merged_df['timing'].dt.dayofweek
merged_df['day'] = merged_df['timing_utc_5'].dt.day_name() #To get all the days in text format 

In [87]:
merged_df.head()

Unnamed: 0,distance,cab_type,destination,source,price,surge_multiplier,name,timing_utc_5,merge_date,temp/f,location,cloud cover,pressure,rain,time_stamp,humidity,wind,merge_date_w,day
0,0.44,Lyft,North Station,Haymarket Square,5.0,1.0,Shared,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3,39.36,Haymarket Square,0.39,1022.44,0.0,1544950000.0,0.74,8.14,Haymarket Square - 2018-12-16 - 3,Sunday
1,0.44,Lyft,North Station,Haymarket Square,11.0,1.0,Lux,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3,39.36,Haymarket Square,0.39,1022.44,0.0,1544950000.0,0.74,8.14,Haymarket Square - 2018-12-16 - 3,Sunday
2,0.44,Lyft,North Station,Haymarket Square,7.0,1.0,Lyft,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3,39.36,Haymarket Square,0.39,1022.44,0.0,1544950000.0,0.74,8.14,Haymarket Square - 2018-12-16 - 3,Sunday
3,0.44,Lyft,North Station,Haymarket Square,26.0,1.0,Lux Black XL,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3,39.36,Haymarket Square,0.39,1022.44,0.0,1544950000.0,0.74,8.14,Haymarket Square - 2018-12-16 - 3,Sunday
4,0.44,Lyft,North Station,Haymarket Square,9.0,1.0,Lyft XL,2018-12-16 03:46:40-05:00,Haymarket Square - 2018-12-16 - 3,39.36,Haymarket Square,0.39,1022.44,0.0,1544950000.0,0.74,8.14,Haymarket Square - 2018-12-16 - 3,Sunday


In [88]:
#Getting the right hour
merged_df['hour'] = merged_df['timing_utc_5'].dt.hour

In [89]:
merged_df[merged_df['day'] == 'Monday'].head()

Unnamed: 0,distance,cab_type,destination,source,price,surge_multiplier,name,timing_utc_5,merge_date,temp/f,location,cloud cover,pressure,rain,time_stamp,humidity,wind,merge_date_w,day,hour
6,1.08,Lyft,Northeastern University,Back Bay,10.5,1.0,Lyft XL,2018-11-26 19:53:20-05:00,Back Bay - 2018-11-26 - 19,43.96,Back Bay,1.0,1006.26,0.0497,1543278000.0,0.9,9.86,Back Bay - 2018-11-26 - 19,Monday,19
6,1.08,Lyft,Northeastern University,Back Bay,10.5,1.0,Lyft XL,2018-11-26 19:53:20-05:00,Back Bay - 2018-11-26 - 19,43.83,Back Bay,0.97,1005.9,0.2173,1543279000.0,0.91,10.93,Back Bay - 2018-11-26 - 19,Monday,19
6,1.08,Lyft,Northeastern University,Back Bay,10.5,1.0,Lyft XL,2018-11-26 19:53:20-05:00,Back Bay - 2018-11-26 - 19,43.82,Back Bay,0.97,1005.87,0.2039,1543279000.0,0.91,11.02,Back Bay - 2018-11-26 - 19,Monday,19
6,1.08,Lyft,Northeastern University,Back Bay,10.5,1.0,Lyft XL,2018-11-26 19:53:20-05:00,Back Bay - 2018-11-26 - 19,43.82,Back Bay,0.97,1005.89,0.2154,1543279000.0,0.91,10.94,Back Bay - 2018-11-26 - 19,Monday,19
6,1.08,Lyft,Northeastern University,Back Bay,10.5,1.0,Lyft XL,2018-11-26 19:53:20-05:00,Back Bay - 2018-11-26 - 19,43.98,Back Bay,1.0,1006.36,0.0358,1543278000.0,0.9,9.53,Back Bay - 2018-11-26 - 19,Monday,19


[Back to top](#top)

<a name=" Removing unwanted features from Merged dataframe"></a>
## Removing unwanted features from Merged dataframe

In [90]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1166936 entries, 0 to 693070
Data columns (total 20 columns):
distance            1166936 non-null float64
cab_type            1166936 non-null object
destination         1166936 non-null object
source              1166936 non-null object
price               1166936 non-null float64
surge_multiplier    1166936 non-null float64
name                1166936 non-null object
timing_utc_5        1166936 non-null datetime64[ns, US/Eastern]
merge_date          1166936 non-null object
temp/f              1161392 non-null float64
location            1161392 non-null object
cloud cover         1161392 non-null float64
pressure            1161392 non-null float64
rain                1161392 non-null float64
time_stamp          1161392 non-null float64
humidity            1161392 non-null float64
wind                1161392 non-null float64
merge_date_w        1161392 non-null object
day                 1166936 non-null object
hour                11

In [91]:
merged_df.isnull().sum()

distance               0
cab_type               0
destination            0
source                 0
price                  0
surge_multiplier       0
name                   0
timing_utc_5           0
merge_date             0
temp/f              5544
location            5544
cloud cover         5544
pressure            5544
rain                5544
time_stamp          5544
humidity            5544
wind                5544
merge_date_w        5544
day                    0
hour                   0
dtype: int64

In [92]:
#Some of the rows do not have corresponding weather information.Thus we need to drop these rows
merged_df.dropna(inplace= True)

In [93]:
merged_df.columns

Index(['distance', 'cab_type', 'destination', 'source', 'price',
       'surge_multiplier', 'name', 'timing_utc_5', 'merge_date', 'temp/f',
       'location', 'cloud cover', 'pressure', 'rain', 'time_stamp', 'humidity',
       'wind', 'merge_date_w', 'day', 'hour'],
      dtype='object')

In [42]:
merged_df = merged_df[['distance', 'cab_type', 'source', 'price',
        'name', 'temp/f', 'cloud cover', 'pressure', 'rain',  'humidity',
       'wind', 'day', 'hour']]

In [43]:
merged_df['source'].value_counts(normalize = True)

Financial District         0.084728
Back Bay                   0.084710
Beacon Hill                0.083830
Boston University          0.083362
North End                  0.083359
South Station              0.083352
Northeastern University    0.083321
Haymarket Square           0.083308
Fenway                     0.083295
North Station              0.082561
West End                   0.082225
Theatre District           0.081949
Name: source, dtype: float64

[Back to top](#top)

<a name=" Splitting data to create seperate Lyft and Uber datasets"></a>
## Splitting data to create seperate Lyft and Uber datasets

In [44]:
lyft_data = merged_df[merged_df['cab_type'] == 'Lyft']

In [45]:
#Dummifying Data
lyft_data = pd.get_dummies(lyft_data,columns =['day','source','hour', 'name'],drop_first=True)

In [46]:
lyft_data.shape

(546266, 54)

In [47]:
lyft_data.to_csv('lyft.csv')

In [48]:
uber_data = merged_df[merged_df['cab_type'] != 'Lyft']
uber_data.to_csv('uber.csv')

In [49]:
#Dummifying Data
uber_data = pd.get_dummies(uber_data,columns =['day','source','hour', 'name'],drop_first=True)

In [50]:
uber_data.to_csv('uber.csv')

In [140]:
lyft_data.shape

(546266, 54)

In [51]:
uber_data.shape

(615126, 54)

[Back to top](#top)