# COGS 108 - Data Checkpoint

- Hien Bui
- Samantha Lin
- Felicia Chan
- Jason Lee

<a id='research_question'></a>
# Research Question

How does distance, the time of day, and climate, specifically weather and  temperature affect the prices of Uber and Lyft rides in Boston’s hotspots? Additionally, how can we utilize these variables to predict the prices of Uber and Lyft rides in Boston’s hotspots?

# Dataset(s)

# Cabs

- Dataset name: cab_rides.csv
- Link to the dataset: https://www.kaggle.com/datasets/ravi72munde/uber-lyft-cab-prices?select=cab_rides.csv 
- Number of Observations: 693071

The cabs_rides dataset contains all Uber and Lyft rides data for approximately a week in November 2018 around the area in Boston. The dataset contains the distance of each ride, the type of cab (Uber/Lyft), when the ride occured (epoch time), the ride destination, the starting location, price of the ride, multiplier to the price of the ride, transaction id, product id, and product name

# Weather

- Dataset name: weather.csv
- Link to the dataset: https://www.kaggle.com/datasets/ravi72munde/uber-lyft-cab-prices?select=cab_rides.csv 
- Number of Observations: 6276

The weather dataset contains the weather conditions of the most popular areas in Boston. The dataset contains the temperature in Farenheit for each location, the location name, cloud visibility, atmospheric pressure in mb, rain in inches for the last hour, time when the observation was recorded (epoch time), humidity in %, and wind speed in mph.

Our plan is to join the weather dataset into the cab_rides dataset based on the destination column from cab_rides dataset and the location column of the weathers_dataset. However, there are timestamps columns for both the cab_rides and weather dataset, so we will use those as keys as well to accurately provide the conditions of the destination when joining the weather dataset to the cab_rides dataset. 

# Setup

In [None]:
#import modules
import pandas as pd
import numpy as np
import datetime
import time

In [None]:
#reading the datasets
cabs = pd.read_csv('cab_rides.csv')
weathers = pd.read_csv('weather.csv')

# Data Cleaning

In [None]:
#drop observations that contains null values and resetting the index
cabs = cabs.dropna().reset_index(drop=True)
cabs

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
0,0.44,Lyft,1544952607890,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared
1,0.44,Lyft,1543284023677,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux
2,0.44,Lyft,1543366822198,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft
3,0.44,Lyft,1543553582749,North Station,Haymarket Square,26.0,1.0,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL
4,0.44,Lyft,1543463360223,North Station,Haymarket Square,9.0,1.0,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL
...,...,...,...,...,...,...,...,...,...,...
637971,1.00,Uber,1543708385534,North End,West End,9.5,1.0,353e6566-b272-479e-a9c6-98bd6cb23f25,9a0e7b09-b92b-4c41-9779-2ad22b4d779d,WAV
637972,1.00,Uber,1543708385534,North End,West End,13.0,1.0,616d3611-1820-450a-9845-a9ff304a4842,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL
637973,1.00,Uber,1543708385534,North End,West End,9.5,1.0,633a3fc3-1f86-4b9e-9d48-2b7132112341,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX
637974,1.00,Uber,1543708385534,North End,West End,27.0,1.0,727e5f07-a96b-4ad1-a2c7-9abc3ad55b4e,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV


In [None]:
#checking types for columns
cabs.dtypes

distance            float64
cab_type             object
time_stamp            int64
destination          object
source               object
price               float64
surge_multiplier    float64
id                   object
product_id           object
name                 object
dtype: object

In [None]:
#since epoch time is in miliseconds, convert to seconds  
cabs['time_stamp'] = cabs['time_stamp'].apply(lambda x: x/1000.0)

In [None]:
#then convert into datetime
cabs['time_stamp'] = cabs['time_stamp'].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%c'))

In [None]:
weathers = weathers.dropna().reset_index(drop=True)
weathers

Unnamed: 0,temp,location,clouds,pressure,rain,time_stamp,humidity,wind
0,42.42,Back Bay,1.0,1012.14,0.1228,1545003901,0.77,11.25
1,42.43,Beacon Hill,1.0,1012.15,0.1846,1545003901,0.76,11.32
2,42.50,Boston University,1.0,1012.15,0.1089,1545003901,0.76,11.07
3,42.11,Fenway,1.0,1012.13,0.0969,1545003901,0.77,11.09
4,43.13,Financial District,1.0,1012.14,0.1786,1545003901,0.75,11.49
...,...,...,...,...,...,...,...,...
889,39.51,North Station,1.0,1018.09,0.0280,1543751574,0.92,5.41
890,39.41,Northeastern University,1.0,1018.06,0.0331,1543751574,0.92,5.37
891,39.62,South Station,1.0,1018.08,0.0376,1543751574,0.92,5.47
892,39.47,Theatre District,1.0,1018.08,0.0333,1543751574,0.92,5.43


In [None]:
weathers['time_stamp'] = weathers['time_stamp'].apply(lambda x: x/1000.0)
weathers['time_stamp'] = weathers['time_stamp'].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%c'))
weathers

Unnamed: 0,temp,location,clouds,pressure,rain,time_stamp,humidity,wind
0,42.42,Back Bay,1.0,1012.14,0.1228,Sun Jan 18 21:10:03 1970,0.77,11.25
1,42.43,Beacon Hill,1.0,1012.15,0.1846,Sun Jan 18 21:10:03 1970,0.76,11.32
2,42.50,Boston University,1.0,1012.15,0.1089,Sun Jan 18 21:10:03 1970,0.76,11.07
3,42.11,Fenway,1.0,1012.13,0.0969,Sun Jan 18 21:10:03 1970,0.77,11.09
4,43.13,Financial District,1.0,1012.14,0.1786,Sun Jan 18 21:10:03 1970,0.75,11.49
...,...,...,...,...,...,...,...,...
889,39.51,North Station,1.0,1018.09,0.0280,Sun Jan 18 20:49:11 1970,0.92,5.41
890,39.41,Northeastern University,1.0,1018.06,0.0331,Sun Jan 18 20:49:11 1970,0.92,5.37
891,39.62,South Station,1.0,1018.08,0.0376,Sun Jan 18 20:49:11 1970,0.92,5.47
892,39.47,Theatre District,1.0,1018.08,0.0333,Sun Jan 18 20:49:11 1970,0.92,5.43
