<a id='top'></a>
# Chicago Ridesharing

#### Contributors: Muoyo Okome, Anesu Masube

<a id='toc'></a>
### Table of Contents
1. [Problem Statement](#problemstatement)
2. [Data Sources](#datasources)
3. [Data Cleaning](#datacleaning)
4. [Linear Regression](#regression)  
5. [Findings](#findings)
6. [Next Steps](#nextsteps)

In [5]:
# Import necessary libraries
import warnings
import pandas as pd
warnings.filterwarnings('ignore')

<a id='problemstatement'></a>
### Problem Statement

**Can data help ridesharing drivers earn more?**

The key question we look to answer is whether knowing where a ridesharing ride was initiated and what time it was initiated can help us to predict the fare for that ride.

Our goal is to eventually provide these insights to ridesharing drivers as a service to help them choose the best driving schedules and waiting positions to optimize their earnings.

[Back to Top ↑](#top)

<a id='datasources'></a>
### Data Sources

#### **[City of Chicago Data Portal](https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p)**

- All trips, starting November 2018, reported by Transportation Network Providers (sometimes called rideshare companies) to the City of Chicago as part of routine reporting required by ordinance.

- Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes. Fares are rounded to the nearest `$`2.50 and tips are rounded to the nearest `$`1.00. 

- 101 million rows of data!

#### **[Dark Sky Weather API](https://darksky.net/dev/docs#time-machine-request)**

The Dark Sky API allows you to look up the weather anywhere on the globe, returning (where available):

- Current weather conditions
- Minute-by-minute forecasts out to one hour
- Hour-by-hour and day-by-day forecasts out to seven days
- Hour-by-hour and day-by-day observations going back decades
- Severe weather alerts in the US, Canada, European Union member nations, and Israel


<br> 

[Back to Top ↑](#top)

In [12]:
# Import utility functions
%run ../python_files/utils

In [14]:
# Read in original data: 1 million records
df = get_trip_records(limit=1000000)

<a id='datacleaning'></a>
### Data Cleaning

Before beginning our analysis, we performed a number of operations to get the data ready to work with, including, but not limited to: 

- Limiting trip data to the columns we were most interested in: **'trip_id', 'trip_start_timestamp', 'trip_end_timestamp', 'trip_seconds', 'trip_miles', 'pickup_community_area', 'fare', 'tip', 'additional_charges', 'trip_total'**

- Convert numeric & timestamp data from strings to the appropriate datatypes

- Imputing missing values with the median, 0, or "missing" depending on what made the most sense for each particular variable

We then created columns for the weekday, hour, and time block when the trip was initiated (There are 8 three hour time blocks in each 24 hour day, with block 0 starting at 12AM) and also combined this information into the **start_date_plus hour** column which we then used to merge trip data with weather data at the hourly level. We've pulled in 425 days of hourly weather data from Chicago, obtained via the Dark Sky API and our WeatherGetter class, which you can see at work in **weather.ipynb** and then saved it to a CSV.

To make the project more modular and easier to follow & build upon, we created separate .py files to handle the heavy lifting for tasks such as data extraction & cleaning, visualizations, and linear regressions. We also created a function called **get_random_samples()** (located in **utils.py**), which allows us to draw random samples of trip data from the Socrata API. Please note that random sampling takes significantly longer than the alternative, due to the fact that we are running many queries instead of one, and there is a certain amount of overhead for each query irrespective of record size.

In [15]:
# Clean data & load into final dataframe
%run ../python_files/data_cleaning

In [16]:
df = clean_data(df)



[Back to Top ↑](#top)

In [17]:
# Import python files we've created to help
%run ../python_files/regression
%run ../python_files/visualizations

<a id='regression'></a>
### Linear Regression

For our predictive analysis we leveraged the tool of linear regression using the following parameters:

- **Dependent variable:** 
**'trip_total'** (OR in some cases we instead looked at **'fare'**)<p>

- **Independendent variables:**
    - **'apparentTemperature'**
    - **'start_weekday'**
    - **'start_hour'** (OR in some cases we instead looked at **'start_time_block'**)
    - **'pickup_community_area'** (Chicago is divided into 77 community areas, each of which belongs to one of nine "sides")


Before performing the actual regression, we pre-processed the data using the following steps:

- Splitting out continuous (temperature) & categorical variables (the rest of the independent variables) to be dealt with separately.

- Splitting data into training and test sets. We reserved 25% of our data for testing purposes, meaning that we do not work with it until after our model is finalized, to avoid data leakage.

- Performing one hot encoding on our categorical variables to make usable for linear regression

- Combining our categorical and continuous features back into a final dataframe


In [18]:
X_train, X_test, y_train, y_test = get_train_test_split(df, test_size=.25)

### Ordinary Least Squares via statsmodels

With our preprocessing complete (we conduct it within the **get_train_test_split()** function, as seen above), we are ready to run our series of regression. 

First, we run an OLS regression using statsmodels and learn that our model explains 26-30% of the variance of our independent variable **trip_total**, depending on the specific set of data we are working with.

In [19]:
OLS(y_train, X)

0,1,2,3
Dep. Variable:,trip_total,R-squared:,0.266
Model:,OLS,Adj. R-squared:,0.266
Method:,Least Squares,F-statistic:,2991.0
Date:,"Thu, 23 Jan 2020",Prob (F-statistic):,0.0
Time:,14:53:11,Log-Likelihood:,-2738100.0
No. Observations:,750000,AIC:,5476000.0
Df Residuals:,749908,BIC:,5477000.0
Df Model:,91,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
apparentTemperature,0.0120,0.001,16.438,0.000,0.011,0.013
start_weekday_0.0,7.1338,0.032,220.040,0.000,7.070,7.197
start_weekday_1.0,6.6222,0.031,211.549,0.000,6.561,6.684
start_weekday_2.0,6.6222,0.032,208.079,0.000,6.560,6.685
start_weekday_3.0,7.2006,0.031,233.001,0.000,7.140,7.261
start_weekday_4.0,7.1891,0.028,255.579,0.000,7.134,7.244
start_weekday_5.0,6.8984,0.029,237.822,0.000,6.841,6.955
start_weekday_6.0,6.5908,0.031,209.304,0.000,6.529,6.653
start_time_block_0.0,4.9302,0.040,122.651,0.000,4.851,5.009

0,1,2,3
Omnibus:,901996.767,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1678411012.714
Skew:,5.486,Prob(JB):,0.0
Kurtosis:,234.493,Cond. No.,1e+16


### LinearRegression via scikit-learn

Next, we run linear regression with scikit-learn using the same data. It's comforting to see that the results agree with our findings from statsmodels.

In [20]:
LinearRegression(X_train, y_train)

Training r^2: 0.2663164490958597
Training MSE: 86.79853712187625


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Ridge & Lasso

We attempted to see whether we could get any improvement in predictive by using Ridge or Lasso to reduce model variance. Unfortunately, neither one helped us, no matter the value of lambda/alpha with Lasso canceling every single variable, yielding an R-squared of 0 and Ridge making little to no changes to our model coefficients.

In [21]:
lasso = Lasso(X_train, y_train)
lasso.coef_

Training r^2: 0.00020689758405956216
Training MSE: 118.28066556392537


array([ 0.00688409,  0.        , -0.        , -0.        ,  0.        ,
        0.        , -0.        , -0.        , -0.        ,  0.        ,
        0.        , -0.        ,  0.        ,  0.        , -0.        ,
       -0.        ,  0.        , -0.        , -0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
        0.        ,  0.        , -0.        ,  0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        ,  0.        ,
       -0.        , -0.        ,  0.        ,  0.        , -0.  

In [22]:
ridge = Ridge(X_train, y_train)
ridge.coef_

Training r^2: 0.2663162105131872
Training MSE: 86.79856534743334


array([ 1.19813400e-02,  2.40127698e-01, -2.71657422e-01, -2.71589687e-01,
        3.06759860e-01,  2.95096734e-01,  4.21314442e-03, -3.02950328e-01,
       -1.10156137e+00,  3.22091956e+00,  1.00939697e+00, -4.22991443e-01,
       -1.51614061e-01, -4.19439871e-02, -1.26921156e+00, -1.24299411e+00,
        1.25069057e+01, -4.49767942e-01, -1.35629806e+00, -9.89495900e-01,
       -3.39305232e-01, -6.59998141e-01, -1.20873458e+00, -1.68353368e+00,
       -1.05301950e+00,  1.34281281e+00,  5.14070831e-01,  1.37896101e-02,
        2.04361837e+00, -7.11012806e-01, -4.68126457e-01, -2.32887865e-01,
       -6.08206931e-01,  2.91434588e-01, -3.84007586e-01, -1.49245514e+00,
       -1.72560475e+00, -1.40693643e+00, -1.65028466e+00, -2.02599982e+00,
       -2.07321346e+00, -1.07782819e+00, -1.99312801e+00, -2.26212999e+00,
       -1.55514931e+00, -2.01960101e+00, -1.50126683e+00, -1.97794290e+00,
       -1.61315522e-01,  4.53947145e-01, -1.56253178e+00, -2.12797816e+00,
       -1.84055310e+00, -

[Back to Top ↑](#top)

<a id='test2'></a>
### Test 2

... Insert Narrative ...

In [8]:
# Test 2

### Test 2 Results

... Insert Narrative ...

In [3]:
# Visualization

[Back to Top ↑](#top)

<a id='findings'></a>
### Findings

1. Drivers can earn the most at 5 am. Peak fares occur between 4-6am on a daily basis.

2. Airport pickups lead to higher earnings.

3. Strong relationship between temperature and trip fare.


[Back to Top ↑](#top)

<a id='nextsteps'></a>
### Next Steps

1. Further refine the model to provide recommendations of ideal pickup location given time & weather (deliver to drivers via app?)

2. Finding the correct independent variables to increase model’s predictive power (events/occasions)

3. How can we extend our model to help different groups such as riders & competing rideshare companies?


[Back to Top ↑](#top)