<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lab 6.02: Statistical Modeling and Model Validation

> Authors: Tim Book, Matt Brems, Jeff Hale

---

## Objective
Predict bike ridership.

### Imports

In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np




#### Read Data
The `citibike` dataset consists of Citi Bike ridership data for over 224,000 rides in February 2014.

In [27]:
# Read in the citibike data in the data folder in this repository.
citibike = pd.read_csv('./data/citibike_feb2014.csv')
citibike.head(2)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,21101,Subscriber,1991,1
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,15456,Subscriber,1979,2


## Explore the data
Familiarize yourself with the data.

If you find any issues, clean them here.

In [28]:
citibike.isna().sum()

tripduration               0
starttime                  0
stoptime                   0
start station id           0
start station name         0
start station latitude     0
start station longitude    0
end station id             0
end station name           0
end station latitude       0
end station longitude      0
bikeid                     0
usertype                   0
birth year                 0
gender                     0
dtype: int64

In [29]:
citibike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224736 entries, 0 to 224735
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             224736 non-null  int64  
 1   starttime                224736 non-null  object 
 2   stoptime                 224736 non-null  object 
 3   start station id         224736 non-null  int64  
 4   start station name       224736 non-null  object 
 5   start station latitude   224736 non-null  float64
 6   start station longitude  224736 non-null  float64
 7   end station id           224736 non-null  int64  
 8   end station name         224736 non-null  object 
 9   end station latitude     224736 non-null  float64
 10  end station longitude    224736 non-null  float64
 11  bikeid                   224736 non-null  int64  
 12  usertype                 224736 non-null  object 
 13  birth year               224736 non-null  object 
 14  gend

In [30]:
citibike['gender'].value_counts()

1    176526
2     41479
0      6731
Name: gender, dtype: int64

In [31]:
citibike[citibike['gender'] == 0]['usertype'].value_counts()

Customer      6717
Subscriber      14
Name: usertype, dtype: int64

**These gender == 0 rows bother me. They probably don't collect gender info for customers and so 0 is automatically assigned.**

### Is average trip duration different by gender?

Conduct a hypothesis test that checks whether the average trip duration is different for `gender=1` and `gender=2`. 

Specify your null and alternative hypotheses, and to state your conclusion carefully!

Null hypothesis: There is no difference in mean trip duration between genders.

Alternative hypothesis: There is a difference in mean trip duration between genders.

Let's use a p-value of 0.05

In [32]:
gender_1_trip = citibike[citibike['gender'] == 1]['tripduration']
gender_2_trip = citibike[citibike['gender'] == 2]['tripduration']

In [33]:
gender_1_trip.shape

(176526,)

In [34]:
gender_2_trip.shape

(41479,)

In [35]:
# With help from lesson 2.06
from scipy.stats import mannwhitneyu as mw

mw(gender_1_trip, gender_2_trip)

MannwhitneyuResult(statistic=3146924216.0, pvalue=0.0)

#### What numeric columns shouldn't be treated as numeric?

Gender is a categorical column, not numerical. Station Id, end station id and bike id are also categorical.

In [36]:
citibike['gender'] = citibike['gender'].astype(object)
citibike['start station id'] = citibike['start station id'].astype(object)

### Feature Engineering
Engineer a feature called `age` that shares how old the person would have been in 2014 (at the time the data was collected)
- Note: you will need to clean the data a bit.

In [37]:
citibike['birth year'].value_counts()

1985    9305
1984    9139
1983    8779
1981    8208
1986    8109
        ... 
1910       4
1917       3
1927       2
1921       1
1913       1
Name: birth year, Length: 78, dtype: int64

In [38]:
# Replace error values with 1985, which is the most common value
citibike['birth year'].replace(to_replace='\\N', value='1985', inplace=True)

In [39]:
citibike['birth year'] = citibike['birth year'].astype(int)

In [40]:
citibike['age'] = 2014 - citibike['birth year'].astype(int)

In [41]:
citibike.head(2)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,age
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,21101,Subscriber,1991,1,23
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,15456,Subscriber,1979,2,35


### Split your data into X and y and then train/test sets


Use `tripduration` as your `y` variable.

Use `age`, `usertype`, `gender`, and `start station id` for your `X` variables.

In [42]:
citibike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224736 entries, 0 to 224735
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             224736 non-null  int64  
 1   starttime                224736 non-null  object 
 2   stoptime                 224736 non-null  object 
 3   start station id         224736 non-null  object 
 4   start station name       224736 non-null  object 
 5   start station latitude   224736 non-null  float64
 6   start station longitude  224736 non-null  float64
 7   end station id           224736 non-null  int64  
 8   end station name         224736 non-null  object 
 9   end station latitude     224736 non-null  float64
 10  end station longitude    224736 non-null  float64
 11  bikeid                   224736 non-null  int64  
 12  usertype                 224736 non-null  object 
 13  birth year               224736 non-null  int64  
 14  gend

In [43]:
X = citibike[['age', 'usertype', 'gender', 'start station id']]
y = citibike['tripduration']


#### One-hot encode `start station id`, `gender`, and `usertype`. 

In [44]:
X = pd.get_dummies(X, columns=['usertype', 'gender', 'start station id'])

In [45]:
X.head()

Unnamed: 0,age,usertype_Customer,usertype_Subscriber,gender_0,gender_1,gender_2,start station id_72,start station id_79,start station id_82,start station id_83,...,start station id_2006,start station id_2008,start station id_2009,start station id_2010,start station id_2012,start station id_2017,start station id_2021,start station id_2022,start station id_2023,start station id_3002
0,23,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,35,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,66,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,33,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,24,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1331)

#### Fit a Linear Regression model in scikit-learn predicting `tripduration`.

In [47]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

-0.0016860210671651998

### Evaluate your model
#### Use several different evaluation metrics on the test set. 
How did your model do? 

In [48]:
lr.score(X_train, y_train)

0.004787950581784051

In [49]:
lr.score(X_test, y_test)

-0.0016860210671651998

#### Does this model outperform a baseline? (e.g. setting $\hat{y}$ to be the mean of our training `y` values.)

In [50]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, np.full_like(y_test, y_train.mean()), squared=False)

4958.059734165046

In [51]:
mean_squared_error(y_test, lr.predict(X_test), squared=False)

4962.103920531877

#### Interpret the age and gender coefficients

In [60]:
pd.DataFrame(lr.coef_.reshape(1, -1), columns=X_train.columns)[['age', 'gender_0', 'gender_1', 'gender_2']]

Unnamed: 0,age,gender_0,gender_1,gender_2
0,4.096329,-1879361000000.0,-1879361000000.0,-1879361000000.0


- For gender_0 = 1, tripduration changes -1.879361e+12, all else equal.
- For gender_1 = 1, tripduration changes -1.879361e+12, all else equal.
- For gender_2 = 1, tripduration changes -1.879361e+12, all else equal.
- For every 1 change in age, tripduration changes 4.096329, all else equal.

---
## Statsmodels 
#### Fit the same Linear Regression model using `statsmodels`.

In [70]:
import statsmodels



<module 'statsmodels' from '/Users/andresperez/opt/anaconda3/lib/python3.9/site-packages/statsmodels/__init__.py'>

## Evaluate your model
Using the `statsmodels` summary, test whether  `age` has a significant effect when predicting `tripduration`.

#### Specify your null and alternative hypotheses.

#### State your conclusion carefully and correctly **in the context of your model**!

---
## A fancier model 🚀

Fit a Gradient Boosting model using GridSearchCV to search over a few hyperparameters. 

#### Which parameters were best? 

#### Does your best model perform better than the linear regression model?