<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lab 3.02: Statistical Modeling and Model Validation

> Authors: Tim Book, Matt Brems

---

## Objective
The goal of this lab is to guide you through the modeling workflow to produce the best model you can. In this lesson, you will follow all best practices when slicing your data and validating your model. 

## Imports

In [1]:
# Import everything you need here.
# You may want to return to this cell to import more things later in the lab.
# DO NOT COPY AND PASTE FROM OUR CLASS SLIDES!
# Muscle memory is important!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression#, LassoCV, RidgeCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn import metrics

## Read Data
The `citibike` dataset consists of Citi Bike ridership data for over 224,000 rides in February 2014.

In [2]:
# Read in the citibike data in the data folder in this repository.
bike_df = pd.read_csv('./data/citibike_feb2014.csv')
bike_df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,21101,Subscriber,1991,1
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,15456,Subscriber,1979,2
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,247,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.72318,-73.9948,16281,Subscriber,1948,2
3,583,2014-02-01 00:00:32,2014-02-01 00:10:15,357,E 11 St & Broadway,40.732618,-73.99158,284,Greenwich Ave & 8 Ave,40.739017,-74.002638,17400,Subscriber,1981,1
4,223,2014-02-01 00:00:41,2014-02-01 00:04:24,401,Allen St & Rivington St,40.720196,-73.989978,439,E 4 St & 2 Ave,40.726281,-73.98978,19341,Subscriber,1990,1


## Explore the data
Use this space to familiarize yourself with the data.

Convince yourself there are no issues with the data. If you find any issues, clean them here.

In [3]:
#reading out columns
bike_df.columns

Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'birth year', 'gender'],
      dtype='object')

In [4]:
#displaying info and getting an idea of null counts
bike_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224736 entries, 0 to 224735
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             224736 non-null  int64  
 1   starttime                224736 non-null  object 
 2   stoptime                 224736 non-null  object 
 3   start station id         224736 non-null  int64  
 4   start station name       224736 non-null  object 
 5   start station latitude   224736 non-null  float64
 6   start station longitude  224736 non-null  float64
 7   end station id           224736 non-null  int64  
 8   end station name         224736 non-null  object 
 9   end station latitude     224736 non-null  float64
 10  end station longitude    224736 non-null  float64
 11  bikeid                   224736 non-null  int64  
 12  usertype                 224736 non-null  object 
 13  birth year               224736 non-null  object 
 14  gend

In [5]:
#locating all columns in bike_df where the dtype is object
bike_df.loc[:, bike_df.dtypes == 'object']

#'birth year' should not be dtype object

Unnamed: 0,starttime,stoptime,start station name,end station name,usertype,birth year
0,2014-02-01 00:00:00,2014-02-01 00:06:22,Washington Square E,Stanton St & Chrystie St,Subscriber,1991
1,2014-02-01 00:00:03,2014-02-01 00:06:15,Broadway & E 14 St,E 4 St & 2 Ave,Subscriber,1979
2,2014-02-01 00:00:09,2014-02-01 00:10:00,Perry St & Bleecker St,Mott St & Prince St,Subscriber,1948
3,2014-02-01 00:00:32,2014-02-01 00:10:15,E 11 St & Broadway,Greenwich Ave & 8 Ave,Subscriber,1981
4,2014-02-01 00:00:41,2014-02-01 00:04:24,Allen St & Rivington St,E 4 St & 2 Ave,Subscriber,1990
...,...,...,...,...,...,...
224731,2014-02-28 23:57:13,2014-03-01 00:11:21,Broadway & W 32 St,E 7 St & Avenue A,Subscriber,1976
224732,2014-02-28 23:57:55,2014-03-01 00:20:30,W 20 St & 8 Ave,Avenue D & E 3 St,Subscriber,1985
224733,2014-02-28 23:58:17,2014-03-01 00:03:21,E 17 St & Broadway,W 20 St & 7 Ave,Subscriber,1968
224734,2014-02-28 23:59:10,2014-03-01 00:04:18,S Portland Ave & Hanson Pl,Fulton St & Grand Ave,Subscriber,1982


In [6]:
#checking 'birth year values'
bike_df['birth year'].sort_values()

220029    1899
25042     1899
125527    1899
177826    1899
124361    1899
          ... 
20792       \N
150054      \N
125890      \N
20757       \N
161082      \N
Name: birth year, Length: 224736, dtype: object

In [7]:
bike_df.groupby(['birth year']).size()
#lots of odd '\N' entries

birth year
1899       9
1900      68
1901      11
1907       5
1910       4
        ... 
1994    1215
1995     827
1996     334
1997     251
\N      6717
Length: 78, dtype: int64

In [8]:
#dropping the entries where 'birth year' is \N to calculate the median for birth year
bike_df_just_N = bike_df[bike_df['birth year'] == '\\N']
bike_df_for_median = bike_df.drop(bike_df_just_N.index, axis=0) 

In [9]:
bike_df_just_N.head()
#these rows also had gender = 0 so it was probably best to drop them

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
31,664,2014-02-01 00:08:47,2014-02-01 00:19:51,237,E 11 St & 2 Ave,40.730473,-73.986724,349,Rivington St & Ridge St,40.718502,-73.983299,17540,Customer,\N,0
55,836,2014-02-01 00:16:10,2014-02-01 00:30:06,488,W 39 St & 9 Ave,40.756458,-73.993722,297,E 15 St & 3 Ave,40.734232,-73.986923,16731,Customer,\N,0
222,1277,2014-02-01 01:17:50,2014-02-01 01:39:07,469,Broadway & W 53 St,40.763441,-73.982681,336,Sullivan St & Washington Sq,40.730477,-73.999061,20728,Customer,\N,0
266,29906,2014-02-01 01:44:59,2014-02-01 10:03:25,294,Washington Square E,40.730494,-73.995721,368,Carmine St & 6 Ave,40.730386,-74.00215,18944,Customer,\N,0
293,2625,2014-02-01 01:56:32,2014-02-01 02:40:17,395,Bond St & Schermerhorn St,40.68807,-73.984106,395,Bond St & Schermerhorn St,40.68807,-73.984106,19782,Customer,\N,0


In [10]:
bike_df_for_median.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,21101,Subscriber,1991,1
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,15456,Subscriber,1979,2
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,247,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.72318,-73.9948,16281,Subscriber,1948,2
3,583,2014-02-01 00:00:32,2014-02-01 00:10:15,357,E 11 St & Broadway,40.732618,-73.99158,284,Greenwich Ave & 8 Ave,40.739017,-74.002638,17400,Subscriber,1981,1
4,223,2014-02-01 00:00:41,2014-02-01 00:04:24,401,Allen St & Rivington St,40.720196,-73.989978,439,E 4 St & 2 Ave,40.726281,-73.98978,19341,Subscriber,1990,1


In [11]:
#calculating the median
bike_df_for_median['birth year'].median()

1978.0

In [12]:
#replacing the 'birth year' colums where values = \N with the median, 1978
bike_df[['birth year']] = bike_df[['birth year']].replace('\\N', 1978)

In [13]:
#dtype is objects again
bike_df['birth year'].dtypes

dtype('O')

In [14]:
#casting back 'birth year' dtype to int64
bike_df = bike_df.astype({'birth year': 'int64'})

In [15]:
#confirming above cell worked
bike_df['birth year'].dtypes

dtype('int64')

In [16]:
#checking 'birth year' values after cleaning
bike_df['birth year'].sort_values()

25042     1899
219784    1899
220029    1899
211832    1899
177826    1899
          ... 
26675     1997
209730    1997
83123     1997
14114     1997
186995    1997
Name: birth year, Length: 224736, dtype: int64

In [17]:
#displaying 'usertype' value counts
bike_df['usertype'].value_counts()

Subscriber    218019
Customer        6717
Name: usertype, dtype: int64

In [18]:
#displaying 'end station name' value counts
bike_df['end station name'].value_counts()

Lafayette St & E 8 St     2622
W 21 St & 6 Ave           2453
Pershing Square N         2419
E 17 St & Broadway        2320
8 Ave & W 31 St           2205
                          ... 
W 13 St & 5 Ave             54
Concord St & Bridge St      51
Bedford Ave & S 9th St      43
Railroad Ave & Kay Ave      32
Church St & Leonard St       6
Name: end station name, Length: 329, dtype: int64

In [19]:
#displaying 'start station name' value counts
bike_df['start station name'].value_counts()

Lafayette St & E 8 St         2920
Pershing Square N             2719
E 17 St & Broadway            2493
W 21 St & 6 Ave               2403
8 Ave & W 31 St               2171
                              ... 
Hanover Pl & Livingston St      54
Concord St & Bridge St          45
Bedford Ave & S 9th St          41
Railroad Ave & Kay Ave          36
Church St & Leonard St           4
Name: start station name, Length: 329, dtype: int64

## Is average trip duration different by gender?

Conduct a hypothesis test that checks whether or not the average trip duration is different for `gender=1` and `gender=2`. Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly!

In [20]:
bike_df = pd.get_dummies(bike_df, columns = ['gender'], drop_first = True)

In [21]:
bike_df.head(3)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender_1,gender_2
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,21101,Subscriber,1991,1,0
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,15456,Subscriber,1979,0,1
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,247,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.72318,-73.9948,16281,Subscriber,1948,0,1


Our null hypothese will be that gender 1 has a different average trip duration than gender 2. Our alternative hypothesis is that gender has no impact on average trip duration. 

In [61]:
#https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce
from scipy.stats import ttest_1samp

gender = bike_df['gender_1']
trip_duration_mean = np.mean(bike_df['tripduration'])

tset, pval = ttest_1samp(gender, 30)
print('p-values:', round(pval))
if pval < 0.05:    # alpha value is 0.05 or 5%
    print("we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

p-values: 0
we are rejecting null hypothesis


## What numeric columns shouldn't be treated as numeric?

**Answer:** start station id, end station id, and bikeid

## Dummify the `start station id` Variable

In [22]:
bike_df = pd.get_dummies(bike_df, columns = ['start station id'], drop_first = True)

In [23]:
bike_df.head(3)

Unnamed: 0,tripduration,starttime,stoptime,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,...,start station id_2006,start station id_2008,start station id_2009,start station id_2010,start station id_2012,start station id_2017,start station id_2021,start station id_2022,start station id_2023,start station id_3002
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,...,0,0,0,0,0,0,0,0,0,0
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,...,0,0,0,0,0,0,0,0,0,0
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.72318,-73.9948,...,0,0,0,0,0,0,0,0,0,0


In [24]:
#bike_df = pd.get_dummies(bike_df, columns = ['end station id'], drop_first = True)

In [25]:
#bike_df.head(3)

In [26]:
#bike_df = pd.get_dummies(bike_df, columns = ['bikeid'], drop_first = True)

In [38]:
#also getting dummies for usertype
bike_df = pd.get_dummies(bike_df, columns = ['usertype'], drop_first = True)

## Engineer a feature called `age` that shares how old the person would have been in 2014 (at the time the data was collected).

- Note: you will need to clean the data a bit.

In [39]:
#I cleaned the data above and set '\N' entries to be the median birthyear, 1978
bike_df['age'] = 2014 - bike_df['birth year']
bike_df['age']

0         23
1         35
2         66
3         33
4         24
          ..
224731    38
224732    29
224733    46
224734    32
224735    54
Name: age, Length: 224736, dtype: int64

## Split your data into train/test data

Look at the size of your data. What is a good proportion for your split? **Justify your answer.**

Use the `tripduration` column as your `y` variable.

For your `X` variables, use `age`, `usertype`, `gender`, and the dummy variables you created from `start station id`. (Hint: You may find the Pandas `.drop()` method helpful here.)

**NOTE:** When doing your train/test split, please use random seed 123.

**Answering:** Look at the size of your data. What is a good proportion for your split? Justify your answer.
- Due to the size of our dataset, it doesn't really matter where we set our train test split, either 80/20 or 90/10. In general, if there is less training data, the coefficients will have have greater variance, whereas, if you have less testing data, your predictions will have greater variance.

In [29]:
#NOTE TO SELF, SET THIS IN IMPORTS WHEN WORKING WITH PANDAS AND YOU WANT TO SEE ALL COLUMNS

#https://datascienceparichay.com/article/show-all-columns-of-pandas-dataframe-in-jupyter-notebook/
#pd.set_option("display.max_columns", None)

In [40]:
bike_df.head(3)

Unnamed: 0,tripduration,starttime,stoptime,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,...,start station id_2009,start station id_2010,start station id_2012,start station id_2017,start station id_2021,start station id_2022,start station id_2023,start station id_3002,age,usertype_Subscriber
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,...,0,0,0,0,0,0,0,0,23,1
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,...,0,0,0,0,0,0,0,0,35,1
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.72318,-73.9948,...,0,0,0,0,0,0,0,0,66,1


In [41]:
#For your `X` variables, use `age`, `usertype`, `gender`,
#and the dummy variables you created from `start station id`. 
#(Hint: You may find the Pandas `.drop()` method helpful here.)

X = bike_df.drop(columns = ['tripduration', 'starttime', 'stoptime',
                            'start station name', 'start station latitude', 'start station longitude',
                            'end station id', 'end station name', 'end station latitude',
                            'end station longitude'])

y = bike_df[['tripduration']]

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    train_size = 0.8, 
                                                    random_state = 123)

## Fit a Linear Regression model in `sklearn` predicting `tripduration`.

In [43]:
model = LinearRegression()

model.fit(X_train, y_train)

In [44]:
model.score(X_train, y_train)

0.004628580392993298

In [45]:
model.score(X_test, y_test)

-0.0016908789975988991

In [35]:
#sc = StandardScaler()

#X_train_sc = sc.fit_transform(X_train)
#X_test_sc = sc.transform(X_test)

## Evaluate your model
Look at some evaluation metrics for **both** the training and test data. 
- How did your model do? Is it overfit, underfit, or neither?
- Does this model outperform the baseline? (e.g. setting $\hat{y}$ to be the mean of our training `y` values.)

**Answer:** This is a terrible model, it's almost as if there's no correlation between age, usertype, gender, and 'start station id' dummies and the target 'tripduration'. This makes sense because tripduration would be based off of something like the start/stop cartesian coordinates or 'start station id' and 'stop station id'.

## Fit a Linear Regression model in `statsmodels` predicting `tripduration`.

In [47]:
#https://www.statsmodels.org/stable/regression.html
import statsmodels.api as sm

In [52]:
#https://www.statsmodels.org/devel/examples/notebooks/generated/ols.html
#fitting OLS model
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:           tripduration   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     2.655
Date:                Mon, 18 Oct 2021   Prob (F-statistic):           2.23e-51
Time:                        18:48:08   Log-Likelihood:            -2.2534e+06
No. Observations:              224736   AIC:                         4.507e+06
Df Residuals:                  224402   BIC:                         4.511e+06
Df Model:                         333                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
bikeid                   -0.00

## Using the `statsmodels` summary, test whether or not `age` has a significant effect when predicting `tripduration`.
- Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly **in the context of your model**!

Sorry, I have no idea how to do this, I can't recall when we did this in class. 

## Citi Bike is attempting to market to people who they think will ride their bike for a long time. Based on your modeling, what types of individuals should Citi Bike market toward?