# Data Science: Bridging Principle and Practice
## Part 8: Improving the Model(Bike Sharing case study)

<br/>

<div class="container">
    <div style="float:left;width:40%">
	    <img src="images/bikeshare_sun.jpg">
    </div>
    <div style="float:left;width:40%">
	    <img src="images/bikeshare_snow.PNG">
    </div>
</div>

### Table of Contents

[Case Study: Bike Sharing](#sectioncase)<br>

<ol start="8">
    <li><a href="section8">Linear Regression Model</a>
        <ol type=a>
            <br>
            <li><a href="section7a">Explanatory and Response Variables</a></li>
            <br>
            <li><a href="section7b">Finding &beta;</a></li>
            <br>
            <li><a href="section7c">Evaluating the Model</a></li>            
        </ol>
    </li>
    </ol>


In [1]:
# run this cell to import some necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import seaborn as sns

import ipywidgets as widgets
from scipy.linalg import lstsq
import ipywidgets as widgets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split
from IPython.display import display,clear_output


## Case Study: Capital Bike Share <a id= "sectioncase"></a>

Bike-sharing systems have become increasingly popular worldwide as environmentally-friendly solutions to traffic congestion, inadequate public transit, and the "last-mile" problem. Capital Bikeshare runs one such system in the Washington, D.C. metropolitan area.

The Capital Bikeshare system comprises docks of bikes, strategically placed across the area, that can be unlocked by *registered* users who have signed up for a monthly or yearly plan or by *casual* users who pay by the hour or day. They collect data on the number of casual and registered users per hour and per day.

Let's say that Capital Bikeshare is interested in a **prediction** problem: predicting how many riders they can expect to have on a given day. [UC Irvine's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset) has combined the bike sharing data with information about weather conditions and holidays to try to answer this question.

In this notebook, we'll walk through the steps a data scientist would take to answer this question.

In [2]:
# run this cell to load the data
bikes = pd.read_csv("../resource/data/day_renamed_dso.csv", index_col=0)

np.random.seed(28)
bikes, bike_test = train_test_split(bikes, train_size=0.8, test_size=0.2)
bike_train, bike_val = train_test_split(bikes, train_size=0.8, test_size=0.2)

# reformat the date column from strings to dates
bike_train.head()

Unnamed: 0_level_0,date,season,year,month,is holiday,week day,is work day,weather,temp,felt temp,humidity,windspeed,casual riders,registered riders,total riders
instant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
93,2011-04-03,summer,2011,4,no,sunday,no,1,0.378333,0.378767,0.48,0.182213,1651,1598,3249
497,2012-05-11,summer,2012,5,no,friday,yes,1,0.533333,0.520833,0.360417,0.236937,1319,5711,7030
596,2012-08-18,fall,2012,8,no,saturday,no,1,0.678333,0.618071,0.603333,0.177867,2827,5038,7865
397,2012-02-01,spring,2012,2,no,wednesday,yes,1,0.469167,0.466538,0.507917,0.189067,304,4275,4579
623,2012-09-14,fall,2012,9,no,friday,yes,1,0.633333,0.594083,0.6725,0.103863,1379,6630,8009


A few of the less straight-forward columns can be described as follows:
- **instant**: record index
- **is 2012** : 1 if the date is in 2012, 0 if the date is in 2011
- **is holiday** : 1 if day is a holiday, 0 otherwise
- **is work day** : 1 if day is not a weekend or holiday, otherwise 0
- **weather** :
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- **temp** : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- **felt temp**: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- **humidity**: Normalized humidity. The values are divided to 100 (max)
- **windspeed**: Normalized wind speed. The values are divided to 67 (max)
- **casual**: count of casual users
- **registered**: count of registered users
- **total riders**: count of total rental bikes (casual + registered)

## 8. Improving the Model <a id="section8"></a>

In notebook 07, we created a linear regression model where we tried to predict the total number of riders on a given day based on the temperature, season, and whether or not the day was a work day. Ultimately, our model was not very accurate; the RMSE was over 1000 for the training data. Our challenge now is to build a better model.

In the following cells you have almost everything you need to create a new linear regression model. To try a new model, fill in the two sets of ellipses below:
- set `response` to the *string name* of the response variable you want to predict
- set `expl` to be a *list of string names of explanatory variables* you want to incorporate into the model. Remember, the names should be strings (i.e. in quotation marks) and separated by commas in between the square brackets.

Once you've filled in the ellipses, run all the cells below in order to recalculate the $\beta$ vector, make new predictions, and look at the residuals. A helpful tip: in the "Cell" menu at the top, clicking "Run All Below" will run all code cells below the cell you currently have selected.

How accurate can you make the model?

In [3]:
x = 1
x in locals()

False

In [11]:
def make_X(df, var_names):
    """Given a DataFrame and a list of explanatory variables, one-hot encodes
    variables if they are categorical and returns a dataframe with 
    all the given explanatory variables."""
    categorical = ["month", "week day", "season"]
    boolean = ["is holiday", "is work day"]
    X = pd.DataFrame({"intercept":np.ones(df.shape[0], dtype='int')}, index = df.index)
    for var in var_names:
        if var in categorical:
            dummies = pd.get_dummies(df[var])
            formatted = dummies.drop(dummies.columns[-1], axis=1)
        elif var in boolean:
            formatted = (df[var] == "yes") * 1
        else:
            formatted = df.loc[:, var]
        X = X.join(formatted)
      
    return X

def predict(response_var, fit_intercept, **kwargs):
    plt.close()
    # select and format X and y
    y_train = bike_train[response_var]
    y_val = bike_val[response_var]
    
    expl_vars = [var for var in kwargs if kwargs[var]]
    
    # bounce if there's no variables for the model
    if len(expl_vars) == 0 and not fit_intercept:
        print("Please select at least one explanatory variable to include in the model.")
        return
    
    X_train = make_X(bike_train, expl_vars)
    X_val = make_X(bike_val, expl_vars)
    
    if not fit_intercept:
        X_train.drop("intercept", axis=1, inplace=True)
        X_val.drop("intercept", axis=1, inplace=True)
    
    # calculate beta
    beta = lstsq(X_train, y_train)[0]
    
    # make predictions
    pred_train = X_train @ beta
    pred_val = X_val @ beta
    
    # generate plots 
    f, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, figsize=(12, 12))

    sns.regplot(x=y_train, y=pred_train, ax=ax1, color="#003262") 
    ax1.set_xlabel(response_var)
    ax1.set_ylabel("predicted {}".format(response_var))
    ax1.set_title("Predicted vs. Actual Values (Training Data)")
    
    sns.regplot(x=y_val, y=pred_val, ax=ax2, color="#FDB515")
    ax2.set_xlabel(response_var)
    ax2.set_ylabel("predicted {}".format(response_var))
    ax2.set_title("Predicted vs. Actual Values (Validation Data)")
    
    sns.regplot(x=y_train, y=y_train - pred_train, ax=ax3, color="#003262") 
    ax3.set_xlabel(response_var)
    ax3.set_ylabel("error ({} - predicted {})".format(response_var, response_var))
    ax3.set_title("Error (Training Data)")
    
    sns.regplot(x=y_val, y=y_val - pred_val, ax=ax4, color="#FDB515")
    ax4.set_xlabel(response_var)
    ax4.set_ylabel("error ({} - predicted {})".format(response_var, response_var))
    ax4.set_title("Error (Validation Data)")
            
    
    # calculate rmse
    print("Training data RMSE = {}".format(np.sqrt(mse(pred_train, y_train))))
    print("Validation data RMSE = {}".format(np.sqrt(mse(pred_val, y_val))))
        

In [5]:
expl_vars = [widgets.ToggleButton(description=var) for var in bike_train.columns[1:13]]
expl_buttons = widgets.Box(expl_vars)


In [15]:
response_radio = widgets.RadioButtons(
    options=['total riders', 'casual riders', 'registered riders'],
    description='Response variable:'
)

intercept = widgets.ToggleButton(description="intercept")

In [17]:
kwargs = {bike_train.columns[1:12][i]: expl_vars[i] for i in range(11)}
kwargs['fit_intercept'] = intercept
kwargs['response_var'] = response_radio
out = widgets.interactive_output(predict, kwargs)
out.layout.height = '800px'

display(expl_buttons,  intercept, response_radio,out)



Box(children=(ToggleButton(value=False, description='season'), ToggleButton(value=False, description='year'), …

ToggleButton(value=False, description='intercept')

RadioButtons(description='Response variable:', options=('total riders', 'casual riders', 'registered riders'),…

Output(layout=Layout(height='800px'))

<div class="class alert-warning">
<b>QUESTION:</b> What explanatory variables did you use in the best model you found? What metrics showed that it was the "best" model? Reference the scatter plots, fit lines, RMSE, etc.
</div>

**ANSWER:**

#### References
- Bike-Sharing data set from University of California Irvine's Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
- Portions of text and code adapted from Professor Jonathan Marshall's Legal Studies 190 (Data, Prediction, and Law) course materials: [lab 2-22-18, Linear Regression](https://github.com/ds-modules/LEGALST-190/tree/master/labs/2-22) (Author Keeley Takimoto)  and [lab 3-22-18, Exploratory Data Analysis](https://github.com/ds-modules/LEGALST-190/tree/masterlabs/3-22) (Author Keeley Takimoto)
- "Capital Bikeshare, Washington, DC" header image by [Leeann Caferatta](https://www.flickr.com/photos/leeanncafferata/34309356871) licensed under [CC BY-ND 2.0](https://creativecommons.org/licenses/by-nd/2.0/)