## Objectives

- Review CRISP-DM model
- Practice running a classifcation model

___
# CRISP-DM Model

What are the major phases ?

___
### - Business Understanding
Why am I looking at this data set?  
What am I trying to answer?  
How does this data help me answer my question?  
___

### - Data Understanding and Data Preparation
Where are the missing values?  
What do the columns mean?  
How do I decide on what new columns to make?  
___

### - Modeling and Evaluation

How well did my model do?  
Which metric am I using to evaluate my model?  
What can I change to increase my scoring metric?
___

### - Deployment

Who is this going to?  
Where do I need to document better?  
Which areas are unclear?  
What can I do better?   
___

At what stages would the following methods be used?
- `StandardScaler()`
- `train_test_split()`
- `auc()`
- `pd.dropna()`

# Time to Code

In [None]:
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno



This [data](https://www.kaggle.com/jolasa/bay-area-bike-sharing-trips) is taken from Kaggle. This dataset contains anonymized trips data of Lyft bike sharing system(Bay Wheels), in the Bay Area from January 2019 to May 2019.

![bikes](pics/bikes.jpg) ![the wiggle](pics/wiggle.png)

### -Business Understanding-

We want to discover if we can predict ____ with ____
___

In [None]:
csv_list = glob.glob("data/*.csv")

list_of_dfs = []

for csv in csv_list:
    df = pd.read_csv(csv, index_col=None, header=0)
    list_of_dfs.append(df)

In [None]:
bike_df = pd.concat(list_of_dfs)
bike_df.head()

### - Data Understanding and Data Preparation -

We need to clean the data, look at NaN values, understand what the columns represent, etc.

In [None]:
bike_df.info()

___
Are there missing values?  
___

In [None]:
bike_df.isna().sum()

### Plug for `missingno`
[missingno](https://github.com/ResidentMario/missingno) is a library to visualize "missing data" in python.

In [None]:
msno.matrix(bike_df)

## What could cause these two columns (member_birth_year and member_gender) to be missing?

In [None]:
bike_df.user_type.value_counts()

In [None]:
bike_df[pd.isnull(bike_df).any(axis=1)].head(20)

## What to do?

What percentage of our data is missing?

In [None]:
bike_df.isna().sum()

In [None]:
#1053067

In [None]:
bike_df = bike_df.dropna(axis=0)

In [None]:
bike_df.isna().sum()

___
Are there redunant columns?  
What ones should I keep?  
___

In [None]:
bike_df = bike_df.drop(['start_station_name', 'end_station_name'], axis = 1)
bike_df.head()

## Some more EDA

In [None]:
plt.figure(figsize = (12, 8))
sns.set_style('darkgrid')
sns.distplot(bike_df['member_birth_year'])
plt.title('Distribution of Customer Ages');

In [None]:
plt.figure(figsize = (12, 8))
sns.set_style('whitegrid')
sns.distplot(bike_df['start_station_id'], label='Starting Dock', color = 'g')
sns.distplot(bike_df['end_station_id'], label='Returning Dock', color='r')
plt.title('Starting and Ending Location')
plt.xlabel('Station ID')
plt.legend();

In [None]:
plt.figure(figsize = (12, 8))
sns.set_style('whitegrid')
sns.distplot(bike_df['trip_duration_sec'])
plt.xlabel('Second')
plt.title('Seconds in Trip');

In [None]:
bike_df.trip_duration_sec.nsmallest(10)

In [None]:
bike_df.trip_duration_sec.nlargest(10)

___
Lets turn the time into mins rather than seconds.

In [None]:
bike_df['trip_duration_min'] = (bike_df.trip_duration_sec / 60).round(2)
bike_df.head()

In [None]:
bike_df.trip_duration_min.nsmallest(10)

In [None]:
bike_df.trip_duration_min.nlargest(10)

In [None]:
plt.boxplot(bike_df.trip_duration_min);

Fancy function to remove outliers using IQR. Not PEP 8.

In [None]:
def remove_outlier(df_in, col_names):
    df_out = df_in
    for col in col_names:
        q1 = df_in[col].quantile(0.25)
        q3 = df_in[col].quantile(0.75)
        iqr = q3-q1 #Interquartile range
        fence_low  = q1-1.5*iqr
        fence_high = q3+1.5*iqr
        df_out = df_in.loc[(df_in[col] > fence_low) & (df_in[col] < fence_high)]
    return df_out

In [None]:
slimmed_df = remove_outlier(bike_df, ["trip_duration_min"])

In [None]:
plt.boxplot(slimmed_df.trip_duration_min);

In [None]:
print(f"We went from {len(bike_df)} rows to a smaller {len(slimmed_df)} by removing the IQR in the duration column.")
print(f"That was a {round((len(bike_df) - len(slimmed_df)) / len(bike_df), 5)*100}% decrease in the number of rows.")

At this point we have a cleaned dataset. Depending on what we found during the EDA, or anything else that came up, we can take this in many directions. 


___
1. First we will try to predict if the member is a "brogrammer".  
2. Then we will attempt to predict the `user_type`.  
3. Finally, we'll see if anything exists to predict `month`.
___


### - Modeling and Evaluation -

Starting off with a simple model to give us a baseline. 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder

from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import ConfusionMatrix

import time

Everyone who is a male and lives in SF AND rides an electric bike is a Brogrammer.

In [None]:
def brogram(pd_series):
    if not "Male" in pd_series:
        return "Nice_person"
    else:
        return "Brogrammer"

In [None]:
df_1 = slimmed_df.copy()

In [None]:
df_1['bg'] = df_1.apply(lambda x: brogram(x['member_gender']), axis=1)

In [None]:
df_1 = df_1.drop('member_gender', axis=1)

In [None]:
df_1.bg.value_counts(normalize=True)

In [None]:
df_1 = pd.get_dummies(df_1, columns=['month', 'user_type'])
df_1.head()

In [None]:
target = pd.DataFrame(df_1['bg'])
data = df_1.drop(['bg'], axis=1)

In [None]:
target.bg.value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, target, stratify=target)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
clf_DT = DecisionTreeClassifier() #No max depth is set, be careful

In [None]:
start_time = time.time()

clf_DT.fit(X_train, y_train)

fit_time = (time.time()) - start_time
print(f'-------{fit_time}s seconds------')

In [None]:
y_hat = clf_DT.predict(X_test)

acc = accuracy_score(y_test, y_hat) * 100
print(f"Accuracy Score is {acc}")

In [None]:
# Instantiate the visualizer with the classification model
visualizer = ROCAUC(clf_DT)

visualizer.score(X_test, y_test)  # Evaluate the model on the test data

visualizer.poof()    

In [None]:
cm = ConfusionMatrix(clf_DT)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()

In [None]:
param_grid = {
    'max_depth': [2, 5, 7, 15],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200]
}

grid_search = GridSearchCV(estimator = clf_RF, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
# start_time = time.time()

# grid_search.fit(X_train, y_train)

# fit_time = (time.time()) - start_time
# print(f'-------{fit_time}s seconds------')

Everything everyone should have seen by now, or at least encoutered **PICKLE**



![pickle](pics/pickle.jpg)


The above CV took 90 mins to run through. Imagine it took a few hours. If I wanted to store this trained model, to use or compare with later, I can _pickle_ it to use for later.

In [None]:
# pd.to_pickle(grid_search, "GridSeach_RF.pkl")

In [None]:
grid_search = pd.read_pickle("GridSeach_RF.pkl")

In [None]:
grid_search.best_params_

In [None]:
clf_RF = (RandomForestClassifier(max_depth = 15,
                                 max_features = 3,
                                 min_samples_leaf = 3,
                                 min_samples_split = 8,
                                 n_estimators = 200))

In [None]:
start_time = time.time()

clf_RF.fit(X_train, y_train)

fit_time = (time.time()) - start_time
print(f'-------{fit_time}s seconds------')

In [None]:
y_hat = clf_RF.predict(X_test)

acc = accuracy_score(y_test, y_hat) * 100
print(f"Accuracy Score is {acc}")

In [None]:
# Instantiate the visualizer with the classification model
visualizer = ROCAUC(clf_RF)

visualizer.score(X_test, y_test)  # Evaluate the model on the test data

visualizer.poof()    

In [None]:
cm = ConfusionMatrix(clf_RF)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()

Random search vs Grid Search

![grid](pics/grid.png) ![random](pics/rand.png)

___
___
___

In [None]:
df_2 = pd.get_dummies(slimmed_df, columns = ['month', 'member_gender'])
df_2.head()

In [None]:
target = pd.DataFrame(df_2['user_type'])
data = df_2.drop('user_type', axis = 1)

Lets check out how well balanced our target values are.

In [None]:
target.user_type.value_counts(normalize=True)

OOOF, that is not good. Not good at all. But our "boss" wants this done, so lets at least attempt it.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, target, stratify = target)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Santity check if everything split correctly.

Lets try a baseline before altering the weights.

In [None]:
clf_RF = RandomForestClassifier()

In [None]:
start_time = time.time()

clf_RF.fit(X_train, y_train)

fit_time = (time.time()) - start_time
print(f'-------{fit_time}s seconds------')

In [None]:
y_hat = clf_RF.predict(X_test)

acc = accuracy_score(y_test, y_hat) * 100
print(f"Accuracy Score is {acc}")

In [None]:
# Instantiate the visualizer with the classification model
visualizer = ROCAUC(clf_RF)

visualizer.score(X_test, y_test)  # Evaluate the model on the test data

visualizer.poof()    

In [None]:
cm = ConfusionMatrix(clf_RF)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()

Now lets just try putting in class weights and see how it improves our scoring.

In [None]:
clf_RF_weights = RandomForestClassifier(class_weight={'Subscriber':0.91, 'Customer':0.9})

In [None]:
start_time = time.time()

clf_RF_weights.fit(X_train, y_train)

fit_time = (time.time()) - start_time
print(f'-------{fit_time}s seconds------')

In [None]:
y_hat = clf_RF_weights.predict(X_test)

acc = accuracy_score(y_test, y_hat) * 100
print(f"Accuracy Score is {acc}")

In [None]:
# Instantiate the visualizer with the classification model
visualizer = ROCAUC(clf_RF_weights)

visualizer.score(X_test, y_test)  # Evaluate the model on the test data

visualizer.poof()    

In [None]:
cm = ConfusionMatrix(clf_RF_weights)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()

___

Last Model - Multiclass Classification

In [None]:
month_count = bike_df.month.value_counts()

In [None]:
sns.barplot(month_count.index, month_count.values)

Visual inspection, but also getting a percentage is wise.

In [None]:
bike_df.month.value_counts(normalize=True)

Much more balanced classes. Will this help in predicting? Lets find out....

In [None]:
df_3 = pd.get_dummies(slimmed_df, columns = ['user_type', 'member_gender'])
df_3.head()

In [None]:
target = df_3['month']
data = df_3.drop('month', axis=1)

In [None]:
# from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder()

# target = le.fit_transform(target)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, target, stratify=target)

In [None]:
clf_RF_multi = RandomForestClassifier()

In [None]:
start_time = time.time()

clf_RF_multi.fit(X_train, y_train)

fit_time = (time.time()) - start_time
print(f'-------{fit_time}s seconds------')

In [None]:
y_hat = clf_RF_multi.predict(X_test)

acc = accuracy_score(y_test, y_hat) * 100
print(f"Accuracy Score is {acc}")

In [None]:
# Instantiate the visualizer with the classification model
visualizer = ROCAUC(clf_RF_multi)

visualizer.score(X_test, y_test)  # Evaluate the model on the test data

visualizer.poof()    

In [None]:
cm = ConfusionMatrix(clf_RF_multi)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()

We worked hard and yet nothing seems to be great. What could we have done differently...?