## Introduction to Classification
In the ***Analysis*** phase, we will take the cleansed data from the ***Process*** phase, and start building a classifier.  For classification, we will use the ***sci-kit learn*** package in Anaconda.  Install that by going to your Anaconda command line and typing ***conda install scikit-learn***

## Decision Trees
Let's start by growing a decision tree from the data

In [None]:
# Import libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Workshop Functions
import sys
sys.path.append('..')
from Wksp722_functions import * 

In [None]:
# Read in the data
df = pd.read_csv("titanic_train_cleaned.csv")
df.head()

### Need for additional data processing
The scikit-learn machine learning package can't take **categorical** data as input.  As a result, we will not be able to use the 'Name' or 'Ticket' columns.  

For 'Sex', we can replace 'male' as 0 and 'female' as 1.  Similarly we can replace 'Embarked' as either 0,1, or 2 based on the value.  

In [None]:
# convert the categorical variable 'Sex' to numerical 0 and 1 using mapping
mapping = {'male':0, 'female':1}
df.loc[:,'Sex'] = df.loc[:,'Sex'].map(mapping)
df.head(2)

For 'Embarked' there are 3 possible values: S, C, and Q.  Rather than assign them values of 0,1,2 respectively, let's use one-hot encoding to create 3 new columns for each value.  In the 'S' column, the value will be a 1 if the original 'Embarked' column has a 'S' as the value for that passenger, and a '0' otherwise.  Similarly for C and Q columns.  

Let's also do the same with the Salutation.  

Finally, let's drop 'PassengerId' as it's really just an index and will not help train the model.  

In [None]:
dfTemp = pd.get_dummies(df.loc[:,['Embarked','Salutation']])

df = pd.concat([df,dfTemp], axis=1)
df.drop(['PassengerId','Embarked','Name','Ticket','Salutation'], axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df.columns

This seems like we just added a lot of extra columns, but it will make little difference to the computational speed for this dataset. 

I've saved all the changes we did in a function that we can use in the future.  It is located in the class library file.  This step is optional, but can be useful if you have test sets that you need to process in the exact same way as your training set.  

In [None]:
#titanicNumericalConverter(df)

In [None]:
df.head(0).to_csv('titanic_train_columns.csv') # <-- we'll need this information later

### Growing our first classifier - Decision Tree
Now that all the variables are **numerical**, we can use scikit-learn to grow our classifiers.  We'll start by designating one of the columns as the **target variable (y)** and the others will be the **input (x)**.

In [None]:
x = df.drop(columns="Survived")
y = df.loc[:,'Survived']

In [None]:
# the number of survivors is less than the number that didn't survive.  
# Though this is an imbalance, there are several hundred examples of each and should suffice for training our algorithms.
y.value_counts()

Now we'll split the dataframe into a training and test set.  We'll choose the training set to be 70% of the original size and the test set to be 30%.  

We'll use a built-in function from the scikit-learn (sklearn) package

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=1)

The scikit-learn package has many different kinds of classifiers.  They generally share many common features in how you use them, which makes it easy to call compare the performance of different classifiers.

**The usual sequence of steps of using most classifiers in scikit-learn** is:
* split data into training and testing sets using ***train_test_split***
* import and set the desired classifier
* fit the model to the training data
* predict for the test set using the ***predict*** function
* measure success using accuracy, f1, precision, recall, etc.

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree1 = DecisionTreeClassifier(max_depth =3, random_state = 1)
tree1.fit(x_train, y_train)

In [None]:
from sklearn import tree
fig = plt.figure(figsize=(25,20))

featureNames = x_train.columns
targetNames = ['Did Not Survive','Survived']
tree.plot_tree(tree1, feature_names=list(featureNames),  
                   class_names=targetNames,
                   filled=True)

fig.savefig('tree1.pdf')

***Curiosity points (5 Points)***
Another cool package for visualization is dtreeviz.  Install the package with ***conda install dtreeviz***.

See the kaggle notebook at this link (https://www.kaggle.com/code/immu123/decision-tree-visualization-with-dtreeviz?scriptVersionId=101370052) and try visualizing your tree this way.  Below is an example of what the package can do with the iris dataset:

<div>
<img src = "dtreeviz_example.PNG" width="500">
</div>

Some branches lead to a leaf, but many are still mixed at their terminal nodes.  Let's see the accuracy with a relatively short tree:

In [None]:
from sklearn import metrics
y_pred = tree1.predict(x_test)

print("Tree1 Confusion Matrix \n", metrics.confusion_matrix(y_test, y_pred))
print("\n")
print("Tree1 Classification Report \n", metrics.classification_report(y_test, y_pred))
print("\n")
print("Tree1 Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
# grow a full length decision tree and check accuracy
tree2 = DecisionTreeClassifier(max_depth=None, random_state = 1)
tree2.fit(x_train, y_train)
y_pred2 = tree2.predict(x_test)
print("Tree2 Accuracy:",metrics.accuracy_score(y_test, y_pred2))

***Curiosity Points (5 points)*** 
Why did the accuracy drop when we grew a larger tree?  
(hint: use tree2.tree_.max_depth to see how large your tree grew)

## Random Forest classification

In this section, let's use the Random Forest algorithm

In [None]:
from sklearn.ensemble import RandomForestClassifier

RF1 = RandomForestClassifier(n_estimators=125, max_depth=None, oob_score=True, random_state=1)
RF1.fit(x_train, y_train)

y_pred = RF1.predict(x_test)
print("RF1 Accuracy:",metrics.accuracy_score(y_test, y_pred))

### Variable Importance
Let's extract and plot the variable importance

In [None]:
importances = RF1.feature_importances_
forest_importances = pd.Series(importances, index=x_train.columns) #cast the list into a Pandas series

In [None]:
fig, ax = plt.subplots()
forest_importances.plot.bar(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

I created an additional function that plots the test error and optional Out of Bag (OOB) error as the forest grows.  It is located in the class library file.  

In [None]:
# def EnsembleGrowthErrorPlot(clf,x_train,y_train,x_test,y_test,min_estimators=5,max_estimators=200,oob=False):

In [None]:
EnsembleGrowthErrorPlot(RF1, x_train, y_train, x_test, y_test, min_estimators =5, max_estimators =200, oob =True)

We see that there is a large drop in error around 25 trees, and a slighter drop in error around 120 trees.  Also there is good agreement between the test and out-of-bag error rates.  

Let's see how other things effect accuracy

In [None]:
test_error = []
for i in range(2,25,1):
        RF2 = RandomForestClassifier(n_estimators=125, max_depth=None, oob_score=True, random_state=1, min_samples_split=i)
        RF2.fit(x_train, y_train)

        y_pred = RF2.predict(x_test)
        acc = metrics.accuracy_score(y_test, y_pred)
        test_error.append((i,1-acc))

In [None]:
# Plot Test set Error rate as a function of minimum number of observations in a node before it can be split
fig, axes = plt.subplots()
x,test_error_plot = zip(*test_error)
axes.plot(x, test_error_plot)
plt.xlabel("min_samples_split")
plt.ylabel("error percentage")
plt.show()

Based on our observations, a good Random Forest for our dataset will have:
* at least 25 decision trees. Let's choose 50 just in case
* min_samples_split between 10 and 13.  Let's choose 12.  

Let's regrow a new Random Forest with these settings and save so it can be loaded again in a new Jupyter Notebook

In [None]:
RF_Final = RandomForestClassifier(n_estimators=50, min_samples_split=12, max_depth=None, oob_score=True, random_state=1)
RF_Final.fit(x_train, y_train)

y_pred = RF_Final.predict(x_test)
print("RF_Final Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
#Store the Random Forest to be used later in another Jupyter Notebook
import pickle

pickle.dump(RF_Final,open('RF_Final.pkl', 'wb'))

***Curiosity points (5 points)***
Play around with the ***max_depth*** variable to the RandomForestClassifier and see if that makes a difference.  

### Boosted decision trees

In [None]:
from sklearn.ensemble import AdaBoostClassifier
AB1 = AdaBoostClassifier(n_estimators=125, random_state=1)
AB1.fit(x_train, y_train)

y_pred = AB1.predict(x_test)
print("AB1 (Adaboost) Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

In [None]:
EnsembleGrowthErrorPlot(AB1, x_train, y_train, x_test, y_test, min_estimators=5, max_estimators =200, oob =False)

Generally, Adaboost suffers from **overfitting** when you increase the number of trees beyond a certain point.  For our test set, it looks like this number was around 50 trees, but the effect is slight.  Let's see if Gradient boost will be better 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
GB1 = GradientBoostingClassifier(n_estimators=125,learning_rate=1.0, random_state=1, max_depth=3)

EnsembleGrowthErrorPlot(GB1, x_train, y_train, x_test, y_test, min_estimators =5, max_estimators =200, oob =False)

Gradient Boost shows less susceptability to overfittiing, but a higher error rate.  