## Many Classification Models for Titanic Dataset

This sample notebook demonstrates how to create a model leveraging the joined dataset leveraging several classification models with **sklearn**.  

In this notebook we are leveraging the pandas get_dummies - which for a categorical variable will apply one-hot encoding.  This take a categorical variable and converts to a multiple columns of yes-no 1/0 values  If you set get_dummies option: drop_first=True you can avoid co-linearity in your dataset.

Problem is - you would need to ensure that all categorical values are represented in your dataset so when you are inferencing, that column is also in the dataset - so you could add columns, and some folks do, but this take a lot of data engineering.

Start with importing required packages for the notebook

This notebook will get you going to understand the code required to create the model, but then you need to have a processing pipeline to handle cleaning your data.

To handle the dataset transformation, leverage an sklearn pipeline to tranform the data

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

### Read in the datasets
- merge them, get dummies for columns  - super common - but sort of stuipid, makes inferencing tough, but is a quick way to explorer best model 

In [2]:
LABEL = 'survived'

In [15]:
df1 = pd.read_csv('./Data/Train1.csv')
df2 = pd.read_csv('./Data/Train2.csv')
print(df1.shape)
print(df2.shape)
titanic_df = df1.merge(df2, on = 'passenger_id', how = 'inner')

titanic_df['survived'] = titanic_df['survived'].fillna(0)
titanic_df['loc']= titanic_df['cabin'].apply(lambda x: x[0] if pd.notnull(x) else 'X')
titanic_df['age'] = titanic_df.groupby(['pclass'])['age'].apply(lambda x: x.fillna(x.median()))
titanic_df = titanic_df.drop(['name', 'ticket', 'home.dest', 'cabin', 'passenger_id'], axis = 1)


#get_dummies drop_first will ensure you avoid co-linearity, but makes inferencing more challenging
titanic_df = pd.get_dummies(titanic_df, columns = ['embarked', 'loc', 'sex'], drop_first=True)

print(titanic_df.shape)

FEATURES = list(titanic_df.columns[0:])
FEATURES.remove("survived")
FEATURES


titanic_df.head()

(917, 6)
(917, 8)
(917, 17)


Unnamed: 0,fare,survived,pclass,age,sibsp,parch,embarked_Q,embarked_S,loc_B,loc_C,loc_D,loc_E,loc_F,loc_G,loc_T,loc_X,sex_male
0,8.05,0.0,3.0,24.0,0.0,0.0,0,1,0,0,0,0,0,0,0,1,1
1,21.0,0.0,2.0,43.0,0.0,1.0,0,1,0,0,0,0,0,0,0,1,1
2,24.15,0.0,3.0,10.0,0.0,2.0,0,1,0,0,0,0,0,0,0,1,0
3,15.5,0.0,3.0,24.0,0.0,0.0,1,0,0,0,0,0,0,0,0,1,1
4,211.3375,1.0,1.0,43.0,0.0,1.0,0,1,1,0,0,0,0,0,0,0,0


### Read in the datasets
- merge them, use a label encoder to transform data from categorical values into integers suffers from sample problem, inferencing will be a challenge, what goes into a model during training needs to be what goes into the model during inferencing.    
- Pretty common - but sort of stupid, makes inferencing tough, but is a quick way to explorer best model 

In [16]:
df1 = pd.read_csv('./Data/Train1.csv')
df2 = pd.read_csv('./Data/Train2.csv')
print(df1.shape)
print(df2.shape)
df = df1.merge(df2, on = 'passenger_id', how = 'inner')

df['survived'] = df['survived'].fillna(0)
df['loc']= df['cabin'].apply(lambda x: x[0] if pd.notnull(x) else 'X')
df['age'] = df.groupby(['pclass'])['age'].apply(lambda x: x.fillna(x.median()))
df = df.drop(['name', 'ticket', 'home.dest', 'cabin', 'passenger_id'], axis = 1)


df_features = list(df.columns)
df_features.remove("survived")


from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['embarked'] = label_encoder.fit_transform(df['embarked'])
df['sex'] = label_encoder.fit_transform(df['sex'])
df['loc'] = label_encoder.fit_transform(df['loc'])

print(df.shape)
df.head()

(917, 6)
(917, 8)
(917, 9)


Unnamed: 0,fare,embarked,survived,pclass,sex,age,sibsp,parch,loc
0,8.05,2,0.0,3.0,1,24.0,0.0,0.0,8
1,21.0,2,0.0,2.0,1,43.0,0.0,1.0,8
2,24.15,2,0.0,3.0,0,10.0,0.0,2.0,8
3,15.5,1,0.0,3.0,1,24.0,0.0,0.0,8
4,211.3375,2,1.0,1.0,0,43.0,0.0,1.0,1


## Training on a variety of models.  

- Now we can use the dataframe df (label encoding), or the dataframe titanic_df (one-hot encoding)

## Create a dictionary for holding results

In [None]:
result_dict = {}

# Create a function that leverages sklearn metrics for model evaulation:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

In [None]:
def summarize_classification(y_test, y_pred):
    
    acc = accuracy_score(y_test, y_pred, normalize=True)
    num_acc = accuracy_score(y_test, y_pred, normalize=False)

    prec = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    return {'accuracy': acc, 
            'precision': prec,
            'recall':recall, 
            'accuracy_count':num_acc}

Generic function that takes in a classifer, named label column, named x columns and splits the data trains a model and tests the data

In [None]:
def build(classifier_fn,name_of_y_col, names_of_x_cols, dataset, test_frac=0.2):
    
    X = dataset[names_of_x_cols]
    Y = dataset[name_of_y_col]
    #split data
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_frac, random_state=42)  
    #train a model - you pass into the classifier_function so this code can run for many different models
    model = classifier_fn(x_train, y_train)   
    #predict on the model
    y_pred = model.predict(x_test)
    #leverage sklearn to get out summary information
    test_summary = summarize_classification(y_test, y_pred)   
    pred_results = pd.DataFrame({'y_test': y_test,
                                 'y_pred': y_pred})
    
    #pd cross tab will give you a basic confusion matrix of results
    model_crosstab = pd.crosstab(pred_results.y_test, pred_results.y_pred)
    #print(model_crosstab)
    return {'test': test_summary,'confusion_matrix': model_crosstab}

In [None]:
def compare_results():
    for key in result_dict:
        print('Classification: ', key)
        print('--------------------------------------------')
        print('test data')
        for score in result_dict[key]['test']:
            print(score, result_dict[key]['test'][score])

        print()


## Logistic Regression Model

The documentation will help you select the model parameters (hyper parameters for tuning your model)

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    

In [None]:
def logistic_fn(x_train, y_train):
    
    model = LogisticRegression(solver='liblinear')
    model.fit(x_train, y_train)
    
    return model

### LinearDiscriminantAnalysis

https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html

- A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule.

- The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix.

- The fitted model can also be used to reduce the dimensionality of the input by projecting it to the most discriminative directions, using the transform method.

In [None]:
def linear_discriminant_fn(x_train, y_train, solver='svd'):
    
    model = LinearDiscriminantAnalysis(solver=solver)
    model.fit(x_train, y_train)
    
    return model


### QuadraticDiscriminantAnalysis

https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html
    
- Quadratic Discriminant Analysis.

- A classifier with a quadratic decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule.

- The model fits a Gaussian density to each class.

In [None]:
def quadratic_discriminant_fn(x_train, y_train):
    
    model = QuadraticDiscriminantAnalysis()
    model.fit(x_train, y_train)
    
    return model

### SGDClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
    
This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning via the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.
    

In [None]:
def sgd_fn(x_train, y_train, max_iter=1000, tol=1e-3):
    
    model = SGDClassifier(max_iter=max_iter, tol=tol)
    model.fit(x_train, y_train)
     
    return model

### LinearSVC

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

Linear Support Vector Classification.

Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

* SVC with a linear kernel
* dual=False when number of samples > number of features

In [None]:
def linear_svc_fn(x_train, y_train, C=1.0, max_iter=1000, tol=1e-3):
    
    model = LinearSVC(C=C, max_iter=max_iter, tol=tol, dual=False)
    model.fit(x_train, y_train) 
    
    return model

## RadiusNeighborsClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsClassifier.html

Classifier implementing a vote among neighbors within a given radius.

In [17]:
def radius_neighbor_fn(x_train, y_train, radius=40.0):

    model = RadiusNeighborsClassifier(radius=radius)
    model.fit(x_train, y_train) 
    
    return model

## DecisionTreeClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

**max_depth** = If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples

**max_features**  = None -- max_features=n_features, 
                    auto -- then max_features=sqrt(n_features), 
                    sqrt -- then max_features=sqrt(n_features), 
                    log2 -- then max_features=log2(n_features)]

In [None]:
def decision_tree_fn(x_train, y_train, max_depth=None, max_features=None): 
    
    model = DecisionTreeClassifier(max_depth=max_depth, max_features=max_features)
    model.fit(x_train, y_train)
    
    return model

In [None]:
def naive_bayes_fn(x_train,y_train, priors=None):
    
    model = GaussianNB(priors=priors)
    model.fit(x_train, y_train)
    
    return model

In [None]:
result_dict['logistic']                        = build(logistic_fn,              LABEL,FEATURES,titanic_df)
result_dict['logistic2']                       = build(logistic_fn,              LABEL,df_features, df)
result_dict['linear_discriminant_analysis']    = build(linear_discriminant_fn,   LABEL, FEATURES,titanic_df)
result_dict['quadratic_discriminant_analysis'] = build(quadratic_discriminant_fn,LABEL,df_features,df)
result_dict['sgd']                             = build(sgd_fn,                   LABEL,FEATURES,titanic_df)
result_dict['linear_svc']                      = build(linear_svc_fn,            LABEL,FEATURES,titanic_df)
result_dict['radius_neighbors']                = build(radius_neighbor_fn,       LABEL,FEATURES,titanic_df)
result_dict['naive_bayes']                     = build(naive_bayes_fn,           LABEL,FEATURES,titanic_df)
result_dict['decision_tree']                   = build(decision_tree_fn,         LABEL,FEATURES,titanic_df)
compare_results()