# Assignment 3
MSDS422 - Fulton

## Problem Question:
Use at least four binary (dichotomous) variables of your choice to build models. Predict the binary response variable of survival.  Use cross-validation on the training set prior to submitting your forecasts to be graded on the Kaggle.com withheld test set.   Employ two classification methods: (1) logistic regression as described in Chapter 4 of the Géron (2017) textbook and (2) naïve Bayes classification. Evaluate these methods within a cross-validation design as well as on the test set (minimum of two Kaggle.com submissions).  Use the area under the receiver operating characteristic (ROC) curve as an index of classification as part of cross-validation. Python scikit-learn should be your primary environment for conducting this research.

Regarding the management problem, imagine that you are providing evidence regarding characteristics associated with survival on this ill-fated voyage to a historian writing a book.  Which of the two modeling methods would you recommend and why?

# Analysis and Insights
## Data Prep
After exploring the data I tried to identify what would be the best means of handling each column. I divided them into four types; ordinal, nominal, numeric, and useless. I dropped the useless fields and transformed the rest in my column Transformer. Below are how I categorized them:

Ordinal:
<i>
* Pclass
* SibSp
* Parch
* Level (grab first digit from Cabin)
</i>

Nominal Text Fields:
<i>
* Sex
* Embarked
</i>

Numeric Fields:
<i>
* Age
* Fare
</i>

Fields not needed:
<i>
* Passanger ID
* Name
* Ticket (very close to an ID type field)
* Cabin (very close to an ID type field)
</i>

Target:
<i>
* Survived
</i>

## Results
The logistic regression model provided a consistently better result than the Naive Bayes model. I expected this because I read in Géron that Naive Bayes models typically are not as powerful. 
Once I knew I would be using the Logistic Regression model I tried some things to improve the accuracy. I experimented with my column groupings; moving some to ordinal or number to see how it would affect things. It had little impact though. I ran a GridSearch to optimize the parameters and that got me better results. It achieved a 0.76076 accuracy score on Kaggle.

## Next Steps
Determine why my Cross Validation score is so different from my train/test score. Using ROC for both I'm seeing very different results. Do I have data leakage some how? I also come in much lower on the test set that gets submitted to Kaggle.
I could also perhaps change the threshold. I believe by default it uses 0.5 as the over/under. Could tweaking that change things?

## Analysis
If I were using this to provide evidence regarding characteristics associated with survival I would recommend neither model. I feel that any evidence should be purely factual. The saying "all models are wrong, some are useful" comes to mind here. If the author truely wants an evidence based book he should use descriptive analysis and not a model. 
If he truely twisted my arm and was going to state that the results were speculative and the results of modeling techniques I'd point him toward my Logistic Regression model since it preformed better. Ultimately though they'd want to keep looking since my results were far from perfect. 

# Appendix - Code and Output

## Adding Libraries

In [29]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
%matplotlib inline

## Importing Training Data

In [30]:
# Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
df = pd.read_csv('./Data/train.csv', sep=',', engine='python')

## High Level Summary Statistics

In [31]:
# Getting a look at the first 5 rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [32]:
# Age, Cabin, Embarked have NULL values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [48]:
# Getting summary statistics for all variables
df.describe(include = 'all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Level,Relatives
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891,891.0,204,889,891.0,891.0
unique,,,,891,2,,,,681,,147,3,,
top,,,,"Herman, Mrs. Samuel (Jane Laver)",male,,,,CA. 2343,,G6,S,,
freq,,,,1,577,,,,7,,4,644,,
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,,0.776655,0.904602
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,,1.590899,1.613459
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,,0.0,0.0
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,,0.0,0.0
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,,0.0,0.0
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,,0.0,1.0


In [34]:
# Checking correlation of features
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


## Feature Creation

In [35]:
# Creating new ordinal field from first letter of cabin. I'm guessing this is something like a level of the ship?
df.loc[:,'Level'] = df['Cabin'].str[0:1]

# Mapping values for level
level_mapping = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F":6, "G":7, "T":8}
df.loc[:,'Level'] = df.loc[:,'Level'].map(level_mapping)
df.loc[:,'Level'] = df.loc[:,'Level'].fillna(0)

# Combining parch and sibsp into one field. They're basically the same thing right?
df.loc[:,'Relatives'] = df.loc[:,'Parch'] + df.loc[:,'SibSp']

## Splitting into Training and Test

In [37]:
# Doing a train test split
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns = 'Survived', axis = 1), df.loc[:,'Survived'], test_size = 0.2, random_state = 1234)

## Feature Transformation

In [39]:
# Setting up a pipeline as shown in "Hands-On Machine Learning with Scikit-Learn" 
num_attribs = ['Age', 'Fare', 'Relatives']
nom_attribs = ['Sex', 'Embarked']
#ord_attribs = ['Pclass']
pass_attribs = ['Level', 'Pclass'] #Since these are already prepped we'll just pass them through

# Numeric pipeline: imputing with median and applying standard scaler
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = "median")),
    ('minmax_scaler', MinMaxScaler())
])

# Nominal pipeline: applying one hot encoding and imputing with the mode
nom_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = "most_frequent")),
    ('1h_encoder', OneHotEncoder(handle_unknown='ignore')) 
])

# Ordinal pipeline: applying ordinal encoding and imputing with the mode
ord_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = "most_frequent")),
    ('1h_encoder', OrdinalEncoder()) 
])

# Full pipeline containing the numeric and categorical pipelines
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("nom", nom_pipeline, nom_attribs),
    #("ord", ord_pipeline, ord_attribs)
    ("pass", "passthrough", pass_attribs)
    ],
    remainder ='drop'
)

# Fitting then transforming the training data
transform_train = full_pipeline.fit_transform(X_train)
# Only Transforming the test data
transform_test = full_pipeline.transform(X_test)

## Modeling
Logistic Regression:

In [40]:
# Using the best params from the grid search below
log_reg = LogisticRegression(random_state = 1776, C = 1, solver = 'saga', penalty = 'l1', max_iter = 1000)
log_reg.fit(transform_train, y_train)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=1776, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

In [22]:
# Checking predictions
# ROC AUC
predictions_train = log_reg.predict(transform_train)
auc_train_log = roc_auc_score(y_train, predictions_train)

predictions_test = log_reg.predict(transform_test)
auc_test_log = roc_auc_score(y_test, predictions_test)

# Accuracy score
predictions_train = log_reg.predict(transform_train)
acc_train_log = accuracy_score(y_train, predictions_train)

predictions_test = log_reg.predict(transform_test)
acc_test_log = accuracy_score(y_test, predictions_test)

print("-------ROC AUC---------")
print("TRAIN:", auc_train_log)
print("TEST:", auc_test_log)

print("-------Accuracy---------")
print("TRAIN:", acc_train_log)
print("TEST:", acc_test_log)

-------ROC AUC---------
TRAIN: 0.7772393048128342
TEST: 0.8275884665792923
-------Accuracy---------
TRAIN: 0.7949438202247191
TEST: 0.8491620111731844


In [23]:
# Calculating 5 fold Cross Validation score
scores = cross_val_score(log_reg, transform_test, y_test, scoring = "roc_auc", cv = 5)
print("Mean: ", scores.mean())
print("Standard Deviation: ", scores.std())

Mean:  0.8783240568954854
Standard Deviation:  0.04591737826603321


Naive Bayes:

In [24]:
gnb = GaussianNB()
gnb.fit(transform_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [25]:
# Checking predictions - GNB
# ROC AUC
predictions_train = gnb.predict(transform_train)
auc_train_gnb = roc_auc_score(y_train, predictions_train)

predictions_test = gnb.predict(transform_test)
auc_test_gnb = roc_auc_score(y_test, predictions_test)

# Accuracy score
predictions_train = gnb.predict(transform_train)
acc_train_gnb = accuracy_score(y_train, predictions_train)

predictions_test = gnb.predict(transform_test)
acc_test_gnb = accuracy_score(y_test, predictions_test)

print("-------ROC AUC---------")
print("TRAIN:", auc_train_gnb)
print("TEST:", auc_test_gnb)

print("-------Accuracy---------")
print("TRAIN:", acc_train_gnb)
print("TEST:", acc_test_gnb)

-------ROC AUC---------
TRAIN: 0.7834893048128342
TEST: 0.7934469200524247
-------Accuracy---------
TRAIN: 0.7879213483146067
TEST: 0.8044692737430168


In [26]:
# Calculating 5 fold Cross Validation score
scores_gnb = cross_val_score(gnb, transform_test, y_test, scoring = "roc_auc", cv = 5)
print("Mean: ", scores_gnb.mean())
print("Standard Deviation: ", scores_gnb.std())

Mean:  0.8619975262832407
Standard Deviation:  0.022335184716764354


## Grid Search
Attempting to find the best parameters for the Logistic Regression Model

In [41]:
# Values to test
grid_values = {'penalty': ['l1', 'l2', 'elasticnet', 'none'],'C':[0.001,0.01,1,10,25,100], 'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
grid_clf_acc = GridSearchCV(log_reg, param_grid = grid_values,scoring = 'roc_auc')
grid_clf_acc.fit(transform_train, y_train)

#Predict values based on new parameters
y_pred_acc = grid_clf_acc.predict(transform_test)

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)

  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will igno

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)

  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and

ValueError: penalty='none' is not supported for the liblinear solver

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)

  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)

  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' 

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty.

ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)

  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' will ignore the C and l1_ratio "
  "Setting penalty='none' 

In [42]:
# Displaying the best parameters
grid_clf_acc.best_params_

{'C': 1, 'penalty': 'l1', 'solver': 'saga'}

## Submission Steps:

In [43]:
# Reading in test data
submit = pd.read_csv('./Data/test.csv', sep=',', engine='python')

In [44]:
# Creating new ordinal field from first letter of cabin
submit.loc[:,'Level'] = submit['Cabin'].str[0:1]

# Mapping values for level
level_mapping = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F":6, "G":7, "T":8}
submit.loc[:,'Level'] = submit.loc[:,'Level'].map(level_mapping)
submit.loc[:,'Level'] = submit.loc[:,'Level'].fillna(0)

# Combining parch and sibsp into one field
submit.loc[:,'Relatives'] = submit.loc[:,'Parch'] + df.loc[:,'SibSp']

In [45]:
# Applying Transformations
transform_submit = full_pipeline.transform(submit)

In [46]:
# Getting predictions for submission
final_predictions = log_reg.predict(transform_submit)

In [47]:
# Packaging submission up
final_id = np.array(submit['PassengerId']).astype(int)
my_solution = pd.DataFrame(final_predictions, final_id, columns = ['Survived'])
my_solution.to_csv("submissions/submission3.csv", index_label = ["PassengerId"])