# Titanic Competition 2018 - Julia Beitel

Julia Beitel - Big Data and Analytics - December 7th 2018

## Import Data Packages

In [377]:
#data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

#visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

## Input Datasets

In [378]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df, test_df]

## Check and Analyze Data

In [379]:
# preview the data
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [380]:
print(train_df.columns.values)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']


In [381]:
train_df.info()
print('_'*40)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null

From the raw data, I observed that there were 418 values in most columns, that there were no prevalent typos, and that Age and Cabin both had large amounts of missing data. Additionally, I observed that There were 11 columns, not including Survived. The fact that there was a large amount of missing data in Age and Cabin caused me to want to drop them because I do not have the coding expertise to estimate. In droping these columns, I wanted to look at just the Sex, Pclass, Family size of passengers, and (also) if they were alone or not. 

## Clean Data

This is turning the categorical data of sex (male vs. female) to a binary data with 1s and 0s. 

In [382]:
#turn categorical data for 'Sex' into binary dataset using 1s and 0s
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

Here, I combines the sibsp and parch variables to make a new variable that defined the total number of extra members the passenger was traveling with. 

In [383]:
#combine sibsp and parch
train_df["Family"] = train_df["SibSp"] + train_df["Parch"]
test_df["Family"] = test_df["SibSp"] + test_df["Parch"]

This is me dropping the columns that I did not want to factor into my predictions. I dropped these columns because I further on found that they did not have strong correlations to a passengers survivability. 

In [384]:
#drop columns that we aren't confident in
train_df.drop('Parch', axis = 1, inplace = True)
train_df.drop('SibSp', axis = 1, inplace = True)
train_df.drop('Ticket', axis = 1, inplace = True)
train_df.drop('Embarked', axis = 1, inplace = True)
train_df.drop('Cabin', axis = 1, inplace = True)
train_df.drop('Name', axis = 1, inplace = True)
train_df.drop('Age', axis = 1, inplace = True)

#drop columns that we aren't confident in
test_df.drop('Parch', axis = 1, inplace = True)
test_df.drop('SibSp', axis = 1, inplace = True)
test_df.drop('Ticket', axis = 1, inplace = True)
test_df.drop('Embarked', axis = 1, inplace = True)
test_df.drop('Cabin', axis = 1, inplace = True)
test_df.drop('Name', axis = 1, inplace = True)
test_df.drop('Age', axis = 1, inplace = True)

Here, I created the variable isalone which states if a passenger traveled alone or with a parch or sibsp, AKA family.

In [385]:
#better analysis than the family key feature
for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['Family'] == 0, 'IsAlone'] = 1

Next, I printed to basic head of the train and test data to grasp what they look like now that I have dropped unwanted columns. 

In [386]:
#print finalized tail of dataset
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Fare,Family,IsAlone
0,1,0,3,0,7.25,1,0
1,2,1,1,1,71.2833,1,0
2,3,1,3,1,7.925,0,1
3,4,1,1,1,53.1,1,0
4,5,0,3,0,8.05,0,1


In [387]:
#print finalized tail of dataset
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Fare,Family,IsAlone
0,892,3,0,7.8292,0,1
1,893,3,1,7.0,1,0
2,894,2,0,9.6875,0,1
3,895,3,0,8.6625,0,1
4,896,3,1,12.2875,2,0


This is where I found the survivability correlations of my finalized columns vs. survive. I found the correlations of the data I dropped, but deleted it once I dropped those unwanted columns because of its insignifigance. 

In [388]:
#find correlations to find the survivability of different features
traincorr = train_df.corr(method='spearman')

traincorr.drop('PassengerId', axis = 1, inplace = True)
traincorr.drop('Pclass', axis = 1, inplace = True)
traincorr.drop('Sex', axis = 1, inplace = True)
#traincorr.drop('Fare', axis = 1, inplace = True)
traincorr.drop('Family', axis = 1, inplace = True)
traincorr.drop('IsAlone', axis = 1, inplace = True)

traincorr

Unnamed: 0,Survived,Fare
PassengerId,-0.005007,-0.013975
Survived,1.0,0.323736
Pclass,-0.339668,-0.688032
Sex,0.543351,0.259593
Fare,0.323736,1.0
Family,0.165463,0.528907
IsAlone,-0.203367,-0.531472


## Analyze Key Features

Next, I wanted to analyze the highest survivability columns (based on the above correlations) as key features for my predictions.

In [389]:
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


In [390]:
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Sex,Survived
1,1,0.742038
0,0,0.188908


In [391]:
train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean().sort_values(by='IsAlone', ascending=False)

Unnamed: 0,IsAlone,Survived
1,1,0.303538
0,0,0.50565


In [392]:
train_df[['Family', 'Survived']].groupby(['Family'], as_index=False).mean().sort_values(by='Family', ascending=False)

Unnamed: 0,Family,Survived
8,10,0.0
7,7,0.0
6,6,0.333333
5,5,0.136364
4,4,0.2
3,3,0.724138
2,2,0.578431
1,1,0.552795
0,0,0.303538


## Training Machines

Then, I trained my classifier machine. I created two for loops (one for train and one for test) to single out the key features that had the highest survivability. 

In [393]:
for val in train_df.columns:
    if val != 'Sex' and val != 'PassengerId' and val != 'Survived' and val != 'Pclass' and val != 'Fare': 
        train_df = train_df.drop([val], axis=1)
        test_df = test_df.drop([val], axis=1)
        
for val in test_df.columns:
    if val != 'Sex' and val != 'PassengerId' and val != 'Pclass' and val != 'Fare': 
        train_df = train_df.drop([val], axis=1)
        test_df = test_df.drop([val], axis=1)

Here, I defined the train and test varibles and dropped the passengerid from my predictions in order to set up my machines for predicting.  

In [394]:
X_train = train_df.drop(["Survived", 'PassengerId'], axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1)

X_train.shape, Y_train.shape, X_test.shape

((891, 3), (891,), (418, 3))

Next I cross validated, split my train and test set, and also set up my test and train variables for an overfitting test. 

In [395]:
#cross validation
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split #split data into train and test set 
X_train, X_test, Y_train, Y_test = train_test_split(X_train, Y_train, train_size = .7, test_size = .3)

valid_X_train, valid_X_test, valid_Y_train, valid_Y_test = train_test_split(X_train, Y_train, train_size = .7, test_size = .3)

valid_X_train.shape, valid_X_test.shape, valid_Y_train.shape, valid_Y_test.shape

((436, 3), (187, 3), (436,), (187,))

Here, I tested multiple classifiers for my prediction accuracy and tested my machines. 

In [396]:
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

78.65

In [397]:
# Support Vector Machines
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

84.27

In [398]:
#KNearest Neighbors
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

86.36

In [399]:
# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

77.69

In [400]:
# Linear SVC
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc

78.97

The decision tree classifier had the highest accuracy score so I decided to use this in my submission post. 

In [401]:
# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

91.01

Here you can see the test I did for potential overfitting. It turned out successful because I only had about a .4 difference from my valid and tested training sets. 

In [402]:
#Decision Tree overfit test - SUCCESS
print((decision_tree.score(X_train, Y_train)*100), decision_tree.score(valid_X_train, valid_Y_train)*100)

91.01123595505618 91.28440366972477


In [403]:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

91.01

In [404]:
#Random Forest overfit test - SUCCESS
print((random_forest.score(X_train, Y_train)*100), random_forest.score(valid_X_train, valid_Y_train)*100)

91.01123595505618 91.05504587155964


In [405]:
# AdaBoost Classifier
ada_boost = AdaBoostClassifier(n_estimators=300)
ada_boost.fit(X_train, Y_train)
Y_pred = ada_boost.predict(X_test)
ada_boost.score(X_train, Y_train)
acc_ada_boost = round(ada_boost.score(X_train, Y_train) * 100, 2)
acc_ada_boost

85.39

## Pick the Winner and Create Submission

This is where I used my high accuracy score for my decision tree classifer in my submission post. I decided to use the decision tree because of the high accuracy score, but also because of the small amount of my overfitting. 

In [406]:
final_pred = decision_tree
submission = pd.DataFrame({'PassengerId': test_df['PassengerId'],'Survived': final_pred})

submission
submission.to_csv('submision.csv', index=False)

## Future Exploration

Given more time, I would have loved to get a higher accuracy score by refining the weights of each of my variables. Additionally, I wish I would have had more experience I machine learning so that I could create my own neural network or classifier! 

## Acknowledgements

I recieved most of my code from Manav Senhgal. His titanic machine learning solutions notebook from the competition tutorials page hugely helped me get a higher accuracy score. He helped me with my data analyzations, combining types of data and variables, and with my classifier code. 


My teacher, Ms.Sconyers, also helped me hugely with my code. She provided the imports, code for my analyzations, and code for my training for loops and cross validations.


My classmate, Lillian Ellis, also helped me. Lillian shared some code with me to help combine one of my variable types and for turning my data into a different type. This code was also used from Manac Senhgal. 