## Data Science Club: Titanic Survival Prediction

#### Titanic Dataset and Kaggle Challenge can be found at this site: https://www.kaggle.com/c/titanic


##### Overview:
The Titanic survival prediction is an introductory machine learning challenge found on Kaggle. There are 2 datasets that are likely to be useful for us.

###### Datasets:

1) Train.csv: This dataset has 891 passengers on the Titanic, and 12 variables about them. One of these variables is whether the passenger survived on the Titanic or not. 

2) Test.csv: This dataset has 418 passengers on the Titanic, and 11 variables about them. This dataset requires us to predict which of these 418 passengers survived using these 11 variables. The test.csv will also make up our final submission to Kaggle. 

3) Titanic.csv: This is a dataset that Data Science Club put together that is cleaned and a little easier to work with! You can use this data to create a simple model if you want. 


###### Variables:

<li>PassengerIdUnique ID of the passenger</li>
<li>SurvivedSurvived (1) or died (0)</li>
<li>PclassPassenger's class (1st, 2nd, or 3rd)</li>
<li>NamePassenger's name</li>
<li>SexPassenger's sex</li>
<li>AgePassenger's age</li>
<li>SibSpNumber of siblings/spouses aboard the Titanic</li>
<li>ParchNumber of parents/children aboard the Titanic</li>
<li>TicketTicket number</li>
<li>FareFare paid for ticket</li>
<li>CabinCabin number</li>
<li>EmbarkedWhere the passenger got on the ship 
 (C - Cherbourg, S - Southampton, Q = Queenstown)</li>
 
 
###### Also on Github:
<li>Titanic.R: From last year, if you want some ideas on the workflow and are familiar with R then this file shows the process and submission for a logistic regression algorithm. 

In [1]:
## Import libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

In [7]:
## Import test and training data
train_df = pd.read_csv("Titanic_train.csv")

In [9]:
train_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [10]:
## Describe variables in training data
# sex
# Pclass
# age

train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [14]:
## Remove missing data if needed

train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [83]:
test_df = pd.read_csv("Titanic_test.csv")

In [84]:
test_df.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [85]:
Titanic_train = train_df.drop(columns = ['Pclass','Name','Ticket','Cabin','Embarked','Age','PassengerId'])

In [86]:
## Create dummy variables where needed

Titanic_train['Sex']=pd.get_dummies(Titanic_train['Sex'])


In [87]:
Titanic_train.head()

Unnamed: 0,Survived,Sex,SibSp,Parch,Fare
0,0,0,1,0,7.25
1,1,1,1,0,71.2833
2,1,1,0,0,7.925
3,1,1,1,0,53.1
4,0,0,0,0,8.05


In [88]:
Titanic_train_y=Titanic_train['Survived']
Titanic_train_x=Titanic_train.drop(columns = ['Survived'])

In [89]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(Titanic_train_x, Titanic_train_y)




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [90]:
import statsmodels.api as sm
logit_model=sm.Logit(Titanic_train_y,Titanic_train_x)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.610661
         Iterations 6
                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.083     
Dependent Variable: Survived         AIC:              1096.1973 
Date:               2019-10-09 21:20 BIC:              1115.3666 
No. Observations:   891              Log-Likelihood:   -544.10   
Df Model:           3                LL-Null:          -593.33   
Df Residuals:       887              LLR p-value:      3.3347e-21
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
-------------------------------------------------------------------
           Coef.    Std.Err.      z      P>|z|     [0.025    0.975]
-------------------------------------------------------------------
Sex        1.6051     0.1686    9.5223   0.0000    1.2747    1.9355
SibSp     -0.5659     0.1017   -5.5624   0.0000   -0.7654 

In [91]:
Titanic_test = test_df.drop(columns = ['Pclass','Name','Ticket','Cabin','Embarked','Age','PassengerId'])
Titanic_test['Sex']=pd.get_dummies(Titanic_test['Sex'])



In [92]:
Titanic_train_x.head()


Unnamed: 0,Sex,SibSp,Parch,Fare
0,0,1,0,7.25
1,1,1,0,71.2833
2,1,0,0,7.925
3,1,1,0,53.1
4,0,0,0,8.05


In [93]:
Titanic_test.isnull().sum()

Sex      0
SibSp    0
Parch    0
Fare     1
dtype: int64

In [94]:
Titanic_test['Fare']=Titanic_test['Fare'].fillna(Titanic_test['Fare'].mean())

In [95]:
Titanic_test.describe()

Unnamed: 0,Sex,SibSp,Parch,Fare
count,418.0,418.0,418.0,418.0
mean,0.363636,0.447368,0.392344,35.627188
std,0.481622,0.89676,0.981429,55.8405
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,7.8958
50%,0.0,0.0,0.0,14.4542
75%,1.0,1.0,0.0,31.5
max,1.0,8.0,9.0,512.3292


In [96]:
y_pred = result.predict(Titanic_test)

In [97]:
y_pred.isnull().sum()

0

In [98]:
y_preds = []
for x in y_pred:
    if x>0.4999:
        y_preds.append(1)
    else:
        y_preds.append(0)

In [99]:
final_preds = pd.DataFrame({
    'PassengerId':test_df['PassengerId'],
    'Survived':y_preds
}
)

In [103]:
final_preds.to_csv("Final_preds.csv", index=False)

## Final result: 0.48803 (11,262nd place)

#### Improvements for next time: Find a way to keep age in the regression model