# **Titanic - Machine Learning for Disaster**

* This task is to predict survival on the Titanic, which is a challenge provided by Kaggle: 'https://www.kaggle.com/c/titanic/overview'. 

* There are ten variables in this dataset: 

  **One Label**: *Survival*: 0=No, 1=Yes

  **Nine Features**:
  *pclass* (ticket class); *sex*; *age*; *sibsp* (# of siblings / spouses aboard); *parch* (# of parents / children aboard); *ticket* (ticket number); *fare* (passenger fare); *cabin* (cabin number); *embarked* (port of embarkation, C=Cherbourg, Q=Queenstown, S=Southampton)

* This exercise is to get familiar with using Python for classification, so we just use two features in the dataset for illustration purpose.

# **Logistic regression via `sklearn`**

In [30]:
import pandas as pd
import numpy as np
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [31]:
# load the survivals data 
full_data = pd.read_csv("titanic.csv")
# have a look at the training data
full_data.head()
#train.info()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [36]:
# get training/test splits by `train_test_split` function
X_train, X_test, y_train, y_test = train_test_split(full_data.loc[:, ['Pclass','Parch']], full_data.Survived, test_size=0.30, random_state=105)
print(full_data.shape)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(891, 12)
(623, 2)
(268, 2)
(623,)
(268,)


In [39]:
# build a logistic regression model
from sklearn.linear_model import LogisticRegression
# initialise a logistic regression model
# lr = LogisticRegression()
lr = LogisticRegression(penalty=None) # by default a penalty term is added 
# train the model by the training feature matrix and labels
lr.fit(X_train,y_train)

In [40]:
# or combine the previous two steps in one line, the results are the same
lr = LogisticRegression(penalty=None).fit(X_train,y_train)

In [41]:
# have a look at the estimated coefficients and intercept
print(lr.coef_, lr.intercept_)

[[-0.88493655  0.29887941]] [1.43199373]


In [42]:
# get prediction of survival from logistic regression
pred = lr.predict(X_test)
pred[0:9]

array([0, 0, 1, 0, 0, 0, 1, 1, 0])

# **Logistic regression via `statsmodels`**

In [43]:
# however, as a statistical model, we usually want to have an easy access to the estimated coefficients,
# their p-values and other statistical quantities, as what we can easily have in R
# in this case, I would recommend to use the statsmodels library
# 'https://www.statsmodels.org/dev/examples/notebooks/generated/glm.html'
import statsmodels.api as sm
# we need to manually add a constant column to include intercept in regression
X_train_new=sm.add_constant(X_train, prepend=False) 
print(X_train_new.head())

     Pclass  Parch  const
280       3      0    1.0
255       3      2    1.0
506       2      2    1.0
808       2      0    1.0
232       2      0    1.0


In [44]:
# fit a GLM model with binomial family
lrs=sm.GLM(y_train,X_train_new,family=sm.families.Binomial()).fit()
print(lrs.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               Survived   No. Observations:                  623
Model:                            GLM   Df Residuals:                      620
Model Family:                Binomial   Df Model:                            2
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -374.38
Date:                Mon, 04 Dec 2023   Deviance:                       748.76
Time:                        16:50:17   Pearson chi2:                     624.
No. Iterations:                     4   Pseudo R-squ. (CS):             0.1217
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Pclass        -0.8849      0.106     -8.342      0.0

In [45]:
# predict for test set
X_test_new = sm.add_constant(X_test, prepend=False) 
scores_new = lrs.predict(X_test_new)
scores_new[0:9] # here we have the scores (posterior probability) rather than labels from prediction

821    0.227441
348    0.284156
258    0.633452
221    0.416324
717    0.416324
314    0.490250
515    0.633452
766    0.633452
175    0.284156
dtype: float64

In [46]:
# transform scores to labels: <0.5 --> 0; >0.5 --> 1
predict_new = scores_new # initialise the predicted label vector
predict_new.loc[predict_new<0.5] = 0
predict_new.loc[predict_new>=0.5] = 1
predict_new=predict_new.astype(int)
print(predict_new[0:9])

821    0
348    0
258    1
221    0
717    0
314    0
515    1
766    1
175    0
dtype: int64


# ***k*NN via `sklearn`**

In [47]:
# build a knn classifier based on the training set
n_neighbours=5
KNNClassifier = neighbors.KNeighborsClassifier(n_neighbours, weights="uniform")
KNNClassifier.fit(X_train, y_train)

In [48]:
# predict the test set
y_pred=KNNClassifier.predict(X_test)
# get the predicted probabilities for each class
print(KNNClassifier.predict_proba(X_test)[0:9,])
# have a look at the prediction
print(y_pred[0:9])
# get the accuracy
print(sum(y_pred==y_test)/len(y_test))
print(accuracy_score(y_pred, y_test))

[[0.8 0.2]
 [0.4 0.6]
 [0.4 0.6]
 [1.  0. ]
 [1.  0. ]
 [0.4 0.6]
 [0.4 0.6]
 [0.4 0.6]
 [0.4 0.6]]
[0 1 1 0 0 1 1 1 1]
0.6716417910447762
0.6716417910447762
