## Titanic Prediction

The Titanic dataset is a very famous example dataset for beginners for machine learning on Kaggle. By analyzing features such as Economic-social-class and gender, we could take a glimpse at the real historical scene of the catastrophic event, which happened a hundred years ago.

In this code report, I would use machine learning to build a model, to analyze those factors which weight in causing a difference in the survival result, and to make a prediction on the given test set based on our model.

We will take an insight into several algorithms and compare the accuracy and recall score between them.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as mp

from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC


In [3]:
train=pd.read_csv('./titanic/train.csv')
test=pd.read_csv('./titanic/test.csv')
train.shape,test.shape

((891, 12), (418, 11))

In [4]:
train.info()
print('---'*5)
print('percentage of NA per property sorted')
print('---'*5)
p=(train.isna().sum()/len(train)*100).sort_values(ascending=False)
print(p)


print('---'*5)
print('unique values for duplications and other info ')
print('---'*5)
u=train.nunique().sort_values()
print(u)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
---------------
percentage of NA per property sorted
---------------
Cabin          77.104377
Age            19.865320
Embarked        0.224467
PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000


In [5]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Noted:

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Embarked is Port of Embarkation.C = Cherbourg, Q = Queenstown, S = Southampton




As we can see, most of our features are complete and no missing values. But some columns indeed have missing values. 

In Cabin, 77% are missed, we could drop it.

In Age column, 20% are missed, we could find some ways to fill in them.

The Embarked column is with 0.22% missing,we can fill in them or simply drop those rows.


Pay attention to our categories variables:

Sex is a binary variable, we can encode it by LabelEncoder or get_dummies or One Hot Encoder.

Name is of no use, we could drop this column.

Cabin is also unuseful, but as we mentioned above, with 70% valus missing, we could simply drop it.

Ticket is also of no use, we drop it.

Embarked also can be encoded with encoding commands.

In [6]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

So we can use the most frequent 'S' to fill in the missing values in this column.

In [7]:
def clean(data):
    data.drop(['Cabin','Name','Ticket'],axis=1,inplace=True)
    
    
    data['Age']=data.groupby(['Pclass','Sex'])['Age'].transform(lambda x: x.fillna(x.median()))

    
    data['Fare']=data.groupby(['Pclass','Sex'])['Fare'].transform(lambda x:x.fillna(x.median()))
    
    data['Embarked']=data['Embarked'].fillna(value='S')
                                            
    le=preprocessing.LabelEncoder()
    
    data['Sex'].replace({'male':0,'female':1},inplace=True)
    data['Embarked'].replace({'S':0,'C':1,'Q':2},inplace=True)
    
    return data

In [8]:
cleantrain=clean(train)
cleantest=clean(test)

In [9]:
cleantrain.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,0,22.0,1,0,7.25,0
1,2,1,1,1,38.0,1,0,71.2833,1
2,3,1,3,1,26.0,0,0,7.925,0
3,4,1,1,1,35.0,1,0,53.1,0
4,5,0,3,0,35.0,0,0,8.05,0


In [10]:
cleantrain.info()
cleantest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    int64  
 4   Age          891 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Fare         891 non-null    float64
 8   Embarked     891 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 62.8 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Sex          418 non-null    int64  
 3   Age          418 non-null    float64
 4   SibSp        418 non-n

As we can see, all missing values have benn handled.

In [11]:
import copy
a=b=copy.deepcopy(cleantrain)
c=d=copy.deepcopy(cleantest)

In [12]:
y=a['Survived']
X=pd.get_dummies(a.drop('Survived',axis=1))

'''
get_dummies could transform non-numerical data , while leaving int and float data unchanged.

'''
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.2,random_state=4)


In [14]:
from sklearn.metrics import accuracy_score,recall_score 

def mypredict(model):
    #evaluate models
    model.fit(X_train, y_train)
    prediction=model.predict(X_test)
    return accuracy_score(y_test,prediction),recall_score(y_test,prediction,average='weighted')


model1=LogisticRegression(solver='liblinear',random_state=4)
model2=GradientBoostingClassifier()   #pay attention to the brackets here
model3=RandomForestClassifier()
model4=SGDClassifier()
model5=SVC()

models=[model1,model2,model3,model4,model5]
for m in models:
    
    print('model:',m)
    print('accuracy:',mypredict(m)[0])
    print('recall:',mypredict(m)[1])
    print('---'*5)

model: LogisticRegression(random_state=4, solver='liblinear')
accuracy: 0.8268156424581006
recall: 0.8268156424581006
---------------
model: GradientBoostingClassifier()
accuracy: 0.8491620111731844
recall: 0.8491620111731844
---------------
model: RandomForestClassifier()
accuracy: 0.8659217877094972
recall: 0.8603351955307262
---------------
model: SGDClassifier()
accuracy: 0.329608938547486
recall: 0.659217877094972
---------------
model: SVC()
accuracy: 0.6983240223463687
recall: 0.6983240223463687
---------------


We can see RandomForest performs best.\
We use it as our final model.

In [16]:
model=RandomForestClassifier()
print(mypredict(model))

(0.8603351955307262, 0.8603351955307262)


In [18]:
predict=model.predict(pd.get_dummies(cleantest))

result=pd.DataFrame({'PassengerId':cleantest.PassengerId,'Survived':predict})
result.to_csv('Pred0308.csv',index=False)
