In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

### Dealing with Missing Values
The test data does not contain any label for survival, due to the nature of competetion. Therefore, we will primarily focus on the training data provided.

In [2]:
train_df = pd.read_csv('./train.csv',index_col='PassengerId')

In [3]:
train_df.describe(include='all')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,891,2,,,,681.0,,147,3
top,,,"Yousseff, Mr. Gerious",male,,,,347082.0,,G6,S
freq,,,1,577,,,,7.0,,4,644
mean,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


As we see from the decription, missing values are everywhere. However, some of the categorical features involves the number 0 which is accidentally counted as missing values.

Let's find all the missing values.

In [4]:
for i in train_df.columns:
    print('Number of missing values in %s : %d'% (i,(train_df[i].isnull()==True).sum()))

Number of missing values in Survived : 0
Number of missing values in Pclass : 0
Number of missing values in Name : 0
Number of missing values in Sex : 0
Number of missing values in Age : 177
Number of missing values in SibSp : 0
Number of missing values in Parch : 0
Number of missing values in Ticket : 0
Number of missing values in Fare : 0
Number of missing values in Cabin : 687
Number of missing values in Embarked : 2


Age and Cabin have the most missing values. Embarked only contains 2. However, we believe Embarked feature is not useful in terms of predicting the survivors of Titanic and the Cabin information is corelated to the Fare and Passenger Class so we can discard the two.

Now we need to impute the age entries. Since the age distribution is unlikely to be normal, it makes more sense to impute the ages with the median instead of mean.From description, the median of age is 28.

In [5]:
age = train_df['Age'].values
age = np.array([28 if np.isnan(i) else i for i in age])
train_df['Age'] = age

Let's see the imputed data's description.

In [6]:
train_df['Age'].describe()

count    891.000000
mean      29.361582
std       13.019697
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

Turn the Sex feature into integers.

In [7]:
sex = train_df['Sex'].values
sex = np.array([0 if i=='male' else 0 for i in sex])
train_df['Sex'] = sex

Discard unwanted features and prepare data from classification.

In [8]:
train_df = train_df.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)

In [9]:
X = train_df[train_df.columns[1:]].values
y = train_df['Survived'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

The process of dealing with missing data follows the ACM Code of Ethics and Professional Conduct(https://www.acm.org/code-of-ethics). If there is any ethical concerns with the use of data, please contact me by liux16@wfu.edu.

### Deep Neural Netwrok Model