# Titanic Dataset

This notebook is based on the Titanic Dataset Machine Learning from Kaggle. Complete this notebook to complete the assignment. 

In [2]:
import pandas as pd
import tensorflow as tf
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from pandas import DataFrame

Using TensorFlow backend.


## Data Extraction

Read from train.csv into a pandas data frame(call it df)

In [8]:
df = pd.read_csv("train.csv")
fm ={'Sex':{'male': 0, 'female': 1}}
embark = {'S': 1, 'C': 2, 'Q': 3}
df.replace(embark, inplace=True)
df.replace(fm, inplace=True)

## Data Visualization

Try viewing the first five rows of your data (Note. try the head function)

In [9]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,2.0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,1.0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,1.0


Let's visualize our data bit and see number of peopled that died for each ticket class

In [10]:
df['Pclass'][df['Survived']==0].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1bbedee08d0>

## Data Cleaning/PreProcessing

Before we contiune let us do some preprocessing on our data. Preprocessing is the process a data scientist or ML engineer goes through to make sure the data is clean and ready for the model. One example is checking to see if there are any null values in any of the columns and replacing them. Let's see if the Age column has any.

In [39]:
df['Age'].isnull().sum()

177

It does so lets us fill those will the median value for age

In [40]:
df['Age'] = df['Age'].fillna(df['Age'].median())

Here we are normalizing the age column by subtracting the mean and dividing by the standard deviation to keep values small

In [41]:
df['Age'] = (df['Age']-df['Age'].mean())/df['Age'].std()

Let's see how that columns looks now

In [42]:
df['Age']

0     -0.565419
1      0.663488
2     -0.258192
3      0.433068
4      0.433068
         ...   
886   -0.181385
887   -0.795839
888   -0.104579
889   -0.258192
890    0.202648
Name: Age, Length: 891, dtype: float64

Try to do the same for the fare column

In [43]:
df['Fare'] = (df['Fare']-df['Fare'].mean())/df['Fare'].std()

Let's see how the far column looks now

In [44]:
df['Fare']

0     -0.502163
1      0.786404
2     -0.488580
3      0.420494
4     -0.486064
         ...   
886   -0.386454
887   -0.044356
888   -0.176164
889   -0.044356
890   -0.492101
Name: Fare, Length: 891, dtype: float64

## Feature Engineering

Now Time to do some feature engineering. Extract values from columns you can use as features(hint try to use numerical columns). Store it an variable called X.

In [45]:
X = df[['Age', 'Parch', 'SibSp', 'Fare', 'Pclass', 'Sex', 'Embarked']].values

Let's see how our input data looks

In [46]:
X

array([[-0.5654189 ,  0.        ,  1.        , -0.50216314,  3.        ,
         0.        ],
       [ 0.66348839,  0.        ,  1.        ,  0.78640362,  1.        ,
         1.        ],
       [-0.25819208,  0.        ,  0.        , -0.48857985,  3.        ,
         1.        ],
       ...,
       [-0.10457867,  2.        ,  1.        , -0.1761643 ,  3.        ,
         1.        ],
       [-0.25819208,  0.        ,  0.        , -0.04435613,  1.        ,
         0.        ],
       [ 0.20264816,  0.        ,  0.        , -0.49210144,  3.        ,
         0.        ]])

Extract the labels (Survived column) into Y

In [47]:
Y = df['Survived'].values

Let's see how our labels look

In [48]:
Y

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,

Let's spilt our dataset into train and test (X_train, X_test, y_train, y_test).Use the train_test_split function from sklearn. Use 30% of the data for test

In [49]:
seed = 5
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = seed)

Let us view the shape of the train data. The first number represents how many rows, the second represents how many columns or features.

In [50]:
X_train.shape

(623, 6)

Let us do the same for the test data

In [51]:
X_test.shape

(268, 6)

## Logistic Regression

Let us create a model and fit the model to the train dataset.Let us use the LogisticRegression model from sklearn.

In [52]:
model = LogisticRegression(C=1.0, solver='lbfgs', multi_class='ovr')
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='ovr', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Let's test the model. (Call the predict function on the model save the output in a variable)

In [53]:
Pred = model.predict(X_test)

Let us evaluate the accuracy of the model. Try using the accuracy score function from sklearn

In [54]:
accuracy_score(y_test, Pred)

0.8097014925373134

## Neural Network

Now let's try the same with a neural network. We will create a small neural network with some hidden layers and an output layer. (Note you are free to design this yourself). The network should output one value (try using sigmoid activation for last layer).

In [55]:
model = Sequential()
model.add(Dense(48, input_dim = 7, activation = 'relu'))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

Now let us fit our model.Remeber to use the test data created from above as the validation data.

In [56]:
model.fit(X_test, y_test, epochs = 150, batch_size = 10)

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

<keras.callbacks.callbacks.History at 0x211aa076ef0>

Run this to see how your model has done.

In [57]:
scores = model.evaluate(X_test, y_test)
for i in range(len(scores)):
 print("\n%s: %.2f%%" % (model.metrics_names[i], scores[i]*100))


loss: 24.18%

accuracy: 91.42%


## Test Time

Time to test our model on the hold out test dataset provided. Read test.csv into a data from called test_df

In [60]:
test_df = pd.read_csv("test.csv")
fm ={'Sex':{'male': 0, 'female': 1}}
test_df.replace(fm, inplace=True)

We do the same as before and pre-process our data.Remember to remove the null values as well as normalize the age and fare columns.



Remove null values for age

In [61]:
test_df['Age'] = test_df['Age'].fillna(test_df['Age'].median())

Normalize age

In [62]:
test_df['Age'] = (test_df['Age']-test_df['Age'].mean())/test_df['Age'].std()

Normalize Fare

In [63]:
test_df['Fare'] = (test_df['Fare']-test_df['Fare'].mean())/test_df['Fare'].std()

Extract the same features as before into a variable.

In [64]:
test = test_df[['Age', 'Parch', 'SibSp', 'Fare', 'Pclass', 'Sex']].values

Use your model to make predicition on the data. Store the result in a variable called predicitions.

In [65]:
predicitions = model.predict(test)

The neural network model will produce values between 0 and 1 that represent the probability of the person surving. We convert those values to either 0 or 1 with those that have less than 50% of surving a 0 and those that have greater a 1.

In [66]:
predicitions = predicitions.squeeze(1)
predicitions = np.where(predicitions < 0.5 , 0, 1)

  


Let us look at how our predicitions looks

In [67]:
predicitions

array([0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,

## Create a submission csv file

Create a data frame with two columns PassengerId and Survived (Try the pd.DataFrame function). The PassengerId column should have the same values as the PassengerId column from the test_df dataframe from above and Survived column should be the predicitions you just created. Create a csv file from this data frome (Try using the .to_csv funtion, however make sure to remove indexes so set to the index flag to false). This should created a csv file, this is what you submit to kaggle.

In [68]:
titanic = pd.DataFrame(test_df['PassengerId'])
titanic['Survived'] = pd.Series(predicitions)
titanic.to_csv('Submission.csv', index = None, header=True)

## Bonus

If you wish to get a better accuracy try extracting the sex column as well, note you will need to find some way to convert that column to a numeric column (e.g male=0, female=1)