# Kaggle Titanic
## Using Artifitial Neural Networks


For this lecture we will be working with the [Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). This is a very famous dataset.


# Step - 1 : Frame The Problem

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.



# Step - 2 : Obtain the Data

## Import Libraries

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [0]:
!ls -l

Pandas provides two important data types with in built functions to be able to provide extensive capability to handle the data.The datatypes include Series and DataFrames.

Pandas provides ways to read or get the data from various sources like read_csv,read_excel,read_html etc.The data is read and stored in the form of DataFrames.

In [0]:
!wget -q https://www.dropbox.com/s/8grgwn4b6y25frw/titanic.csv

In [0]:
!ls -l

In [0]:
data = pd.read_csv('titanic.csv')

In [0]:
data.info()

In [0]:
data.describe()

# Step - 3 : Analyse the Data

#### What do you observe from the above charts?

# Step - 4 : Feature Engineering

## Feature Engineering

We want to fill the missing values of the age in the dataset with the average age value for each of the classes. This is called data imputation.

In [0]:
data.info()

In [0]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        # Class-1
        if Pclass == 1:
            return 37
        # Class-2 
        elif Pclass == 2:
            return 29
        # Class-3
        else:
            return 24

    else:
        return Age

Applying the function.

In [0]:
data['Age'] = data[['Age','Pclass']].apply(impute_age,axis=1)

Now let's visualize the missing values.

The Age column is imputed sucessfully.

Let's drop the Cabin column and the row in the Embarked that is NaN.

In [0]:
data.drop('Cabin', axis = 1,inplace=True)

In [0]:
data.head()

In [0]:
data.dropna(inplace = True)

In [0]:
data.info()

## Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [0]:
sex_dummies = pd.get_dummies(data['Sex'],drop_first=1)
embark_dummies = pd.get_dummies(data['Embarked'],drop_first=1)
sex_dummies.head()

In [0]:
embark_dummies.head()

In [0]:
data.drop(['PassengerId', 'Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
data.head()



In [0]:
data = pd.concat([data,sex_dummies,embark_dummies],axis=1)

In [0]:
data.head()

In [0]:
data.info()

In [0]:
data.describe()

# Step - 5 : Model Selection

In [0]:
Target = 'Survived'
X = data.drop(Target,axis=1)
y = data[Target]

In [0]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score,accuracy_score, f1_score

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


In [0]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train = sc.transform(X_train)


In [0]:
X_test = sc.transform(X_test) # uses the scaling factors used in X_train above. So only transform must be done. No Fitting

# Now lets make ANN

In [0]:
#Now lets make ANN

In [0]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense

In [0]:
# Initialising the ANN
classifier = Sequential()

In [0]:
X_train.shape

In [0]:
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 9, kernel_initializer = 'uniform', 
                     activation = 'relu', input_dim = X_train.shape[1]))

In [0]:
# Adding the second hidden layer
classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu'))

In [0]:
# Adding the third hidden layer
classifier.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))

In [0]:
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

In [0]:
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [0]:
classifier.summary()

In [0]:
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, verbose=1, batch_size = 8000, epochs = 400)


In [0]:
#Part 3 - Making predictions and evaluating the model

In [0]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

In [0]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [0]:
cm

In [0]:
y_pred[:,0]