<a href="https://colab.research.google.com/github/khuzaifa5188/Analyzing-Titanic-data-set-by-using-ANN/blob/main/Titanic_Survival_Prediction_using_ANN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

titanic_path = kagglehub.competition_download('titanic')

print('Data source import complete.')


# Titanic Survival Prediction
The reason for this analysis and model is to predict wether or not a person on the titanic will survive or not based on various features such as age, class, sex and where they embarked.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_set = pd.read_csv('/kaggle/input/titanic/train.csv')
test_set = pd.read_csv('/kaggle/input/titanic/test.csv')

## Variable Identification
First I will explore each variable first, I want to find out the data type of each and how many null entries I have in the dataset.

In [None]:
train_set.head()

I am going to change the index to Passenger ID as this is the same as the index we have already making the column redundent.

In [None]:
train_set = train_set.set_index('PassengerId')

In [None]:
train_set.head()

This shows we have a few text columns and quite a big handful of null entries. Also Cabin has a huge amount of null entires so I will have to do something about this.

In [None]:
train_set.info()

In [None]:
train_set.describe()

## Univariate Analysis
Now I will visualize some features to try and find some outliers and see if we can find some interesting stats.

In [None]:
plt.bar(train_set['Pclass'].unique(), train_set['Pclass'].value_counts())

In [None]:
bar = plt.bar(train_set['Sex'].unique(), train_set['Sex'].value_counts())
bar[0].set_color('blue')
bar[1].set_color('pink')
plt.show()

In [None]:
train_set['Age'].hist()

## Bi-variate Analysis
Now I will compare features against each other to try and find some correlation between them.

This shows that you are more likely to survive if you are a woman as they were sent of the ship first and over 5 times more likley to not survive if you are a man.

In [None]:
df = pd.DataFrame({'Gender': train_set['Sex'], 'Survived': train_set['Survived']})
total_counts = df.groupby(['Survived', 'Gender']).size()
total_counts.plot.bar(rot=0)

In [None]:
def correlation_heatmap(train):
    correlations = train.corr()

    fig, ax = plt.subplots(figsize=(16,16))
    sb.heatmap(correlations, vmax=1.0, center=0, fmt='.2f', square=True, linewidths=.5, annot=True, cbar_kws={"shrink":.70})
    plt.show()
correlation_heatmap(train_set)

## Missing Values
Now I will treat the missing values by first removing the redundant columns like Name, Ticket and Cabin aswel as removing the little handful of null rows.

In [None]:
def values_drop(set):
    set = set.drop('Name', axis=1)
    set = set.drop('Ticket', axis=1)
    set = set.drop('Cabin', axis=1)
    set = set.dropna()
    return set
def values_drop_test(set):
    set = set.drop('Name', axis=1)
    set = set.drop('Ticket', axis=1)
    set = set.drop('Cabin', axis=1)
    return set
train_set = values_drop(train_set)
test_set = values_drop_test(test_set)

In [None]:
test_set = test_set.replace(np.nan, 0)

In [None]:
train_set.head()

## Encode Categorical features
Now I will use One Hot Encoding to chnage the Sex and Embarked columns to be continuous variables.

In [None]:
survived = train_set[['Survived']]
train_set = train_set.drop("Survived", axis=1)

In [None]:
sex_cat = train_set[["Sex"]]
emb_cat = train_set[["Embarked"]]

In [None]:
from sklearn.preprocessing import OneHotEncoder
OHE = OneHotEncoder()
sex_cat_encoded = OHE.fit_transform(sex_cat)
sex_cat_encoded.toarray()

In [None]:
emb_cat_encoded = OHE.fit_transform(emb_cat)
emb_cat_encoded.toarray()

In [None]:
train_num = train_set.drop(["Sex", "Embarked"], axis=1)

Here I create a full pipeline of transformations so I can easily call it on new entries and exisitng ones.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_attribs = list(train_num)
cat_attribs = ["Sex", "Embarked"]

full_pipeline = ColumnTransformer([
    ("num", StandardScaler(), num_attribs),
    ("cat", OneHotEncoder(), cat_attribs)
])
train_prepared = full_pipeline.fit_transform(train_set)

In [None]:
test_prepared = full_pipeline.fit_transform(test_set)

## Train Models
Now I will train a neural network and optimize it as best I can without overfitting and underfitting.

In [None]:
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split

X = train_prepared
y = survived
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42)

In [None]:
X_train.shape

In [None]:
def create_network():
    model = keras.models.Sequential([
        keras.layers.Dense(100, activation='relu', input_dim=10),
        keras.layers.Dense(66, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid')
    ])
    return model
def evaluate(model):
    model.summary()
    model.compile(optimizer=keras.optimizers.SGD(lr=0.05), loss='binary_crossentropy', metrics=['accuracy'])
    history = model.fit(X_train, y_train, batch_size=40, epochs=30, validation_split=.1,
                       callbacks=[keras.callbacks.EarlyStopping(patience=5)])
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['training', 'validation'], loc='best')
    plt.show()

In [None]:
model = create_network()
evaluate(model)

This looks like it is fair on all three sets which is great news!

In [None]:
model_acc = model.evaluate(X_test, y_test)
print(" Model Accuracy is : {0:.1f}%".format(model_acc[1]*100))

Final predictions on the test set.

In [None]:
results = model.predict_classes(test_prepared)
results = pd.Series(results[:,0], name="Survived")
submission = pd.concat([pd.Series(test_set.PassengerId, name="PassengerId"),results],axis = 1)
submission.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")