<a href="https://colab.research.google.com/github/nabilahnran/titanic_classification/blob/main/Titanic_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic Classification Project

## The Titanic Dataset

[Kaggle](https://www.kaggle.com) has a [dataset](https://www.kaggle.com/c/titanic/data) containing the passenger list on the Titanic. The data contains passenger features such as age, gender, ticket class, as well as whether or not they survived.

Purpose of the project is to create a binary classifier using TensorFlow and Sklearn to determine if a passenger survived or not.

**Pull the dataset**



In [None]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && cp kaggle.json ~/.kaggle/ && echo 'Done'
! kaggle competitions download -c titanic
! ls

In [None]:
from zipfile import ZipFile

with ZipFile('titanic.zip', 'r') as zipObj:
   zipObj.extractall('temp')

Three files are downloaded:

1. `train.csv`: training data (contains features and targets)
1. `test.csv`: feature data used to make predictions
1. `gender_submission.csv`: an example competition submission file

## Step 1: Exploratory Data Analysis

**Read Data**

In [None]:
import pandas as pd

train_data = pd.read_csv('temp/train.csv')
train_data

In [None]:
train_data.dtypes

In [None]:
print(train_data.describe())
#pulls out the data with objects dtypes attributes and shows their describe attribute
print(train_data.describe(include=['O']))

In [None]:
test_data = pd.read_csv('temp/test.csv')

print("train set: " + str(train_data.shape))
print("test set: " + str(test_data.shape))

In [None]:
print(train_data.nunique())

In [None]:
for i in train_data.columns:
  print(f"{i}:", train_data[i].isna().sum())

**Drop useless var and variable which have a many null value (Cabin) and keep fare for cabin alternate**

In [None]:
#Drop useless var and variable which have a many null value (Cabin)
#Keep fare for cabin alternate
train_data = train_data.drop(['Name', 'Ticket', 'PassengerId', 'Cabin'], axis=1)
train_data

In [None]:
for i in train_data.columns:
  print("{}:{}".format(i, train_data[i].isna().sum()))

In [None]:
train_data.dtypes

**Processing values for analysis**

In [None]:
#Not bothering the original data, so copy it
from sklearn.preprocessing import OneHotEncoder
Correlation_df = train_data.copy()

#Fill NaN cell with the most frequent value on those variable
from sklearn.impute import SimpleImputer
import numpy as np
Correlation_df = pd.DataFrame(SimpleImputer(missing_values=np.nan, strategy = 'most_frequent')
  .fit_transform(Correlation_df), columns = Correlation_df.columns)

#Split to Numeric and Categorical Data for Normalization
num_col = train_data.select_dtypes(include = ['float64', 'int64']).columns
category_col = train_data.select_dtypes(include = ['object']).columns
    
#Use OneHotEncoder to categorize Sex and Embarked variable. Also drop the 'embarked_unknown' column after OneHotEncoding
Correlation_df_category = pd.get_dummies(Correlation_df[category_col], columns=['Embarked'])
from sklearn.preprocessing import LabelEncoder
Correlation_df_category['Sex'] = LabelEncoder().fit_transform(Correlation_df_category['Sex'])

#Normalization num variables
from sklearn.preprocessing import StandardScaler
Correlation_df_num = pd.DataFrame(StandardScaler().fit(Correlation_df[num_col])
  .transform(Correlation_df[num_col]), columns = Correlation_df[num_col].columns)

#combine the sub df
Correlation_df = pd.concat([Correlation_df_num, Correlation_df_category], axis = 1 )

print(Correlation_df)


In [None]:
Correlation_df[['Sex', 'Embarked_C', 'Embarked_Q', 'Embarked_S']] = Correlation_df[
    ['Sex', 'Embarked_C', 'Embarked_Q', 'Embarked_S']].astype('float')
Correlation_df

In [None]:
for i in Correlation_df.columns:
  print("{} : {}".format(i, Correlation_df[i].isna().sum()))

**Measure and visualize correlation between each feature**

In [None]:
#Make a correlation table with kendall method
Correlation_df.corr(method='kendall')

In [None]:
#Visualization the Kendall Correlation
import seaborn as sns
sns.heatmap(Correlation_df.corr(method='kendall'))

In [None]:
#For more obvious correlation comparasion between each column with target column(Survived), I use point biserial correlation
from scipy.stats import stats
Biserial_df = pd.DataFrame()
for i in Correlation_df.drop(columns=['Survived']).columns:
  Biserial_df = Biserial_df.append(pd.DataFrame({'Variable': i,
                                                 'Correlation': stats.pointbiserialr(Correlation_df[i], Correlation_df['Survived']).correlation,
                                                 'P-Value': round(stats.pointbiserialr(Correlation_df[i],Correlation_df['Survived']).pvalue, 3)},
                                                index = [0]))
  
Biserial_df.index = Biserial_df['Variable']
Biserial_df

In [None]:
sns.heatmap(Biserial_df[['Correlation']])

Conclusion so far:


1.   Strongest correlation with Survived variable is Fare, Parch, and Embarked_C.
2. Pclass and Fare correlated -0.5 which is worth to anticipate, because pclass and fare should have high correlation.
3. Sex has no correlation about surviving/not.





**Data Splitting**

Data that will be used is the cleaned data used for correlation before

In [None]:
training_data = Correlation_df.copy()
training_data['Survived'] = training_data['Survived'].astype('int')
training_data

In [None]:
train_x = training_data.drop(columns = ['Survived'])
train_y = training_data['Survived']

#Train&Test Split
from sklearn.model_selection import train_test_split
train_x, validation_x, train_y, validation_y = train_test_split(train_x, train_y, test_size = 0.33, random_state=52)
print(train_x.shape, train_y.shape, validation_x.shape, validation_y.shape)

In [None]:
train_x

In [None]:
train_y

---

## Step 2: The Model

**With Tensorflow**

In [None]:
#---------------Tensorflow----------------
import tensorflow as tf
model = tf.keras.Sequential([
                             tf.keras.layers.Dense(32, input_shape = [9], activation = 'relu'),
                             tf.keras.layers.Dense(64, activation = 'relu', 
                                                   kernel_regularizer= tf.keras.regularizers.l1(l=0.1)),
                             tf.keras.layers.Dense(1, activation = 'sigmoid')
])

model.compile(optimizer = 'adam', loss = 'BinaryCrossentropy', metrics = 'binary_accuracy')

In [None]:
MLP = model.fit(train_x, train_y, epochs = 1500, validation_data=(validation_x, validation_y))

In [None]:
import matplotlib.image  as mpimg
import matplotlib.pyplot as plt

acc = MLP.history['binary_accuracy']
val_acc = MLP.history['val_binary_accuracy']
loss = MLP.history['loss']
val_loss = MLP.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'r')
plt.plot(epochs, val_acc, 'b')
plt.title('Training and validation accuracy')
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend(["Accuracy", "Validation Accuracy"])

plt.figure()

plt.plot(epochs, loss, 'r')
plt.plot(epochs, val_loss, 'b')
plt.title('Training and validation loss')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend(["Loss", "Validation Loss"])

plt.figure()

### Model with Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

DSC = DecisionTreeClassifier()
DSC = DSC.fit(train_x, train_y)

In [None]:
DSC_val_pred = DSC.predict(validation_x)
print("accuracy : ", metrics.accuracy_score(validation_y, DSC_val_pred))

In [None]:
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(DSC, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = train_x.columns,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('titanic_survived.png')
Image(graph.create_png())

### Model with Random Forest

In [None]:
#cari alpha
from sklearn.tree import DecisionTreeClassifier
alphas = DecisionTreeClassifier(random_state=0).cost_complexity_pruning_path(train_x, train_y)['ccp_alphas']

#Pools of Param
random_parameters = {'n_estimators' : [10,100,1000],
                     'criterion' : ['gini', 'entropy'],
                     'max_depth' : [10,100,1000],
                     'max_features' : ['auto', 'sqrt', 'log2'],
                     'bootstrap' : [True, False],
                     'class_weight': ['balanced', 'balanced_subsample'],
                     'ccp_alpha' : alphas}

#Customized Cross Validation for Param Tuning
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

RFC = RandomizedSearchCV(RandomForestClassifier(),
                         param_distributions = random_parameters,
                         n_iter = 100,
                         scoring = 'accuracy',
                         n_jobs = 10,
                         cv = 3,
                         verbose = 1,
                         random_state = 0,
                         return_train_score = True)
RFC.fit(train_x, train_y)

In [None]:
Best_param = RFC.best_params_

#test with best params
RFC = RandomForestClassifier(
    n_estimators = Best_param['n_estimators'],
    criterion = Best_param['criterion'],
    max_depth = Best_param['max_depth'],
    max_features = Best_param['max_features'],
    bootstrap = Best_param['bootstrap'],
    class_weight = Best_param['class_weight'],
    ccp_alpha = Best_param['ccp_alpha']
)
RFC.fit(train_x, train_y)

In [None]:
RFC_val_pred= RFC.predict(validation_x)
print("Accuracy :", metrics.accuracy_score(validation_y, RFC_val_pred))

###Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB

NBC = GaussianNB()
NBC = NBC.fit(train_x, train_y)

In [None]:
NBC_val_pred = NBC.predict(validation_x)
print("Accuracy : ", metrics.accuracy_score(validation_y, NBC_val_pred))

**So, from the trial with Deep Learning-MLP, Decision Tree, Random Forest, and Naive Bayes Classifier, the best one is the Deep Learning-MLP that have highest accuracy: >80%**

---

## Predict with the selected algorithm

In [None]:
test_data = pd.read_csv('temp/test.csv')
testing_data = test_data.copy()
testing_data.drop(['Name', 'Ticket', 'PassengerId', 'Cabin'], axis = 1, inplace = True)
testing_data.describe()

In [None]:
testing_data.isna().sum()

In [None]:
testing_data

In [None]:
print(test_data)
print(testing_data)

In [None]:
testing_data.dtypes

In [None]:
#Split to Numeric and Categorical Data for Normalization
num_col = testing_data.select_dtypes(include = ['float64', 'int64']).columns
category_col = testing_data.select_dtypes(include = ['object']).columns

#fill NaN cell with the most frequent value on those variable
testing_data = pd.DataFrame(SimpleImputer(missing_values=np.nan, strategy = 'most_frequent')
  .fit_transform(testing_data), columns = testing_data.columns)
#Normalization
testing_data_num = pd.DataFrame(StandardScaler().fit(testing_data[num_col])
  .transform(testing_data[num_col]), columns = testing_data[num_col].columns)
print(testing_data)
#Use OneHotEncoder to categorize Sex and Embarked variable. Also drop the 'embarked_unknown' column after OneHotEncoding
testing_data_category = pd.get_dummies(testing_data[category_col], columns=['Embarked'])
from sklearn.preprocessing import LabelEncoder
testing_data_category['Sex'] = LabelEncoder().fit_transform(testing_data_category['Sex'])



#combine the sub df
testing_data = pd.concat([testing_data_category, testing_data_num], axis = 1)

In [None]:
testing_data

In [None]:
result = pd.DataFrame()
result['Survived'] = model.predict(testing_data).round(0).astype(int).tolist()
result['Survived'] = result['Survived'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype('int')

result

---