# Titanic - Machine Learning from Disaster

In [45]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Download the ***only the training set*** from following link https://www.kaggle.com/competitions/titanic/data

Divide the training set into train and test later when needed


Data Description:

survival	Survival	0 = No, 1 = Yes

pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd

sex	Sex

Age	Age in years

sibsp	# of siblings / spouses aboard the Titanic

parch	# of parents / children aboard the Titanic

ticket	Ticket number

fare	Passenger fare

cabin	Cabin number

embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton


**Use NN to create three models that predicts which passengers survived the Titanic shipwreck**

In [46]:
!pip install scikeras



### Data pre-processing



In [47]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from scikeras.wrappers import KerasClassifier

In [48]:
df = pd.read_csv("/content/drive/MyDrive/tested.csv")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


###Feature engineering

Feature engineering, in data science, refers to manipulation — addition, deletion, combination, mutation — of your data set to improve machine learning model training, leading to better performance and greater accuracy.

From the columns that denote the number of sibilings and number of parents **define a new column isAlone** which shows if the passenger has relatives on the boat. The column should contain 0s and 1s.

Additionally change the **age column** such that the passengers are divided in five age groups: 0 for age<=16, 1 for 16<age<=32, 2 for 32<age<=48, 3 for 48<age<=64 and 4 for age>64.

Hint: Drop the columns for the number of sibilings and parents

In [49]:
# Categorize the 'Age' column
def categorize_age(age):
    if pd.isna(age):
        return age  # Keep NaN as is
    elif age <= 16:
        return 0
    elif 16 < age <= 32:
        return 1
    elif 32 < age <= 48:
        return 2
    elif 48 < age <= 64:
        return 3
    else:
        return 4

In [50]:
df['Age'] = df['Age'].apply(categorize_age)

In [51]:
df.drop(columns=['SibSp', 'Parch'], inplace=True)

In [52]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,2.0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,2.0,363272,7.0000,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,3.0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,1.0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,1.0,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",male,,A.5. 3236,8.0500,,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,2.0,PC 17758,108.9000,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,2.0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,359309,8.0500,,S


In [53]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [54]:
df = df.dropna()

In [55]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

### Neural Network 1

#### Optimize number of epochs and batch size for NN1

(Try different values for the epochs and batch size parameters and choose the optimal ones)

Hint: You can use exhaustive search over specified parameter values for an estimator.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

You will need a wrapper class for your neural network models

https://adriangb.com/scikeras/stable/generated/scikeras.wrappers.KerasClassifier.html

In [56]:
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df['Embarked'] = le.fit_transform(df['Embarked'].astype(str))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Sex'] = le.fit_transform(df['Sex'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Embarked'] = le.fit_transform(df['Embarked'].astype(str))


In [57]:
X = df.drop(columns=['Survived', 'Name', 'Ticket', 'Cabin', 'PassengerId'])
y = df['Survived']

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [59]:
def create_model(epochs=100, batch_size=32):
    model = Sequential()
    model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dropout(0.5))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [60]:
# Wrap the model
model = KerasClassifier(build_fn=create_model, verbose=0)

In [61]:
# Define parameter grid
param_grid = {
    'epochs': [50, 100, 150],
    'batch_size': [16, 32, 64]
}

In [62]:
# Setup GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3, scoring='accuracy')

In [63]:
# Fit the model
grid_result = grid.fit(X_train, y_train)

  pid = os.fork()
  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [64]:
# Print the best parameters and score
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")

Best: 0.5507246376811595 using {'batch_size': 32, 'epochs': 150}


In [65]:
# Evaluate on the test set
best_model = grid_result.best_estimator_
y_pred = best_model.predict(X_test)
y_pred = (y_pred > 0.5).astype(int)  # Threshold at 0.5 for binary classification

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print(f"F1 Score: {f1_score(y_test, y_pred)}")

Accuracy: 0.4444444444444444
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Neural Network 2

#### Optimize number of epochs and batch size for NN2

(Try different values for the epochs and batch size parameters and choose the optimal ones)

Hint: You can use exhaustive search over specified parameter values for an estimator.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

You will need a wrapper class for your neural network models

https://adriangb.com/scikeras/stable/generated/scikeras.wrappers.KerasClassifier.html

In [66]:
def create_model_nn2(epochs=100, batch_size=32):
    model = Sequential()
    model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [67]:
model_nn2 = KerasClassifier(build_fn=create_model_nn2, verbose=0)

In [68]:
param_grid_nn2 = {
    'epochs': [50, 100, 150],
    'batch_size': [16, 32, 64]
}

In [69]:
grid_nn2 = GridSearchCV(estimator=model_nn2, param_grid=param_grid_nn2, n_jobs=-1, cv=3, scoring='accuracy')

In [70]:
grid_result_nn2 = grid_nn2.fit(X_train, y_train)

  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [71]:
print(f"NN2 Best: {grid_result_nn2.best_score_} using {grid_result_nn2.best_params_}")

NN2 Best: 0.5797101449275363 using {'batch_size': 32, 'epochs': 100}


In [72]:
# Evaluate NN2 on the test set
best_model_nn2 = grid_result_nn2.best_estimator_
y_pred_nn2 = best_model_nn2.predict(X_test)
y_pred_nn2 = (y_pred_nn2 > 0.5).astype(int)

print("NN2 Evaluation:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_nn2)}")
print(f"Precision: {precision_score(y_test, y_pred_nn2)}")
print(f"Recall: {recall_score(y_test, y_pred_nn2)}")
print(f"F1 Score: {f1_score(y_test, y_pred_nn2)}")



NN2 Evaluation:
Accuracy: 0.8333333333333334
Precision: 0.7692307692307693
Recall: 1.0
F1 Score: 0.8695652173913043


#### Optimize number of epochs and batch size for NN3

(Try different values for the epochs and batch size parameters and choose the optimal ones)

Hint: You can use exhaustive search over specified parameter values for an estimator.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

You will need a wrapper class for your neural network models

https://adriangb.com/scikeras/stable/generated/scikeras.wrappers.KerasClassifier.html


In [73]:
def create_model_nn3(epochs=100, batch_size=32):
    model = Sequential()
    model.add(Dense(256, activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dropout(0.5))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [74]:
model_nn3 = KerasClassifier(build_fn=create_model_nn3, verbose=0)

In [75]:
param_grid_nn3 = {
    'epochs': [50, 100, 150],
    'batch_size': [16, 32, 64]
}

In [76]:
grid_nn3 = GridSearchCV(estimator=model_nn3, param_grid=param_grid_nn3, n_jobs=-1, cv=3, scoring='accuracy')

In [77]:
grid_result_nn3 = grid_nn3.fit(X_train, y_train)

  pid = os.fork()
  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [78]:
print(f"NN3 Best: {grid_result_nn3.best_score_} using {grid_result_nn3.best_params_}")

NN3 Best: 0.6231884057971014 using {'batch_size': 16, 'epochs': 50}


In [79]:
# Evaluate NN3 on the test set
best_model_nn3 = grid_result_nn3.best_estimator_
y_pred_nn3 = best_model_nn3.predict(X_test)
y_pred_nn3 = (y_pred_nn3 > 0.5).astype(int)

print("NN3 Evaluation:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_nn3)}")
print(f"Precision: {precision_score(y_test, y_pred_nn3)}")
print(f"Recall: {recall_score(y_test, y_pred_nn3)}")
print(f"F1 Score: {f1_score(y_test, y_pred_nn3)}")



NN3 Evaluation:
Accuracy: 0.7777777777777778
Precision: 0.75
Recall: 0.9
F1 Score: 0.8181818181818182
