The deadline for this homework is on **18.10.2023 08:59** (right before the practice session). After completing the exercises, you should

1. Download this file into your computer (`File` $\to$ `Download .ipynb`)

2. Name the file in the following way *HWx_NameSurname* (for example `HW2_NshanPotikyan.ipynb`)

4. Send the file to this email address `nshan.potikyan@gmail.com` with subject **ML2**

**Note**

* if you do not follow any of the above conditions, your homework will not be graded.

* you do not need to send any dataset files or helper scripts that I provide with your homework (since I already have them).

* you need to write the code for the exercises yourself; you can use ``built-in functions``, ``numpy``, ``pandas``, ``sklearn``
and ``matplotlib``.

**Problem.** During the practice session we tried to build a binary classifier on the titanic dataset that would predict whether a passenger will survive or not.

* In this homework, you need to take the same dataset but this time you need to try the 3 different algorithm families on the given problem
  * KNN
  * Naive Bayes
  * Decision Trees

* Split the training dataset into train/val/test parts, so that you can evaluate which approach/algorithm results in better performance (use random_state=42, train=80%, val=10%, test=10% splits).

* Try leaving out unimportant features from the data.

* Make use of sklearn [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) to construct the different approaches.

* Use hyper-parameter tuning to find the best combination of parameters for each algorithm.

* Evaluate the model performance in terms of the accuracy score.

* Use the best data processing method to train a final model on the train+val dataset and report the accuracy score on the test dataset.

Your grade will be based on how many things you have tried and how good your final model performs on the test set.

In [72]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [73]:
# Load the Titanic dataset
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
titanic_df = pd.read_csv(url)
print(titanic_df.columns)

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare'],
      dtype='object')


# Preprocessing step

Cutting "Name"

In [74]:
titanic_df.drop(columns=['Name'], inplace=True)

In [75]:
titanic_df['Age'].fillna(titanic_df['Age'].median(), inplace=True)

In [76]:
categorical_cols = ['Sex']
numeric_cols = ['Pclass', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']

In [77]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(drop='first'), categorical_cols)
    ])

# Splitting step

In [78]:
X = titanic_df.drop(columns=['Survived'])
y = titanic_df['Survived']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Creating pipelines

In [79]:
knn_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier())
])

In [80]:
nb_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GaussianNB())
])

In [81]:
dt_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

# Tuning hyperparameters

KNN model

In [82]:
knn_param_grid = {
    'classifier__n_neighbors': [3, 5, 7],
    'classifier__weights': ['uniform', 'distance']
}

In [83]:
knn_grid = GridSearchCV(knn_pipe, knn_param_grid, cv=5, scoring='accuracy')
knn_grid.fit(X_train, y_train)

NB model

In [84]:
nb_pipe.fit(X_train, y_train)

Decision Tree model

In [85]:
dt_param_grid = {
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__criterion': ['gini', 'entropy']
}

In [86]:
dt_grid = GridSearchCV(dt_pipe, dt_param_grid, cv=5, scoring='accuracy')
dt_grid.fit(X_train, y_train)

# Predict values on Train dataset

In [87]:
knn_val_predictions = knn_grid.predict(X_val)
knn_val_accuracy = accuracy_score(y_val, knn_val_predictions)
print(f"KNN Validation Accuracy: {knn_val_accuracy}")

KNN Validation Accuracy: 0.7865168539325843


In [88]:
nb_val_predictions = nb_pipe.predict(X_val)
nb_val_accuracy = accuracy_score(y_val, nb_val_predictions)
print(f"Naive Bayes Validation Accuracy: {nb_val_accuracy}")

Naive Bayes Validation Accuracy: 0.7528089887640449


In [89]:
dt_val_predictions = dt_grid.predict(X_val)
dt_val_accuracy = accuracy_score(y_val, dt_val_predictions)
print(f"Decision Trees Validation Accuracy: {dt_val_accuracy}")

Decision Trees Validation Accuracy: 0.7640449438202247


# Choosing the best model

In [90]:
best_model = None
best_accuracy = 0

if knn_val_accuracy > best_accuracy:
    best_model = knn_grid.best_estimator_
    best_accuracy = knn_val_accuracy

if nb_val_accuracy > best_accuracy:
    best_model = nb_pipe
    best_accuracy = nb_val_accuracy

if dt_val_accuracy > best_accuracy:
    best_model = dt_grid.best_estimator_
    best_accuracy = dt_val_accuracy

# Re-training the best model on train + val datasets

In [91]:
best_model.fit(pd.concat([X_train, X_val]), pd.concat([y_train, y_val]))

# The best model evalution on Test dataset

In [92]:
test_predictions = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test Accuracy: {test_accuracy}")

Test Accuracy: 0.7752808988764045
