# Exercises in Classification II

## Exercise 1

In this exercise, we look at the titanic dataset, which is on Moodle in the file "titanic_survival_data.csv". 

Answer the following questions:
1. Load in the dataset, replace the missing values in the age column by the mean age of the column, and encode the Sex column as 0 and 1s.
2. Make an X set of the variables "Pclass", "Sex", "Age" and "SibSp", and take Survived as the y variable. Then make train-test split with 20% of the dataset for testing.
3. Do MinMax scaling on the training dataset.
5. Use 10-fold cross-validation on the training set to train different KNN algorithms and chose a suitable K based on accuracy score.
6. For the chosen K, train a model on the entire training dataset.
7. Create a confusion matrix for the model trained in 4 and calculate accuracy, precision, recall, and F1 score on the test dataset.
8. OPTIONAL: Create a ROC curve for the test dataset as well as the AUC score
9. OPTIONAL: Can you use 10-fold cross validation to get an estimate of the recall instead of accuracy?

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, cross_val_score

### 1. Load in the dataset, replace the missing values in the age column by the mean age of the column, and encode the

In [19]:
titanic_data = pd.read_csv("titanic_survival_data.csv")
titanic_data.fillna(titanic_data['Age'].mean(), inplace=True)
titanic_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


### 2. Make an X set of the variables "Pclass", "Sex", "Age" and "SibSp", and take Survived as the y variable. Then make train-test split with 20% of the dataset for testing.

In [20]:
X = titanic_data[["Pclass", "Sex", "Age", "SibSp"]]
y = titanic_data["Survived"]

In [23]:
X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})


In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8532)

### 3. Do MinMax scaling on the training dataset.


In [26]:
print(X['Sex'].unique())

[0 1]


In [27]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [33]:
print(X_train_scaled)
print("\n")
print(X_test_scaled)

[[1.        0.        0.3709909 0.       ]
 [1.        0.        0.25      0.       ]
 [0.        0.        0.45      0.       ]
 ...
 [0.5       0.        0.425     0.125    ]
 [1.        0.        0.35      0.       ]
 [0.        1.        0.65      0.125    ]]


[[1.        0.        0.3709909 0.       ]
 [1.        1.        0.3875    0.       ]
 [0.        0.        0.4375    0.       ]
 [1.        1.        0.2625    0.       ]
 [1.        1.        0.2       0.       ]
 [1.        1.        0.225     0.       ]
 [0.        1.        0.275     0.125    ]
 [0.        0.        0.5       0.       ]
 [0.5       0.        0.0375    0.125    ]
 [1.        0.        0.25      0.       ]
 [1.        1.        0.3709909 0.       ]
 [0.5       0.        0.3875    0.125    ]
 [0.        1.        0.3875    0.       ]
 [1.        1.        0.375     0.       ]
 [1.        0.        0.225     0.       ]
 [1.        0.        0.375     0.       ]
 [1.        0.        0.3709909 0.       ]
 [1

### 5. Use 10-fold cross-validation on the training set to train different KNN algorithms and chose a suitable K based on accuracy score.


In [50]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# Define list to store results
kacclist = []

# Define range of K values to test (different KNN models)
for k in range(1, 21):  # Testing K from 1 to 20
    knn = KNeighborsClassifier(n_neighbors=k)  # Set the K value
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=10, scoring='accuracy')  # 10-fold CV
    kacclist.append({"K": k, "Mean accuracy": scores.mean()})  # Store K and accuracy

# Convert results into a DataFrame
kaccuracyDF = pd.DataFrame(kacclist)

# Find the best K (highest accuracy)
best_row = kaccuracyDF.loc[kaccuracyDF["Mean accuracy"].idxmax()]
best_k = best_row["K"]
best_accuracy = best_row["Mean accuracy"]

# Print results
print(f"Best K: {int(best_k)} with accuracy: {best_accuracy:.4f}")




Best K: 3 with accuracy: 0.8174


In [48]:
kaccuracyDF.describe()

Unnamed: 0,K,Mean Accuracy
count,20.0,20.0
mean,10.5,0.80161
std,5.91608,0.009038
min,1.0,0.783725
25%,5.75,0.794308
50%,10.5,0.804108
75%,15.25,0.807996
max,20.0,0.817449


### 6. For the chosen K, train a model on the entire training dataset.

### 7. Create a confusion matrix for the model trained in 4 and calculate accuracy, precision, recall, and F1 score on the test dataset.

### 8. OPTIONAL: Create a ROC curve for the test dataset as well as the AUC score


### 9. OPTIONAL: Can you use 10-fold cross validation to get an estimate of the recall instead of accuracy?

## Exercise 2

In this exercise, we will predict the two income classes in the adult dataset (The file "adult.csv" is also on Moodle). 

Answer the following questions:
1. Clean the `income` variable such that it has only two values
2. Select as set of minimum two feature variables you want to use to predict `income`. Do the necessary transformation of these variables.
3. Create X and y dataset and split the datasets into training and testing sets
4. Train a KNN classifier to predict the variable `income` based on the feature variables selected in 2 - try out some different Ks 
5. Train a logistic regression classifier to predict the variable `income` based on the feature variables selected in 2 and compare it to the KNN classifier.
6. Train a decision tree classifier to predict the variable `income` based on the feature variables selected in 2 and compare it to the previous classifiers.
7. Train a random forest classifier to predict the variable `income` based on the feature variables selected in 2 and compare it to the previous classifiers.
8. Train a AdaBoost classifier to predict the variable `income` based on the feature variables selected in 2 and compare it to the previous classifiers.