# Luca Corsetti 0001131095

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score, classification_report

%matplotlib inline

random_state=777

np.random.seed(random_state)

Consider the file provided with the assignment and execute the analysis described below according to the best practices of Machine Learning. You are
allowed to use only the computers of the lab, you are not allowed to use any other device, email or any other messaging tool. You can use only the websites
accessible through the computers of the lab, as listed in the following page.
Cooperative work will be heavily sanctioned

The notebook must operate as follows:
1. Load the file data.csv, explore the data showing size and do some data
exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1pt
2. Deal with null values, imputing the mean for numeric features and the
string “unknown” for categorical features . . . . . . . . . . . . . . . . . . . . . . . . . 2pt
3. train, optimize and test two classifier models of your choice, the
optimization must be done with cross validation, optimize the f1-
score_macro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4pt
4. show the result for both models, including the optimal hyperparameter
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1pt
5. repeat the experiment using the best model found in the previous steps
and doing feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4pt
6. show the results with the best hyperparameter values . . . . . . . . . . . . 1pt
7. comment the results of the two experiments . . . . . . . . . . . . . . . . . . . . 3pt

### 1. Load the file data.csv, explore the data showing size and do some data exploration

In [2]:
df = pd.read_csv('./data.csv', index_col=0)

df.head()

FileNotFoundError: [Errno 2] No such file or directory: './data.csv'

In [None]:
print(f"the dataset has {df.shape[0]} rows and {df.shape[1]} columns")

In [None]:
df.describe()

In [None]:
sns.pairplot(df)

In [None]:
print(f"there are {df['F00'].isna().sum()} rows with F00 having NaN values out of {df.shape[0]} records in the dataset")

- the data is composed for the majority of numeric values
- the column 'class' denotes the column that will be used to classify the data.
- the column 'F13' seems to represent some sort of category
- the column 'F00' seems to have lots of NaN values, in fact 950 out of 1000 rows in the dataset have missing value in this column. we will likely need to work on the column, either by purging it entirely or by filling it with some values

the dataset has a lot of features, we may therefore encounter not-so-great performances on our trained classifiers. maybe we will need to do some feature selection to improve our models

### 2. Deal with null values, imputing the mean for numeric features and the string “unknown” for categorical features

In [None]:
df.isna().sum()

only the column "F00" seems to have NaN values, we proceed by setting it with the mean of the values present for the same feature column

In [None]:
mean = df['F00'].mean()

df['F00'] = df['F00'].replace(np.nan, mean)

# we could also do this iteratively for each column

# for c in df.columns:
#    mean = df[c].mean()

#    df[c] = df[c].replace(np.nan, mean)

categorical features do not seem to be having missing values, but in order for the classifiers to work, we need to encode them in a numeric format

In [None]:
enc = OrdinalEncoder(dtype=int)

df['F13'] = pd.DataFrame(enc.fit_transform(pd.DataFrame(df['F13'], columns=['F13'])))

### 3. train, optimize and test two classifier models of your choice, the optimization must be done with cross validation, optimize the f1-score_macro

firstly, we need to split the data into train & test before training the classifiers

In [None]:
y = df['class']
X = df.drop(columns=['class'])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)

print(f"the training set has {X_train.shape[0]} samples")
print(f"the training set has {X_test.shape[0]} samples")

for this task we will use the *DecisionTreeClassifier* and the *KNeighborsClassifier* classifiers

In [None]:
tree_model = DecisionTreeClassifier(random_state=random_state)

tree_model.fit(X_train, y_train)

y_pred_tree = tree_model.predict(X_test)
tree_accuracy_score = accuracy_score(y_test, y_pred_tree)

print(f"decision tree trained, max_depth reached={tree_model.tree_.max_depth}, with an accuracy of {tree_accuracy_score*100:.2f}%")

In [None]:
kn_model = KNeighborsClassifier()

kn_model.fit(X_train, y_train)
y_pred_kn = kn_model.predict(X_test)
kn_accuracy_score = accuracy_score(y_test, y_pred_kn)

print(f"decision tree trained, with an accuracy of {kn_accuracy_score*100:.2f}%")

let's try to optimize them using cross validation over the "f1-score_marco"

In [None]:
scoring = 'f1_macro'

In [None]:
tree_params = [{ "max_depth": range(1, tree_model.tree_.max_depth + 1), "criterion": ["gini", "entropy"] }]

tree_cv = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=random_state),
    param_grid=tree_params,
    cv=5,
    n_jobs=2
)

tree_cv.fit(X_train, y_train)

In [None]:
kn_params = [{ "n_neighbors": range(1, 15), "weights": ["uniform", "distance"] }]

kn_cv = GridSearchCV(
    estimator=KNeighborsClassifier(),
    param_grid=kn_params,
    cv=5,
    n_jobs=2
)

kn_cv.fit(X_train, y_train)

### 4. show the result for both models, including the optimal hyperparameter values

let's compute some metrics

In [None]:
y_tree_tuned_pred = tree_cv.best_estimator_.predict(X_test)

tree_tuned_cr = classification_report(y_test, y_tree_tuned_pred, zero_division=np.nan, output_dict=True)

In [None]:
y_kn_tuned_pred = kn_cv.best_estimator_.predict(X_test)

kn_tuned_cr = classification_report(y_test, y_kn_tuned_pred, zero_division=np.nan, output_dict=True)

In [None]:
results = pd.DataFrame([
              ['dt', tree_cv.best_params_, tree_tuned_cr['accuracy'], tree_tuned_cr['0']['recall'], tree_tuned_cr['0']['f1-score']],
              ['kn', kn_cv.best_params_, kn_tuned_cr['accuracy'], kn_tuned_cr['0']['recall'], kn_tuned_cr['0']['f1-score']]
          ], columns=['model', 'best_params', 'accuracy', 'recall', 'f1-score'])

results

KNearestNeighbors seems to be the best model, scoring 82% of accuracy, better recall and better f1-score

### 5. repeat the experiment using the best model found in the previous steps and doing feature selection

let's try to see what features to remove using correlation

In [None]:
df.corr()

we will remove:
- "F06", "F09" because they have perfect correlation with "F03". also, supporting this choice, "F06" and "F09" seems to be highly correlated (> .82) with F07
- "F14" because it has high correlation (> .92) with "F02".

In [None]:
# NOTE: class target can remain the same, hence 'y' is not altered
X_feat = df.drop(columns=['F06', 'F09', 'F14'])

X_feat_train, X_feat_test, _, _ = train_test_split(X_feat, y, random_state=random_state)

let's now repeat the training and see the results on the feature-selected dataset

In [None]:
kn_feat_cv = GridSearchCV(
    estimator=KNeighborsClassifier(),
    param_grid=kn_params,
    cv=5,
    n_jobs=2
)

kn_feat_cv.fit(X_feat_train, y_train)

### 6. show the results with the best hyperparameter values

let's see the results of the trained models with the feature selected dataset

In [None]:
y_kn_feat_pred = kn_feat_cv.best_estimator_.predict(X_feat_test)

kn_feat_cr = classification_report(y_test, y_kn_feat_pred, zero_division=np.nan, output_dict=True)

In [None]:
results_feat = pd.DataFrame([
              ['kn_feat', kn_feat_cv.best_params_, kn_feat_cr['accuracy'], kn_feat_cr['0']['recall'], kn_feat_cr['0']['f1-score']]
          ], columns=['model', 'best_params', 'accuracy', 'recall', 'f1-score'])

results_feat

### 7. comment the results of the two experiments

by performing feature selection over the dataset and using the best previously found estimator (KNearestNeighborsClassifier), we were able to increment the accuracy of the model, scoring a 87.6% of accuracy (previouslu we achieved 82%)