# Lab 3 Classification Models & Model Evaluation

In [13]:
"""
Created on Thu Sep  01 15:40:00 2022

@author: konstantinoskalaitzidis
student_name =   "Konstantinos Kalaitzidis"
student_email =  "kon.kalaitzidis@gmail.com"

Notebook based on notes from Christian Kauth (UniFribourg) and Maria Bampa (DSV)
""" 

'\nCreated on Thu Sep  01 15:40:00 2022\n\n@author: konstantinoskalaitzidis\nstudent_name =   "Konstantinos Kalaitzidis"\nstudent_email =  "kon.kalaitzidis@gmail.com"\n\nNotebook based on notes from Christian Kauth (UniFribourg) and Maria Bampa (DSV)\n'

## This is the 3rd and 4th lab exercise of the Data Science for Health Informatics (DSHI) module of Stockholm University (2022). 

## Importing Packages and Libraries

In [2]:
# Import the package with an alias
# Numeric analysis
import numpy as np
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns

from pandas import DataFrame

In [3]:
# Set the seed of the pseudo randomization to guarantee that results are reproducible between executions
RANDOM_SEED = 3456
np.random.seed(RANDOM_SEED)

## Introductory information:
Applying 3 classification models on our dataset that perform on 2 selected features.

In [None]:
path = "/Users/konstantinoskalaitzidis/Desktop/DSHI/Data-Analysis/avocado.csv"
df = pd.read_csv(path)

In [None]:
# save dataframe to pickle file
df.to_pickle('avocado1.pkl')

# Data Preparation

In [6]:
# read pickle file as dataframe
df = pd.read_pickle('avocado1.pkl')
# display the dataframe
print(df.shape)

(18249, 14)


In [None]:
# What is the total size of the dataset?
df.shape

(18249, 14)

In [None]:
# Which are the variable types?
df.dtypes

Unnamed: 0        int64
Date             object
AveragePrice    float64
Total Volume    float64
4046            float64
4225            float64
4770            float64
Total Bags      float64
Small Bags      float64
Large Bags      float64
XLarge Bags     float64
type             object
year              int64
region           object
dtype: object

In [None]:
# Lets take a look at the first three rows of our dataset
df.head(3)

In [None]:
# Lets take a look at three random rows of our dataset
df.sample(3)

In [None]:
# Lets have a look at the last three rows of our dataset
df.tail(3)

In [None]:
# Lets read some information about our dataframe
df.info()

In [None]:
df.describe()

In [None]:
# Checking for missing data
df.isnull().sum()

In [None]:
# Another way to verify that no null data is present in the dataframe is to check if the red 
# color in the plot is distributed equally according to each colunm.'''
sns.heatmap(df.isnull())

Great. No missing values. 

## Pre-processing

In [None]:
# Deleting irrelevant features
columns_to_delete = ["Unnamed: 0"]

# axis=1 means that the operation is executed in the columns, axis=0 is in the rows
df = df.drop(columns_to_delete, axis=1)    

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
# We need to change the Date from an Object type to a Date type
df['Date']=pd.to_datetime(df['Date'])
df['Month']=df['Date'].apply(lambda x:x.month)
df['Day']=df['Date'].apply(lambda x:x.day)

In [None]:
df.head()

In [None]:
# we also need to change the data types of "type" and "region" and make them categorical. 
df["type"] = df["type"].astype("category")
df["region"] = df["region"].astype("category")
df.dtypes

In [None]:
# Describes the numerical variables
df.describe()

## Store processed data

In [None]:
import os

this_path = "."
folder_name = "data"
data_folder = os.path.join(this_path, folder_name)
os.makedirs(data_folder, exist_ok=True) # Check if the folder exists
data_folder

In [None]:
file_name = "avocado_processed.csv"
filepath = os.path.join(data_folder, file_name)
filepath

In [None]:
df.to_csv(filepath, index=False)

## Data Visualization

In [None]:
# What is the representatitve average price range?
sns.set(font_scale=1.5) 
from scipy.stats import norm
fig, ax = plt.subplots(figsize=(12, 6))
sns.distplot(a=df.AveragePrice, kde=False, fit=norm)

In [None]:
# Average price distribution between conventional and oranic avocados. 
plt.figure(figsize=(12,6))
sns.lineplot(x="Month", y="AveragePrice", hue='type', data=df)
plt.show()

In [None]:
# Average price distribution over date
byDate=df.groupby('Date').mean()
plt.figure(figsize=(12,6))
byDate['AveragePrice'].plot()
plt.title('Average Price')

### Correlations

In [None]:
correlations = df.corr(method="pearson")
correlations

## Feature Engineering

In [None]:
df.dtypes

In [None]:
# Deleting irrelevant features
columns_to_delete = ["Date", "region"]

# axis=1 means that the operation is executed in the columns, axis=0 is in the rows
df = df.drop(columns_to_delete, axis=1)    

In [None]:
df.head(3)

### One-hot Encoding

For categorical values "type" we apply what is called **one-hot encoding**, when each possible value in the categorical feature is transformed into a column and populated with 1s and 0s.

In [None]:
# This function transforms a categorical feature into one-hot encoding
pd.get_dummies(df["type"])

In [None]:
## Replace the column in the dataframe 
# Add the new columns after one-hot encoding
oh_encoding = pd.get_dummies(df["type"])
df = pd.concat([df, oh_encoding], axis=1)
df

In [None]:
df.dtypes

In [None]:
sns.heatmap(df.isnull())

In [None]:
df["type"] = df["type"].map({
                "conventional":0, 
                "organic":1
            })
df

## Normalization & Standarization

In [None]:
# Since the dataset has more than 200 rows I need to find a sample of it which I will use.
df = df.sample(200)

In [None]:
df.head()

In [None]:
df

# Classification Models

We have chosen 2 numerical features that are relatable to the target variable that we intend to predict. We are only training datasets with two features to simplify the visualization of the decision boundaries.

In [None]:
df_new = pd.DataFrame({
                        "x1": df.iloc[:, 0],
                        "x2": df.iloc[:, 1],
                        "class": df.iloc[:, 9]
                        })
df_new.sample(10)

Filtering the original dataset to create two arrays that contain only feature matrix ( 𝑋 ) and target label array ( 𝑦 ).

In [None]:
X = df_new.iloc[:,:2] # features x1 and x2
y = df_new.iloc[:, 2]

In [None]:
X

In [None]:
X.shape

In [None]:
type(X)

In [None]:
y.shape

In [None]:
type(y)

In [None]:
# Let's create a function to see the dataset easier
def visualize_dataset_with_target_class(X, y, title=""):
    """
    Input:
        X: (np.array[N,2]) - The features from the data
        y: (np.array[N,1]) - The corresponding target class of each sample
    Returns:
        A plot with the dataset and the colors of the respective class
    """
    plt.scatter(x = X.iloc[:,0], y = X.iloc[:,1], c=y, s=30)
    plt.xlabel("Average Price")
    plt.ylabel("Total Volume")
    plt.title(title)
    plt.grid(True)
    return plt.show()



In [None]:
visualize_dataset_with_target_class(X, y, title="Dataset with original class labels")

## Train-test partitioning

Perform train-test split with a proportion of 80%/20%.
We want our target variable to not be continuous e.g. average price. 
Synthetic data is in X variable
y label to train the model
class = type
X1 = average price
X2 = total volume

X and y must have the same shape

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

In [None]:
# Plot the TRAINING set
visualize_dataset_with_target_class(X_train, y_train, title="Training set with original class labels")

In [None]:
# Plot the TEST set 
visualize_dataset_with_target_class(X_test, y_test, title="Test set with original class labels")

## Decision Tree (DT)

We will train three classifiers on our small dataset: A decision tree (DT), a random forest (RF), and a K-nearest neighbors (KNN)

In [None]:
import time

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Here, we start our counter
t_start = time.time()

# Recall from the previous lab that these are the traditional three steps for most sklearn models.

# 1) Initialize an object containing the algorithm
dt_classifier = DecisionTreeClassifier(max_depth=6)     # Criterion split by default is gini-index, which is ok

# 2) Apply the algorithm using the training data
dt_classifier.fit(X_train, y_train)   ### NOTE here that we also used the labels `y`, in clustering algorithms we only pass `X`

# 3) Generate class labels for new unseen data (predictions)
y_predicted = dt_classifier.predict(X_test)

print("According to the DT classifier, the class labels in the test set are: ", y_predicted)
# At the end we calculate the end time
t_end = time.time()
t_elapsed = t_end - t_start

print(f"The elapsed time of the function is {t_elapsed} seconds")

In [None]:
# Let's create a function to visualize the true labels and predicted labels easier
def visualize_and_compare_classifications(X, real_y, predicted_y, title=""):
    """
    Input:
        X: (np.array[N,2]) - The dataset to visualize (only 2 features)
        real_y: (np.array[N,1]) - Real class labels from X
        predicted_y: (np.array[N,1]) - Predicted class labels from X
    Returns:
        A plot with two axes showing the real and the predicted labels
    """

    fig, axes = plt.subplots(1, 2, figsize=(12,6))
    
    # First plot contains real class labels
    ax = axes[0]
    ax.scatter(x = X.iloc[:,0], y = X.iloc[:,1], c=real_y, s=80)
    ax.set(xlabel="Average Price",ylabel="Total Volume",title="Real labels")
    ax.grid(True)

    # Second plot contains predicted class labels
    ax = axes[1]
    ax.scatter(x = X.iloc[:,0], y = X.iloc[:,1], c=predicted_y, s=80)
    ax.set(xlabel="Average Price",ylabel="Total Volume",title="Predicted labels")
    ax.grid(True)

    if title is not "":
        plt.suptitle(title)

    return plt.show()



Classification results

In [None]:
visualize_and_compare_classifications(X_test, y_test, y_predicted, title="Real and predicted classes for the test")

## Plotting decision trees

In [None]:
# Sklearn also has functions to visualize the Decision Tree
from sklearn.tree import plot_tree
plot_tree(dt_classifier.fit(X_train, y_train))

## Random Forest (RF)

In [None]:
from sklearn.ensemble import RandomForestClassifier
t_start = time.time()

rf_classifier = RandomForestClassifier(n_estimators=10, max_depth=3, criterion="entropy")
rf_classifier.fit(X_train, y_train)
y_predicted = rf_classifier.predict(X_test)

print("According to the RF classifier, the class labels in the test set are: ", y_predicted)

# At the end we calculate the end time
t_end = time.time()
t_elapsed = t_end - t_start

print(f"The elapsed time of the function is {t_elapsed} seconds")

Classification results

In [None]:
visualize_and_compare_classifications(X_test, y_test, y_predicted, title="Real and predicted classes for the test")

## K-nearest neighbors (KNN)

In [None]:
# Normalize the data for KNN

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
norm_X_train = scaler.fit_transform(X_train)
norm_X_test  = scaler.fit_transform(X_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

t_start = time.time()


knn_classifier = KNeighborsClassifier(n_neighbors=1)    # The default distance metric is Euclidean, which is usually ok.
knn_classifier.fit(norm_X_train, y_train)
y_predicted = knn_classifier.predict(norm_X_test)

# At the end we calculate the end time
t_end = time.time()
t_elapsed = t_end - t_start

print(f"The elapsed time of the function is {t_elapsed} seconds")

In [None]:
# Let's create a function to visualize the true labels and predicted labels easier
def visualize_and_compare_classifications_version2(X, real_y, predicted_y, title=""):
    """
    Input:
        X: (np.array[N,2]) - The dataset to visualize (only 2 features)
        real_y: (np.array[N,1]) - Real class labels from X
        predicted_y: (np.array[N,1]) - Predicted class labels from X
    Returns:
        A plot with two axes showing the real and the predicted labels
    """

    fig, axes = plt.subplots(1, 2, figsize=(12,6))
    
    # First plot contains real class labels
    ax = axes[0]
    ax.scatter(x = X[:,0], y = X[:,1], c=real_y, s=80)
    ax.set(xlabel="Average Price",ylabel="Total Volume",title="Real labels")
    ax.grid(True)

    # Second plot contains predicted class labels
    ax = axes[1]
    ax.scatter(x = X[:,0], y = X[:,1], c=predicted_y, s=80)
    ax.set(xlabel="Average Price",ylabel="Total Volume",title="Predicted labels")
    ax.grid(True)

    if title is not "":
        plt.suptitle(title)

    return plt.show()




Classification results

In [None]:
visualize_and_compare_classifications_version2(norm_X_test, y_test, y_predicted, title="Real and predicted classes for the test")

## Analysis

We applied 3 classification models on our dataset that performed on the 2-features that we selected. For each classifier, we showed a plot of the predicted classification.

There is a noticeable difference between the performance of the
KNN classifier compared to the RF and DT. The KNN classifier is much faster whereas the
RF and DT are closer to each other, performance-wise. For the features, we chose Average
Price and Total Volume and created an X variable that contains both so I can feed them into
the classifiers. In place for y, we used the “type” target variable (organic or conventional)
which takes numerical values (0 or 1 respectively) and has been previously one-hot encoded.

When looking at the plots, we can clearly see an outlier which can be removed
in future work. Other than that, the plots look very similar to each other with minor
differences in the labels between the Real and Predicted labels plots in DT, RF, and KNN
classifiers. The data points featured are 200 rows sampled from the more than 18000 rows of
the initial dataset.


# Continuation of Lab 3 - Lab 4

## Organizing data types

In [None]:
# lets create a new dataframe with 3 or more features (+ y)

# Lets see the previous dataframe again to select new features to include
df

In [None]:
# lets add the "Total Bags" feature
df_lab4 = pd.DataFrame({
                        "x1": df.iloc[:, 0],
                        "x2": df.iloc[:, 1],
                        "x3": df.iloc[:, 5], # new feature
                        "y": df.iloc[:, 9] # class
                        })
df_lab4.head()

In [None]:
# Now we have a new dataframe with 3 features and a target variable to run the classification task

In [None]:
df_lab4.dtypes

In [None]:
# class is category and has numerical data. 

In [None]:
print("List of values in the feature 'class':", df_lab4["y"].unique(), "and dtype=",df_lab4["y"].dtypes)

## Handling Missing Values and Filtering data

In [None]:
# Check if there are missing values
df_lab4.isnull().sum()

In [None]:
df_lab4.describe(include="all")

In [None]:
df_lab4.value_counts()

In [None]:
df_lab4.shape

In [None]:
df_lab4["y"].value_counts().plot.bar()

In [None]:
# Data to be used for the classification task
df_lab4.sample(10)

## Preparing dataframes for Classification Tasks

In [None]:
# Transform the pandas DataFrame into numerical Numpy arrays, 
# so that they can be processed by the packages in sklearn

In [None]:
# convert all types to int
df_lab4 = df_lab4.astype(int)

In [None]:
df_lab4.dtypes

In [None]:
df_lab4.head()

In [None]:
# Separate the features X and the target variable y
df_X = df_lab4.drop(["y"], axis=1)
df_y = df_lab4["y"]

In [None]:
# Finally, we need transform from Pandas DataFrame to numerical Arrays, and store the column names
df_X = df_X.values
df_y = df_y.values

df_colnames = df_lab4.columns.values
print(df_colnames)

In [None]:
type(df_X)
type(df_y)

### Final feature matrix X

In [None]:
df_X

### Final target array  𝐲

In [None]:
df_y.size

## Evaluation metrics

### Single train-test split

In [None]:
from sklearn.model_selection import train_test_split

# 80/20 train-test split
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size = 0.2, random_state=RANDOM_SEED)

In [None]:
from sklearn.metrics import confusion_matrix

cm_results = confusion_matrix(y_test, y_predicted)
cm_results

In [None]:
# Visual representation of the same Confusion Matrix
from sklearn.metrics import ConfusionMatrixDisplay

cm_display = ConfusionMatrixDisplay(cm_results).plot()

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
CV_indices = kf.split(df_X, df_y)

In [None]:
for train_index, test_index in CV_indices:
    print("TRAIN:", train_index.shape, "TEST:", test_index.shape)

In [None]:
from sklearn.model_selection import cross_validate

## Experimental evaluation of best performance

In [None]:
# We will apply the classifiers on the normalized dataset

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_X_norm = scaler.fit_transform(df_X)

print(f"data_X min. value: {df_X_norm.min()}, max. value: {df_X_norm.max()}")

In [None]:
# Apply my own version

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

MODELS_TO_TEST = {
    "RF_10": RandomForestClassifier(n_estimators=5, max_depth=10),
    "DT" :  DecisionTreeClassifier(max_depth=10),
    "KNN" : KNeighborsClassifier(n_neighbors=2)
}

# Define the number of splits 
NUMBER_OF_SPLITS = 10

# Scoring metrics
SCORING_METRICS = ["accuracy", "precision_macro", "recall_macro", "f1_macro"] # Metrics of interest

# Create empty DataFrame to populate  the name of the classifier and the six values returned from `cross_validate()`
results_evaluation = pd.DataFrame({
                                    "classifier_name":[],
                                    "fit_time": [],
                                    "score_time": [],
                                    "test_accuracy": [],
                                    "test_precision_macro": [],
                                    "test_recall_macro": [],
                                    "test_f1_macro": [],
                                    })

In [None]:
#### ITERATION FOR THE EXPERIMENT

for name, classifier in MODELS_TO_TEST.items():
    
    print(f"Currently training the classifier {name}.")

    # Get the evaluation metrics per fold after cross-validation
    # Note that we are passing the normalized array `df_X_norm` to all classifiers
    scores_cv = cross_validate(classifier, df_X_norm, df_y, cv=NUMBER_OF_SPLITS, scoring=SCORING_METRICS)

    # Average the scores among folds
    dict_this_result = {
                    "classifier_name":[name],
                    }
    # Populate the dictionary with the results of the cross-validation
    for metric_name, score_per_fold in scores_cv.items():
        dict_this_result[metric_name] = [ scores_cv[metric_name].mean() ]

    #### Generate the results to populate the pandas.DataFrame
    this_result = pd.DataFrame(dict_this_result)

    # Append to the main dataframe with the results 
    results_evaluation = pd.concat([results_evaluation, this_result], ignore_index=True)

print("The experimental setup has finished")

In [None]:
results_evaluation

## Visualizations

In [None]:
# Store the file in the indicated path
file_name = "results_timing.csv"
results_evaluation.to_csv(file_name, index=False)

In [None]:
# training time (fit_time) and prediction time (score_time)

### Which was the fastest/slowest algorithm

In [None]:
average_time_classifier = results_evaluation.groupby(by=["classifier_name"]).mean()
average_time_classifier.drop(["test_accuracy", "test_precision_macro", "test_recall_macro", "test_f1_macro"],axis=1,inplace=True) # Delete unnecessary features
average_time_classifier["total_time"] = average_time_classifier["fit_time"] + average_time_classifier["score_time"] # Create new features
average_time_classifier

In [None]:
average_time_classifier.plot.barh()
plt.title("Average time per classifier among dataset")
plt.xlabel("Time (s)")
plt.show()

RF slowest. DT fastest. 

### Which classification model seems to perform better in your data? Would you deploy it in a real-life task? Why or why not?  

The classification model that performs better on my data is the Decision Tree (DT) classifier. It is faster than the KNN classifier and performs almost 70% better than the Random Forest classifier (which is also the worst-performing classifer).  
In this case I could employ the KNN classifier because the difference in performance compared to DT isn't that great and KNN performs adequate in small datasets such as this one. For much larger ones I would prefer to use DT compared to KNN due to KNN's large computational cost.  
Generally, DT is faster than KNN but DT can be prone to outliers. If total time was irrelevant, and the aim was quality of results, Random Forest would be my personal choice as it is a more robust and accurate version of DT that isn't prone to overfitting.  
Important to note is whether the y is consisted of continuous or discrete variables as this changes which classifier is more ideal. In our case the y target variable is consisted of discrete variables and as a result I would choose the DT classifier for a similar real-life task. 

### Which has the best F1 score?

In [None]:
accuracy_classifier = results_evaluation.groupby(by=["classifier_name"]).mean()
accuracy_classifier.drop(["test_accuracy", "fit_time", "score_time", "test_precision_macro", "test_recall_macro"],axis=1,inplace=True) # Delete unnecessary features
#accuracy_classifier["total_time"] = average_time_classifier["fit_time"] + average_time_classifier["score_time"] # Create new features
accuracy_classifier

In [None]:
accuracy_classifier.plot.barh()
plt.title("Macro-F1 score per classifier")
plt.xlabel("Time (s)")
plt.show()

The F1 score is a good measure of evaluating model performance. The KNN classifier has performed the best (note that in terms of execution time, KNN perform the worst out of all three classifiers).

## Final Reflection

So far you chose a handful of classifiers with predefined hyperparameters. Describe briefly how do you think you can determine experimentally which hyperparameter performs better for a given classifier? 

Chosing which hyperparameters to tune can directly affect the performance of a classifier and metrics such as accuracy and the F1 score. It is important to have a close look at the numerical analysis or the "results_evaluation" table after each change of the hyperparamaters and then reflect on whether the score has been improved or not. Visualizations on the results table can help us see if the changes made have affected positively or negatively the performance of the classifiers.  
If there are mutliple dimensions or many features in our dataset it is more appropriate to consider the numerical analysis of the results_evaluation table to determine areas of improvement and potential tuning of hyperparameters.