# COSC 5557 - Practical Machine Learning
# Almountassir Bellah Aljazwe
# ML Algorithm Selection

For this exercise, we will have a look at the simplest version of Automated
Machine Learning and choose the best type of machine learning model for a given
task. The end result (i.e. the predictive performance) is not important; how
you get there is.

Your deliverable will be a report, written in a style that it
would be suitable for inclusion in an academic paper as the "Experimental
Setup" section or similar. If unsure, check an academic paper of your choice,
for example [this one](https://www.eecs.uwyo.edu/~larsko/papers/pulatov_opening_2022-1.pdf). The
level of detail should be higher than in a typical academic paper though. Your
report should be at most five pages, including references and figures but
excluding appendices. It should have the following structure:
- Introduction: What problem are you solving, how are you going to solve it.
- Dataset Description: Describe the data you're using, e.g. how many features and observations, what are you predicting, any missing values, etc.
- Experimental Setup: What specifically are you doing to solve the problem, i.e. what programming languages and libraries, how are you processing the data, what machine learning algorithms are you considering, how are you evaluating them, etc.
- Results: Description of what you observed, including plots.
- Code: Add the code you've used as a separate file.

Your report must contain enough detail to reproduce what you did without the
code. If in doubt, include more detail.

There is no required format for the report. You could, for example, use an
iPython notebook.

## Data

We will have a look at the [Wine Quality
dataset](https://archive-beta.ics.uci.edu/dataset/186/wine+quality). Choose the
one that corresponds to your preference in wine. You may also use a dataset of
your choice, for example one that's relevant to your research.

Choose a small number of different machine learning algorithms. For example, you
could use a random forest, support vector machine, linear/logistic regression, a
decision/regression tree learner, and gradient boosting. You will also have to
choose their hyperparameters, e.g. the default values. Determine the best
machine learning algorithm for your dataset, where the "best" algorithm could be
a set of algorithms. Make sure that the way you evaluate this avoids bias and
overfitting. You could use statistical tests to make this determination.

## Submission

Add your report and code to this repository. Bonus points if you can set up a
Github action to automatically run the code and generate the report!

## Resources

- Online AutoML Course - ML Evaluation Section
- https://youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&si=OKAA33VoZgunehui
- https://youtu.be/jwqSrGvX8nE?si=5OsShcGSoT1IhpDC

## Importing the "Red Wine Quality" Dataset

https://archive.ics.uci.edu/dataset/186/wine+quality

In [None]:
import pandas as pd
import numpy as np

In [None]:
red_wine_quality_file = "winequality-red.csv"

red_wine_quality_df = pd.read_csv(red_wine_quality_file, sep=';')

In [None]:
print(f'Shape of the dataset : {red_wine_quality_df.shape}')
print('---------------------------------------------------')
red_wine_quality_df.head(5)

Shape of the dataset : (1599, 12)
---------------------------------------------------


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Seperating Dataset into "Features" Matrix & "Target" Vector

In [None]:
# X : features matrix
X = red_wine_quality_df.iloc[:, :-1]
X.shape

(1599, 11)

In [None]:
# y : target vector
y = red_wine_quality_df.iloc[:, -1]
y.shape

(1599,)

## Training and Test Splits

In [None]:
from sklearn.model_selection import train_test_split

random_seed = 90

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=90)

## Creating our Machine Learning Algorithms

In [None]:
models = []

### Logistic Regression

#### Import Model

In [None]:
from sklearn.linear_model import LogisticRegression

#### Instantiate Model

In [None]:
logistic_regression_model = LogisticRegression(solver='liblinear')

models.append(logistic_regression_model)

logistic_regression_model.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

#### Fit / Train Model

In [None]:
logistic_regression_model.fit(X_train, y_train)

### K-Nearest Neighbors

#### Import Model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

#### Instantiate Model

In [None]:
k = 5
knn_model = KNeighborsClassifier(n_neighbors=k)

models.append(knn_model)

knn_model.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

#### Fit / Train the Model


In [None]:
knn_model.fit(X_train, y_train)

### Random Forest

#### Import the Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

#### Instantiate the Model

In [None]:
trees = 100
depth = 5

random_forest_model = RandomForestClassifier(n_estimators=trees, max_depth=depth)

models.append(random_forest_model)

random_forest_model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

#### Fit / Train the Model

In [None]:
random_forest_model.fit(X_train, y_train)

### Support Vector Classifier (SVM Classifier)

#### Import the Model

In [None]:
from sklearn.svm import SVC

#### Instantiate the Model

In [None]:
svc_model = SVC()

models.append(svc_model)

svc_model.get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

#### Fit / Train the Model

In [None]:
svc_model.fit(X_train, y_train)

### Decision Tree Classifier

#### Import the Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

#### Instantiate the Model

In [None]:
depth = 5

decision_tree_model = DecisionTreeClassifier(max_depth=depth)

models.append(decision_tree_model)

decision_tree_model.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

#### Fit / Train the Model

In [None]:
decision_tree_model.fit(X_train, y_train)

## "Null Accuracy"... if the model always chose the most frequent class, what would its accuracy be?

We should know this value so it can be a baseline for evaluating our models.

In [None]:
target_column = 'quality'

classification_values = red_wine_quality_df['quality'].value_counts()

classification_values.values

array([681, 638, 199,  53,  18,  10])

In [None]:
most_frequent_class_count = classification_values.max()

most_frequent_class_count

681

In [None]:
number_of_samples = red_wine_quality_df['quality'].count()

number_of_samples

1599

In [None]:
null_accuracy = (most_frequent_class_count / number_of_samples) * 100

round(null_accuracy, 2)

42.59

So the baseline accuracy score for our classification models should be $42.59$%.

## Evaluating our Machine Learning Models

Evaluation Options
 - Train-Test Split
 - K-Fold Cross Validation with Stratified Sampling for Classification (Scikit-learn does it by default with the `cross_val_score()` function)
 - Leave-One-Out Cross Validation

### Train-Test Split Evaluation

In [None]:
from sklearn.metrics import accuracy_score

evaluations = {}

for model in models:
  # Predict...
  y_predicted = model.predict(X_test)

  # Evaluate...
  score = accuracy_score(y_test, y_predicted)

  # Save accuracy score...
  evaluations[model] = score

In [None]:
for model in evaluations.keys():
  print(f'{model} | Accuracy score of : {evaluations[model] * 100 : .2f}%')
  print('------------------------------------------------------')

LogisticRegression(solver='liblinear') | Accuracy score of :  58.13%
------------------------------------------------------
KNeighborsClassifier() | Accuracy score of :  48.75%
------------------------------------------------------
RandomForestClassifier(max_depth=5) | Accuracy score of :  63.12%
------------------------------------------------------
SVC() | Accuracy score of :  49.38%
------------------------------------------------------
DecisionTreeClassifier(max_depth=5) | Accuracy score of :  60.31%
------------------------------------------------------


The drawback of this method is the high variance of the accuracy results; changing the splits will result in different accuracy results for each individual evaluation.

### K-Fold Cross Validation

#### 5-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

K_FOLDS = 5

evaluations = {}
for model in models:
  # Get the  number of scores as Numpy array...
  model_accuracy_scores = cross_val_score(
      model,
      X,
      y,
      cv=K_FOLDS,
      scoring='accuracy',
  )

  # Average the scores...
  model_mean_score = model_accuracy_scores.mean()

  # Save accuracy score...
  evaluations[model] = model_mean_score

In [None]:
print('********************************************')
print(f"{K_FOLDS}-Fold Cross Validation - Evaluation Results")
print('********************************************')
print()

for model in evaluations.keys():
  print(f'{model} | Accuracy score of : {evaluations[model] * 100 : .2f}%')
  print('------------------------------------------------------')

********************************************
5-Fold Cross Validation - Evaluation Results
********************************************

LogisticRegression(solver='liblinear') | Accuracy score of :  56.91%
------------------------------------------------------
KNeighborsClassifier() | Accuracy score of :  44.21%
------------------------------------------------------
RandomForestClassifier(max_depth=5) | Accuracy score of :  59.23%
------------------------------------------------------
SVC() | Accuracy score of :  50.22%
------------------------------------------------------
DecisionTreeClassifier(max_depth=5) | Accuracy score of :  54.54%
------------------------------------------------------


#### 10-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

K_FOLDS = 10

evaluations = {}
for model in models:
  # Get the  number of scores as Numpy array...
  model_accuracy_scores = cross_val_score(
      model,
      X,
      y,
      cv=K_FOLDS,
      scoring='accuracy',
  )

  # Average the scores...
  model_mean_score = model_accuracy_scores.mean()

  # Save accuracy score...
  evaluations[model] = model_mean_score

In [None]:
print('********************************************')
print(f"{K_FOLDS}-Fold Cross Validation - Evaluation Results")
print('********************************************')
print()

for model in evaluations.keys():
  print(f'{model} | Accuracy score of : {evaluations[model] * 100 : .2f}%')
  print('------------------------------------------------------')

********************************************
10-Fold Cross Validation - Evaluation Results
********************************************

LogisticRegression(solver='liblinear') | Accuracy score of :  57.60%
------------------------------------------------------
KNeighborsClassifier() | Accuracy score of :  44.59%
------------------------------------------------------
RandomForestClassifier(max_depth=5) | Accuracy score of :  58.98%
------------------------------------------------------
SVC() | Accuracy score of :  50.03%
------------------------------------------------------
DecisionTreeClassifier(max_depth=5) | Accuracy score of :  56.04%
------------------------------------------------------


#### 15-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

K_FOLDS = 15

evaluations = {}
for model in models:
  # Get the  number of scores as Numpy array...
  model_accuracy_scores = cross_val_score(
      model,
      X,
      y,
      cv=K_FOLDS,
      scoring='accuracy',
  )

  # Average the scores...
  model_mean_score = model_accuracy_scores.mean()

  # Save accuracy score...
  evaluations[model] = model_mean_score



In [None]:
print('********************************************')
print(f"{K_FOLDS}-Fold Cross Validation - Evaluation Results")
print('********************************************')
print()

for model in evaluations.keys():
  print(f'{model} | Accuracy score of : {evaluations[model] * 100 : .2f}%')
  print('------------------------------------------------------')

********************************************
15-Fold Cross Validation - Evaluation Results
********************************************

LogisticRegression(solver='liblinear') | Accuracy score of :  57.61%
------------------------------------------------------
KNeighborsClassifier() | Accuracy score of :  44.26%
------------------------------------------------------
RandomForestClassifier(max_depth=5) | Accuracy score of :  59.73%
------------------------------------------------------
SVC() | Accuracy score of :  50.39%
------------------------------------------------------
DecisionTreeClassifier(max_depth=5) | Accuracy score of :  53.85%
------------------------------------------------------


### Leave-One-Out Cross Validation

**NOTE : The code cell below takes a long time to finish.**

In [None]:
from sklearn.model_selection import LeaveOneOut

evaluations = {}
model_scores = []
for model in models:
  for training_indices, testing_indices in LeaveOneOut().split(red_wine_quality_df):
    training_dataset = red_wine_quality_df.iloc[training_indices]
    testing_dataset = red_wine_quality_df.iloc[testing_indices]

    X_train = training_dataset.iloc[:, :-1]
    y_train = training_dataset.iloc[:, -1]

    X_test = testing_dataset.iloc[:, :-1]
    y_test = testing_dataset.iloc[:, -1]

    # Train...
    model.fit(X_train, y_train)

    # Predict
    y_predicted = model.predict(X_test)

    # Evaluate...
    score = accuracy_score(y_test, y_predicted)

    model_scores.append(score)

    np_model_scores = np.array(model_scores)

    # Save accuracy score...
    evaluations[model] = np_model_scores.mean()

In [None]:
print(f"Accuracy results after Leave-One-Out Cross Validation :")
print()
evaluations

Accuracy results after Leave-One-Out Cross Validation :



{LogisticRegression(solver='liblinear'): 0.5784865540963102,
 KNeighborsClassifier(): 0.5444027517198249,
 RandomForestClassifier(max_depth=5): 0.5682718365645195,
 SVC(): 0.5519074421513446,
 DecisionTreeClassifier(max_depth=5): 0.5552220137585991}