# Model evaluation

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

We will start up by creating a dataframe from the data imported from the given website. Column names were extracted from the file with a describtion of the dataset. We can give a look at our data by displaying first five rows with `head()` method.

In [2]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df = pd.read_csv(url, delimiter=',', names=[
    'Class',
    'Alcohol',
    'Malic acid',
    'Ash',
    'Alcalanity of ash',
    'Magnesium',
    'Total phenols',
    'Flavanoids',
    'Nonflavanoid phenols',
    'Proanthocyanins',
    'Color intensity',
    'Hue',
    'OD280/OD315 of diluted wines',
    'Proline'])

df.head()

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalanity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Next we can define predictor and target variables by selecting appropriate columns.

In [3]:
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values

To simplify the task and modularize the code we will define three functions. The first function fits a classifying model (taken as an argument) to the training dataset and predicts the target variable based on test features.

In [4]:
def predict_clf(classifier, X_train, X_test, y_train):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    return y_pred

The second function prints classification metrics comparing predicted output with the test target variable. We chose accuracy, precision and recall to check a performance of the classifier. Accuracy is a most common evaluation metric – it's just the number of correct predictions made as a ratio of all predictions made:

\begin{equation}
\text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN}.
\end{equation}

Precision measures the portion of positive identifications in a classification set that were actually correct:

\begin{equation}
\text{Precision} = \frac{TP}{TP + FP}.
\end{equation}

On the other hand recall represents the proportion of actual positives that were identified correctly:

\begin{equation}
\text{Recall} = \frac{TP}{TP + FN},
\end{equation}

where $TP$ is the number of true positives, $TN$ – true negatives, $FP$ – false positives and $FN$ – false negatives.

In [5]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

def print_metrics(name, classifier, X_train, X_test, y_train, y_test):
    y_pred = predict_clf(classifier, X_train, X_test, y_train)
        
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    
    print('Classifier: {}'.format(name))
    print('Accuracy: {}%'.format(str(round(accuracy*100, 2))))
    print('Precision: {}%'.format(str(round(precision*100, 2))))
    print('Recall: {}%'.format(str(round(recall*100, 2))))
    print('------------------')

Finally the third function prints and compares the metrics of all the classifiers in a list taken as an argument.

In [6]:
def compare_models(names, classifiers, X_train, X_test, y_train, y_test):
    for name, clf in zip(names, classifiers):
        print_metrics(name, clf, X_train, X_test, y_train, y_test)

The last step before evaluation of the models is creating lists with names of the classifiers and and the actual models.

In [7]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

names = ['LDA', 'QDA', 'NB']
classifiers = [LinearDiscriminantAnalysis(), QuadraticDiscriminantAnalysis(), GaussianNB()]

## All the features – training dataset as test dataset

In [8]:
compare_models(names, classifiers, X, X, y, y)

Classifier: LDA
Accuracy: 100.0%
Precision: 100.0%
Recall: 100.0%
------------------
Classifier: QDA
Accuracy: 99.44%
Precision: 99.44%
Recall: 99.53%
------------------
Classifier: NB
Accuracy: 98.88%
Precision: 98.85%
Recall: 98.97%
------------------


Although the classifiers have almost perfect accuracy it's not a desirable method of model evaluation. Taking every feature into consideration and testing the classifier with training dataset lead to overfitting of our model. It means that the model doesn't generalize well and would have much lower accuracy when testing with independent dataset.

## First 2/5/10 features – training dataset as test dataset

In [9]:
X_two_features = df.iloc[:, 1:3].values
compare_models(names, classifiers, X_two_features, X_two_features, y, y)

Classifier: LDA
Accuracy: 80.9%
Precision: 79.87%
Recall: 79.67%
------------------
Classifier: QDA
Accuracy: 81.46%
Precision: 80.43%
Recall: 80.01%
------------------
Classifier: NB
Accuracy: 80.9%
Precision: 79.76%
Recall: 79.45%
------------------


In [10]:
X_five_features = df.iloc[:, 1:6]
compare_models(names, classifiers, X_five_features, X_five_features, y, y)

Classifier: LDA
Accuracy: 87.64%
Precision: 87.13%
Recall: 86.72%
------------------
Classifier: QDA
Accuracy: 88.76%
Precision: 88.31%
Recall: 88.24%
------------------
Classifier: NB
Accuracy: 85.39%
Precision: 84.96%
Recall: 84.88%
------------------


In [11]:
X_ten_features = df.iloc[:, 1:11]
compare_models(names, classifiers, X_ten_features, X_ten_features, y, y)

Classifier: LDA
Accuracy: 98.88%
Precision: 98.97%
Recall: 98.84%
------------------
Classifier: QDA
Accuracy: 99.44%
Precision: 99.44%
Recall: 99.53%
------------------
Classifier: NB
Accuracy: 96.07%
Precision: 96.3%
Recall: 96.2%
------------------


We can see the models have worse accuracy with decreasing number of features. However they generalize much better. Optimally we should have applied feature selection to choose the features that are the most impactful or correlate with the target variable the most. To do so we could have used regularization techniques or XGBoost algorithm.

## First 2 features – splitting the data into training, validation and test datasets

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_two_features, y, test_size=0.25)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33)

compare_models(names, classifiers, X_train, X_val, y_train, y_val)

Classifier: LDA
Accuracy: 77.27%
Precision: 79.45%
Recall: 76.49%
------------------
Classifier: QDA
Accuracy: 79.55%
Precision: 79.44%
Recall: 78.87%
------------------
Classifier: NB
Accuracy: 79.55%
Precision: 79.44%
Recall: 78.87%
------------------


Splitting the data into training, validation and test datasets provides enhanced generalization because the validation set can be employed to tuning of model's hyperparameters without overfitting. However this method has one major drawback – performance of the model varies greatly with a way the data is split. Changing which observations happens to be in test dataset can significantly change test accuracy. That's where cross validation comes into play.

## First 2 features – cross validation

Cross validation is a techique widely used to test an effectiveness of the model when there is a limited amount of data. Basically it splits the dataset into $k$ folds. One of the folds becomes test dataset and the remaining ones are used for training. The process is being repeated $k$ times with a change of partition employed for testing every time. At the end we can use average of the metrics to check performance of the classifier.

In [13]:
from sklearn.model_selection import KFold, cross_validate

lda = classifiers[0]
kfold = KFold(n_splits=10, shuffle=True)
scoring = {'accuracy': 'accuracy',
           'recall': 'recall_macro',
           'precision': 'precision_macro'}
cross_val_scores = cross_validate(lda, X_two_features, y, cv=kfold, scoring=scoring)
cross_val_scores = {key: score.mean() for key, score in cross_val_scores.items()}

print('Accuracy: {}%'.format(str(round(cross_val_scores['test_accuracy']*100, 2))))
print('Precision: {}%'.format(str(round(cross_val_scores['test_precision']*100, 2))))
print('Recall: {}%'.format(str(round(cross_val_scores['test_recall']*100, 2))))

Accuracy: 78.07%
Precision: 80.12%
Recall: 78.89%


The metrics of our model have been improved slightly in comparison with splitting technique. We could have increased the accuracy a little if we had applied nonlinear model instead of LDA. For example we can see QDA outperforms LDA in almost every case. 