<img style="float: right;" src="../../assets/htwlogo.svg">

# Exercise: all about classification evaluation

In the following, we would like to analyze the performance of an already predefined machine learning pipeline.

**Author**: _Erik Rodner_<br>

In [None]:
# Import necessary libraries and functions - a lot :)
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, StratifiedKFold, \
                                    cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix, \
                            ConfusionMatrixDisplay


As a data, we will use the good old labeled-faces-in-the-wild dataset (lfw).
Let's download the dataset and display some images.

In [None]:

# Load the digits dataset
data = datasets.fetch_lfw_people(min_faces_per_person=32)

def display_image_dataset(data, nrows=2, ncols=5):
    """ Display some images """
    fig, axes = plt.subplots(nrows, ncols, figsize=(10, 5))
    for ax, image, label in zip(axes.ravel(), data.images, data.target):
        ax.set_axis_off()
        ax.imshow(image, cmap=plt.cm.gray, interpolation='nearest')
        ax.set_title(f'Target: {label}')


In [None]:
# Standard machine learning pipelines that are not especially 
# designed for images, simply expect vectors as inputs, so let's
# flatten the image data for scikit-learn 
n_samples = len(data.images)
X = data.images.reshape((n_samples, -1))
y = data.target

## Task 1: Exploration - Analyze the class distribution of the dataset

How many classes are there? Is the dataset *imbalanced*? What is the most frequent class?
Use some code to answer these questions

## Task 2: Analyze the multi-class classifier

Let's first a define a classifier pipeline. Each aspect of the definition will be discussed in later lectures:

In [None]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('pca', PCA(n_components=32)),  # Adjust n_components to your needs
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

Let's dive into some tasks and questions that need to be solved by your code:
1. Analyze the classifiers performance on 20% hold-out test data!
2. What is the cross-validation performance of it?
3. Is accuracy a good measure to use?
4. Draw a confusion matrix!
5. Bonus: Optimize the pipeline if you can :)