## TRNU.CSN: Supervised Classification TP2 2025

Gilles Vanwormhoudt, Christelle Garnier, Vincent Itier, Juan-Manuel Miramont

### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#TODO ADD other needed libraries

## Data Set Information

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s
paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for
example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of
iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable
from each other. The predicted attribute (the output) is the class of iris plant.

### Features Information

1. sepal length in cm
1. sepal width in cm
1. petal length in cm
1. petal width in cm
1. classes: Iris Setosa, Iris Versicolour, Iris Virginica

## Exercises

The purpose of this practical assignment is to implement the Perceptron, which is a basic supervised learning algorithm for performing binary data classification.

This practical assignment consists of four parts: __implementation__, __validation__, __evaluation__, and __comparison__. The __comparison__ part is homework to be completed.

### 1. Implementation

#### Reminders
The Perceptron algorithm is a supervised classification algorithm invented by Rosenblatt for binary classification problems (two classes). The principle of this algorithm is to find a linear classifier on a training set $S= \{(x^{j}, y^{j})\}$ where the inputs $x$ are vectors of dimension $d$ and the outputs $y$ are integers $\in \{-1,1\}$. As explained in class, the goal is to obtain as few misclassified examples as possible.</br>
Learning is performed using an iterative algorithm in which an example is chosen randomly from $S$ and the coefficients of the hyperplane (vector $w$ of dimension $d$ and real $w_0$) are updated if the classifier gives a wrong prediction for the chosen example. The rule for updating the coefficients is to move in the direction of the misclassified example by adding the vector of the misclassified example multiplied by a learning step to the current coefficients. 

#### 1.1 Algorithm

Implementing this algorithm requires manipulating vectors and performing simple vector calculations (addition, subtraction, and multiplication of vectors). Using the `numpy` library is therefore recommended.

To implement the algorithm in Python and make it easily reusable, we can draw inspiration from the Scikit-Learn library that we learn to use during the previous practical work. In this library, supervised classifiers are implemented as object classes equipped with the special method `__init__`, which allows parameters to be provided when the classifier object is created, and two main methods: the `fit` method, which performs the learning, and the `predict` method, which performs the prediction based on the learned model.

1. Write the code for a __Perceptron__ class in a file that you can call `perceptron.py`.</br>
Since the algorithm must be applied to different situations, several parameters must be taken into account, namely:
    - dimension: the size of the input vectors,
    - max_iter: the maximum number of iterations of the algorithm,
    - learning_rate: the learning rate of the algorithm.
    
As in Scikit-Learn, we recommend providing these parameters when creating the classifier object. You must therefore specify these parameters in the class's special `__init__` method.

2. You must also write the code for the following two methods:
    - The method `fit(X,y)` performs learning based on the training data $X$ and $y$. The two parameters $X$ and $y$ are lists of the same size: $X$ is a list of *numpy* vectors, $y$ is a list of integer values equal to $-1$ or $+1$. This method does not return a result; it updates the coefficients of the hyperplane (vector $w$ and real number $w_0$).
    - The `predict(X)` method, which predicts the class for a new input $x$. The parameter $x$}$ is a *numpy* vector with the same dimensions as the training data $X$. This method returns the predicted value for the data ($-1$ or $+1$). It is also possible to consider an iterative version of this method that takes a list of data as input and returns the list of corresponding predictions.

It is of course possible to add other methods to this class for factorization or structuring purposes in the algorithm. 

3. To fully understand how objects of this class can be used, here is an example of a Python session that creates an instance and uses the `fit` and `predict` methods for training and evaluation.

In [None]:
from perceptron import Perceptron

# creation des donnees apprentissage
X_train = []
X_train.append(np.array([1, 1]))
X_train.append(np.array([1, 0]))
X_train.append(np.array([0, 1]))
X_train.append(np.array([0, 0]))

y_train = np.array([1, -1, -1, -1])

# creation et entrainement du classifieur
perceptron = Perceptron(dimension=2, max_iter=100, learning_rate=0.1)
perceptron.fit(X_train, y_train)

# prediction
new_x = np.array([1, 1])
print(perceptron.predict(new_x))

new_x = np.array([0, 1])
print(perceptron.predict(new_x))

### 2. Algorithm Validation

To validate your algorithm, you can use the iris dataset. In this case, since the classifier is binary, we will limit ourselves to two species: Setosa and Versicolor. To visualize the data, we will limit ourselves to two characteristics: sepal length and petal length.

1. Load the dataset as a pandas Dataframe.
1. Extract usefull content for the validation.
1. Display the new dataset. By examining the plot, you will see that the selected data are linearly separable.
1. Train the perceptron. To do this, you must separate the data into a training set and a test set. Using the `train_test_split` command from *Scikit-Learn*.
1. Test the perceptron on the test set and compare it with true labels, using `classification_report`.
1. Construct a new visualization of the data with the line separating the two classes, whose parameters are to be determined based on the perceptron coefficients.

### 3. Evaluation

1. Using cross validation, assess the performance of your model with several metrics form `sklearn.metrics` such as `accuracy_score`, `precision_score`, `recall_score`, `f1_score`, ...
1. What performances can we expect on the whole dataset?

### 4. Comparison with other algorithm

In general, it is not easy to find the classification algorithm that will give the best results with your dataset. It is often necessary to try several algorithms and several hyperparameter configurations and compare the accuracy of the corresponding models. The objective of this last part is to compare some algorithms and fine tune parameters.

1. Compare your performances with the scikit-learn implementation of the perceptron, and other models.
1. Using the `GridSearchCV` from `sklearn.model_selection`, find the best parameter for some models. Display results using the method `cv_results_.keys()` What does the function do to find the best parameters?

### 5.  Saving the trained model

Once you have found a good model for your problem, you need to think about putting it into production in an application. This involves saving the trained model and checking that it loads correctly so that it can make predictions with new data in the production system.

A model produced with the Scikit library can be saved using the object serialization library called `pickle`. This library has the functions `dump()` and `load()` to save and load the model, respectively. Review the documentation for these functions to learn about their parameters and apply them in your previous work.