# Linear Models for Classification: Wine Dataset

## IMPORTANT: make sure to rerun all the code from the beginning to obtain the results for the final version of your notebook (this is the way we will do it for evaluating your HWs!)

### Dataset description

We will be working with a dataset on wines from the UCI machine learning repository
(http://archive.ics.uci.edu/ml/datasets/Wine). It contains data for 178 instances. 
The dataset is the results of a chemical analysis of wines grown in the same region
in Italy but derived from three different cultivars. The analysis determined the
quantities of 13 constituents found in each of the three types of wines. 

### The features in the dataset are:

- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
-Proline



We first import the dataset we are going to use.

In [None]:
#let's import the sklearn library


#let's print out the version of scikit-learn


#this imports the datasets module, which has useful datasets


# Load the dataset from scikit learn


Let's check out the description of the dataset from the scikit learn documentation: https://scikit-learn.org/0.23/modules/classes.html#module-sklearn.datasets

(**Note**: we are considering the scikit-learn version that is installed in the labs Te and Ue, but there are more recent ones)

Now let's understand a little bit the data.

In [None]:
#let's print the data matrix


#let's print the dimension of the data matrix


#let's print the target (labels)


#let's print the features names


#let's print the targets names names


#let's print the description of the dataset


To simplify a bit the problem (and the presentation), we are going to classify class "1" vs the other two classes (0 and 2). We are going to relabel the other classes (0 and 2) as "-1".

For convenience, let's save the instances (vectors of features) in matrix $\mathbf{X}$ and the targets into a vector $\mathbf{Y}$.

In [None]:
X = 
Y = 

#let's print out the matrix of instances and the vector of targets, just to make sure that everything looks ok
print("Matrix of instances")
print(X)

print("Vector of labels")
print(Y)

Let's relabel the labels for classes 0 and 2 as stated before.

In [None]:
#let's relabel classes 0 and 2 as -1


        
#let's print the new vector Y
print(Y)

## Data Preprocessing and Split into Training and Testing ##

Before we actually learn the model, it is important that we perform two operations:
1. split the data into a training set and a test set
2. normalize the features

**Note**: some of there operations can be done with scikit-learn functions, but we do them "manually" to get a better understanding of what is going on.

We now want to split the data into training and testing. Let's say we keep 80% of the data for training and 20% for testing. How do we split the data?

What about keeping the first 80% of the raws for training and the last 20% of rows for testing? Is it a good idea?

Solution: ...

**Note**: ...

In [None]:
# we need to import numpy


# set the random seed to your ID number
IDnumber = 
np.random.seed(IDnumber)

#let's generate a permutation among the number of rows


#let's print Y_perm


Let's split the data and save into 2 new data matrices/vectors.

We now center and scale the data to have unit variance. This is an important step for the stability of the computation and for other reasons. We are going to use the standard scaler from scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
#load the StandardScaler module


# we first "learn" the scaler function using the training data


# we then apply the scaling function to both training and test data, since we want to simulate what happens when we have data for training and we have future data


#let's print the scaled version of X_traing_scaled


## Learning a Model ##

We now need to decide which model/algorithm we are going to use for our classification task. There are several models available in scikit-learn: https://scikit-learn.org/0.23/index.html

We are going to start from the simplest models, that is, linear models: https://scikit-learn.org/0.23/modules/classes.html#module-sklearn.linear_model

How do we find the best hypothesis?

We need to define a loss function and then use Empirical Risk Minimization (ERM). 

What loss function does it make sense to use?

But what is the actual algorithm? We are going to consider the **Perceptron** algorithm: https://scikit-learn.org/0.23/modules/generated/sklearn.linear_model.Perceptron.html

Let's load the corresponding module in scikit-learn

Let's us the Perceptron algorithm as implemented in scikit-learn. It proceeds in iterations.

The Perceptron has several parameters, some of which we will understand later on. An important one is $\texttt{tol}$, that represents how much the training error should improve in one iteration for the algorithm to continue.

In [None]:
#let's learn a model using Perceptron

#we first define the classifier, fixing the random state for reproducibility


#let's now learn the classifier (i.e., run the perceptron to fix the weights)


Let's print out the model we learned.

How well does our method perform?

We need to compute the training error of the hypothesis $h_S$ we learned from the training set $S$. There is no function in python to compute the training error $L_S(h_S)$. However, there is a function to compute the \emph{score}, that for the 0-1 loss corresponds to $1 - L_S(h_S)$.

In [None]:
#let's compute the training error


#let's print the training error


But we don't care about the training error... we are interested in the generalization error! How do we estimate it? Let's use some data that we did not use for training, that is what we called test data.

In [None]:
#let's compute the test error


#let's print the test error


## Impact of the amount of training data ##

We will now try to understand the impact of the amount of data we have for training.

To do this, we are going to train a model using a subset of the data with $10*i$ samples, for $i=1,2,3,\dots,10$, and then compute the training error and the test error.

In [None]:
#total number of samples, useful for later on
m_total = X.shape[0]

#two lists where to save the training error and the test error, useful for plotting
train_errors = list()
test_errors = list()

#let's define the learner we use in this part
perceptron_class = Perceptron(random_state = IDnumber, tol=1e-3 )

for i in range(1,10):
    # we now repeat all the previous steps
    # split into training and test

    
    #scale the data according to the training test, for both training and testing

    
    #let's now learn the classifier (i.e., run the perceptron to fix the weights

print(train_errors)
print(test_errors)
    


Now let's plot the training and test error as a function of the training dataset size.

In [None]:
#the following is to have the plots appearing inline
%matplotlib inline

#import the pyplot module from matplotlib for plotting (functions are similar to matlab)
import matplotlib.pyplot as plt

x_axis=range(10,100,10)
plt.plot(x_axis,train_errors,'x:')
plt.plot(x_axis,test_errors,'o:')
plt.legend(["Training error","Test error"])

## Impact of initial conditions by the perceptron 

Note that the solution found by the Perceptron algorithm depends on the initial condition. Let's learn a model with a different random seed for the Perceptron and see how different the model is from the previous one.

In [None]:
#let's learn a new model using Perceptron



## Impact of normalization

Let's try to understand what the impact of scaling data is. Let's learn a model without without normalizing the data.

In [None]:
#let's learn a new model using Perceptron


## Impact of number of iterations

Let's write the code that performs one iteration at the time, and let' compute the training error after iteration.