<table bgcolor=#ffffff align="center" width="100%" noborder>
    <tr>
        <td align="left" width="30%"><img src="images/IST_logo.png" width="50%"></td>
        <td width="40%"></td>
        <td align="right" width="30%"><img src="images/ds_logo.png" width="25%"></td>
    </tr>
    <tr><td align="left" width="30%"></td>
        <td width="40%"><p align="center"><img src="images/title.png"</td>
        <td align="right" width="30%"></td>
    </tr>
</table>

<h1 align="center" style="font-family:Arial;color:#6c6c6c;font-size:30px;">Lab 3: Classification</h1>

Classification is one of the major tasks in data science, and can be performed through <code>sklearn</code> package 
and its multiple subpackages. The image below summarizes the different major classification techniques and the 
corresponding implementation packages in <code>sklearn</code>.

<p align="center"><img src="images/classification.png" width="50%">

<h2 style="font-family:Arial;color:#6c6c6c;font-size:25px;">Training Models</h2>

Whenever we are in the presence of a classification problem, the first thing to do is to identify the <i>target</i> or
<i>class</i>, which is the variable to predict. The type of the target variable determines the kind of operation to 
perform: targets with just a few values allow for a <strong>classification</strong> task, while real-valued targets 
require a <strong>prediction</strong> one.

In the presence of a classification task, identifying the target balancing is mandatory, in order to choose the most 
adequate balancing strategy (see <a href="Lab23_balancing.ipynb">Data balancing</a>) and elect the best metrics to 
evaluate the results achieved.

<h3 style="font-family:Arial;color:#6c6c6c;font-size:20px;font-style:italic;">Training strategy</h3>

After applying balancing techniques, if required, we need to choose the best training strategy to train classification
models. The training strategy concerns with the way to get the <strong>train and test datasets</strong>, which is done
 in accordance to data characteristics:
- <strong>k-fold cross validation</strong> (<code>StratifiedKFold</code>): used in the presence of a few thousand records;
- <strong>hold-out</strong> (<code>train_test_split</code>): used in the presence of large thousands of records;
- <strong>sample hold-out</strong>: used in the presence of millions of records.

With the data split, we proceed to create the prediction model. However, there is a plethora of techniques and  
extensions, with an infinite number of different parametrisations, and the choice of the best one to apply can only be 
done by comparing their results in our data. Additionally, each technique works better for data with some specific 
characteristics, which demands the application of some data preparation transformations.  

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

data: pd.DataFrame = pd.read_csv('data/iris.csv')
y: np.ndarray = data.pop('class').values
X: np.ndarray = data.values
labels: np.ndarray = pd.unique(y)

trnX, tstX, trnY, tstY = train_test_split(X, y, train_size=0.7, stratify=y)

As noted above, the train of classification models is achieved through <code>sklearn</code> package. Since it is 
constructed over the <code>numpy</code> package, we need to present numpy arrays <code>ndarray</code> as parameters 
for the different methods, like <code>train_test_split</code>.
In mathematical terms, classification aims to map the data <i>X</i> to values into the domain of the target 
variable, call it <i>y</i>.
After loading the data, in <i>data</i> dataframe, we need to separate the target variable from the rest of the data, 
since it plays a different role in the training procedure. Through the application of the <code>pop</code> method, we
get the <i>class</i> variable, and simultaneously removing it from the dataframe. So, <i>y</i> will keep the 
<code>ndarray</code> with the target variable for each record and <i>X</i> the <code>ndarray</code> containing the
records themselves.

The <code>train_test_split</code> receives both <i>X</i> and <i>y</i> as the data to split, and returns both of them 
split in two: <i>trnX</i> will contain <code>trains_size</code> of <i>X</i> and <i>tstX</i> will contain the remaining 
30%, and the same for <i>y</i>. 

Note the <code>stratify</code> parameter - when <code>y</code> it establishes that the split will keep
the original distribution of data, which is mandatory whenever the data is not balanced, and usually advisable for the
majority of situations.

<h4 style="font-family:Arial;color:#6c6c6c;font-size:20px;font-style:italic;">Estimators and Models</h4>

In <code>sklearn</code>, a <strong>estimator</strong> is an object of an extension of the <code>BaseEstimator</code> 
class, which implements the <code>fit</code> and <code>predict</code> methods. Beside these, it also implements the 
<code>score</code> method. Estimators parametrization are done through passing the different choices as parameters
to their constructors methods.

Note that in sklearn <strong>there is no class for representing the models learnt</strong>, but their effects are 
reachable through the estimator object. Indeed, an <i>estimator</i> is the result of parametrising a learning technique, 
trained over a particular dataset, creating a <i>classification model</i>.  

<table noborder>
    <tr><th><p text-align:"center">Estimators Methods</th></tr>
    <tr><th><p text-align:"center"></th></tr>
    <tr><td><p text-align:"left"><code><b>fit(trnX: np.ndarray, trnY: np.ndarray)</b></code></td></tr>
    <tr><td><p text-align:"left">trains the classifier over the data <i>trnX</i> labeled according to <i>trnY</i>, 
        creating an internal model</td></tr>
    <tr><td><p text-align:"left"><code><b>predict(trnX: np.ndarray) -> np.ndarray</b></code></td></tr>
    <tr><td><p text-align:"left">applies the learnt model to the training data in <i>trnX</i> and returns their 
        predicted labels</td></tr>
    <tr><td><p text-align:"left"><code><b>score(tstX: np.ndarray, tstY: np.ndarray) -> float</b></code></td></tr>
    <tr><td><p text-align:"left">applies the model to <i>tstX</i> and compares the predicted labels to the labels
        in <i>tstY</i>, computing model's mean accuracy on the given data</td></tr>
</table>
 

Among the techniques that we are going to use, are: <code>GaussianNB</code>, <code>KNeighborsClassifier</code>, 
<code>DecisionTreeClassifier</code>, <code>RandomForestClassifier</code> and <code>GradientBoostingClassifier</code>.

The rest of this module is organized in a similar way for each one of the classification techniques: it first succinctly 
describes the technique and its main parameters, then we train different models through different parametrisations of 
the technique, using a 70%train-30%test split strategy, and evaluate the accuracy of each model as explained in 
<a href="Lab4_evaluation.ipynb">Lab 4 - Models evaluation</a>, comparing the different results.

<table bgcolor=#ffffff align="center" width="100%" noborder>
    <tr>
        <td align="center" width="30%"><a href="Lab2_preparation.ipynb"><img src="images/prev.png"></a></td>
        <td width="40%"></td>
        <td align="center" width="30%"><a href="Lab31_naivebayes.ipynb"><img src="images/next.png"></a></td>
    </tr>
</table>