# Supervised Learning: Classification

Broadly speaking, we can identify three main operational areas in ML:

- Supervised Learning;
- Unsupervised Learning;
- Reinforcement Learning.

## Machine Learning areas

![alt text](../resources/machine-learning.png "Title")


We are talking about supervised learning when we have a set of targets and we want our model to learn to recognize patterns and associate data points to a specific target.

Targets can be either:
- numeric (regression);
- categorical (classification).

Today we'll focus on the latter.

## Introduction to classification

We have a dataset and a set of targets (labels) for each data point.

In general terms, when performing classification, we are trying to represent all items in a multidimensional space and to find a boundary between different groups.  
We can achieve this in two ways:

#### Transforming the data so that the classes are lineraly separable

<figure style="text-align:center">
  <img src="https://i2.wp.com/appliedmachinelearning.blog/wp-content/uploads/2017/03/svm_logo1.png?fit=392%2C374&ssl=1" alt="" style="width:40%;text-align:center">
  <figcaption style="text-align:center">Linear (planar) separation in space</figcaption>
</figure>



#### Using non-linear algorithms


<figure style="text-align:center">
  <img src="https://cdn-images-1.medium.com/max/1600/1*5l08QfsUsrsOxcPzfDoStg.png" alt="" style="width:70%;text-align:center">
  <figcaption style="text-align:center">Linear vs nonlinear model</figcaption>
</figure>

In [None]:
# numpy for algebra
import numpy as np
# pandas for data manipulation
import pandas as pd
# sklearn for machine learning
from sklearn import datasets

# matplotlib an d seaborn for plotting and graphs
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.rcParams["figure.figsize"] = 16, 3

## Binary classification

We will use some datasets bundled with scikit learn. For binary classification we will use the breast cancer detection set.

Some details below:



Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.  
https://goo.gl/U2Uwz2 
  
Features are computed from a digitized image of a fine needle  
aspirate (FNA) of a breast mass.  They describe  
characteristics of the cell nuclei present in the image.  

In [None]:
bc_dataset = datasets.load_breast_cancer()

In [None]:
df = pd.DataFrame(bc_dataset["data"], columns=bc_dataset["feature_names"])

In [None]:
df.head()

In [None]:
target = pd.Series(bc_dataset["target"])

In [None]:
target.head()

In [None]:
# class balance
target.value_counts()

In [None]:
bc_dataset["target_names"]

### The algorithm

For this example we will use one of the simplest algorithms available for binary classification: Logistic Regression.

We are trying to get a combination of the features (x axis) such that a logistic curve can differentiate the two classes (y axis).


<figure style="text-align:center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/6/6d/Exam_pass_logistic_curve.jpeg" alt="" style="width:50%;text-align:center">
  <figcaption style="text-align:center"> Example of logistic regression - probability of passing an exam vs hours of studying</figcaption>
</figure>


As simplistic as this may sound, it actually works well in a lot of real world situations.

### Data exploration and preparation

In any scenario when we are trying to fit data to a label, we must ensure that our model retains predictive power (avoid overfitting!).

To do so we need to split the data so that we can run the evaluation on data points the model has not seen before (testing generalization).

In [None]:
# correlations
plt.rcParams["figure.figsize"] = 15, 15
sns.heatmap(df.corr(), annot=True)

In [None]:
import tqdm
from sklearn.preprocessing import StandardScaler

scalers = {}

for column in tqdm.tqdm(df.columns):
    _scaler = StandardScaler()
    _scaler.fit(df[column].values.reshape(-1, 1))
    scalers[column] = _scaler

In [None]:
# transform the columns
import copy

data = copy.deepcopy(df)

for column in tqdm.tqdm(df.columns):
    data[column] = scalers[column].transform(df[column].values.reshape(-1, 1))

In [None]:
data.head()

In [None]:
# random split of data
from sklearn import model_selection

train, test, target_train, target_test = model_selection.train_test_split(data, target)

In [None]:
train.shape, test.shape

### Training and evaluation

In [None]:
# instantiate the model
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

We can check the training score to see if the model is learning something from the data...

It looks like the training score is very high (always between 0 and 1). The model is learning a lot about the data.

But is it learning patterns or is it just memorizing this specific dataset?

### Evaluation

We use the test set to run evaluation and verify how well the model is behaving.

**What do these numbers mean?**

<figure style="text-align:center">
  <img src="https://www.digital-mr.com/media/cache/5e/b4/5eb4dbc50024c306e5f707736fd79c1e.png" alt="" style="width:90%;text-align:center">
  <figcaption style="text-align:center"> Precision and recall</figcaption>
</figure>

The confusion matrix shows in detail what errors we are committing.

## Multiclass classification

As the name suggests we are dealing with multiple classes. Some methods need to change in order to cope with the new problem configuration.


<figure style="text-align:center">
  <img src="https://i.stack.imgur.com/La40O.jpg" alt="" style="width:60%;text-align:center">
  <figcaption style="text-align:center">Multiclass classification</figcaption>
</figure>



Also, some algorithms don't work anymore (such as Logistic regression*).

We will use the wine dataset!

In [None]:
wine = datasets.load_wine()

In [None]:
df = pd.DataFrame(wine["data"], columns=wine["feature_names"])

In [None]:
target = pd.Series(wine["target"])

### Data exploration, preparation, and fitting.

In [None]:
df.head()

In [None]:
target.value_counts()

### How to deal with multiclass targets

There are a few problems with categorical multiclass targets...

## The algorithm

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()

## Evaluation

### Overfitting

Let's quickly try a different algorithm

In [None]:
from sklearn.neighbors import KNeighborsClassifier