# Introduction to Machine Learning

**Supervised learning**
- We have labelled or annotated samples, which are split
    - Training split: train model parameters
    - Validation split: for adjusting hyperparameters (skipped in the course)
    - Test split: final performance metric; nothing tested anymore
- Usually two types of problems are solved
    - Classification: predict the class of a new sample using a model trained with labelled samples
    - Regression: predict the continuous value of a sample using a model trained with labelled samples

**Unsupervised learning**
- We don't have labelled data.
- Usually these kind of problems are solved:
    - Clustering: data points are grouped due to their similarity; since data is not labelled, the grouping can result to be artificial, if not correctly done
    - Anomaly detection: we detect outliers in a dataset, eg, fraud detection; usually we don't have outliers as data, but we recognize them when they appear
    - Dimensionality reduction: the number of features of each sample can be reduced either to compress or to better understand the dataset
- Since we don't have labels (ground truth), we cannot evaluate the methods taht easily.

**Performance evaluation for classification (categorical values)**
- Metrics
    - Accuracy
    - Recall
    - Precission
    - F1-score
- Confusion matrix; Type I & II errors

**Performance evaulation for regression (continuoues values)**
- Mean absolute error: avg(abs(y-y_hat)); large errors not punished
- (Root) Mean squared error: sqrt(avg((y-y_hat)^2))

### Scikit Learn: Overview of pipeline

#### 0. Install it

```python
conda install scikit-learn
pip install scikit-learn
```

#### 1. Import model estimator and instantiate it (with or without params)

`from sklearn.family import Model`

Example (family = linear_model, Model = LinearRegression):

In [1]:
from sklearn.linear_model import LinearRegression

In [2]:
model = LinearRegression()

#### 2. Create training and test splits with the data

In [12]:
from sklearn.model_selection import train_test_split

In [4]:
import numpy as np

In [5]:
X, y = np.arange(10).reshape(5, 2), range(5)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

#### 3. Train/Fit model

Supervised **estimators**:

`model.fit(X_train, y_train)`

Unsupervised **estimators**:

`model.fit(X_train)`

Many models can compute a fitting score [0,1]:

`model.score()`

In [7]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

#### 4. Inference/Prediction

Supervised, classification (categorical values) & regression (continuous values):

`predictions = model.predict(X_new)`

Supervised, classification (probabilities):

`probabilities = model.predict_proba(X_new)`

Unsupervised:

`predictions = model.predict(X_new)`

`X_trans = model.transform(X_new)`


In [9]:
predictions = model.predict(X_test)

In [10]:
predictions

array([ 1.,  3.])

## Which method/estimator should we use?

![Choosing the appropriate ML method](_scikit_learn_ml_map_choose_method.png "Choosing the appropriate ML method")