# Introduction to ML Concepts and Scikit-Learn

This notebook is a brief introduction to Machine Learning and Sckit-Learn.

Overview of contents:

1. ML Concepts - A Refresher: Supervised vs. Unsupervised ML, Performace Metrics of Regression, Performance Metrics of Classification (Accuracy, Precision, Recall, F1, Confusion Matrix)

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. ML Concepts - A Refresher

### Types of Machine Learning

Most of machine learning techniques are divided in these groups:

- Supervised Learning: regression and classification with labelled data
- Unsupervised Learning: clustering, anomaly detection and dimenstionality reduction with unlabelled data
- Reinforcement Learning: learning through actions with the environment.

### Supervised learning

We have labelled or annotated samples, which are split
- Training split: train model parameters.
- Validation split: for adjusting hyperparameters (skipped in the course).
- Test split: final performance metric; nothing tested anymore.

Usually two types of problems are solved:
- Classification: predict the class of a new sample using a model trained with labelled samples.
- Regression: predict the continuous value of a sample using a model trained with labelled samples.

### Unsupervised learning
We don't have labelled data.
Usually these kind of problems are solved:
- Clustering: data points are grouped due to their similarity; since data is not labelled, the grouping can result to be artificial, if not correctly done.
- Anomaly detection: we detect outliers in a dataset, eg, fraud detection; usually we don't have outliers as data, but we recognize them when they appear.
- Dimensionality reduction: the number of features of each sample can be reduced either to compress or to better understand the dataset.

Since we don't have labels (ground truth), we cannot evaluate the methods that easily.

### Performance evaulation for regression (continuous values)

- Mean absolute error: `avg(abs(y-y_hat))`; large errors not punished
- (Root) Mean squared error: `sqrt(avg((y-y_hat)^2))`

### Performance evaluation for classification (categorical values)

**Metrics**
- Accuracy = correct predictions / total predictions `= (TP + TN) / (TP + TN + FP + FN)`
    - Accuracy is good for balanced datasets = datasets that contain a similar amount of samples for each class
    - Why? because if we have a dataset with 99% dogs, the accuracy is surely going to be good by guessing a dog!
    - The reverse of the accuracy is the Error Rate = Missclassification Rate: `(FP + FN) / (TP + TN + FP + FN)`

- Recall = correct T predictions / all real T samples `= TP / (TP + FN)`
    - The ability of a model to find all relevant classesfrom all real T cases, how many are predicted correctly as T?

- Precission = correct T predictions / all predicted T samples `= TP / (TP + FP)`
    - The ability of a model to identify only the relevant data points from all predicted T cases, how many are correctly/really T?
    - We have often a trade-off between Precission & Recall

- F1-core `= 2 * (precission * recall) / (precission + recall)`
    - It combines both precission & recall, it is the harmonic mean of both
    - The harmonic mean is typically used when the average of rates is desired
    - It punishes bad rates/values at the extremes, leading to a bad F1-score

**Confusion matrix (Type I & II errors)**

- Real/Actual (T or F) vs Prediction (T or F)
- Quadrants:
    - TP
    - FN - type II error: these are the really severe errors we'd like to avoid!
    - FP - type I error
    - TN
- From this matrix, we can compute many metrics, among them the ones defined: Accuracy, Precission, Recall, F1-score

[Confusion Matrix from the Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix)

![Confusion Matrix (Wikipedia)](../pics/confusion_matrix.png)

For each example, we need to pick the most appropriate metric and choose the most appropriate threshold for considering the model performs correctly enough. It's not the same predicting
- diseases, severe or light
- defects
- ...