# Machine Learning by Example (2020)

"*In traditional programming, the computer follows a set of predefined rules to
process the input data and produce the outcome. In machine learning, the computer
tries to mimic human thinking.*"

## Tasks can be classified into:

1) Unsupervised Learning: Data used for learning has indicative signals but no description. Ex: Anomalies detection;

2) Supervise Learning: Goal is to find a function mapping inputs to output, so in this sense data comes with description, targets or desired output. Ex: Sales forecasting;

3) Reinforcement Learning: System can adapt to certain dynamic conditions with data providing feedbacks. There is a goal in the end and the system understands its perfomance, adjusting accordingly. Ex: Self-driven cars;


## Overfitting, underfitting, and the bias-variance trade-off

Concepts recap:

**Bias**: 

-> Error from incorrect assumptions iin learning algorithm:

\begin{align}
Bias[ \hat y ] = E[\hat y - y ]
\end{align}

**Variance**: 

-> Sensitivity of the model regarding variations in dataset:

\begin{align}
Variance = E[ \hat y^2 ] - E[\hat y]^2
\end{align}

**Mean Squared Error (MSE)**:

-> A measure for the error of estimation

\begin{align}
MSE = E[(y(x) - \hat y (x))^2]
\end{align}


**Overfitting**: The model is fitting the training set extremely well, but it is not good for predictions, in this sense, it does not have "external validity".
    
- Its bias is low, but variance is high, since preadictions tend to have large variability;
    
**Underfitting**: The model perfoms badly in training and test sets.

- Its bias is high, variance potentially low (in case our model is extremely simple, think about a a straight horizontal line as prediction);
    
**Bias-variance trade-off**

More data and complex models tend to reduce bias, however there will be more shifts in the model to better fit the data, increasing variance.

\begin{align}
MSE & = E[(y - \hat y)^2]\\
& = E \left[(y-E[\hat y] + E[\hat y] - \hat y)^2\right]\\
& = E \left[(y-E[\hat y])^2 \right] + E\left[(E[\hat y] - \hat y)^2\right] + E\left[2\left(y-E[\hat y]\right)\left(E[\hat y] - \hat y\right)\right]\\
& = E \left[(y-E[\hat y])^2 \right] + E\left[(E[\hat y] - \hat y)^2\right] + 2\left(y-E[\hat y]\right)\left(E[\hat y] - E[\hat y]\right)\\
& = \left(E[\hat y - y]\right)^2 + E[\hat y^2] - E[\hat y ]^2\\
& = \underbrace{Bias[ \hat y ]^2}_{\text{Error of estimations}} + \underbrace{Variance[ \hat y ]}_{\hat y \text{ movement around its mean}}
\end{align}
    
**Cross-validation**

Cross-validation helps in avoiding overfitting, such that the training set is split into training and validation set. 

It can be: (1) **exhaustive**: When all possible partitions are tested (e.g. Leave-One-Out-Cross_Validation - LOOCV); (2) **Non-exhaustive**: Not all possible partitions are used (e.g. k-fold cross validation - set is split in k equal-size folds leaving one out for test in each of the k rounds);



## Chapter 2: Building a Movie Recommendation Engine with Naive Bayes

- Movie recommendation is a classification problem.

- Generally speaking classification maps observations/features/predictive variables to target categories/labels/classes.


### Binary Classification

- Classify observations in one of two possible classes (e.g. spam detection, click-thorugh for online ads, whether a person likes or not a movie).

### Multiclass Classification

- Classify observations in more than two possible classes (e.g. handwritten digit recognition as number 9, 2, etc).

### Multi-label Classification

- An observation can belong to more than one class (e.g. a movie can be classified as adventure, sci-fi).

- Typical approach to solve is divide it in a set of binary problem classification.


### Exploring Naive Bayes

- Probabilistic classifier

#### Recall Bayes' Theorem

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

E.g. If I have a unfair coin (U) and a fair one (F), such that in the first one the probability of head (P(H|U)=90%), so given that we got head, what it the probability that an unfair coin was picked?

Answer: $$P(U|H) = \frac{P(H|U)P(U)}{P(H)} = \frac{P(H|U)P(U)}{P(H|U)P(U) + P(H|F)P(F)} = \frac{0.9*0.5}{0.9*0.5+0.5*0.5} = 0.64$$

#### The mecanics of Naive Bayes