# Machine Learning by Example (2020)

"*In traditional programming, the computer follows a set of predefined rules to
process the input data and produce the outcome. In machine learning, the computer
tries to mimic human thinking.*"

## Tasks can be classified into:

1) Unsupervised Learning: Data used for learning has indicative signals but no description. Ex: Anomalies detection;

2) Supervise Learning: Goal is to find a function mapping inputs to output, so in this sense data comes with description, targets or desired output. Ex: Sales forecasting;

3) Reinforcement Learning: System can adapt to certain dynamic conditions with data providing feedbacks. There is a goal in the end and the system understands its perfomance, adjusting accordingly. Ex: Self-driven cars;


## Overfitting, underfitting, and the bias-variance trade-off

Concepts recap:

**Bias**: 

-> Error from incorrect assumptions iin learning algorithm:

\begin{align}
Bias[ \hat y ] = E[\hat y - y ]
\end{align}

**Variance**: 

-> Sensitivity of the model regarding variations in dataset:

\begin{align}
Variance = E[ \hat y^2 ] - E[\hat y]^2
\end{align}

**Mean Squared Error (MSE)**:

-> A measure for the error of estimation

\begin{align}
MSE = E[(y(x) - \hat y (x))^2]
\end{align}


**Overfitting**: The model is fitting the training set extremely well, but it is not good for predictions, in this sense, it does not have "external validity".
    
- Its bias is low, but variance is high, since preadictions tend to have large variability;
    
**Underfitting**: The model perfoms badly in training and test sets.

- Its bias is high, variance potentially low (in case our model is extremely simple, think about a a straight horizontal line as prediction);
    
**Bias-variance trade-off**

More data and complex models tend to reduce bias, however there will be more shifts in the model to better fit the data, increasing variance.

\begin{align}
MSE & = E[(y - \hat y)^2]\\
& = E \left[(y-E[\hat y] + E[\hat y] - \hat y)^2\right]\\
& = E \left[(y-E[\hat y])^2 \right] + E\left[(E[\hat y] - \hat y)^2\right] + E\left[2\left(y-E[\hat y]\right)\left(E[\hat y] - \hat y\right)\right]\\
& = E \left[(y-E[\hat y])^2 \right] + E\left[(E[\hat y] - \hat y)^2\right] + 2\left(y-E[\hat y]\right)\left(E[\hat y] - E[\hat y]\right)\\
& = \left(E[\hat y - y]\right)^2 + E[\hat y^2] - E[\hat y ]^2\\
& = \underbrace{Bias[ \hat y ]^2}_{\text{Error of estimations}} + \underbrace{Variance[ \hat y ]}_{\hat y \text{ movement around its mean}}
\end{align}
    
**Cross-validation**

Cross-validation helps in avoiding overfitting, such that the training set is split into training and validation set. 

It can be: (1) **exhaustive**: When all possible partitions are tested (e.g. Leave-One-Out-Cross_Validation - LOOCV); (2) **Non-exhaustive**: Not all possible partitions are used (e.g. k-fold cross validation - set is split in k equal-size folds leaving one out for test in each of the k rounds);



## Chapter 2: Building a Movie Recommendation Engine with Naive Bayes

- Movie recommendation is a classification problem.

- Generally speaking classification maps observations/features/predictive variables to target categories/labels/classes.


### Binary Classification

- Classify observations in one of two possible classes (e.g. spam detection, click-thorugh for online ads, whether a person likes or not a movie).

### Multiclass Classification

- Classify observations in more than two possible classes (e.g. handwritten digit recognition as number 9, 2, etc).

### Multi-label Classification

- An observation can belong to more than one class (e.g. a movie can be classified as adventure, sci-fi).

- Typical approach to solve is divide it in a set of binary problem classification.


### Exploring Naive Bayes

- Probabilistic classifier

#### Recall Bayes' Theorem

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

E.g. If I have a unfair coin (U) and a fair one (F), such that in the first one the probability of head (P(H|U)=90%), so given that we got head, what it the probability that an unfair coin was picked?

Answer: $$P(U|H) = \frac{P(H|U)P(U)}{P(H)} = \frac{P(H|U)P(U)}{P(H|U)P(U) + P(H|F)P(F)} = \frac{0.9*0.5}{0.9*0.5+0.5*0.5} = 0.64$$

#### The mecanics of Naive Bayes

Consider:

Let $k \in \{1,2,...,K\}$ denote classes, the probability that a sample belong to class $k$ given observed $x$ is: 

$$P(y_k|x) = \frac{P(x|y_k)P(y_k)}{P(x)}$$

The names given for the components of the equation above are:

- **Prior**: $P(y_k)$ - How classes are distributed, without any knowledge of features;

- **Posterior**: $P(y_k|x)$ - Incorporates knowledge from observation;

- **Likelihood** - P(x|y_k) - The distribution of n features given that the sample belong to class $y_k$. 
Likelihood ends up being very hard to calculate when there are a large number of features, since this become a large joint distribution.
To circumvent this issue, Naive Bayes assumes feature independence, which allow us to write:

$$P(x|y_k) = P(x_1|y_k)*P(x_2|y_k)*...*P(x_n|y_k)$$

The denominator of our bayes formula, $P(x)$ (called **evidence**) depends on overall distribution of fetures, meaning that it acts a constant, 
and so our posterior is proportional to:

$$ P(y_k|x) \propto P(x|y_k)P(y_k) = P(x_1|y_k)*P(x_2|y_k)*...*P(x_n|y_k) $$

Note that it is possible that, for a given sample, a given feature, say $n'$ presents: $P(x_{n'}|y_k) = 0$, which would cause the likelihood to be zero, 
and so an unknown likelihood. To avoid that, **Laplace Smoothing** is used:

$$ P(x_{n'}|y_k) = \frac{N_{x_{n'}|y_k} + \alpha}{N_{y_k}+ \alpha d}$$

Where $N_{x_{n'}|y_k}$ is how many times $x_{n'}$ occured given that $y_k$ was observed, $N_{y_k}$ how many times $y_k$ wa observed, $\alpha>0$ is the smoothing parameter ($\alpha=0$ means no 
smoothing, many times this is set to 1) and $d$ in the binary case is 2 (because there are two possible values).

Knowing the likelihoods and given some prior, one can calculate the posteriors. Using the fact that the sum of posteriors for a given $x$ is 1, the probaility that
$y_k$ is observed given $x$ is found. 


### Implementing Naive Bayes

In [10]:
import numpy as np
# Building a toy dataset, which tries to discover if the user likes the target movie based on how she likes other three movies (like is YES or NO)
X_train = np.array([
    [0, 1, 1],
    [0, 0, 1],
    [0, 0, 0],
    [1, 1, 0]])
Y_train = ['Y', 'N', 'Y', 'Y']
X_test = np.array([[1, 1, 0]])    


In [11]:
# Group data by label ('Y' and 'N') recording their indices (where they show up) by classes

def get_label_indices(labels):
    """
    Group samples based on their labels and return indices
    @param labels: list of labels
    @return: dict, {class1: [indices], class2: [indices]}
    """
    from collections import defaultdict
    label_indices = defaultdict(list)
    for index, label in enumerate(labels):
        #print('index:',index)
        #print('label', label)
        label_indices[label].append(index)
    return label_indices

In [12]:
label_indices = get_label_indices(Y_train)
print('label_indices:\n', label_indices)

label_indices:
 defaultdict(<class 'list'>, {'Y': [0, 2, 3], 'N': [1]})


In [None]:
def get_prior(label_indices):
    """
    Compute prior based on training examples
    @param label_indices: grouped sample indices by class
    @return: dictionary, with class label as key, corresponding prior
             as the value
    """
    # define prior as an object that can be referred by label
    #  get the length of indices inside label_indices items
    prior = {label: len(indices) for label, indices in label_indices.items()}
    total_count = sum(prior.values())
    for label in prior:
        prior[label]

In [17]:
label_indices.items()

dict_items([('Y', [0, 2, 3]), ('N', [1])])