In [1]:
%run ../common/import_all.py

from common.setup_notebook import set_css_style, setup_matplotlib, config_ipython

config_ipython()
setup_matplotlib()
set_css_style()

# A learning machine

Machine Learning is that field at the intersection of statistics and computer science that tries to have machines perform tasks somewhat related to intelligence, making them "understand" data patterns, generalise learnings to new data points and extract information in the data which isn't immediately visible to the naked eye.

## Paradigms of Machine Learning (main ones)

There are several ways to separate tasks and problems in machine learning into categories based on the logical way of proceding and the type of outcome desired. The main, traditional ones are *supervised* and *unsupervised* learning, which are different in the approach and the type of data you feed in, but there are some other paradigms which are getting interest in more recent times (for instance reinforcement learning) and also, the distinctions can be not very draconian sometimes.

### Supervised learning

In supervised learning, you teach the machine to learn from data points that have a target value against them specifying the ground truth. 

In the case of a *regression* you want to predict the value of the dependent variable given the independent variable(s); in the case of a *classification*, you have data separated into categories, identified by their labels and try to predict the target class of the data point. In both cases the machine is trained to learn from existing matches in order to generalise on new data and spit out the target value for those.

In fact, *regression* and *classification* are the main problems you tackle in a supervised learning setting.

### Unsupervised learning

Learning is unsupervised when there are no labels in the training set that specify the ground truth. This type of Machine Learning deals with extracting patterns from the data, understanding and manipulating its structure in order to separate group of points which are somehow similar. 

*Clustering* is one instance of unsupervised learning: you try to group data points together into groups for similarity; another task is *anomaly detection*, where the computer flag some data point as weird with respect to the rest. Other techniques which usually get classed in this paradigm deal with the shrinking of data points into smaller sets which contain most of the relevant original information, these usually go under the name of *dimensionality reduction*. 

## Regression

## Classification

### Binary

### Multiclass

The methods here exposed are used in multiclass classification when the algorithm does not support it naturally (it is suited to binary classification).

#### One-vs.-all

Also called *one-vs.-rest*, in this technique a single classifier per class is trained: the samples in the training set which are labelled for that class are considered positive and all the other samples negative. A confidence score is drawn for each of these classifiers classifier.

For an unseen data point, each classifier gets applied and the predicted label will be the one for which the corresponding classifier has the highest score. 

The problem with this approach is that the single classifiers are trained on unbalanced sets.

#### One-vs.-one

In this technique, $\frac{k(k-1)}{2}$ classifiers get trained, where $k$ is the number of classes: each classifier receives a pair of classes from the set and learns to distinguish between them. 

For an unseen data point, a voting scheme is applied where all the classifiers are used on it and then the class with the highest number of positives gets predicted.

The problem with this approach is that some regions of the input space may receive the same number of votes.

## Garbage in, garbage out

This phrase is used to mean that if poor data is fed into an algorithm, however sophisticated it may be, poor results are obtained. 

## Lazy and eager learning

In a *lazy learning* approach, the algorithm outputs the result on test data only after all training data has been ingested and computation made on it. It uses a predictive function that gets approximates locally and is thereby adaptable to changes. An example is kNN.

In an *eager learning* approach instead, the predictive function is built during training so that the algorithm can be run on test data along the way. The function is then built globally, making computation less space-consuming. Examples are Naive Bayes and neural networks.

## Ensemble methods

Ensemble methods combine predictions from several estimators to improve the ability to generalise to new data. There are two main classes.

### Averaging methods

The estimators used are independent and their predictions get averaged, to reduce the variance. The main way to obtain this is via *bagging* (which stands for *bootstrap aggregating*): several instances of the algorithm are run on random subsets of the training set, where the random subsets are selected with replacement. The method works great to reduce overfitting; an example is the Random Forest. 

### Boosting methods

The estimators are run sequentially so that you have several weak learners combined, and this reduces the bias. Boosting works better than bagging on noisy data; an example is AdaBoost. 

> TODO has to be massively improved, has to become a generic overview of ML and how to do it

* regression/classification
* clustering
* feat eng
* statistical vs. ML approaches
* cost functions
* overfitting/underfitting

## Problems

### Regression

### Classification

## Workflow

> TODO to be improved/changed

1. Define the problem and check if the data you have is good and informative enough
2. Feature engineering
3. Choose a set of algorithms
4. Do cross-validation
5. Examine the metrics
6. Regularize