In [None]:
# Create a quick index, copy the output into a MD cell.
abc = 'A B C D E F G H I J K L M N O P Q R S T U V W X Y Z'
arr = [' [{}](#{}) ·'.format(k, k) for k in abc.split()]
print(''.join(arr))

# Scikit-learn algorithm cheatsheet
Got from https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
![img](https://scikit-learn.org/stable/_static/ml_map.png)

# Glossary of terms
Here are the terms collected during the learning journey.

**Index**

[A](#A) · [B](#B) · [C](#C) · [D](#D) · [E](#E) · [F](#F) · [G](#G) · [H](#H) · [I](#I) · [J](#J) · [K](#K) · [L](#L) · [M](#M) · [N](#N) · [O](#O) · [P](#P) · [Q](#Q) · [R](#R) · [S](#S) · [T](#T) · [U](#U) · [V](#V) · [W](#W) · [X](#X) · [Y](#Y) · [Z](#Z)

***
Also found the scikit learn glossary [sklearn](https://scikit-learn.org/stable/glossary.html)

# B
* **Black box model:** as opposite to [white box model](#W), those models in which we can't figure out what's going on, we know that they work fine but not why (eg, a neural network)

# C
* **Classification:** the supervised learning where the target are categories (labeled data)

* **Curse of dimensionality:** when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. [[wiki]](https://en.wikipedia.org/wiki/Curse_of_dimensionality) [[sklearn]](https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html#the-curse-of-dimensionality)

# D
* **Discretization:** reduce the number of values of a continuous feature by groupping them into intervals (bins)

# F
* **Feature:** variable, column in a dataset, dimension, all of these mean the same concept

* **Fit (method):** fit method calculates the algorithm, that is, creates the model that will be used for predictions.

## K
* **Kernel function:** a symmetric function over pairs of data points. Those pairs are: 
    1. The training instances, and
    2. The testing ones.  
    
  Outputs a number, the distance between the two points, which is used in turn by a *kernel method* [wiki](https://en.wikipedia.org/wiki/Positive-definite_kernel)
* **Kernel method:** Predicts the value by the unlabeled point in the kernel method by summing the weighted kernels from all the points trained to the testing point so the farther they are the less influence they'll have [wiki](https://en.wikipedia.org/wiki/Kernel_method) 

# L
* **L1 norm:** metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. Can also refer to L1 loss function:

* **L1-norm loss function**: also known as least absolute deviations (LAD), least absolute errors (LAE). It is basically minimizing the sum of the absolute differences (S) between the target value ($Y_i$) and the estimated values ($f(x_i)$) [[s]](http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/)


* **L2 norm:** or euclidean norm, the distance is given by pythagoras. Can also be applied to the least squares distance loss function:

* **L2-norm loss function:** is also known as least squares error (LSE). It is basically minimizing the sum of the square of the differences (S) between the target value ($Y_i$) and the estimated values ($f(x_i)$) [[s]](http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/)

**L1-L2 comparison**

![img](https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Manhattan_distance.svg/200px-Manhattan_distance.svg.png)

# M
* **Multi-class classification:** the model's vectors we are going to work with, have more than two dimensions.
* **Mutual information:** or info gain. A metric that shows how much information of a variable *A* is contained in another variable *B*. So the lower the number (close to 0) the more independent the variables are.

# N
* **Normalization:** the process of scaling individual samples to have unit norm, that is, they will be in the range {0, 1} or {-1, 1}

# O
* **One vs one:** when we deal with multi-class classification, that is, our input vectors have more than two dimensions, is the strategy where we evaluate the classes in pairs. Say we have three classes, A, B and C. The OVO ensemble will be composed of 3 (= 3 * (3 - 1) / 2) binary classifiers. The first will discriminante A from B, the second A from C, and the third B from C. At prediction stage, the class that got the highest number of "+1" predictions is our winner.
* **One vs rest:** (aka one-vs-all),when we deal with multi-class classification, is the strategy that involves training one classifier (estimator) for class and then taking the one which gives the highest confidence. [wiki](https://en.wikipedia.org/wiki/Multiclass_classification#Transformation_to_binary)
* **Outliers:** An outlier is an observation that lies an abnormal distance from other values in a random sample from a population [[s]](https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm)
* **Overfitting:** models that do not generalise the data well, that is, they stick to the dataset learned so much that they don't work well with other datasets or predict future observations realiably.  
On the following graph the green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line.

![img](https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/300px-Overfitting.svg.png)

# P
* **Pipeline:** the stages in a machine learning project. Usually they are:
    * Preprocessing
    * Feature selection
    * ML algorithm
    * Model
    * Evaluation
* **Preprocessing:** the stage where we clean, prepare and make sense of the data we'll work with. 

# R
* **Regression:** Is a broad concept. For the moment, it could be the family of algorithms used in supervised learning where the targets are continous values (rather than labels/categories)

# S
* **Supervised Learning:** the data also comes with an output (also known class column). It consists either in classification (outputs are categories) or in regression methods (outputs are numbers)

* **Standarization:** the process of rescaling the features so that they’ll have the properties of a Gaussian distribution with μ=0 and σ=1 where μ is the mean and σ is the standard deviation from the mean

# U
* **Unsupervised learning:** the data doesn't come with any outcome (no class column) so it is used in a exploratory way. [Mathworks](https://www.mathworks.com/discovery/unsupervised-learning.html)

# W
* **White box model:** those models on which we can figure out what's going on inside. The opposite of a [black box model](#B).

# X
* **$X$:** all the data matrix, input vectors. $X_i$, some instance.

# Y
* **$y$:** the target column or output, that is, the real outcome we get when we input $X$. $y_i$, some instance.
* **$\hat{y}$:** a prediction of the learned model.