In [None]:
import numpy as np

# Create a quick index, copy the output into a MD cell.
abc = 'A B C D E F G H I J K L M N O P Q R S T U V W X Y Z'
arr = [' [{}](#{}) ·'.format(k, k) for k in abc.split()]
print(''.join(arr))

# Scikit-learn algorithm cheatsheet
Got from https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
![img](https://scikit-learn.org/stable/_static/ml_map.png)

# Glossary of terms
Here are the terms collected during the learning journey.

**Index**

[A](#A) · [B](#B) · [C](#C) · [D](#D) · [E](#E) · [F](#F) · [G](#G) · [H](#H) · [I](#I) · [J](#J) · [K](#K) · [L](#L) · [M](#M) · [N](#N) · [O](#O) · [P](#P) · [Q](#Q) · [R](#R) · [S](#S) · [T](#T) · [U](#U) · [V](#V) · [W](#W) · [X](#X) · [Y](#Y) · [Z](#Z)

***
Also found the scikit learn glossary [sklearn](https://scikit-learn.org/stable/glossary.html)

# B
* **Bias:** the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated.[[wiki]](https://en.wikipedia.org/wiki/Bias_of_an_estimator) [[Collective wisdom]](Collective%20wisdom/personal%20review.ipynb#A-BRIEF-PRIMER-ON-STATISTICAL-DECISION-THEORY)
* **Bias-Variance tradeoff:** [[wiki]](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff) Models with lower bias (nice) tend to have a higher variance (bad, overfitting) 
* **Black box model:** as opposite to [white box model](#W), those models in which we can't figure out what's going on, we know that they work fine but not why (eg, a neural network)

# C
* **Classification:** the supervised learning where the target are categories (labeled data)

* **Coefficient of determination:** $r^2$, a metric of how well performs an algorithm. Mathematically, is one minus the proportion between the sum of the squared residuals and the total sum of squares. [[wiki]](https://en.wikipedia.org/wiki/Coefficient_of_determination)

* **Covariance matrix:** shows the multidimensional covariances (correlations). It generalizes the notion of variance to multiple dimensions. [[wiki]](https://en.wikipedia.org/wiki/Covariance_matrix)
    
     ![img](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/GaussianScatterPCA.svg/220px-GaussianScatterPCA.svg.png)
    
    Above picture are sample points from a bivariate Gaussian distribution with a standard deviation of 3 in roughly the lower left-upper right direction and of 1 in the orthogonal direction. Because the $x$ and $y$ components co-vary, the variances of $x$ and $y$ do not fully describe the distribution. A $2\times 2$ covariance matrix is needed; the directions of the arrows correspond to the eigenvectors of this covariance matrix and their lengths to the square roots of the eigenvalues.  
    
   Covariance matrix is often built off a vector to show how related are components inside the vector. $cov (\mathbf{X}, \mathbf{X}) = E[(\mathbf{X}-\mu_X)(\mathbf{X}-\mu_X)]$. Notice the expectation across samples of same class. [[PCA visualized]](https://notsquirrel.com/pca/). This property can be used to guess values of a pixel's neigbors when knowing its value. Alternatively, when used as a linear transformation over an unknown image, it can be used to identify it depending the outcome of such transformation.

* **Curse of dimensionality:** when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. [[wiki]](https://en.wikipedia.org/wiki/Curse_of_dimensionality) [[sklearn]](https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html#the-curse-of-dimensionality)

In [None]:
# Compute a covariance matrix over a vector

# Create a column vector
X = np.arange(5.).reshape(5, 1)

# Compute the distance from the expected value
X -= X.mean()

# Finally build the matrix by dotting the col vector & its transposed
np.dot(X, X.T)

# D
* **Discretization:** reduce the number of values of a continuous feature by groupping them into intervals (bins)

# E
* **Eigenvalues:** the factor by which eigenvectors are scaled up or down.

* **Eigenvectors:** when applying a linear transformation of the input space, are those vectors that will continue pointing in the same direction and they will be only scaled up or down.

    When dealing with normal multivariate distributions the eigenvectors will be the axes of the ellipses 

# F
* **Fat tail:** some events in the real world are really prone to render very unexpected outcomes with some probability way greater than if they were normally distributed events. These events lay 3 or more standard deviations from the mean event and they usually appear as outliers in the plots (like whisker plots). These kind of events, might be the hardest to forecast. [[wiki]](https://en.wikipedia.org/wiki/Fat-tailed_distribution)

* **Feature:** variable, column in a dataset, dimension, all of these mean the same concept

* **Fit (method):** fit method calculates the algorithm, that is, creates the model that will be used for predictions.

## H
* **Hiperplane:** a subspace whose dimension is one less than that of its ambient space. [[wiki]](https://en.wikipedia.org/wiki/Hyperplane)

## I
* **Inverse problem:** watching some random datapoints come up with the mathematical model (if any) that could generate them.

## J
* **Jensen inequalities:** The loss of the expectation is allways less or equal to the expectation of losses. [[Collective Wisdom]](Collective%20wisdom/personal%20review.ipynb#Jensen-inequalities)


* **Joint probability:** Understand the intersection of probabilities, the probability of a bunch events happening together whether independent or dependent. 
[[source]](https://www.statisticshowto.datasciencecentral.com/joint-probability-distribution/)

## K
* **Kernel function:** a symmetric function over pairs of data points. Those pairs are: 
    1. The training instances, and
    2. The testing ones.  
    
  Outputs a number, the distance between the two points, which is used in turn by a *kernel method* [wiki](https://en.wikipedia.org/wiki/Positive-definite_kernel)
* **Kernel method:** Predicts the value by the unlabeled point in the kernel method by summing the weighted kernels from all the points trained to the testing point so the farther they are the less influence they'll have [wiki](https://en.wikipedia.org/wiki/Kernel_method) 

# L
* **L1 norm:** metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. Can also refer to L1 loss function:

* **L1-norm loss function**: also known as least absolute deviations (LAD), least absolute errors (LAE). It is basically minimizing the sum of the absolute differences (S) between the target value ($Y_i$) and the estimated values ($f(x_i)$) [[s]](http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/)


* **L2 norm:** or euclidean norm, the distance is given by pythagoras. Can also be applied to the least squares distance loss function:

* **L2-norm loss function:** is also known as least squares error (LSE). It is basically minimizing the sum of the square of the differences (S) between the target value ($Y_i$) and the estimated values ($f(x_i)$) [[s]](http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/)

**L1-L2 comparison**

![img](https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Manhattan_distance.svg/200px-Manhattan_distance.svg.png)

* **Likelihood:**  Likelihood is a measure of the extent to which a sample provides support for particular values of a parameter in a parametric model. [[source]](https://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability). 

    In probability we start with a model defined by some parameters (like mean and deviation) to get samples i.e. a PDF. However, in statistics,  we start from the observations and try to come up with the model parameters (Inverse problem). 
    
    Also, in Bayes' theorem, is the likelihood of the data given some hypothesis.
    
* **Likelihood function:** The function (a pdf or a pmf) that we assume generated the data we have. 

# M
* **Multi-class classification:** the model's vectors we are going to work with, have more than two dimensions.
* **Mutual information:** or info gain. A metric that shows how much information of a variable *A* is contained in another variable *B*. So the lower the number (close to 0) the more independent the variables are.
* **Maximum likelihood estimation:** try to come up with the parameters that maximize the [likelihood function](#L) for the given datapoints. That is performed by finding the joint probability of all datapoints for that likelihood function and then maximizing it.

# N
* **Normalization:** the process of scaling individual samples to have unit norm, that is, they will be in the range {0, 1} or {-1, 1}

# O
* **One vs one:** when we deal with multi-class classification, that is, our input vectors have more than two dimensions, is the strategy where we evaluate the classes in pairs. Say we have three classes, A, B and C. The OVO ensemble will be composed of 3 (= 3 * (3 - 1) / 2) binary classifiers. The first will discriminante A from B, the second A from C, and the third B from C. At prediction stage, the class that got the highest number of "+1" predictions is our winner.
* **One vs rest:** (aka one-vs-all),when we deal with multi-class classification, is the strategy that involves training one classifier (estimator) for class and then taking the one which gives the highest confidence. [wiki](https://en.wikipedia.org/wiki/Multiclass_classification#Transformation_to_binary)
* **Outliers:** An outlier is an observation that lies an abnormal distance from other values in a random sample from a population [[s]](https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm). Normal distributions don't get along well enough with them [[wiki]](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Assumptions)
* **Overfitting:** models that do not generalise the data well, that is, they stick to the dataset learned so much that they don't work well with other datasets or predict future observations realiably.  
On the following graph the green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line.

![img](https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/300px-Overfitting.svg.png)

# P
* **Pipeline:** the stages in a machine learning project. Usually they are:
    * Preprocessing
    * Feature selection
    * ML algorithm
    * Model
    * Evaluation
* **Preprocessing:** the stage where we clean, prepare and make sense of the data we'll work with.
* **Principal component analysis:** a techique that let's us to get a reduced & efficient version of a given matrix (sometimes covariance matrix?), that can be used in turn for predictions. Efficient since provides eigenvectors (unique information) and reduced since only a few of these eigenvectors are enough to give close approximations to the original matrix. 
* **Probability distribution:** the mathematical function that provides the probabilities of different possible outcomes of a random variable.

# R
* **Regression:** Is a broad concept. Is the set of statistical processes for estimating the relationships between a dependent variable *--target--* and one or more independent variables *--predictors, covariates or features--*. Is widely used for prediction and forecasting.

  The basic idea of regression is the following:
  
  $$\displaystyle\hat{y}(\mathbf{w},\mathbf{x})=w_0+\mathbf{w_1 x_1}+...+\mathbf{w_p y_p}$$
  We'll try to predict a $\hat{y}$ by assigning a coeficient ($\mathbf{w}$, weight) to each component (feature) of the vector $\mathbf{X}$ we input and an intercept point (constant term) $w_0$.

  That is: we assume that **every target in the data can be approximated by a linear combination of its features.**
  
  Also, it could be the family of algorithms used in supervised learning where the targets are continous values (rather than labels/categories). Although the logistic regression deals with classification problems.
  
* **Regularization:** the process by which we add information to a model to prevent overfitting [[wiki]](https://en.wikipedia.org/wiki/Regularization_(mathematics)).   
    
    Also in SVC algorithms is a parameter that provides a well-defined solution [[wiki]](https://en.wikipedia.org/wiki/Support-vector_machine#Regularization_and_stability).
  
* **Residual:** The distance between the estimated value and the real value not confuse with statistical errors [[wiki]](https://en.wikipedia.org/wiki/Errors_and_residuals). Uses:
    * Give a feel whether the data has a linear behavior
    * we want them to be normally distributed

# S
* **Singular value descomposition (SVD):** Any linear transformation can be defined as a three component transformations, say:  
    * Initial rotation $\mathbf{V^T}$  
    * Axis-wise scaling $\mathbf{\Sigma}$  
    * Final rotation $\mathbf{U}$
    
    $\displaystyle\mathbf{M} = \mathbf{U}\cdot\mathbf{\Sigma}\cdot\mathbf{V^T}$
    
    [[wiki]](https://en.wikipedia.org/wiki/Singular_value_decomposition)
    
    ![SVD](https://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/Singular-Value-Decomposition.svg/500px-Singular-Value-Decomposition.svg.png)

* **Supervised Learning:** the data also comes with an output (also known class column). It involves either classification (classes are categories) or regression methods (classes are numbers)

* **Standarization:** the process of rescaling the features so that they’ll have the properties of a Gaussian distribution with μ=0 and σ=1 where μ is the mean and σ is the standard deviation from the mean

* **Statistical errors:** are the deviations from the observed value and the unobsevable true value [[wiki]](https://en.wikipedia.org/wiki/Errors_and_residuals)

* **Support vector:** the data points that are closer to the hyperplane and influence its orientation  and position.

# T
* **Transform and compare:** on this machine learning problems is quite usual to find the following operation: $\mathbf{x}^t\cdot\mathbf{M}\cdot\mathbf{x}$ whose intuituition is *take a vector, apply a linear transformation to it and compare the outcome with the original.*

# U
* **Unsupervised learning:** the data doesn't come with any outcome (no class column) so it is used in a exploratory way. [Mathworks](https://www.mathworks.com/discovery/unsupervised-learning.html)

# V
* **Variance:** measures the spread of the data in a sample.

# W
* **White box model:** those models on which we can figure out what's going on inside. The opposite of a [black box model](#B).

# X
* **$X$:** all the data matrix, input vectors. $X_i$, some instance.

# Y
* **$y$:** the target column or output, that is, the real outcome we get when we input $X$. $y_i$, some instance.
* **$\hat{y}$:** a prediction of the learned model.