<a href="https://colab.research.google.com/github/nyculescu/handson-ml2/blob/master/00_general_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Maths references

## Linear algebra

In ML, Linear Algebra comes up everywhere. Topics such as: 
* Principal Component Analysis (PCA), 
* Singular Value Decomposition (SVD), 
* Eigendecomposition of a matrix, 
* LU Decomposition, 
* QR Decomposition/Factorization, 
* Symmetric Matrices, 
* Orthogonalization & Orthonormalization, 
* Matrix Operations, 
* Projections, 
* Eigenvalues & Eigenvectors, 
* Vector Spaces 
* Norms 

are needed for understanding the optimization methods used for ML. 

Links:
* Linear Algebra course is the one offered by MIT Courseware ([Prof. Gilbert Strang](https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/)).
* https://www.khanacademy.org/math/linear-algebra
* [Coding the Matrix: Linear Algebra through Computer Science Applications](http://codingthematrix.com/) by Philip Klein, Brown University
* [Linear Algebra — Foundations to Frontiers](https://www.edx.org/course/linear-algebra-foundations-to-frontiers) by Robert van de Geijn, University of Texas
* Applications of Linear Algebra, [Part 1](https://www.edx.org/course/applications-of-linear-algebra-part-1) and [Part 2](https://www.edx.org/course/applications-of-linear-algebra-part-2). A newer course by Tim Chartier, Davidson College

### [Sparse matrix](https://en.wikipedia.org/wiki/Sparse_matrix)
In numerical analysis and scientific computing, a **sparse matrix** or sparse array is a matrix in which most of the elements are zero. <br>By contrast, if most of the elements are nonzero, then the matrix is considered **dense**. The number of zero-valued elements divided by the total number of elements (e.g., m × n for an m × n matrix) is called the sparsity of the matrix (which is equal to 1 minus the density of the matrix). Using those definitions, a matrix will be sparse when its sparsity is greater than 0.5.


## Probability theory & statistics

Machine Learning and Statistics aren’t very different fields. Actually, someone recently defined Machine Learning as ‘doing statistics on a Mac’. Some of the fundamental Statistical and Probability Theory needed for ML are: 
* Combinatorics, 
* Probability Rules & Axioms, 
* Bayes’ Theorem, 
* Random Variables, 
* Variance and Expectation, 
* Conditional and Joint Distributions, 
* Standard Distributions (Bernoulli, Binomial, Multinomial, Uniform and Gaussian), 
* Moment Generating Functions, 
* Maximum Likelihood Estimation (MLE), 
* Prior and Posterior, 
* Maximum a Posteriori Estimation (MAP) 
* Sampling Methods.

Links:
* https://www.khanacademy.org/math/probability
* [Statistics 110: Probability](https://projects.iq.harvard.edu/stat110/youtube) by Joe Blitzstein
* [All of Statistics: A Concise Course in Statistical Inference](http://read.pudn.com/downloads158/ebook/702714/Larry%20Wasserman_ALL%20OF%20Statistics.pdf)
* [Udacity’s Introduction to Statistics](https://www.udacity.com/course/intro-to-statistics--st101)

## Multivaraite calculus

Some of the necessary topics include: 
* Differential and Integral Calculus, 
* Partial Derivatives, 
* Vector-Values Functions, 
* Directional Gradient, 
* Hessian, 
* Jacobian, 
* Laplacian 
* Lagragian Distribution.

Links:
* https://www.khanacademy.org/math/multivariable-calculus

## Algortihms & complexity

This is important for understanding the computational efficiency and scalability of our Machine Learning Algorithm and for exploiting sparsity in our datasets. Knowledge of: 
* data structures (Binary Trees, Hashing, Heap, Stack etc), 
* Dynamic Programming, 
* Randomized & Sublinear Algorithm, 
* Graphs, 
* Gradient/Stochastic Descents 
* Primal-Dual methods 

are needed.

Links:
* https://www.khanacademy.org/math/ap-calculus-ab/ab-diff-analytical-applications-new/ab-5-11/e/optimization

This comprises of other Math topics not covered in the four major areas described above. They include Real and Complex Analysis (Sets and Sequences, Topology, Metric Spaces, Single-Valued and Continuous Functions, Limits, Cauchy Kernel, Fourier Transforms), Information Theory (Entropy, Information Gain), Function Spaces and Manifolds.

Links:
* [Boyd and Vandenberghe’s course on Convex optimization](http://stanford.edu/~boyd/cvxbook/) from Stanford
* 

## Misc notes

**The real prerequisite for machine learning isn’t math, it’s data analysis** [link](https://www.r-bloggers.com/the-real-prerequisite-for-machine-learning-isnt-math-its-data-analysis/) 
* “Off the shelf” tools take care of the math for you; 
* Most data scientists don’t do much math; 
* 80% of your work will be data preparation, EDA, and visualization;
* For beginning practitioners, data hacking beats math

# Ideas

## Normalizing Flows

Links:
* http://akosiorek.github.io/ml/2018/04/03/norm_flows.html

Machine learning is all about probability. To train a model, we typically tune its parameters to maximise the probability of the training dataset under the model. To do so, we have to assume some probability distribution as the output of our model. The two distributions most commonly used are [Categorical](https://en.wikipedia.org/wiki/Categorical_distribution) for classification and [Gaussian](https://en.wikipedia.org/wiki/Normal_distribution) for regression. The latter case can be problematic, as the true probability density function (pdf) of real data is often far from Gaussian. If we use the Gaussian as likelihood for image-generation models, we end up with blurry reconstructions. We can circumvent this issue by adversarial training, which is an example of likelihood-free inference, but this approach has its own issues.

# Scikit-learn


Scikit-learn is a free software machine learning *library* for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

[github link](https://github.com/scikit-learn/scikit-learn)


# Machine Learning 
notes and theoretical stuff

## Ensemble Learning and Random Forests

Suppose you ask a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert’s answer. This is called the **wisdom of the crowd**.

Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called **Ensemble Learning**, and an Ensemble Learning algorithm is called an ***Ensemble method***.

As an example of **Ensemble method**, you can train a group of Decision Tree classifiers, each on a different random subset of the training set.

To make predictions, you just **obtain the predictions of all individual trees**, then predict the class that gets **the most votes**. 

Such an ensemble of Decision Trees is called a ***Random Forest***, and despite its simplicity, this is one of the most powerful Machine Learning algorithms available today.

$tbc$</br> 
Most popular Ensemble methods:
* bagging
* boosting
* stacking
* random forests

## ML Pipelines

A **Pipeline** is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.

A pipeline consists of a sequence of stages. There are two basic types of pipeline stages: 
* Transformer - A Transformer takes a dataset as input and produces an augmented dataset as output. E.g., a tokenizer is a Transformer that transforms a dataset with text into an dataset with tokenized words ![](https://spark.apache.org/docs/latest/img/ml-PipelineModel.png)
* Estimator - An Estimator must be first fit on the input dataset to produce a model, which is a Transformer that transforms the input dataset. E.g., logistic regression is an Estimator that trains on a dataset with labels and features and produces a logistic regression model ![](https://spark.apache.org/docs/latest/img/ml-Pipeline.png)



## What is a hyperparameter?

[blog link](http://pasank.com/what-is-a-hyperparameter/)

A hyperparameter is a knob.

![](https://github.com/nyculescu/handson-ml2/blob/master/images/misc/knob.jpg?raw=1)

![](http://pasank.com/assets/images/hyperparameter-post/linear-model.png)

It’s a knob that you tweak in your model to control its learning process.

Hyperparameters are a bit different from what we normally call a parameter.

Parameters serve as a representation of your data. A parameteric type of model represents your data as a set of parameters.

For example, the data given below can be represented with a linear model with the parameters (m = 3, c = 7)

![](https://github.com/nyculescu/handson-ml2/blob/master/images/misc/linear_eq.jpg?raw=1)

![](http://pasank.com/assets/images/hyperparameter-post/linear-model.png)

In contrast to model parameters, hyperparameters do not represent data. Hyperparameters are not part of the model that describes the data. Hyperparameters are generally not learned and not tuned during the learning process.

Hyperparameters are the control knobs to the learning process. Some examples of hyperparameters are the learning rate (in gradient descent) or the number of leaves (in a decision tree).

Fine print

A more rigorous explanation would be that a hyperparameter encodes prior belief about how your parameters are distributed.

You pick a distribution that you think will describe your data. This distribution would have parameters (for example, the normal distribution has the parameters mean and standard deviation). If you pick a distribution that describes those parameters, the parameters of this ‘higher-level’ distribution would be the hyperparameters.

In short, hyperparameters are the parameters of a distribution that is put on the model parameters.

### Regularization Hyperparameters

**Decision Trees** make very **few assumptions** about the training data (as **opposed to linear models, which obviously assume that the data is linear**, for example). 

If left unconstrained, the tree structure will **adapt itself** to the training data, fitting it very closely, and **most likely overfitting** it. Such a model is often called a **nonparametric model**, not because it does not have any parameters (it often has a lot) but **because the number of parameters is not determined prior to training**, so the model structure is free to stick closely to the data. 

In contrast, **a parametric model** such as a linear model **has a predetermined number of parameters**, so its degree of freedom is limited, reducing the risk of overfitting (but increasing the risk of underfitting).

## Why do we need the bias term in ML algorithms such as linear regression and neural networks? 
[Quora answer](https://qr.ae/TqjfAv)

The answer is that bias values allow a neural network to output a value of zero even when the input is near one. Adding a bias permits the output of the activation function to be shifted to the left or right on the x-axis. Consider a simple neural network where a single input neuron I1 is directly connected to an output neuron O1.

![](https://qph.fs.quoracdn.net/main-qimg-fc37bf8f5caddf96222f7a33784dacec.webp)

This network’s output is calculated by multiplying the input (x) by the weight (w). The result is then passed through an activation function. In this case, we are using the sigmoid activation function. Consider the output of the sigmoid function for the following four weights. sigmoid(0.5* x) , sigmoid(1.0* x) sigmoid(1.5* x) , sigmoid(2.0* x)

The output is as below :

![](https://qph.fs.quoracdn.net/main-qimg-e70779fa8e2605566688312cec06c401.webp)

Modification of the weight w alters the “steepness” of the sigmoid function. This allows the neural network to learn patterns. However, what if you wanted the network to output 0 when x is a value other than 0, such as 3? Simply modifying the steepness of the sigmoid will not achieve this. You must be able to shift the entire curve to the right.

That exactly is the purpose of **bias**.

Now, consider a network with bias neurons.

![](https://qph.fs.quoracdn.net/main-qimg-564a6626cee0e2b754df9cb6664db602.webp)

Calculating output for the below bias weights:

sigmoid( 1* x + 1* 1), sigmoid( 1* x + 0.5* 1) sigmoid( 1* x + 1.5* 1) ,sigmoid( 1* x + 2* 1)

The output is as below :

![](https://qph.fs.quoracdn.net/main-qimg-26f1eda7cf50ede6c9904b59f237b595.webp)

Here, in this case, like you can observe, **the entire curve shifts**.

Bias is a vital concept for neural networks. Bias neurons are added to every non-output layer of the neural network. They are unique from ordinary neurons in two very significant ways. Firstly, the output from a bias neuron is always one. Secondly, a bias neuron has no inbound connections. The constant value of one makes the layer to respond with non-zero values even when the input to the layer is zero. This may be very crucial for certain data sets.

## What is a Feature?

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon being observed.



## Hypothesis

[geeksforgeeks link](https://www.geeksforgeeks.org/ml-understanding-hypothesis/)

In most supervised machine learning algorithm, our main goal is to find out a possible hypothesis from the hypothesis space that could possibly map out the inputs to the proper outputs.

Hypothesis space is the set of all the possible legal hypothesis.

A hypothesis is a function that best describes the target in supervised machine learning. 

## Pseudoinverse

In mathematics, and in particular linear algebra, a pseudoinverse $\mathbf{A}^+$ of a matrix A is a generalization of the inverse matrix. The most widely known type of matrix pseudoinverse is the Moore–Penrose inverse.
* A common use of the pseudoinverse is to compute a "best fit" (least squares) solution to a system of linear equations that lacks a unique solution. 
* Another use is to find the minimum (Euclidean) norm solution to a system of linear equations with multiple solutions.

The pseudoinverse facilitates the statement and proof of results in linear algebra.

## Entropy

The concept of entropy originated in thermodynamics as a measure of molecular disorder: entropy approaches zero when molecules are still and well ordered. It later spread to a wide variety of domains, including Shannon’s *information theory*, where it measures the average information content of a message (a reduction of entropy is often called an *information gain*): entropy is zero when all messages are identical. 

In Machine Learning, it is frequently used as an impurity measure: a set’s entropy is zero when it contains instances of only one class. The following equation shows the definition of the entropy of the $i^{th}$ node: $H_i = - \sum_{k=1}^np_{i,k}log_2(p_{i,k})$

# Utils

[Mathematics in R Markdown](https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html)

[An Example R Markdown](http://www.math.mcgill.ca/yyang/regression/RMarkdown/example.html)

_