# MACHINE LEARNING CONCEPTS

# Adam Algorithm
---

Adam ("Adaptive Moment Estimation") is different to classical stochastic gradient descent. Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training. Instead, with Adam a learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds.

The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. Specifically:

* _Adaptive Gradient Algorithm_ (AdaGrad) that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).

* _Root Mean Square Propagation_ (RMSProp) that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Adam realizes the benefits of both AdaGrad and RMSProp and is being adopted for benchmarks in deep learning papers.

![Adam Comparison to Other Optimization Algorithms](images\optimization_algorithms_comparison.png "Adam Comparison to Other Optimization Algorithms")


## Configuration Parameters

* __alpha__<br>
Also referred to as the learning rate or step size. The proportion that weights are updated (e.g. $0.001$). Larger values (e.g. $0.3$) results in faster initial learning before the rate is updated. Smaller values (e.g. $1.0E-5$) slow learning right down during training

* __beta1__<br>
The exponential decay rate for the first moment estimates (e.g. $0.9$).

* __beta2__<br>
The exponential decay rate for the second-moment estimates (e.g. $0.999$). This value should be set close to $1.0$ on problems with a sparse gradient (e.g. NLP and computer vision problems).

* __epsilon__<br>
Is a very small number to prevent any division by zero in the implementation (e.g. $10E-8$).

# Autoencoder
---

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name.

An autoencoder is essentially a neural network that learns to copy its input to its output. It has an internal (hidden) layer that describes a code used to represent the input, and it is constituted by two main parts: an encoder that maps the input into the code, and a decoder that maps the code to a reconstruction of the original input. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties.

![Autoencoder](images\autoencoder.png "Autoencoder")

Examples are the regularized autoencoders (Sparse, Denoising and Contractive autoencoders), proven effective in learning representations for subsequent classification tasks, and Variational autoencoders, with their recent applications as generative models. Autoencoders are effectively used for solving many applied problems, from face recognition to acquiring the semantic meaning of words.

The simplest form of an autoencoder is a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons (MLP) – having an input layer, an output layer and one or more hidden layers connecting them – where the output layer has the same number of nodes (neurons) as the input layer, and with the purpose of reconstructing its inputs (minimizing the difference between the input and the output) instead of predicting the target value $Y$ given inputs $X$. Therefore, autoencoders are unsupervised learning models (do not require labeled inputs to enable learning).




# Gradient Descent (Classic) Algorithm
---

The goal of gradient descent is usually to minimize the loss function for a machine learning problem. A good algorithm finds the minimum fast and reliably well (i.e. it doesn’t get stuck in local minima, saddle points, or plateau regions, but rather goes for the global minimum).

The basic gradient descent algorithm follows the idea that the opposite direction of the gradient points to where the lower area is. So it iteratively takes steps in the opposite directions of the gradients.

![Gradient Descent Algorithm](images\gradient_descent.gif "Gradient Descent Algorithm")

As human perception is limited to 3 dimensions, in all our visualizations, imagine we only have two parameters (or thetas) to optimize, and they are represented by the $x$ and $y$ dimensions in the graph. The surface is the loss function. We want to find the ($x$, $y$) combination that’s at the lowest point of the surface. The problem is trivial to us because we can see the whole surface. But the ball (the descent algorithm) doesn’t; it can only take one step at a time and explore its surroundings, analogous to walking in the dark with only a flashlight.


# Momentum Algorithm
---

The gradient descent with momentum algorithm (or Momentum for short) borrows the idea from physics. Imagine rolling down a ball inside of a frictionless bowl. Instead of stopping at the bottom, the momentum it has accumulated pushes it forward, and the ball keeps rolling back and forth.

![Momentum Algorithm](images\momentum.gif "Momentum Algorithm")

We can apply the concept of momentum to our vanilla gradient descent algorithm. In each step, in addition to the regular gradient, it also adds on the movement from the previous step. 



