# Machine Learning and Big Data

> ### ___We are drowning in information but starved for knowledge.___ $-$ John Naisbitt

Training a low-bias learning algorithm on a massive dataset is one of the best ways to get a high performance machine learning system. However, learning with large datasets result in severe computational penalties when using standard learning methods. __Stochastic gradient descent__ (SGD) is a modification of the standard gradient descent procedure that is efficient, simple and robust. This project will describe SGD with applications to the backpropogation algorithm in neural networks.

## Gradient Descent

__Gradient descent__ is an iterative optimization procedure that finds the minimum of a function by taking a step along the negative of the gradient of the current point. 

For a straightforward illustration of gradient descent, let $f:\mathcal{R}^d\rightarrow\mathcal{R}$ be a differentiable, convex function. Then the gradient of $f$ at a point $\textbf{w}$ is 

$$ \nabla f(\textbf{w}) = \left ( \dfrac{\partial f(\textbf{w})}{\partial w[1]},\ldots, \dfrac{\partial f(\textbf{w})}{\partial w[d]} \right)  $$ 

i.e. the vector of partial derivatives.

Let $\textbf{w}^{(1)}$ denote the initial value of $\textbf{w}$ at interation 1. Then the update step is defined as 

$$ \textbf{w}^{(t+1)} = \textbf{w}^{(t)} -\eta\nabla f\left(\textbf{w}^{(t)}\right) $$

where $\eta > 0$. After $T$ iterations, there are three potential outputs:

* the averaged vector: $\overline{\textbf{w}} = \frac{1}{T} \sum_{t=1}^{T} \textbf{w}^{(t)}$,
* the last vector: $\textbf{w}^{(T)}$, or
* the best performing vector: $\text{argmin}_{t\in[T]}\ f\left(\textbf{w}^{(t)}\right)$

The averaged vector is better generalizable to nondifferentiable functions and the stochastic case, so let the output be the averaged vector $\overline{\textbf{w}}$. 

