# Week 10 -  Large Scale Machine Learning

This week, we'll cover large scale machine learning. Since ML works best when we have an abundance of data to leverage for training, knowing how to handle "big data" is a very sought after skill.

The topics we'll cover are:
* Gradient Descent with Large Datasets
  * Learning with Large Datasets
  * Stochastic Gradient Descent
  * Mini-Batch Gradient Descent
  * Stochastic Gradient Descent Convergence
* Advanced Topics
  * Online Learning
  * Map Reduce and Data Parallelism
  
## Gradient Descent with Large Datasets
 
### Learning with Large Datasets

One of the reasons learning algorithms work better in the last decade is simply because of the large increase in the volume of data we're keeping and using. But why do we want such large datasets? We saw in one example where we classified between two confusable words (e.g., two and too), where a study (Banko and Brill, 2001) showed results that indicated that it wasn't necessarily the best algorithm that performed the best but the algorithm that used the most data.

Let's say we have a large dataset with a training set of 100,000,000 samples. So our gradient descent update term for a linear regression is

$$ \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)}, $$

and to perform this for each feature is going to take *a lot* of computation. We need to find a way to reduce the computation expense here. One thing we can consider is just training the model on a smaller dataset , say of 1,000 samples.  We can then use learning curves to see if adding more than this smaller set would help. From before, we found that if
* we had high variance (larger gap between $J_\text{cv}(\theta)$ and $J_\text{train}(\theta)$) then increasing the training set size was helpful
* we had high bias (very small gap between $J_\text{cv}(\theta)$ and $J_\text{train}(\theta)$) then increasing the training set size was not helpful

### Stochastic Gradient Descent

With large training sets, gradient descents as we've been performing them can be quite computationally expensive. Here, we'll take a look at a modification of gradient descent, called *stochastic* gradient descent that will allow us to scale these algorithms to large datasets.

Suppose we're using gradient descent on a linear regression. We'll stick to linear regression to introduce stochastic gradient descent, but this is applicable to other learning algorithms. Recall

$$ h_\theta (x) = \sum_{j=0}^n \theta_j x_j \\
   J_\text{train} (\theta) = \frac{1}{2 m} \sum_{i=0}^m \Big(h_\theta (x^{(i)}) - y^{(i)} \Big)^2 $$
   
and our gradient descent update to parameters was to repeat

$$ \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)}, j \in \{0,\cdots,n\} $$

until convergence. Remember that gradient descent is simply starting as some initial point in the cost function and then slowly moving towards the minimum in the cost function. For a large dataset, this summation in the gradient descent update step will take a lot of calculations, and when we do it this way, we call it *batch* gradient descent. This term *batch* refers to the fact that we're looking at *all* of the training data.

In contrast to this expensive batch gradient descent, we'll come up with a way that doesn't need to look at each example but instead only a single example. We'll write the cost function in a slightly different way:

$$ \mathrm{cost} \big( \theta,(x^{(i)},y^{(i)}) \big) = \frac{1}{2} \Big(h_\theta (x^{(i)}) - y^{(i)} \Big)^2 \\
   J_\text{train} = \frac{1}{m} \sum_{i=0}^m \mathrm{cost} \big( \theta,(x^{(i)},y^{(i)}) \big) $$
   
The steps of our stochastic gradient descent are then
1. Randomly shuffle dataset
2. Repeatedly perform (maybe 1-10 times), looping over each training sample individually,

   $$ \theta_j := \theta_j - \alpha \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)}, j \in \{0,\cdots,n\}, i \in \{1,\cdots,m\} $$
   
So our second step here is looping through and trying to fit only individual training examples. The random shuffling simply makes sure we visit the examples in no particular order. We've now removed the need to sum over all taining examples. 

One consequence here is that we won't be guaranteed to always step closer to the minimum. Instead, we may sometimes step away from the minimum, but we'll still move in the general direction. However, once the gradient descent has reached the vicinity of the minimum, it will simply randomly wonder around the minimum rather than converging. But this isn't much of a problem, since we really only care about the cost function being small, not being right at the minimum. 

With large enough datasets, we may find that only passing though the gradient descent once is enough. This is because, if our dataset is large enough, we've made sufficiently many steps in gradient descent for each training sample.

### Mini-Batch Gradient Descent
   
There's another variation on batch gradient descent: *mini-batch* gradient descent. Our three choices so far will then be
* batch gradient descent: use all $m$ examples in each iteration
* stochastic gradient descent: use 1 example in each iteration
* mini-batch gradient descent: use $b<m$ examples in each iteration
  * $b$ here is called the "mini-batch" size, and a typical range of mini-batch sizes is from 2-100

Basically, we'll choose some mini-batch size, say

$$ b = 10 $$

and the get 10 examples

$$ (x^{(i)},y^{(i)}), \cdots, (x^{(i+9)},y^{(i+9)}) $$

and update parameters via

$$ \theta_j := \theta_j - \frac{\alpha}{10} \sum_{i=1}^{i+9} \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)}, j \in \{0,\cdots,n\} $$

and then we'd perform this for every 10 training sets. For example, say

$$ b =10, m=1000 .$$

Then our algorithm would be

> Repeat {
> 
> &nbsp; &nbsp; for $i = 1,11,21,31,\cdots,991$ {
>
> &nbsp; &nbsp; &nbsp; &nbsp; $\theta_j := \theta_j - \frac{\alpha}{10} \sum_{i=1}^{i+9} \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)}, j \in \{0,\cdots,n\}$
>
> &nbsp; &nbsp; }
>
> }

This allows us to make progress on the gradient descent steps without having to through *all* of the samples for each update, as in full batch gradient descent. But when does mini-batch outperform stochastic gradient descent? In general, mini-batch only works best when we're implementing vectorization. So the summation we have in the gradient descent in mini-batch can be parallelized when linear algebra libraries can work with vectorized formulation, allowing better performance. 


One disadvantage of mini-batch gradient descent is that we now have a new parameter to consider, the mini-batch size.

### Stochastic Gradient Descent Convergence

How do we make sure the stochastic gradient descent is converging, and how to we choose our learning rate? Before, we were plotting our cost function

$$ J_\text{train}(\theta) $$

as a function of the number of iterations of gradient descent, and we'd make sure that the cost function was decreasing with each iteration. So now what we can do instead is compute, before updating the parameters and while it's scanning through our training samples,
 
$$ \mathrm{cost} \big( \theta,(x^{(i)},y^{(i)}) \big), \text{ using } (x^{(i)},y^{(i)}) $$

So for a specific example, we're going to check to see how our gradient descent is doing for a specific example. And then we can plot this cost averaged over, say, 1000 examples, and do this for every 1000 iterations. Since on average, our stochastic gradient descent ought to be moving in general towards the minimum of the cost function. A smaller learning rate will likely make the oscillations seen in the plot a bit smaller. The larger number that we choose to average over, the smoother our curve will look, but we suffer from getting few points to plot. You may see the plot increase with iteration, and assuming you're code is bug-free, try using a smaller learning rate.

As we have it now, stochastic gradient descent will ideally get near the minimum but never quite converge. Typically, we hold the learning rate constant. But we can slowly decrease the learning rate instead in order to get convergence. For example, we could set

$$ \alpha = \frac{\text{constant}_1}{\text{iteration count} + \text{constant}_2} .$$

So here we have a learning rate that drops with each iteration count. One reason not to do this is of course becuase we now have two parameters. What we should find though is that, as we iterate, the gradient descent meanders towards the minimum, where the steps decrease as we move closer to the minimum since we're taking more iterations.

## Advanced Topics

### Online Learning

We sometimes need to continuously take newly generated data. For example, suppose we run a website for a shipping service where a user specifies an origin and destination, and we offer to ship for a certain price. Sometimes, users will choose to use our shipping server, i.e.,

$$ y = 1 ,$$

and sometimes they will not, i.e.,

$$ y = 0 .$$

We want to learn on these features to optimize the price that we charge for shipping. Our features will capture properties of the user, origin/destination, and the asking price. So we want to learn the probability that they will elect to choose our service for a given price:

$$ p(y=1|x;\theta) .$$

We can use a logistic regression here. What our online logistic regession will do is as follows:

> Repeat forever {
> 
> &nbsp; &nbsp; Get $(x,y)$ corresponding to user
> 
> &nbsp; &nbsp; Update $\theta$ using $(x,y)$
> 
> &nbsp; &nbsp; &nbsp; &nbsp; $\theta_j := \theta_j - \alpha \big( h_\theta (x) - y \big) x_j,  j \in \{0,\cdots,n\} $
>
> }

Here, we're going to just be using one sample at a time, so we're discarding our past examples and only updating for each user that accesses our site. This is a pretty good algorithm provided we have a continuous stream of users providing the data. 

One advantage here is that our algorithm can adapt to user preferences. So if something external happens (e.g., economy crashes) that makes people change their preferences for what they're willing to pay, we can adapt to that with such and algorithm.

Let's look at another example, this time for product search. Say a user searches for "Android phone 1080p camera", and that we have 100 phones in our store, and we want to return 10 results for this search. Here's what we'll do.
* Define $x$ as features of the phone, how many words match the user's query, etc.
* Define $y=1$ if user clicks link, $y=0$ otherwise
* Learn $p(y=1|x;\theta)$
* Use $p$ to show the user the top 10 phones they're most likely to click on

This problem name is learning the "predicted Click-Through Rate" (CRT). Since we're showing 10 search results, for each search, we get 10 pairs

$$(x,y)$$

that we can learn from.

### Map Reduce and Data Parallelism

Some machine learning problems are too big to run on one computer, no matter what algorithm we choose. We can use "map-reduce" to use multiple machines to learn. Say we have, for simplicity, 400 training samples. Our batch gradient descent would be

$$ \theta_j := \theta_j - \frac{\alpha}{400} \sum_{i=1}^{400} \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)}, j \in \{0,\cdots,n\} $$

Let's assume we have 4 computers to run in parallel. Machine 1 will use

$$ \{ (x^{(1)},y^{(1)}), \cdots, (x^{(100)},y^{(100)}) \} $$

and calculate

$$ \text{temp}_j^{(1)} = \sum_{i=1}^{100} \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)} $$

Machine 2 will use

$$ \{ (x^{(101)},y^{(101)}), \cdots, (x^{(200)},y^{(200)}) \} $$

and calculate

$$ \text{temp}_j^{(2)} = \sum_{i=101}^{200} \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)} $$

Machine 3 will use

$$ \{ (x^{(201)},y^{(201)}), \cdots, (x^{(300)},y^{(300)}) \} $$

and calculate

$$ \text{temp}_j^{(3)} = \sum_{i=201}^{300} \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)} $$

And finally, Machine 4 will use

$$ \{ (x^{(301)},y^{(301)}), \cdots, (x^{(400)},y^{(400)}) \} $$

and calculate

$$ \text{temp}_j^{(4)} = \sum_{i=301}^{400} \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)} $$

So now, each machine only has to do a quarter of the work, and presumably this step only takes a quarter of the time. The next step is to combine the results to update our parameters:

$$\theta_j := \theta_j - \frac{\alpha}{400}  \big( \text{temp}_j^{(1)} + \text{temp}_j^{(2)} + \text{temp}_j^{(3)} + \text{temp}_j^{(4)} \big), j \in \{0,\cdots,n\} .$$

Many learning algorithms can be expressed as computing sums of function over the training set, so it's often possible to do use map-reduce. For example, if we want to used an advanced optimization with logisitic regression. We need to compute the cost function and gradient:

$$ J_\text{train}(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)} \log h_\theta (x^{(i)}) - (1-y^{(i)}) \log (1 - h_\theta (x^{(i)})) \\ 
\frac{\partial}{\partial \theta_j} J_\text{train}(\theta) = \frac{1}{m} \sum_{i=1}^m \Big(h_\theta (x^{(i)}) - y^{(i)} \Big) x_j^{(i)} $$

We can again break these summations up across our machines, so we're still able to use map-reduce with the advanced optimization techniques.

Note that you don't necessarily need different computers, but different computing cores. So if you have a quad-core CPU, then you can use map-reduce on your multi-core machine to parallelize. Some linear algebra libraries take advantage of this for you behind the scenes if you're using vectorized computations.