# Training Logistic Regression via Stochastic Gradient Ascent

The goal of this notebook is to implement a logistic regression classifier using stochastic gradient ascent. You will:

 * Extract features from Amazon product reviews.
 * Convert an SFrame into a NumPy array.
 * Write a function to compute the derivative of log likelihood function (with L2 penalty) with respect to a single coefficient.
 * Implement stochastic gradient ascent with L2 penalty
 * Compare convergence of stochastic gradient ascent with that of batch gradient ascent

# Fire up graphlab create

Make sure you have the latest version of GraphLab Create. If you don't find the decision tree module, then you would need to upgrade graphlab-create using

```
   pip install graphlab-create --upgrade
```

In [None]:
from __future__ import division
import graphlab

## Load and process review dataset

For this assignment, we will use a subset of the Amazon product review dataset. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted primarily of positive reviews. (**Add link to the subset data**)

In [None]:
products = graphlab.SFrame('amazon_baby_subset.gl/')

Just like we did in an earlier assignment, we will work with a subset of hand-curated **important** words. We perform 2 simple data transformations:

1. Remove punctuation using [Python's built-in](https://docs.python.org/2/library/string.html) string manipulation functionality.
2. Compute word counts (only for the important_words)

Refer to notes in the earlier assignments for more details on how this works.

In [None]:
import json
with open('important_words.json', 'r') as f: 
    important_words = json.load(f)
important_words = [str(s) for s in important_words]

# Remote punctuation
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

# Split out the words into individual columns
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.count(word))

The SFrame *products* now contains one column for each of the 193 **important_words**. 

In [None]:
products

### Split data into training and test sets

We will now split the data into a 90-10 split where 90% is in the training set and 10% is in the test set.

In [None]:
train_data, test_data = products.random_split(.9, seed=1)

print 'Training set : %d data points' % len(train_data)
print 'Test set     : %d data points' % len(test_data)

## Convert SFrame to NumPy array

Just like in the earlier assignments, we provide you with a function that extracts columns from an SFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels. 

**Note:** The feature matrix includes an additional column 'intercept' filled with 1's to take account of the intercept term.

In [None]:
import numpy as np

def get_numpy_data(data_sframe, features, label):
    data_sframe['intercept'] = 1
    features = ['intercept'] + features
    features_sframe = data_sframe[features]
    feature_matrix = features_sframe.to_numpy()
    label_sarray = data_sframe[label]
    label_array = label_sarray.to_numpy()
    return(feature_matrix, label_array)

Note that we convert both training and test sets into NumPy arrays.

**Warning**: This may take a few minutes.

In [None]:
feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')
feature_matrix_test, sentiment_test = get_numpy_data(test_data, important_words, 'sentiment') 

In [None]:
feature_matrix_train

** Quiz question**: In the earlier assignment, there were 194 features (an intercept + one feature for each of the 193 important words). In this assignment, we will use stochastic-gradient ascent to train the classifier using logistic regression. How does the changing the solver from to stochastic-gradient ascent effect the number of features?

In [None]:
sentiment_train

## Building on logistic regression with L2 penalty assignment

Let us now build on previous assignments. Recall from lecture that the link function for logistic regression can be defined as:

$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))},
$$

where the feature vector $h(\mathbf{x}_i)$ is given by the word counts of **important_words** in the review $\mathbf{x}_i$. 


We will use the **same code** as in this past assignment to make probability predictions since this part is not affected by using stochastic-gradient ascent as a solver. (Only the way in which the coefficients are learned is affected by using stochastic-gradient ascent as a solver).

In [None]:
'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''
def predict_probability(feature_matrix, coefficients):
    # Take dot product of feature_matrix and coefficients  
    score = np.dot(feature_matrix, coefficients)
    
    # Compute P(y_i = +1 | x_i, w) using the link function
    predictions = 1. / (1.+np.exp(-score))    
    return predictions

## Derivative for a single feature

Let us now work on making minor changes to how the derivative computation is performed for logistic regression with an L2 penalty. 

Recall from lecture and the previous assignment that for logistic regression with an L2 penalty, **per-coefficient derivative for logistic regression with an L2 penalty** is as follows:

$$
\frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) - 2\lambda w_j
$$
and for the intercept term, we have
$$
\frac{\partial\ell}{\partial w_0} = \sum_{i=1}^N h_0(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$$

Recall from an earlier assignment that 
we can compute the derivative of log likelihood with respect to a single coefficient $w_j$ by writing a function 
which accepts the following five arguments:
 * `errors` vector containing $(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w}))$ for all $i$
 * `feature` vector containing $h_j(\mathbf{x}_i)$  for all $i$
 * `coefficient` containing the current value of coefficient $w_j$.
 * `l2_penalty` representing the L2 penalty constant $\lambda$
 * `feature_is_constant` telling whether the $j$-th feature is constant or not.
 
Complete the following code block which computes the feature derivative with an L2 penality given the above 5 terms:

In [None]:
def feature_derivative_with_L2(errors, feature, coefficient, l2_penalty, feature_is_constant): 
    
    # Compute the dot product of errors and feature
    ## YOUR CODE HERE
    derivative = ...

    # add L2 penalty term for any feature that isn't the intercept.
    if not feature_is_constant: 
        ## YOUR CODE HERE
        ...
        
    return derivative

To verify the correctness of the gradient computation, we provide a function for computing average log likelihood (which we recall from the last assignment was a topic detailed in an advanced optional video, and used here for its numerical stability).

To track the performance of stochastic-gradient ascent, we provide a function for computing $\color{red}{\mbox{average}}$ log likelihood. 

$$\ell\ell_A(\mathbf{w}) = \color{red}{\frac{1}{N}} \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) - \lambda\|\mathbf{w}\|_2^2 $$

**Note** that we made one tiny modification to the log-likelihood function (called **compute_log_likelihood_with_L2**) in our earlier assignments. We added a $\color{red}{\frac{1}{N}}$ term which averages the log-likelihood accross all data points. The $\color{red}{\frac{1}{N}}$ term makes it easier for us to compare stochastic-gradient ascent with gradient ascent. We will use this function to generate plots that are similar to those you saw in the lecture.

In [None]:
def compute_avg_log_likelihood_with_l2(feature_matrix, sentiment, coefficients, l2_penalty):
    
    indicator = (sentiment==+1)
    scores = np.dot(feature_matrix, coefficients)
    logexp = np.log(1. + np.exp(-scores))
    
    # Simple check to prevent overflow
    mask = np.isinf(logexp)
    logexp[mask] = -scores[mask]
    
    lp = np.sum((indicator-1)*scores - logexp)/len(feature_matrix) - l2_penalty*np.sum(coefficients[1:]**2)
    
    return lp


** Quiz question:** Recall from the lecture and the earlier assignment, the log-likelhood (wihout the averaging term) is given by 

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) -{\lambda\|\mathbf{w}\|_2^2} $$

If the L2 regualarization $\lambda=0$, how are the functions $\ell\ell(\mathbf{w})$ and $\ell\ell_A(\mathbf{w})$ related?

## Modifying the derivative for stochastic-gradient ascent

Recall from the lecture that the gradient with respect to a single data point $\color{red}{x_i}$ can be computed using the following formula:

$$
\frac{\partial\ell_{\color{red}{i}}(\mathbf{w})}{\partial w_j} = \sum_{i=1}^B h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) - 2\lambda w_j
$$


** Computing the gradient for a single data point**

Do we really need to re-write all our code to modify $\frac{\partial\ell(\mathbf{w})}{\partial w_j}$ to $\frac{\partial\ell_{\color{red}{i}}(\mathbf{w})}{{\partial w_j}}$? 


Thankfully **No!**. Using NumPy, we an access $x_i$ in the training data using `feature_matrix_train[i:i+1,:]`
and $y_i$ in the training data using `sentiment[i:i+1]`. We can compute $\frac{\partial\ell_{\color{red}{i}}(\mathbf{w})}{\partial w_j}$ by re-using **all the code** written in `feature_derivative_with_L2` and `predict_probability`.


We can compute $\partial\ell_{\color{red}{i}}(\mathbf{w})$ easily using the following steps:
* First, compute $P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ using the `predict_probability` function with `feature_matrix_train[i:i+1,:]` as the first parameter.
* Next, compute $\mathbf{1}[y_i = +1]$ using `sentiment_train[i:i+1]`.
* Finally, call the `feature_derivative_with_L2` function with `feature_matrix_train[i:i+1, j]` as one of the parameters. 

Let us follow these steps for `j = 1` and `i = 10`:

In [None]:
j = 1                        # Feature number
i = 10                       # Data point number
coefficients = np.zeros(194) # A point w at which we are computing the gradient.
l2_penalty = 0.0             # L2 penalty

predictions = predict_probability(feature_matrix_train[i:i+1,:], coefficients)
indicator = (sentiment_train[i:i+1]==+1)

errors = indicator - predictions        
gradient_single_data_point = feature_derivative_with_L2(errors, feature_matrix_train[i:i+1, j], 
                                    coefficients[j], l2_penalty, False)
print "Gradient single data point: %s" % gradient_single_data_point

** Quiz Question:** The code block above computed $\frac{\partial\ell_{\color{red}{i}}(\mathbf{w})}{{\partial w_j}}$ for `j = 1` and `i = 10`.  Is $\frac{\partial\ell_{\color{red}{i}}(\mathbf{w})}{{\partial w_j}}$ a scalar or a 194-dimensional vector?

## Modifying the derivative for using a batch of data points

Given a mini-batch (or a set of data points) $x_{i}, x_{i+1} \ldots x_{a+B}$, the gradient function for the batch of data points is given by:
$$
\color{red}{\sum_{s = i}^{i+B}} \frac{\partial\ell_{s}}{\partial w_j} = \color{red}{\sum_{s = i}^{i + B}} h_j(\mathbf{x}_s)\left(\mathbf{1}[y_s = +1] - P(y_s = +1 | \mathbf{x}_s, \mathbf{w})\right) - 2\lambda w_j
$$


** Computing the gradient for a "mini-batch" of data points**
Using NumPy, we an access the points $x_i, x_{i+1} \ldots x_{i+B}$ in the training data using `feature_matrix_train[i:i+B,:]`
and $y_i$ in the training data using `sentiment[i:i+B]`. 

We can compute $\color{red}{\sum_{s = i}^{i+B}} \frac{\partial\ell_{s}}{\partial w_j}$ easily as follows:

In [None]:
j = 1                        # Feature number
i = 10                       # Data point start
B = 10                       # Mini-batch size
coefficients = np.zeros(194) # A point w at which we are computing the gradient.
l2_penalty = 0.0             # L2 penalty

predictions = predict_probability(feature_matrix_train[i:i+B,:], coefficients)
indicator = (sentiment_train[i:i+B]==+1)

errors = indicator - predictions        
gradient_mini_batch = feature_derivative_with_L2(errors, feature_matrix_train[i:i+B, j], 
                                    coefficients[j], l2_penalty, False)
print "Gradient mini-batch data points: %s" % gradient_mini_batch

** Quiz Question:** The code block above computed 
$\color{red}{\sum_{s = i}^{i+B}}\frac{\partial\ell_{s}(\mathbf{w})}{{\partial w_j}}$ 
for `j = 10`, `i = 10`, and `B = 10`. Is this a scalar or a 194-dimensional vector?


** Quiz Question:** For what value of `B` is the term
$\color{red}{\sum_{s = 1}^{B}}\frac{\partial\ell_{s}(\mathbf{w})}{\partial w_j}$
the same as the full gradient
$\frac{\partial\ell(\mathbf{w})}{{\partial w_j}}$?


### Averaging the gradient across batch-size

It is a common practice to **normalize** the gradient update rule by the batch size B:

$$
\frac{\partial\ell_{\color{red}{A}}(\mathbf{w})}{\partial w_j} \approx \color{red}{\frac{1}{B}} {\sum_{s = i}^{i + B}} h_j(\mathbf{x}_s)\left(\mathbf{1}[y_s = +1] - P(y_s = +1 | \mathbf{x}_s, \mathbf{w})\right) - 2\lambda w_j
$$
In other words, we update the coefficients using the **average gradient over data points** (instead using a summation). By using the average gradient, we ensure many things:
* First, the L2 penalty $\lambda$ has consistent effect regardless of the choice of the batch size.
* Second, we can compare various batch sizes of stochastic-gradient ascent (including a batch-size of **all the data points**) and study the effect of batch size on the algorithm.


## Implementing stochastic-gradient ascent

Now we are ready to implement our own logistic regression with stochastic-gradient ascent. Complete the following function to solve the logistic regression model using gradient ascent:

In [None]:
from math import sqrt
def logistic_regression_SG(feature_matrix, sentiment, initial_coefficients, step_size, batch_size, l2_penalty, max_iter):
    log_likelihood_all = []
    
    # make sure it's a numpy array
    coefficients = np.array(initial_coefficients)
    # set seed=0 to produce consistent results
    np.random.seed(seed=0)
    
    for itr in xrange(max_iter):
        # Randomly select an index
        i = np.random.randint(len(feature_matrix)-batch_size+1)
        
        # Predict P(y_i = +1|x_1,w) using your predict_probability() function
        # Make sure to slice the i-th row of feature_matrix with [i:i+batch_size,:]
        ### YOUR CODE HERE
        predictions = ...
        
        # Compute indicator value for (y_i = +1)
        # Make sure to slice the i-th entry with [i:i+batch_size]
        indicator = (sentiment[i:i+batch_size]==+1)
        
        # Compute the errors as indicator - predictions
        errors = indicator - predictions
        for j in xrange(len(coefficients)): # loop over each coefficient
            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j]
            # compute the derivative for coefficients[j].
            # Make sure to slice the i-th row of feature_matrix with [i:i+batch_size,j]
            # For the last argument, test if j is 0 or not.
            ### YOUR CODE HERE
            derivative = ...
            
            # compute the product of the step size, the derivative, and the **normalization constant** (1./batch_size)
            ### YOUR CODE HERE
            coefficients[j] += ...
        
        # Checking whether log likelihood is increasing
        # Print the log likelihood over the *current batch*
        lp = compute_avg_log_likelihood_with_l2(feature_matrix[i:i+batch_size,:], sentiment[i:i+batch_size],
                                                coefficients, l2_penalty)
        log_likelihood_all.append(lp)
        if itr <= 15 or (itr <= 1000 and itr % 100 == 0) or (itr <= 10000 and itr % 1000 == 0) \
         or itr % 10000 == 0 or itr == max_iter-1:
            data_size = len(feature_matrix)
            print 'Iteration %*d: Average log likelihood (of data points in batch [%0*d:%0*d]) = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, \
                 int(np.ceil(np.log10(data_size))), i, \
                 int(np.ceil(np.log10(data_size))), i+batch_size, lp)
                
    # We return the list of log likelihoods for plotting purposes.
    return coefficients, log_likelihood_all

### Checkpoint


The following cell tests your stochastic-gradient ascent function using a toy dataset consisting of two data points. If the test does not pass, make sure you are normalizing the gradient update rule correctly.

In [None]:
sample_feature_matrix = np.array([[1.,2.,-1.], [1.,0.,1.]])
sample_sentiment = np.array([+1, -1])

coefficients, log_likelihood = logistic_regression_SG(sample_feature_matrix, sample_sentiment, np.zeros(3),
                                                  step_size=1., batch_size=2, l2_penalty=0, max_iter=2)
print '-------------------------------------------------------------------------------------'
print 'Coefficients compute                 :', coefficients
print 'Average log likelihood per-iteration :', log_likelihood
if np.allclose(coefficients, np.array([-0.09755757,  0.68242552, -0.7799831]), atol=1e-3)\
  and np.allclose(log_likelihood, np.array([-0.33774513108142956, -0.2345530939410341])):
    # pass if elements match within 1e-3
    print '-------------------------------------------------------------------------------------'
    print 'Test passed!'
else:
    print '-------------------------------------------------------------------------------------'
    print 'Test failed'

## Compare convergence behavior of stochastic-gradient ascent

For the remainder of the assignment, we will try and compare compares stochastic-gradient ascent against gradient ascent. For that we need a reference stochastic-gradient ascent implementation. But do we need to implement this from scratch, do we?

**Quiz Question:** For what value of batch-size **B** above is the stochastic-gradient ascent function `logistic_regression_SG` the same as gradient ascent from the previous assignment?

## Running gradient ascent using the stochastic-gradient ascent implementation

Instead of doing them separately, we save time by re-using the stochastic-gradient ascent function we just wrote &mdash; **to perform gradient ascent**, it suffices to set **`batch_size`** to the number of data points in the training data. Yes, we did answer above the quiz question for you, but that is an important point to remember in the future.

**Small Caveat**. The batch gradient ascent implementation here is slightly different than the one in the earlier assignments, as we now normalize the gradient update rule.

We now **run stochastic gradient ascent** over the **feature_matrix_train** for 10 iterations using:
* `initial_coefficients = np.zeros(194)`
* `step_size = 5e-1`
* `batch_size = 1`
* `l2_penalty = 0.0`
* `max_iter = 11`

In [None]:
coefficients, log_likelihood = logistic_regression_SG(feature_matrix_train, sentiment_train,
                                        initial_coefficients=np.zeros(194),
                                        step_size=5e-1, batch_size=1, l2_penalty=0.,
                                        max_iter=11)

**Quiz Question**. When you set `batch_size = 1`, as each iteration passes, how does the average log-likelihood change:
* Increases
* Descreases
* Fluctuates 

Now run **batch gradient ascent** over the **feature_matrix_train** for 200 iterations using:
* `initial_coefficients = np.zeros(194)`
* `step_size = 5e-1`
* `batch_size = len(feature_matrix_train)`
* `l2_penalty = 0.0`
* `max_iter = 200`

In [None]:
coefficients_batch, log_likelihood_batch = ...

**Quiz Question**. When you set `batch_size = len(train_data)`, as each iteration passes, how does the average likelihood change.
* Increases 
* Descreases
* Fluctuates 

## Make "passes" over the dataset

To make a fair comparison betweeen stochastic-gradient ascent and batch-gradient ascent, we use measure the average log-likelihood as a function of the number of passes (defined as follows):
$$
[\text{# of passes}] = \frac{[\text{# of data points touched so far}]}{[\text{size of dataset}]}
$$

**Quiz Question** Suppose that we run stochastic-gradient ascent with a batch size of 100. How many gradient updates are performed at the end of two passes over a dataset consisting of 50000 data points?

## Log-likelihood plots for stochastic-gradient ascent

With the terminology in mind, let us run stochastic-gradient ascent for 10 passes. We will use
* `step_size=1e-1`
* `batch_size=100`
* `l2_penalty=0`. 
* `initial_coefficients` to all zeros.

In [None]:
step_size = 1e-1
batch_size = 100
num_passes = 2
num_iterations = num_passes * int(len(feature_matrix_train)/batch_size)

coefficients_sgd_first_2, log_likelihood_sgd_first_2 = ...

We provide you a utility function to plot how the average log-likelihood changes as the number of passes increases.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plot_init = True

def make_plot(log_likelihood_all, len_data, batch_size, smoothing_window=1, label=''):
    plt.rcParams.update({'figure.figsize': (9,5)})
    log_likelihood_all_ma = np.convolve(np.array(log_likelihood_all), \
                                        np.ones((smoothing_window,))/smoothing_window, mode='valid')
    plt.plot(np.array(range(smoothing_window-1, len(log_likelihood_all)))*float(batch_size)/len_data,
             log_likelihood_all_ma, linewidth=4.0, label=label)
    plt.rcParams.update({'font.size': 16})
    plt.tight_layout()
    plt.xlabel('# of passes over data')
    plt.ylabel('Average log likelihood per data point')
    plt.legend(loc='lower right', prop={'size':14})

In [None]:
make_plot(log_likelihood_sgd_first_2, len_data=len(feature_matrix_train), batch_size=100,
          label='stochastic-gradient, step_size=1e-1')

## Smoothing the stochastic-gradient ascent curve

The plotted line oscillates so much that it is hard to see whether the log likelihood is improving. In our plot, we apply a simple smoothing operation using the parameter `smoothing_window`. The smoothing is simply a [moving average](https://en.wikipedia.org/wiki/Moving_average) of log-likelihood over the last `smoothing_window` "iteraetions" of  stochastic-gradient ascent.

In [None]:
make_plot(log_likelihood_sgd_first_2, len_data=len(feature_matrix_train), batch_size=100,
          smoothing_window=30, label='stochastic-gradient, step_size=1e-1')

**Checkpoint**: The above plot should look smoother than the previous plot. Play around with `smoothing_window`. As you increase it, you should see a smoother plot.

## Stochastic-gradient ascent vs gradient ascent

To compare convergence rates for stochastic-gradient ascent with gradient ascent, we call `make_plot()` multiple times in the same cell.

We are comparing:
* **stochastic-gradient ascent**: `step_size = 0.1`, `batch_size=100`
* **batch-gradient ascent**: `step_size = 0.5`, `batch_size=len(feature_matrix_train)`

Write code to run stochastic-gradient ascent for 200 passes using:
* `step_size=1e-1`
* `batch_size=100`
* `l2_penalty=0`. 
* `initial_coefficients` to all zeros.

In [None]:
step_size = 1e-1
batch_size = 100
num_passes = 200
num_iterations = num_passes * int(len(feature_matrix_train)/batch_size)

## YOUR CODE HERE
coefficients_sgd, log_likelihood_sgd = ...

In [None]:
make_plot(log_likelihood_sgd, len_data=len(feature_matrix_train), batch_size=100,
          smoothing_window=30, label='stochastic, step_size=1e-1')
make_plot(log_likelihood_batch, len_data=len(feature_matrix_train), batch_size=len(feature_matrix_train),
          smoothing_window=1, label='batch, step_size=5e-1')

**Quiz Question**: In the figure above, how many passes does batch-gradient ascent need to achieve a similar log likelihood as stochastic-gradient ascent? 

1. Its always better
2. 10 passes
3. 20 passes
4. 200 passes or more

## Explore the effects of step sizes on stochastic-gradient ascent

In previous sections, we chose step sizes for you. In practice, it helps to know how to choose good step sizes yourself.

To start, we explore a wide range of step sizes that are equally space in the log space. Run stochastic-gradient ascent for `num_passes = 10` using `step_size` set to 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, and 1e2. Use 
* `initial_coefficients=np.zeros(194)`
* `batch_size=100`
* `l2_penalty=0`.

In [None]:
batch_size = 100
num_passes = 10
num_iterations = num_passes * int(len(train_data)/batch_size)

coefficients_sgd = {}
log_likelihood_sgd = {}
for step_size in np.logspace(-4, 2, num=7):
    coefficients_sgd[step_size], log_likelihood_sgd[step_size] \
        = ...

### Plotting the log-likelihood as a function of passes for each step size

Now, we will plot the change in log-likelihood using the `make_plot` for each of the following values of `step_size`:

* `step_size = 1e-4`
* `step_size = 1e-3`
* `step_size = 1e-2`
* `step_size = 1e-1`
* `step_size = 1e0`
* `step_size = 1e1`
* `step_size = 1e2`

In [None]:
for step_size in np.logspace(-4, 2, num=7):
    make_plot(log_likelihood_sgd[step_size], len_data=len(train_data), batch_size=100,
              smoothing_window=30, label='step_size=%.1e'%step_size)

**Quiz question**: Did `step_size = 1e2` diverge or converge?

Now, let ue remove the step size `step_size = 1e2` and plot the rest of the curves.

In [None]:
for step_size in np.logspace(-4, 2, num=7)[0:6]:
    make_plot(log_likelihood_sgd[step_size], len_data=len(train_data), batch_size=100,
              smoothing_window=30, label='step_size=%.1e'%step_size)

**Quiz Question**: For which of the following step sizes has stochastic-gradient ascent be said to diverge? Choose all that apply. Hint: Which of the plotted lines fail to approach the optimum?
1. 1e-2
2. 1e-1
3. 1e0
4. 1e1
5. 1e2