<a href="https://colab.research.google.com/github/jmhuer/utaustin_optimization/blob/main/homework6/homework6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Set 6
In this problem set you will implement SGD and SVRG and compare the two to each other, and also to GD.

# Problem 1: Stochastic Variance Reduced Gradient Descent (SVRG)

As we discussed in the video lectures, decomposable functions of the form
$$
\min_{\omega} \left [ F(\omega) = \frac{1}{n} \sum_i^n f_i(\omega) \right ],
$$
are very common in statistics/ML problems. Here, each $f_i$ corresponds to a loss for a particular training example. For
example, if $f_i(\omega) = (\omega^\top x_i - y_i)^2$, then $F(\omega)$ is a least
squares regression problem. The standard gradient descent (GD) update 
$$
\omega_t = \omega_{t-1} - \eta_t \nabla F(\omega_{t-1})
$$

evaluates the full gradient $\nabla F(\omega) = \frac{1}{n} \sum_i^n \nabla
f_i(\omega)$, which requires evaluating $n$ derivatives. This can be
prohibitively expensive when the number of training examples $n$ is large. SGD evaluates
the gradient of one (or a small subset) of the training examples--drawn
randomly from ${1,...n}$--per iteration:
$$
\omega_t = \omega_{t-1} - \eta_t \nabla f_i(\omega_{t-1}).
$$

In expectation, the updates are equivalent, but SGD has the computational
advantage of only evaluating a single gradient $\nabla f_i(\omega)$. The
disadvatage is that the randomness introduces variance, which slows
convergence. This was our motivation in class to introduce the SVRG algorithm.

Given the dataset in **digits.zip**, plot the performance of GD, SGD, and SVRG for logistic regression with $l2$ regularization in terms of negative log likelihood on the training data against the number of gradient evaluations for a single training example (GD performs $n$ such evaluations per iteration and SGD performs $1$). Choose the $l2$ parameter to optimize performance on the test set. How does the choice of $T$ (the number of inner loops) affect the performance of SVRG? There should be one plot with a title and three lines with different colors, markers, and legend labels.



In [None]:
import zipfile as zipfile
import pandas as pd
import numpy as np
import numpy.linalg as la
import matplotlib.pyplot as plt
import time
import pdb

%matplotlib inline

#sample code to load digits.zip
def loaddata(filename):
    data={}
    with zipfile.ZipFile(filename) as z:
        for filename in z.namelist():
          data[filename] = pd.read_csv(z.open(filename), sep=' ', header=None)
    return data

digits_dict = loaddata('./digits.zip')
print(digits_dict.keys())
X_digits_train = digits_dict['X_digits_train.csv']
X_digits_test = digits_dict['X_digits_test.csv']
y_digits_train = digits_dict['y_digits_train.csv'].to_numpy(dtype=int).ravel()
y_digits_test = digits_dict['y_digits_test.csv'].to_numpy(dtype=int).ravel()

dict_keys(['X_digits_test.csv', 'X_digits_train.csv', 'y_digits_test.csv', 'y_digits_train.csv'])


# Problem 2: Newsgroup Dataset Optimization

Using any approach, optimize performance of logistic regression on the test set in **news.zip** and compare the performance of your approach to standard SGD. This dataset is the full-dimensional newsgroup dataset (as opposed to the compressed version you worked with previously). The $X$ matrices are stored in sparse matrix format and can be read using scipy.io.mmread. As the dataset is large and high-dimensional, you will have to decide on how best to allocate your computational resources. Try to utilize the sparsity of the data (i.e., don't just convert it to a dense matrix and spend all your time multiplying zeros). You may use any of the techniques covered in class or ideas from outside class (e.g., momentum, variance reduction, minibatches, adaptive learning rates, preprocessing). Describe your methodology and comment on what you found improved performance and why. Plot the performance (negative log likelihood) of your method against standard SGD in terms of the number of gradient evaluations. 

In [None]:
from scipy.io import mmread
import sklearn.feature_selection


#sample code to load news.zip
def loadnewsdata(filename='./news.zip'):
    data={}
    with zipfile.ZipFile(filename) as z:
        for filename in z.namelist():
          if 'csv' in filename:
            data[filename] = pd.read_csv(z.open(filename), sep=' ', header=None)
          elif 'mtx' in filename:
            data[filename] = mmread(z.open(filename))
          else:
            raise Exception('unexpected filetype') 
    return data

news_dict = loadnewsdata('./news.zip')
print(news_dict.keys())
X_news_train = news_dict['X_news_train.mtx']
X_news_test = news_dict['X_news_test.mtx']
y_news_train = news_dict['y_news_train.csv'].to_numpy(dtype=int).ravel()
y_news_test = news_dict['y_news_test.csv'].to_numpy(dtype=int).ravel()

dict_keys(['X_news_test.mtx', 'X_news_train.mtx', 'y_news_test.csv', 'y_news_train.csv'])
