**Here is a code template you can refer to for Question 1, you need to write standard_normalizer and PCA_sphereing function correctly to make this work, feel free to change this template if you need**

In this and the following example we compare runs of gradient descent on various real datasets using a) the original input, b) standard normalized input, and c) PCA sphered input.  The `Python` function can be used to loop over three gradient descent runs using a single cost function with each version of the data loaded in.  Three steplength parameter inputs allow one to adjust and compare steplength choices for each run.

As we saw when comparing standard normalization to original input in Sections 8.4, 9.4, and 10.3 we will typically find that a substantially larger steplength value can be used when comparing a run on original data to one on standard normalized data, and likewise when comparing a run on standard normalized data to one in which the input was first PCA sphered.  The intuition behind why this is possible - first detailed in Section 8.4.3 - is that PCA sphereing tends to make the contours of a cost function even more 'circular' than standard normalization, and the more circular a cost function's contours become the larger the steplength one can use because the gradient descent direction aligns more closely with the direction one must travel in to reach a true global minimum of a cost function. 

In [5]:
# This code cell will not be shown in the HTML version of this notebook
def identity(x):
    normalizer = lambda data: data
    inverse_normalizer = lambda data: data
    return normalizer,inverse_normalizer

def compare_schemes(x,y,costname,countname,alpha1,alpha2,alpha3,max_its):     
    # parameters for all gradient descent runs
    C = len(np.unique(y))
    if C == 2:
        C-=1
    
    # create initialization
    w = 0.0*np.random.randn(x.shape[0]+1,C)
    
    # gradient descent loop
    cost_histories = []
    count_histories = []
    for transform,alpha_choice in zip([identity,standard_normalizer,PCA_sphereing],[alpha1,alpha2,alpha3]): 
        #### transform input data ####
        # transform data
        normalizer,inverse_normalizer = transform(x)

        # normalize input
        x_transformed = normalizer(x)
        
        #### make cost and misclassification counter based on transformed input ####
        # create cost and counter
        cost = cost_lib.choose_cost(x_transformed,y,costname)
        count = cost_lib.choose_cost(x_transformed,y,countname)
        
        #### run gradient descent ####
        # make run of gradient descent
        weight_history,cost_history = optimizers.gradient_descent(cost,alpha_choice,max_its,w)
        
        # compute number of misclassifications
        count_history = [count(v) for v in weight_history]
        cost_histories.append(cost_history)
        count_histories.append(count_history)
    return cost_histories,count_histories

Next we illustrate a run on each type of input using $50,000$ handwritten digits from the [MNIST dataset](http://scikit-learn.org/stable/datasets/index.html) - consisting of hand written digits between 0 and 9.  These images have been contrast normalized, a common pre-processing technique for image based data we discuss in Chapter 16. 

We pick steplength values precisely as done in the previous dataset, and again find that we pick much larger values when comparing runs on the original to that of the standard normalized input, and this to the PCA sphered input.

In [7]:
from sklearn.datasets import fetch_openml
from autograd import numpy as np

# import MNIST
x, y = fetch_openml('mnist_784', version=1, return_X_y=True)

# re-shape input/output data
x = x.T
y = np.array([int(v) for v in y])[np.newaxis,:]

print("input shape = " , x.shape)
print("output shape = ", y.shape)

input shape =  (784, 70000)
output shape =  (1, 70000)


Randomly sample input / output pairs.

In [8]:
# sample indices
num_sample = 50000
inds = np.random.permutation(y.shape[1])[:num_sample]
x_sample = x[:,inds]
y_sample = y[:,inds]

In [None]:
# # create normalizer
# normalizer,inverse_normalizer = standard_normalizer(x_sample.T)

# # normalize input
# x_sample = normalizer(x_sample.T).T

In [None]:
# run comparison module above
alpha_orig = 10**(-5);  alpha_standard = 10**(-1); alpha_pca_sphered = 100;  costname = 'multiclass_softmax'; countname = 'multiclass_counter';
max_its = 10
cost_histories,count_histories = compare_schemes(x_sample,y_sample,costname,countname,alpha_orig,alpha_standard,alpha_pca_sphered,max_its)

Plotting the resulting cost function histories we can see how the run on standard normalized data converges rapidly in comparison to the raw data, and how the run on PCA sphered data converges even more rapidly still.

In [None]:
# compare cost / count histories
static_plotter = superlearn.classification_static_plotter.Visualizer()
static_plotter.plot_cost_histories(cost_histories,count_histories,start = 1,labels = ['original','standard','sphered'])