In [None]:
# A bit of setup
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Create color maps
cmap_list = ['orange', 'cyan', 'cornflowerblue']
cmap_bold = ['darkorange', 'c', 'darkblue']
cmap_light = ListedColormap(cmap_list)

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

<!-- dom:TITLE: Homework 1, PHY 959 -->
<!-- dom:AUTHOR: [PHY 959: Machine Learning in Physics]-->


# PHY 905: Homework Set #2
Due: **February 2, 2023**



___
***


# Problem 2: Exploring Ensembles

In this problem, we will exercise what we learned about using ensembles to improve the reliability of our training.  We will continue to use sparse datasets with one dimension, as it's easier to visualize.  However, these methods will be useful in the rest of the course material...especially with neural networks and boosted decision trees.  

Parts 2a-2c will be almost identical to what you have already seen in class or homework, so use what you can from those and copy/paste.  That will not be considered plagiarism.

There is a lot of "starter" code here that will not work out of the box.  You'll need to add your own code to fill in some gaps that are missing.  Ask Prof. Fisher if you get stuck.

## Part 2a:

This is almost exactly the same as what we did for Homework Set 2 Problem 1a, so feel free to copy that code over to the next cell!  Make the following edits:

  1. Increase the number of training data points from 10 to 40, but keep the sampling of the x range random.
  2. Change the functional form for the targets (```ytrain```) to be $y=0.75+3x-2x^2+2x^3-2x^4+0.3\sin(x\times 15)$.  For the training data, add some Gaussian noise (ie, add $+ \rm{Gaus}(0,0.15)$ to ytrain).
  3. Make another set of 20 data points in the range `[0,1]` with the same functional form, but increase the noise to 0.60.  Then concatenate (append) to the previous training set.  This will give you a set of VERY noisy data with outliers that can impact your fits.  As you're doing the appending, recall that `np.random.uniform` gives you a `numpy` array and not a standard Python array.  You need to use `np.append()` for this.
  4. I suggest that you make a testing sample with zero noise for comparisons later in the notebook.
  5. It will help your visualizations if you do an `argsort` to permute the indices of your appended training arrays.
  
Keep the testing data as a convenient way to plot the true function.  You shouldn't add the noise to the testing data.

In [None]:
# Build and visualize the data

# Make some training data
# First, 40 points that are a little noisy
nPts = 40
np.random.seed(37)


## Part 2b:

Create a design/feature matrix for your samples.  Do this in the same way that you did for Problem 1b.  Use a 7th order polynomial (nDegr=7).  Again, I strongly encourage you to try out the module ```PolynomialFeatures``` from ```scikit-learn```.  Don't forget your randomized training data!

## Part 2c:

Use the Numpy ```linalg``` module to calculate the vector of weights that best fit your data.  Make a plot of your training data and your best fit line.  Also plot your testing data to illustrate how well the linear regression fit reproduces your underlying true function.

A convenient way to generate your line would be like this:  ```line = np.dot(DM,betas)``` wherein ```betas``` is the weights vector from your ```linalg``` solution.  Just like Problem #1c.

## Part 2d:

Here we will explore two strategies for generating mini-batches for performing cross-validation training.  
1. The first strategy will use k-fold cross validation.  This is performed by breaking your data sample into k orthogonal subsamples and training on each subsample independently.  
2. The second strategy will be to generate bootstrap data samples, wherein the training data are randomly sampled (with replacement).  
    
In each case, you will use a range of mini-batch sizes, ranging from small (10%) to big (50%).  The way you will use these mini-batches is to fit your polynomial function to each mini-batch and average over all mini-batches to find an aggregate "best fit".  For example, if you have a mini-batch size of 10%, you will have 10 mini-batches and, thus, 10 regression results to average.

To avoid spending your time in Python developer mode, most of the code has been provided.  Your job will be to create functions that properly generate the mini-baches for training.  Do not use ```scikit-learn``` for this, as the goal is to generate your own intuition for how the algorithm should work.  However, the use of ```Numpy``` is stronly encouraged.  You can either set up your code to perform both k-fold cross validation and bootstrap sampling at once, or allow toggling between the two. **If you get stuck, consult a classmate or Prof. Fisher!**

Some things to think about:

  1. When you make your mini-batches, you need to provide subsamples for both the design matrix and for the targets.
  2. Your mini-batches must maintain the one-to-one relationship of design matrix rows to targets.  This will be more challenging with the bootstrap random sampling.  I suggest exploring the ```Numpy::random.choice()``` method.


You will need to make plots of your results in the next part of the problem.  So you need two arrays to keep track of your results:

  1. An array to average the fit parameters from each mini-batch.  For example, if you choose mini-batch size to be 10%, you will have 10 mini-batches to average over.  This array should be extended to hold all of your mini-batch size averages.  The size should be ```(batchRange.size,nDegr+1)```.
  2. An array to hold the weights for fits to each of the mini-batches in your smallest mini-batch size.  This will be to illustrate the power of averaging over mini-batches.  The size should be ```(nBatches,nDegr+1)```.
  

In [None]:
#Linear regression

# some hyperparameters
step_size = 0.2  # Note that if the learning rate is too big, you may not converge!

step = 0.1  # number of steps for the batch size to be tested over
bMin = 0.1
bMax = 0.501

doRandomBatch = 1

# this array will hold the values of the regularization strength that we'll test
batchRange = np.arange(bMin,bMax+0.001,step)

# this array will hold the results of our various fits
weights = np.zeros((batchRange.size,nDegr+1))

# this array will hold the results of our smallest batch size
smallWeights = np.zeros((np.int(0.01+1.0/batchRange[0]),nDegr+1))

# gradient descent loop
Niter = 2500
for bidx, batchFrac in enumerate(batchRange):

    #calculate the number of batches we can get using this fraction
    batchIterations = np.int(0.01+1.0/batchFrac)
    avgWeights = np.zeros(betas.size)
    
    print("Batch Size Fraction:",batchFrac)
    for aidx in range(batchIterations):
        
        # build a mini-batch!
        if doRandomBatch: # bootstrap data samples
            iDM, iy = ?? # function to create DM and targets for bootstrap sample of the right size
        else: # k-fold cross validation data samples
            iDM, iy = ?? # function to create DM and targets for k-fold cross validation of the right size
        
    
        #Start from our analytical fit to speed convergence
        W = betas*0.99
        for i in range(Niter):
  
            # evaluate function values
            fhat = np.dot(iDM,W)

            # compute mean squared loss
            data_loss = np.sum((iy-fhat)**2)
            data_loss /= iy.size

            # Report progress
            if i % 1000 == 0:
                print("Batch %.3f, iteration %d: loss %f" % (batchFrac,i, data_loss))
        
            #compute the loss gradients
            dW = -2*np.dot(iDM.T,iy-fhat)
            dW /= iy.size
                
            # update the weights
            W -= step_size*dW

        #capture weights
        avgWeights += W*1.0/batchIterations
        if bidx == 0:
            smallWeights[aidx] = W
        
    print("Final Weights:\n",avgWeights)
    weights[bidx] = avgWeights
print(weights)

## Part 2e:

Now that you have your mini-batch results, create two figures:

  1. A plot of your training and test samples overlaid with all N of the fits from your smallest mini-batch fraction.  For example, if your smallest mini-batch fraction is 10%, you'll have 10 lines overlaid.
  2. A plot of your training and test samples overlaid with all M of your mini-batch fit averages.
  
You should be able to make these figures for both k-fold cross validation and for random bootstrap sampling.  Once you've inspected all four figures, please answer the following questions:

  1. Which ensemble method give you the smallest variance between mini-batch sizes?  Why do you think this is?
  2. Which ensemble method reproduces the true underlying function the best?  Is that a good thing or not in this case?
  3. What are the apparent advantages of large or small mini-batch sizes?
  