# LGT1 Unit07 Day29 - In-Class Assignment:  ML for the Inverse Problem in the Pseudo-PDF Method for LQCD
### <p style="text-align: right;"> &#9989; Put your name here.</p>
#### <p style="text-align: right;"> &#9989; Put your group member names here.</p>

## Goals of this assignment

The goals of this assignment are:

* Construct and train a multi-layer perceptron (MLP) to produce PDFs from lattice (pseudo-)data
* Tune the hyperparameters of the MLP to get the best validation accuracy
* See how well the model can predict pseudo-data generated from more complicated functions

## Assignment instructions

Work with your group to complete this assignment. Upload the assignment to Gradescope at the end of class.
**Make sure everyone's name is listed in everyone's notebook before moving on**

# Part 1: Train and Tune a MLP Regressor
In this section, we will load in the pseudo-data and create a multilayer perceptron (MLP) model with SciKit Learn to predict the PDF.
The input and output pseudo-data will be exactly the same format as we generated in the pre-class assignment (except the input RpITD is flattened instead of 2D).

In [3]:
# You will need this to display plots with latex in google colab
#!sudo apt install cm-super dvipng texlive-latex-extra texlive-latex-recommended

In [2]:
#Libraries and stuff to include
from sklearn.neural_network import MLPRegressor
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['text.usetex'] = True
plt.rcParams['text.latex.preamble'] = r'\usepackage{amsmath}'
from sklearn.model_selection import train_test_split
from sklearn import metrics

## Read In and Split the Data
We load the trainig input RpITD pseudo-data from "indata.npy" and the training output PDFs from "outdata.npy". 
Additionally, I provide the the $(a,b)$ pairs that correspond to the PDFs, so that we can see later how well we reproduce certain types of PDFs.
I decided that 2000 samples was a little better for this exercise than the orginally proposed 1000 samples.
We do a 67/33 training to testing split.

In [None]:
X = np.load("indata.npy")
y = np.load("outdata.npy")
ab_pairs = np.load("ab_pairs.npy")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

## Train the Model and Play With Hyperparameters
For the first task, we will take the standard SKLearn MPLRegressor, train it, and read out the train and test mean squared error (this is what the standard MLP regressor uses as its loss function).

The most important hyperparameter in an MLP is probably ```hidden_layer_sizes```.
This determines the overall structure of the MLP.
Our MLP will automatically have a input and output sizes of 25 and 99 respectively based on our input and output data.
The number of hidden layers and their sizes is up to you!

We will also consider the ```activation``` functions. 
```MLPRegressor``` supports ```{'identity', 'logistic', 'tanh', 'relu'}```.


The ```solver``` and learing rate (```alpha```) are worth exploring as well.
```MLPRegressor``` supports solvers ```{'lbfgs', 'sgd', 'adam'}```.
```lbfgs``` is fairly slow in my experience and doesn't converge, but that doesn't necessarily mean you should avoid it fully. 
Just keep in mind that there are other exercises to get to.

As mentioned in a previous notebook, you have to be careful with ```alpha```, as too large can cause divergence and too small can cause things to slow down. 
The default ```alpha``` for ```MLPRegressor``` is 0.0001.

Finally, if you have time, you can look into more parameters for ```MLPRegressor``` in the SKLearn documenation [here](https://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPRegressor.html).

### (Task 1): Play with the hyperparameters
Modify the code below to see what gets you the smallest mean squared error (MSE).
Write observations in the cell below the code block.

In [None]:
#Main hyperparameters
layers = ### YOUR CODE HERE # Should be an n-tuple of integers, ex. (10,10,10) 
activate = ### YOUR CODE HERE # Choose from {'identity', 'logistic', 'tanh', 'relu'}
solve = ### YOUR CODE HERE # Choose from {'lbfgs', 'sgd', 'adam'}
learn_rate = 0.00001 # Learning rate, you may change this, if you want



# Create the MLPRegressor
mlp_test=MLPRegressor(hidden_layer_sizes=layers, activation = activate, solver = solve, alpha = learn_rate, random_state = 0, max_iter=2000)

#Train on the test set
mlp_test.fit(X_train, y_train)

#calculate and read out the test and train error
train_mse = metrics.mean_squared_error(mlp_test.predict(X_train), y_train)
test_mse = metrics.mean_squared_error(mlp_test.predict(X_test), y_test)

print("Train error:", train_mse)
print("Test error:", test_mse)

### Observations:
Write take some notes about your findings in the cell below

\[Your notes here\]

### (Task 2): Visualize the Results
The MSE can only really tell us so much about your results from the MSE, so pick some samples, plotting the actual and predicted results.
Make comments in the markdown cells below the code block.

Reminder: Samples 0-1339 are the training samples, and 1340-1999 are the test samples. Make sure to make some observations about samples from both sets.

In [None]:
sample = ### YOUR CODE HERE # Sample to plot

#print (a,b) of the sample
ab = ab_pairs[sample]
print("a =", ab[0], "b =", ab[1])

#Get actual and predicted gluon PDFs
xgx = y[sample] #actual
xgx_pred = mlp_test.predict([X[sample]]) #predicted

#Make an array of the x values for the PDFs
xs_out = np.arange(1, 100)/100

#Plot of the output PDF data
plt.plot(xs_out, xgx, label = "actual")
plt.plot(xs_out, xgx_pred[0], label = "predicted")
plt.xlabel(r'$x$', fontsize = 18)
plt.ylabel(r'$xg(x)/\langle x_g \rangle$', fontsize = 18)
#plt.yscale('log') ### You can uncomment this to look a little closer at the large-x region
plt.legend(fontsize = 18)

### Record your observations
\[Your notes here\]

# Part 2: Compare to a "Realistic" PDF (30 minutes?)
We have seen that our model can somewhat reasonably reproduce simple model PDFs from their pseudo-RpITD-data, but what if actual lattice predicts a more complicated PDF?
Can our model handle it?


I have taken the more complicated parameterization of the gluon PDF from the CT18 global fit ([arXiv:1912.10053v3](https://arxiv.org/abs/1912.10053)) and generated PDFs and corresponding pseudo-data with 5% variance around their parameters.
We would like to see if our model can reproduce the more complicated PDF forms.

## Read in The Challenge Data
We have 1000 samples of the challenge input and output data.
We are NOT training our model with this data, just testing it, so there are no test-train splits.
The ```indata_challenge.npy``` and ```outdata_challenge.npy``` sets are organized exactly like the training data.
```PDF_challenge_mean_err.npy``` and ```RpITD_challenge_mean_err.npy``` contain the mean and error for the PDFs and the RpITDs for later plotting.

In [None]:
X_challenge = np.load("indata_challenge.npy")
y_challenge = np.load("outdata_challenge.npy")

PDF_challenge = np.load("PDF_challenge_mean_error.npy")
RpITD_challenge = np.load("RpITD_challenge_mean_error.npy")

N_data = len(y_challenge)

### (Task 3): PDF Visualization
Below, we plot the random sample of PDFs that I generated from the more complicated parameterization.
In the cell below the plot, comment any observations you see compared to the behavior of our simpler model PDFs.
Do you think our model will be able to handle the features of these PDFs?

In [None]:
xs_out = np.arange(1,100)/100 

for i in range(N_data):
    plt.plot(xs_out, y_challenge[i], color = 'tab:blue', alpha = .2)

plt.xlabel(r'$x$', fontsize = 18)
plt.ylabel(r'$xg(x)/\langle x_g \rangle$', fontsize = 18)

\[Your comments here\]

### (Task 4): Use Your Best Model to Predict the PDFs
Ensure that the last time you trained ```mlp_test``` that you trained with the best hyperparameters.
Run the cell below to predict the challenge PDFs with your model.
Comment in the markdown cell below how your model does on these PDFs compared to the train and test sets.

Do NOT go back and retrain your model. We'll build a new model in a couple more tasks. 

In [None]:
#calculate and read out the test and train error
train_mse = metrics.mean_squared_error(mlp_test.predict(X_train), y_train)
test_mse = metrics.mean_squared_error(mlp_test.predict(X_test), y_test)
challenge_mse = metrics.mean_squared_error(mlp_test.predict(X_challenge), y_challenge)

print("Train error:", train_mse)
print("Test error:", test_mse)
print("Challenge error:", challenge_mse)

\[Your comments here\]

### (Task 5): What's going wrong (or Right)?
Choose some challenge PDF samples and compare your actual and predicted PDFs.
Comment in the markdown cell below.

In [None]:
sample = ### YOUR CODE HERE # Sample to plot

#Get actual and predicted gluon PDFs
xgx = y_challenge[sample] #actual
xgx_pred = mlp_test.predict([X_challenge[sample]]) #predicted

#Make an array of the x values for the PDFs
xs_out = np.arange(1, 100)/100

#Plot of the output PDF data
plt.plot(xs_out, xgx, label = "actual")
plt.plot(xs_out, xgx_pred[0], label = "predicted")
plt.xlabel(r'$x$', fontsize = 18)
plt.ylabel(r'$xg(x)/\langle x_g \rangle$', fontsize = 18)
#plt.yscale('log') ### You can uncomment this to look a little closer at the large-x region
plt.legend(fontsize = 18)

\[Your comments here\]

### (Task 6): How Well Can We Predict Complicated Data from Simple Pseudo-Data?
In this task, we will train a new model on the original simple pseuod-data, but we will try to optimize the parameters to predict the more complicated PDFs.
This should be similar to task 1.
Comment your observations in the markdown cell below.
How well can we do?

In [None]:
#Main hyperparameters
layers = ### YOUR CODE HERE # Should be an n-tuple of integers, ex. (10,10,10) 
activate = ### YOUR CODE HERE # Choose from {'identity', 'logistic', 'tanh', 'relu'}
solve = ### YOUR CODE HERE # Choose from {'lbfgs', 'sgd', 'adam'}
learn_rate = 0.0001 # Learning rate, you may change this, if you want



# Create the MLPRegressor
mlp_new=MLPRegressor(hidden_layer_sizes=layers, activation = activate, solver = solve, alpha = learn_rate, random_state = 0, max_iter=2000)

#Train on the test set
mlp_new.fit(X_train, y_train)

#calculate and read out the test and train error
train_mse = metrics.mean_squared_error(mlp_new.predict(X_train), y_train)
test_mse = metrics.mean_squared_error(mlp_new.predict(X_test), y_test)
challenge_mse = metrics.mean_squared_error(mlp_new.predict(X_challenge), y_challenge)

print("Train error:", train_mse)
print("Test error:", test_mse)
print("Challenge error:", challenge_mse)

\[Your comments here\]

### (Task 7): Comment on the PDFs Again
Again, plot some PDF samples from your newly optimized model and compare to the actual results.
Is there distinctive features that our model can't seem to reproduce?

In [None]:
sample = ### YOUR CODE HERE # Sample to plot

#Get actual and predicted gluon PDFs
xgx = y_challenge[sample] #actual
xgx_pred = mlp_new.predict([X_challenge[sample]]) #predicted

#Make an array of the x values for the PDFs
xs_out = np.arange(1, 100)/100

#Plot of the output PDF data
plt.plot(xs_out, xgx, label = "actual")
plt.plot(xs_out, xgx_pred[0], label = "predicted")
plt.xlabel(r'$x$', fontsize = 18)
plt.ylabel(r'$xg(x)/\langle x_g \rangle$', fontsize = 18)
#plt.yscale('log') ### You can uncomment this to look a little closer at the large-x region
plt.legend(fontsize = 18)

\[Your comments here\]

### (Task 8): Mean and Error
We're dealing with statistics here, so let's take a look at the mean and error of the actual vs predicted PDFs to be able to make some more general comments.
We'll make two different plots. 
Add your observations to the cell at the end of this section.

The first plot is just the mean and the error of the two PDF sets.

In [None]:
plt.plot(PDF_challenge[:,0],PDF_challenge[:,1], color = 'tab:blue', label = "actual")
plt.fill_between(PDF_challenge[:,0], PDF_challenge[:,1] + PDF_challenge[:,2],
                PDF_challenge[:,1] - PDF_challenge[:,2], color = 'tab:blue', alpha=0.2)


#Get the mean the the error of the predicted PDFs
y_preds = mlp_new.predict(X_challenge)

y_pred_mean = np.mean(y_preds, axis = 0)
y_pred_error = np.std(y_preds, axis = 0)

plt.plot(PDF_challenge[:,0], y_pred_mean, color = 'tab:orange', label = "predicted")
plt.fill_between(PDF_challenge[:,0], y_pred_mean + y_pred_error,
                 y_pred_mean - y_pred_error, color = 'tab:orange', alpha=0.2)

plt.xlabel(r'$x$', fontsize = 18)
plt.ylabel(r'$xg(x)/\langle x_g \rangle$', fontsize = 18)
#plt.yscale('log') ### You can uncomment this to look a little closer at the large-x region
plt.legend(fontsize = 18)

Below is a common plot for PDF visualization, where the PDFs and the errors are divided by the mean PDF from one global fit.
This allows us to see relative errors better.

In [None]:
plt.plot(PDF_challenge[:,0],PDF_challenge[:,1]/PDF_challenge[:,1], color = 'tab:blue', label = "actual")
plt.fill_between(PDF_challenge[:,0], (PDF_challenge[:,1] + PDF_challenge[:,2])/PDF_challenge[:,1],
                (PDF_challenge[:,1] - PDF_challenge[:,2])/PDF_challenge[:,1], color = 'tab:blue', alpha=0.2)


#Get the mean the the error of the predicted PDFs
y_preds = mlp_new.predict(X_challenge)

y_pred_mean = np.mean(y_preds, axis = 0)
y_pred_error = np.std(y_preds, axis = 0)

plt.plot(PDF_challenge[:,0], y_pred_mean/PDF_challenge[:,1], color = 'tab:orange', label = "predicted")
plt.fill_between(PDF_challenge[:,0], (y_pred_mean + y_pred_error)/PDF_challenge[:,1],
                 (y_pred_mean - y_pred_error)/PDF_challenge[:,1], color = 'tab:orange', alpha=0.2)

plt.xlabel(r'$x$', fontsize = 18)
plt.ylabel(r'$g(x)/g^{actual}(x) \rangle$', fontsize = 18)
plt.ylim(0,2)
#plt.yscale('log') ### You can uncomment this to look a little closer at the large-x region
plt.legend(fontsize = 18)

\[Your comments here\]

# Part 3: Discussion
Overall, we have implemented a very simple model with simple pseudo-data.
Take a few minutes to discuss what you think may be problems with the ways we have implemented everything and how we could resolve them.

\[Your comments here\]