### The Liquid Drop Model for Nuclear Binding Energy

Inspired by work by Morten Hjorth-Jensen, Department of Physics, University of Oslo, for the course "Applied Data Analysis and Machine Learning"

A basic quantity which can be measured for the ground states of nuclei is the atomic mass $M(N, Z)$ of the neutral atom with atomic mass number $A$, charge $Z$, and number of neutrons $N$. There are indeed several sophisticated experiments worldwide which allow us to measure this quantity to high precision (parts per million even).

Atomic masses are usually tabulated in terms of the mass excess defined by
$$ \Delta M(N, Z) = M(N, Z) - u A $$
where $u$ is the Atomic Mass Unit:
$$ u = M(^{12}C)/12 = 931.4940954(57) MeV/c^2 $$

The nucleon masses are
$$ m_p = 1.00727646693(9) u $$
and
$$ m_n = 1.0086649156(6) u $$

The nuclear binding energy is defined as the energy required to break up a given nucleus into its constituent parts of $N$ neutrons and $Z$ protons. In terms of the atomic masses $M(N, Z)$ the binding energy is defined by
$$ BE(N,Z) = Z M_H c^2 + N m_n c^2 - M(N, Z) c^2 $$
where $M_H$ is the mass of the hydrogen atom and $m_n$ is the mass of the neutron. 

In terms of the mass excess the binding energy is given by
$$ BE(N,Z) = Z \Delta_H c^2 + N \Delta_n c^2 - \Delta(N, Z) c^2 $$
where $\Delta_H c^2 = 7.2890$ MeV  is the mass excess of the hydrogen atom and $\Delta_n c^2 = 8.0713$ MeV is the mass excess of the neutron.

A popular and physically intuitive model which can be used to parametrize the experimental binding energies as function of $A$ is the so-called liquid drop model. The ansatz is based on the following expression

$$BE(N, Z) = a_1 A - a_2 A^{2/3} - a_3 \frac{Z^2}{A^1/3} - a_4 \frac{(N-Z)^2}{A} $$
 
 
where $A$ is the number of nucleons and the $a_i$ 's are parameters which are determined by a fit to the experimental data.

To arrive at the above expression we have assumed that we can make the following assumptions:

- There is a volume term $a_1 A$ that proportional with the number of nucleons. When an assembly of nucleons of the same size is packed together into the smallest volume, each interior nucleon has a certain number of other nucleons in contact with it. This contribution is proportional to the volume.

- There is a surface energy term $a_2 A^{2/3}$. The assumption here is that a nucleon at the surface of a nucleus interacts with fewer other nucleons than one in the interior of the nucleus and hence its binding energy is less. This surface energy term takes that into account and is therefore negative and is proportional to the surface area.

- There is a Coulomb energy term $a_3 \frac{Z^2}{A^1/3}$, since the electric repulsion between each pair of protons in a nucleus yields less binding.

- There is an asymmetry term $a_4 \frac{(N-Z)^2}{A}$. This term is associated with the Pauli exclusion principle and reflects the fact that the proton-neutron interaction is more attractive on the average than the neutron-neutron and proton-proton interactions.

The liquid drop model does not take into account the shell structure of the nucleus, so it tends to provide a good fit to heavier nuclei, and a poor fit to lighter nuclei. 

### In this homework problem, we will use linear regression to fit the parameters of the liquid drop model and predict nuclear masses. 

### Step 1: Load the data and calculate the targets
A subset of the data from the 2020 Atomic Mass Evaluation (AME2020) has been provided in the atomic_masses.csv file (contained in the Data folder on Github). The publications containing the data can be found here: Chinese Phys. C 45 030002 (2021), and Chinese Phys. C 45 030003 (2021). 

The data file contains N, Z, A, the element, the mass excess (in $keV/c^2$), and the atomic mass (in $u$) for each of 2550 nuclei.

Load the data into a pandas data frame. 

We want to train a model for the binding energies of the nuclei, which isn't provided in the data set. Using the expression(s) above, evaluate the true value of the binding energy for each nucleus in the data set, saving the values to an array or data frame.

### Step 2: Select features and perform feature engineering

Based on the liquid drop model above, select the appropriate features from the provided data and engineer any new features you think will be useful, saving all of the features as columns to a new data frame X. Make sure you don't include anything that shouldn't be a feature. You don't need to include a bias feature (1 for all instances), since sk-learn's methods include fitting the y intercept by default.

Explain what features you included and your reasoning.

### Step 3: Blind yourself from some of the data
In addition to our usual cross-validation practices, this time we're going to set aside a subset of the data as a fully blinded sample, pretending that it's a set of new measurements that we're making predictions for. This will help you get used to what it's like using machine learning techniques for "real world" problems, where you have more than just a learning set. 

Using test_train_split, set a random 10% of the instances aside for use as a blinded sample.

### Step 4: Train Model #1

Pick an appropriate scoring parameter and train a LinearRegressor model to fit the data, using cross-validation to evaluate the uncertainty of the score.

How does the model perform? Give an appropriate quantitative measure.

### Step 5: Train Model #2

Using the same scoring parameter, train a linear model with ElasticNet regularization. Use nested CV to optimize the $\texttt{l1\_ratio}$ hyperparameter, which adjusts the balance between the L1 and L2 regularization terms.

How does the model perform? Give an appropriate quantitative measure. Did it out-perform the non-regularized model? What was the optimal setting for the regularization hyperparameter? 

### Step 6: Analyze coefficients

Give the coefficients found by each model. Compare them in a bar chart, as we did in Studio 7.


Did regularization drive any of the coefficients to 0?

### Step 7: Make predictions and analyze the results

Apply the trained models to the blinded test data to make predictions of the binding energy for each atom, and compare the results to the true values.

Make the following plots to help you analyze the results:

- A scatter plot of binding energy vs. Z. Include the train and blinded test points, with different colors or markers to distinguish them.  
- A scatter plot of binding energy vs. A. Include the train and blinded test points, with different colors or markers to distinguish them.  
- A scatter plot of the liquid drop model residuals per nucleon (true BE - fit BE)/A vs. Z for the blinded test points, using the results from your best-performing model. 
- A scatter plot of the liquid drop model residuals per nucleon (true BE - fit BE)/A vs. A for the blinded test points, using the results from your best-performing model. 
- A color 2d scatter plot where Z is the y axis, N is the x axis, and color gives the value of the liquid drop model residuals per nucleon (true BE - fit BE)/A for all the instances. The inspiration here is the chart of the nuclides found at https://www.nndc.bnl.gov/nudat3/, where you can select "BE-LDM Fit/A" to plot these residuals for each nucleus, or this example on Wikipedia (https://en.wikipedia.org/wiki/Semi-empirical_mass_formula#/media/File:Semi-empirical_mass_formula_discrepancy.png). Their residuals won't agree with yours, since that fit includes an extra pairing interaction term that we didn't include. 


Do you notice a trend in the residuals as a function of either Z or A? Any other trends/features you noticed?



