# Machine learning for computational materials science and chemistry with MALA

## What is MALA?

A short summary about MALA and how it works is given in `mala_background.pdf`.

## Setting up MALA

For this tutorial, a Google Collab enviroment will be provided that includes all necessary packages. Generally, MALA is an open source framework that can be obtained [here](https://github.com/mala-project/mala). Detailled (installation) instructions can be found [here](https://mala-project.github.io/mala/).

A few examples at the end of the notebook tackle advanced applications. The necessary backends are, for the ease of installation, not bundled with the Google Collab environment. Interested readers may install them themselves on their machines, for in presence workshops, they will be demonstrated by the host, as are sosme aspects of data generation.

## Loading the modules

These modules will be necessary for the tutorials discussed here.

In [1]:
# MALA itself.

import mala

# We would like to visualize simple plots.
# The font size can sometimes be a bit small for Jupyter Notebooks.

import matplotlib
import matplotlib.pyplot as plt
font = {'size'   : 22}
matplotlib.rc('font', **font)

# For the data paths.
from os.path import join as pj

# Data I: Performing simulations (presentation only)

Data is the backbone of each ML application. For MALA, this includes target data (some electronic structure quantity to be learned, usually the local density of states, LDOS) and some descriptor data (a vectorial field that encodes the atomic density at each point in space, usually via so called bispectrum components).

MALA data generation can be performed with the Quantum ESPRESSO package. Some changes to this open source package were necessary to enable the correct sampling of the LDOS. The current development branch of Quantum ESPRESSO includes those - beginning with Quantum ESPRESSO version 7.2 (to be released in ~June 2023) users can simply download the latest QE version and perform data generation.

Data generation is two-fold: First, one creates a set of atomic position via a regular DFT-MD simulation at the conditions of interest. This can be done with any suitable code, such as VASP, QE, etc. Secondly, one performs DFT simulations to access the LDOS.

The test system for our investigation here will be a simply beryllium system at room temperature consisting of 2 beryllium atoms. Atomic configurations have been sampled beforehand. We will start with the DFT simulation. Note: For actual data generation, the simulation output needs to be saved (usually as ".out" file) and is of course done on HPC infrastructure.

MALA data generation thus comes down to a simply DFT + Postprocessing calculation. There is one drawback though: The LDOS has to be sampled with a very high fidelity in the k-space (phase space, i.e., how the Fourier components of the basis set are sampled). The fidelity of the MALA calculation has to be higher then for standard DFT calculations, which has to be kept in mind. A good way to visualize the problem is through the density of states (LDOS integrated on real space grid) which shows unphysical oscillations for low fidelity calculations. These oscillations vanish as one moves to higher fidelity calculations.

![kpoint_comparison](./figures/DOSwdifferentdeltasandks.png)

After we have performed the actual simulation, we are halfway done. We now have simulation outputs, both the DFT output as well as the LDOS. From this we need to do two things:

1. We need to convert the LDOS into a format we actually want to work with. Cube files are unnecessarily huge, complicated both disk usage as well as speed.
2. We need to calculate the atomic density descriptors from the atomic positions.

So it's finally time to fire up MALA.

# The MALA interface (hands-on)

Within extended ML frameworks, a big problem is reproducibility. Models often depend on a lot of so called hyperparameters, that characterize model behavior. This may include, but in no way be limited to, neural network layer sizes and number, training procedures, description of data specifics (e.g.: how is the local density of states sampled?), etc.
There is a number of ways to efficiently handle this problem. A lot of frameworks rely on command line arguments to deal with this, i.e., the user provides a, potentially extensive, list of command line arguments upon runtime. Another good way to handle all this is the usage of input files. This is consistent with the way standard computational science simulation codes work.
An obvious downside to this is that one has to have a framework at hand to prepare these input files.

MALA follows a route sort of in between the two approaches. The central quantity is the `Parameters()` object. It holds ALL (hyper)parameters one could use in the course of a MALA run. It is structured by the subtasks of MALA.

In [11]:
parameters = mala.Parameters()

# All parameters related to how data is handled in general.
parameters.data

# "Targets" always refers to the quantity being learned.
parameters.targets

# "Descriptors" always refers to the quantity from which we learn.
parameters.descriptors

# "Data generation" refers to useful routines for creating training data.
# These routines are mostly experimental at the moment and will not be discussed here in detail
parameters.datageneration

# "Network" refers to everything related to neural network creation and training.
# In the future, support for more models is planned, and this collection of parameters
# will be updated to reflect this.
parameters.network

# "Hyperparameters" means hyperparameter optimization. This is the process of finding the optimal
# hyperparameters for MALA model training, and we will come back to this process later.
parameters.hyperparameters

<mala.common.parameters.ParametersHyperparameterOptimization at 0x7f5c63aac760>

Individual parameter objects can be printed to see what's hidden inside. This also works on the main object as well.

In [12]:
parameters.data.show()

snapshot_directories_list: []
data_splitting_type: by_snapshot
input_rescaling_type: None
output_rescaling_type: None
use_lazy_loading: False
use_clustering : False
number_of_clusters: 40
train_ratio    : 0.1
sample_ratio   : 0.5
use_fast_tensor_data_set: False
shuffling_seed : None


Finally, there are some high-level parameters one needs when performing ML at scale.

In [13]:
# Whether or not to use a GPU for model training and inference.
parameters.use_gpu

# Whether or not to use MPI parallel CPU inference (no training supported).
# This option is either for pre-processing or production runs of trained models.
# More on that later.
parameters.use_mpi

# Manual seeds can be used to fix the Pseudo RNG to re-create models with the exact
# same model weights.
parameters.manual_seed

# A comment may be useful to distinguish between sets of parameters.
parameters.comment

# This is useful for adjusting the output level of MALA.
parameters.verbosity

1

All of this does not explain how MALA handles reproducibility. Write hundreds of lines of parameter statement in each python script is not exactly maintainable.
Therefore MALA provides a .json interface.

In [14]:
parameters.save("mala_parameters_01.json")
new_parameters = mala.Parameters.load_from_file("mala_parameters_01.json")

Have a look at the .json file that was just created. You will see that it is structured in the same ways as the python object, allowing fast access. You will see some parameters that are not dicussed here since they exceed the scope of this tutorial. For a first excercise, try to modify parameters both in python and in json and see whether loading will recover those changes. The comment and manual seed are good first examples for this.

In [15]:
# Set a comment
parameters.comment = "My first parameters."

# Save.
parameters.save("mala_parameters_01.json")

In [16]:
# Edit something in the file (e.g. the manual_seed) and reload.
parameters = mala.Parameters.load_from_file("mala_parameters_01.json")
parameters.show()

---     All parameter MALA needs to perform its various tasks. ---
comment        : My first parameters.
manual_seed    : None
use_gpu        : False
device         : cpu
use_horovod    : False
use_mpi        : False
verbosity      : 1
openpmd_configuration: {}
openpmd_granularity: 1
---     Parameters necessary for constructing a neural network. ---
	nn_type        : feed-forward
	layer_sizes    : [10, 10, 10]
	layer_activations: ['Sigmoid']
	loss_function_type: mse
	num_hidden_layers: 1
	no_hidden_state: False
	bidirection    : False
	dropout        : 0.1
	num_heads      : 10
---     Parameters necessary for calculating/parsing input descriptors. ---
	descriptor_type: Bispectrum
	lammps_compute_file: 
	descriptors_contain_xyz: True
	use_z_splitting: True
	number_y_planes: 0
	bispectrum_twojmax: 10
	rcutfac        : 4.67637
	atomic_density_cutoff: 4.67637
	snap_switchflag: 1
	use_atomic_density_energy_formula: False
	atomic_density_sigma: None
---     Parameters necessary for calculat

For extended experiments, it is very useful to operate with such input files and only use the in-python parameter editing when absolutely necessary. Further, concluded experiments can be saved in this way for future reference.
In the following, we will use a combination of both approaches for the sake of transparency.

# Data II: Data preprocessing (presentation-only)

We can now start using MALA to prepare our data. MALA directly takes in calculation outputs and transforms it into formats with which we can easily work.
First, we have to decide which descriptors to calculate and how to correctly process the LDOS.

In [17]:
parameters = mala.Parameters()

# These values we will take for granted now. In the hyperparameter section we will find out
# how they are determined.
parameters.descriptors.descriptor_type = "Bispectrum"
parameters.descriptors.bispectrum_twojmax = 10
parameters.descriptors.bispectrum_cutoff = 4.67637
parameters.descriptors.descriptors_contain_xyz = True

# These values need to correspond to the ones used in the DFT simulation.
parameters.targets.target_type = "LDOS"
parameters.targets.ldos_gridsize = 11
parameters.targets.ldos_gridspacing_ev = 2.5
parameters.targets.ldos_gridoffset_ev = -5

Now we can use the `DataConverter` class to convert simulation outputs. File format labels follow the ASE package wherever possible!

In [None]:
data_converter = mala.DataConverter(parameters)
data_converter.add_snapshot(descriptor_input_type="espresso-out",
                            descriptor_input_path="./data_generation/dft.out",
                            target_input_type=".cube",
                            target_input_path="./data_generation/tmp.pp0*Be_ldos.cube")
data_converter.convert_snapshots("./data_generation/", naming_scheme="Be_snapshot*.npy")

Disabling z-splitting for preprocessing.
Calculating descriptors from ./data_generation/dft.out


In this example we will use the [MALA test data set](https://github.com/mala-project/test-data), which contains 4 atomic configurations, including simulation output, bispectrum components and LDOS - i.e., this preprocessing has already been done for all of them.

In [4]:
data_path = "/home/fiedlerl/data/mala_data_repo/Be2"

# Visualizing and reproducing output data (hands-on)

Before we train a model, it is a good idea to think about which metric is important, i.e., how do we test if a model is good?

In essence, the advantage of MALA is the access to multiple observables. Two easily accesible metrics are the density of states (DOS) and the band energy. We will now learn how to calculate them from the LDOS (the actual DFT LDOS in this case) so we can do the same after model training to test our models.

For this, we first have to make sure the correct LDOS parameters are used.

In [None]:
parameters.targets.target_type = "LDOS"
parameters.targets.ldos_gridsize = 11
parameters.targets.ldos_gridspacing_ev = 2.5
parameters.targets.ldos_gridoffset_ev = -5

Now we can create an LDOS calculator and directly populate it with the LDOS data from the data set.

In [None]:
ldos_calculator = mala.LDOS.from_numpy_file(parameters, pj(data_path, "Be_snapshot0.out.npy"))

Afterwards, we have to read in some additional information from the simulation data (size of the real space grid, temperature, etc.).

In [None]:
ldos_calculator.read_additional_calculation_data(pj(data_path, "Be_snapshot0.out"))

Now we can access the DOS and the band energy as properties of the calculator object.

In [None]:
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1)
ax.plot(ldos_calculator.energy_grid, ldos_calculator.density_of_states)

print(ldos_calculator.band_energy)

# Training a model (hands-on)

Finally, we can train a neural network based model for the electronic structure. First, let us review which parameters we need for this. In the following we will slowly adapt the parameters until we get a good model out of it.

Since we want to learn about the inner workings of MALA, we want full output. We will also fix the manual seed, so that all the models are comparable.

In [5]:
parameters = mala.Parameters()
parameters.verbosity = 2
parameters.manual_seed = 2023

Since MALA provides quite a few reasonable default values, in the simplest case the only thing we have to decide upon is the data we want to learn on and the architecture of the neural network.

For each training we have to specify training (`"tr"`) and validation (`"va"`) snapshots. The former are used to actually tune the network weights, the latter monitor model performance during training. They will become very relevant as we optimize the training process.

Deciding on the layer sizes is usually done AFTER the data is loaded, since the first and last layer need to match up with the data provided.

In [6]:
data_handler = mala.DataHandler(parameters)
data_handler.add_snapshot("Be_snapshot0.in.npy", data_path,
                          "Be_snapshot0.out.npy", data_path, "tr")
data_handler.add_snapshot("Be_snapshot1.in.npy", data_path,
                          "Be_snapshot1.out.npy", data_path, "va")

# This already loads data into RAM!
data_handler.prepare_data()
parameters.network.layer_sizes = [data_handler.input_dimension,
                                  100,
                                  data_handler.output_dimension]

No data rescaling will be performed.
No data rescaling will be performed.
Checking the snapshots and your inputs for consistency.
Checking descriptor file  Be_snapshot0.in.npy at /home/fiedlerl/data/mala_data_repo/Be2
Checking targets file  Be_snapshot0.out.npy at /home/fiedlerl/data/mala_data_repo/Be2
Checking descriptor file  Be_snapshot1.in.npy at /home/fiedlerl/data/mala_data_repo/Be2
Checking targets file  Be_snapshot1.out.npy at /home/fiedlerl/data/mala_data_repo/Be2
Consistency check successful.
Initializing the data scalers.
Input scaler parametrized.
Output scaler parametrized.
Data scalers initialized.
Build datasets.
Build dataset: Done.


Now we can actually train a network.

In [7]:
network = mala.Network(parameters)
trainer = mala.Trainer(parameters, network, data_handler)
trainer.train_network()

Initial Guess - validation data loss:  0.23660719517299106
Epoch 0: validation data loss: 2.0642450877598352e-05, training data loss: 0.0010316168240138463
Time for epoch[s]: 14.075532674789429
Epoch 1: validation data loss: 8.426698190825327e-06, training data loss: 9.5022799713271e-06
Time for epoch[s]: 0.5715315341949463
Epoch 2: validation data loss: 6.573324224778584e-06, training data loss: 4.276818728872708e-06
Time for epoch[s]: 0.49625110626220703
Epoch 3: validation data loss: 6.066970527172088e-06, training data loss: 3.2637411994593483e-06
Time for epoch[s]: 0.5719480514526367
Epoch 4: validation data loss: 5.8819215212549484e-06, training data loss: 2.9526403439896447e-06
Time for epoch[s]: 0.614145040512085
Epoch 5: validation data loss: 5.756421280758722e-06, training data loss: 2.780921491129058e-06
Time for epoch[s]: 0.8275718688964844
Epoch 6: validation data loss: 5.729225065026964e-06, training data loss: 2.724048016326768e-06
Time for epoch[s]: 0.9969642162322998
E

Well, how was that? Do we have a good model now, can we predict the LDOS with this? That is not an easy question to answer from this output alone. First of all, we see loss values being printed, and those look all nice, but they are not trivially related to physical/chemical accuracy, which we are actually looking for.

We can test this by using the `Tester` class. The class works similar to the Trainer class. We add data, push them through the model, and then use the results to perform calculations.

We just have to make sure that the LDOS is correctly integrated by setting the appropriate parameters. Then we can add data to test. We should always test on data different from the one we trained on. Also, we now have to specify the corresponding calculation output, since we may need this for integration.

In [10]:
parameters.targets.ldos_gridsize = 11
parameters.targets.ldos_gridspacing_ev = 2.5
parameters.targets.ldos_gridoffset_ev = -5

data_handler.clear_data()
data_handler.add_snapshot("Be_snapshot2.in.npy", data_path,
                          "Be_snapshot2.out.npy", data_path, "te",
                          calculation_output_file=pj(data_path, "Be_snapshot2.out"))
data_handler.add_snapshot("Be_snapshot3.in.npy", data_path,
                          "Be_snapshot3.out.npy", data_path, "te",
                          calculation_output_file=pj(data_path, "Be_snapshot3.out"))
data_handler.prepare_data(reparametrize_scaler=False)


Checking the snapshots and your inputs for consistency.
Checking descriptor file  Be_snapshot2.in.npy at /home/fiedlerl/data/mala_data_repo/Be2
Checking targets file  Be_snapshot2.out.npy at /home/fiedlerl/data/mala_data_repo/Be2
Checking descriptor file  Be_snapshot3.in.npy at /home/fiedlerl/data/mala_data_repo/Be2
Checking targets file  Be_snapshot3.out.npy at /home/fiedlerl/data/mala_data_repo/Be2
DataHandler prepared for inference. No training possible with this setup. If this is not what you wanted, please revise the input script. Validation snapshots you may have entered willbe ignored.
Consistency check successful.
Build datasets.
Build dataset: Done.


Now comes the actual object with which to test. We simply tell it which observables to test for and off we go.

In [11]:
tester = mala.Tester(parameters, network, data_handler, observables_to_test=["band_energy"])
results = tester.test_all_snapshots()


The results

In [12]:
results

{'band_energy': [-0.0352251206432328, -0.02902837244603873]}