# Machine learning for computational materials science and chemistry with MALA

## What is MALA?

A short summary about MALA and how it works is given in `mala_background.pdf`.

## Setting up MALA

For this tutorial, a Google Collab enviroment will be provided that includes all necessary packages. Generally, MALA is an open source framework that can be obtained [here](https://github.com/mala-project/mala). Detailled (installation) instructions can be found [here](https://mala-project.github.io/mala/).

A few examples at the end of the notebook tackle advanced applications. The necessary backends are, for the ease of installation, not bundled with the Google Collab environment. Interested readers may install them themselves on their machines, for in presence workshops, they will be demonstrated by the host.

## Loading the modules

These modules will be necessary for the tutorials discussed here.

In [2]:
# MALA itself.

import mala

# We would like to visualize simple plots.
# The font size can sometimes be a bit small for Jupyter Notebooks.

import matplotlib
import matplotlib.pyplot as plt
font = {'size'   : 22}
matplotlib.rc('font', **font)

## Data

Data is the backbone of each ML application.

## The MALA interface

Within extended ML frameworks, a big problem is reproducibility. Models often depend on a lot of so called hyperparameters, that characterize model behavior. This may include, but in no way be limited to, neural network layer sizes and number, training procedures, description of data specifics (e.g.: how is the local density of states sampled?), etc.
There is a number of ways to efficiently handle this problem. A lot of frameworks rely on command line arguments to deal with this, i.e., the user provides a, potentially extensive, list of command line arguments upon runtime. Another good way to handle all this is the usage of input files. This is consistent with the way standard computational science simulation codes work.
An obvious downside to this is that one has to have a framework at hand to prepare these input files.

MALA follows a route sort of in between the two approaches. The central quantity is the `Parameters()` object. It holds ALL (hyper)parameters one could use in the course of a MALA run. It is structured by the subtasks of MALA.

In [3]:
parameters = mala.Parameters()

# All parameters related to how data is handled in general.
parameters.data

# "Targets" always refers to the quantity being learned.
parameters.targets

# "Descriptors" always refers to the quantity from which we learn.
parameters.descriptors

# "Data generation" refers to useful routines for creating training data.
# These routines are mostly experimental at the moment and will not be discussed here in detail
parameters.datageneration

# "Network" refers to everything related to neural network creation and training.
# In the future, support for more models is planned, and this collection of parameters
# will be updated to reflect this.
parameters.network

# "Hyperparameters" means hyperparameter optimization. This is the process of finding the optimal
# hyperparameters for MALA model training, and we will come back to this process later.
parameters.hyperparameters

<mala.common.parameters.ParametersHyperparameterOptimization at 0x7fab7d654280>

Individual parameter objects can be printed to see what's hidden inside. This also works on the main object as well.

In [4]:
parameters.data.show()

snapshot_directories_list: []
data_splitting_type: by_snapshot
input_rescaling_type: None
output_rescaling_type: None
use_lazy_loading: False
use_clustering : False
number_of_clusters: 40
train_ratio    : 0.1
sample_ratio   : 0.5
use_fast_tensor_data_set: False
shuffling_seed : None


Finally, there are some high-level parameters one needs when performing ML at scale.

In [5]:
# Whether or not to use a GPU for model training and inference.
parameters.use_gpu

# Whether or not to use MPI parallel CPU inference (no training supported).
# This option is either for pre-processing or production runs of trained models.
# More on that later.
parameters.use_mpi

# Manual seeds can be used to fix the Pseudo RNG to re-create models with the exact
# same model weights.
parameters.manual_seed

# A comment may be useful to distinguish between sets of parameters.
parameters.comment

''

All of this does not explain how MALA handles reproducibility. Write hundreds of lines of parameter statement in each python script is not exactly maintainable.
Therefore MALA provides a .json interface.

In [6]:
parameters.save("mala_parameters_01.json")
new_parameters = mala.Parameters.load_from_file("mala_parameters_01.json")

Have a look at the .json file that was just created. You will see that it is structured in the same ways as the python object, allowing fast access. You will see some parameters that are not dicussed here since they exceed the scope of this tutorial. For a first excercise, try to modify parameters both in python and in json and see whether loading will recover those changes. The comment and manual seed are good first examples for this.

In [10]:
# Set a comment
parameters.comment = "My first parameters."

# Save.
parameters.save("mala_parameters_01.json")

In [12]:
# Edit something in the file (e.g. the manual_seed) and reload.
parameters = mala.Parameters.load_from_file("mala_parameters_01.json")
parameters.show()

---     All parameter MALA needs to perform its various tasks. ---
comment        : My first parameters.
manual_seed    : 1234
use_gpu        : False
device         : cpu
use_horovod    : False
use_mpi        : False
verbosity      : 1
openpmd_configuration: {}
openpmd_granularity: 1
---     Parameters necessary for constructing a neural network. ---
	nn_type        : feed-forward
	layer_sizes    : [10, 10, 10]
	layer_activations: ['Sigmoid']
	loss_function_type: mse
	num_hidden_layers: 1
	no_hidden_state: False
	bidirection    : False
	dropout        : 0.1
	num_heads      : 10
---     Parameters necessary for calculating/parsing input descriptors. ---
	descriptor_type: Bispectrum
	lammps_compute_file: 
	descriptors_contain_xyz: True
	use_z_splitting: True
	number_y_planes: 0
	bispectrum_twojmax: 10
	rcutfac        : 4.67637
	atomic_density_cutoff: 4.67637
	snap_switchflag: 1
	use_atomic_density_energy_formula: False
	atomic_density_sigma: None
---     Parameters necessary for calculat

For extended experiments, it is very useful to operate with such input files and only use the in-python parameter editing when absolutely necessary. Further, concluded experiments can be saved in this way for future reference.
In the following, we will use a combination of both approaches for the sake of transparency.