In [None]:
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import MACCSkeys, Draw, rdFingerprintGenerator
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [None]:
!pip3 install numpy==1.23.5
!pip install keras==2.9.0 
!pip install -U tensorflow

In [None]:
!pip install deepchem

In [None]:
import deepchem
import tensorflow
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Dropout, Conv2D, Conv1D
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.regularizers import L1

***Make sure the above cells all run properly***

# Geometry-based Descriptors for Neural Networks

All of our notebooks this far have used descriptors that are in some way derived from a SMILES string. SMILES strings only specify molecular identity and connectivity, but they cannot encode a specific molecule's geometry. For example, a SMILES string could not be used to generate a ML-based force field, where the model would need to predict the total energy as a function of changing molecular geometry. 

There are many choices for geometry-based descriptors. The simplest choice is the Coulomb matrix, a 2D descriptor, which encodes the molecular geometry using nuclear charges and distances. A given element of the Coulomb matrix can be defined using the charges of any pair of atoms ($q_Aq_B$) and their distance, $r_{AB}$:

$$C_{AB} = \frac{q_Aq_B}{r_{AB}}$$

More sophistocated descriptors use symmetry functions, which encode the local environments of each atom beyond a pairwise interaction.

## Interatomic Potentials based on Machine Learning

One of the main goals of all computational chemists is to accurately compute the total energy of molecules. However, there does not exist a theory that can be universally used for all molecules. Quantum mechanical methods tend to be limited to small molecules (<50 heavy atoms), since their cost (in terms of memory/time) scale with the number of electrons.

Larger systems, like proteins, membrane models, etc., cannot be easily modeled with quantum mechanical methods, so classical methods are typically used, often called "molecular mechanics". Molecular mechanics involve simple, inexpensive physical models that depend only on atom positions and locations, and not on the electronic structure. While very useful, MM methods depend a lot on their parameterization, and their degree of reliability is limited.

It becomes a tantalizing possibility, then, that we could use machine learning to bring the accuracy of a QM method to the cost of a MM method. Indeed, this is the goal of many reseach laboratories, with the most successful example being the ANI potential. ANI is a NN-based ML model that was trained on DFT energies of about ~2 million small molecules. ANI has made it possible to calculate DFT-quality energies for a variety of molecules at a cost comparable to most MM methods. The success of ANI has led to the developmet of many ML models designed to predict quantum mechanically-derived quantities. 

In this notebook, you will be builing your own molecular potential based on QM energies. You will be using the QM9 dataset, which is composef of roughly 130,000 molecules containing 9 heavy atoms. This dataset also contains a number of molecular properties, like the dipole moment, total energy, and zero-point energy correciton, all computed at the B3LYP/6-31G(2df,p) level of theory.

## Loading the Data

The QM9 dataset is so commonly used, that `DeepChem` has a function to load, featurize, normalize, and split the data all in one function. 

All we need to do is define a featurizer, here we'll choose the Coulomb matrix, and then pass to the function `get_qm9()` defined in DeepChem.

In [None]:
# define a featurizer below

# Pass it to the function
tasks, datasets, transformers = deepchem.molnet.load_qm9(featurizer= )

The `tasks` list tells is what data is available for training. Print it below to see the values, and look at the `DeepChem` documentation to get their definitions

The `datasets` list contains the training, validation, and testing set descriptors and labels. We will use the training set to train the model, and the validation set is used at each epoch to estimate the accuracy. The testing set is only used to test the completed model, after it is trained. 

As usual, we need to do a little formatting of the data. First, we need to store each feature/label pair in a well-named list. I'll do what's needed on the training set, but repeat the procedure for the validation and testing sets.

In [None]:
# First we unroll the datasets
train, valid, test = datasets

# Then we grab the features, store them as x and y
xtrain = train.X
ytrain = train.y

# We need to make sure the features are the right dimension
x_train = np.reshape(x_train, (len(x_train), 32*32))

# Let's see how many training points we're using
print(len(x_train))


Now we need to select our labels. The lists contain all available data defined in the `tasks` list. We need to take our sets of y values, and create lists that only contain the label we want. In our case, this is the total energy, `u0`. Follow the comments below to create the three lists we need.

In [None]:
# First, determine the index of 'u0'. This will tell us what column from the y-values we need.

# Next, define a list of y-values (e.g., start with ytrain), as the slice
# of values at the index you defined


# Then, flatten the lists. Be sure they have nice names
# like ytrain, yvalid, and ytest



## Defining and training your model

Now that you have the data in the proper format, you need to define your model architecture. Despite the programmingnot being too demanding, this is a very challenging task. There are many choices here, but we'll stick to the same sequential models from the previous notebook. One addition you may want to consider a few things:

 1. Have your first layer have a number of nodes equal to your **total** number of features per molecule.
 2. Have the next few layers reduce the number of nodes to something more modest.
 3. Consider adding one or more **dropout layers**. These randomly set node values to zero, and are implemented to avoid overfitting.
 4. Try having a somewhat large number of layers. You may want to use a loop for that.
 5. We are ultimately learning a single number---the total energy---so your last layer should only have a single node.


Compile the model below. If you notice your model converges to a high-loss value very quickly, consider changing the **learning rate**, by passing a value to the compiler.

Finally, train your model. I'd recommend using a somewhat large batch size, and be sure to remove the `verbose=0` parameter so that you can monitor the loss in real time.

## Testing and evaluation

In the cell below, test your model by following these steps:

  1. Use your model to predict the energies of the test set. You will need to flatten the resulting list.
  2. Plot your predicted test values against the actual test values. Also plot the y=x line
  3. Calculate the RMSE and $R^2$ for the predicted values.