# Predicting Gibbs free energy of metabolic reactions using Neural Network
Motivation: A quantitative understanding of the thermodynamics of biochemical reactions is essential for accurately modeling metabolism. The energetics studies of metabolic pathways require an accurate knowledge of the thermochemical quantities of involved substrates and metabolites. Most importantly, accurate knowledge of the standard Gibbs free energy change of the involved reaction is required in order to quantify the degree of thermodynamic favourability of reactions in biological systems. The number of molecules involved in the biochemical reactions is huge and the currently available Gibbs free energies of formation is not accurate (based on empirical methods such as GCM or CCM). Quantum chemistry has recently emerged as an important alternative modelling tool for the accurate prediction of biochemical thermodynamics. However, quantum chemical methods tend to have very high computational cost in comparison with the GCM or other cheminformatic-based alternatives.

    "There is a need to develop alternative non-empirical approach that use quantum chemical data and predict free energy of molecules relevant for the metabolic pathways."

## Metrics: 

We are hoping that we would be able to successfully train our model to reproduce the existing analysis and predict energetics of new metabolic reactions. 

1. Our goal is to predict the free energy of metabolites (unknown/not trained) involved in metacyc within the accuracy of DFT by training quantum mechanical data (QM9).
2. Generate Gibbs energy of metabolic reactions (Need to think about how can we do this)
3.  If the machine learning predicted free energy of metabolic reactions is within 1-2 kcal/mol than that of experimental value, we are successful. 


## Deep Learning Approaches
Supervised learning on molecules has seen rapid improvements with applications to chemistry, drug discovery, and materials science
* Message-Passing Neural Networks ([Gilmer et al 2017] (https://arxiv.org/pdf/1704.01212.pdf)) A framework for describing many graph neural networks (described below) in terms of  Message, Update and  Readout operations on graphs with analogy to message passing in Probabilistic Graphical models
  * Graph Convolutional Networks  ([Thomas Kipf et al 2016](http://arxiv.org/abs/1609.02907))
  * Gated Graph Convolutional Networks [(Li et al 2016)](http://arxiv.org/abs/1511.05493)
  * [Interaction Networks](https://github.com/PNNL-CompBio/graph-neural-networks) ([Battaglia et al 2016](http://arxiv.org/abs/1612.00222)) This is the network we got working first
  * [Deep Tensor Neural Networks](https://github.com/atomistic-machine-learning/dtnn) [(Schutt 2017a)](https://www.nature.com/articles/ncomms13890) This was referenced in MPNN paper
  * [SchNet](https://github.com/djinnome/SchNet) ([Schutt 2017b](http://arxiv.org/abs/1712.06113)) This improved on DTNN
  * Neural Message Passing with Edge Updates [(Bjorgensen 2018)](https://arxiv.org/pdf/1806.03146.pdf) This improved on SchNet.
  * Graph Networks[(Battaglia et al 2018)](https://arxiv.org/abs/1806.01261) This is a generalization of MPNNs (see figure below)
* Ensemble networks to predict properties.
    


![graph neural network architectures](https://ndownloader.figshare.com/files/12245093/preview/12245093/preview.jpg?private_link=7bc3719fab09b0639bd4)

## Training Data
* Quantum mechanics data (QM9/B3LYP) raw distances and chemical graph
* QM9/B3LYP with binned distances and chemical graph. Bin 0 is anything less than 2 angstroms. Bin 9 is anything greater than 3 angstroms.
* QM9/B3LYP with chemical graph only
    
    
## Dev data
* MetaCyc (Group Contribution) subset
* Equilibrator (Component Contribution) subset

## Test data
* MetaCyc (Group Contribution) different subset
* Equilibrator (Component Contribution) different subset

## Experimental (Golden set) data
* NIST free energy of reaction for quantification of true error in prediction

## Preliminary results

Last night, we submitted  Interaction Networks architecture (Battaglia 2016) for training on QM9 for raw distance

After analyzing the trained network, we will predict Gibbs energy from our Dev set.


## Challenges
Rep. of input molecules for prediction of properties: There are a number of things we need to modify in the Message Passing Neural Network (MPNN) in order to predict free energy of reactions involved in metabolism

1. 3D Coordinates: We dont have accurate 3D coordinates of the lowest energy states for MetaCyc/Equilibrator compounds. We know that MPNN gets 11/13 properties within chemical accuracy of DFT using both 3D coordinates of the lowest energy state and the SMILES, whereas it  gets 5/13 properties accurately from the Inchi/Smiles only.

2. Free Energy of formation to free energy of reaction: 
We just have values for the gibbs free energy of formation $\Delta G^\circ_{f}$ and from that we can get free energy of reaction $\Delta G^\circ_{rxn}$ (by calculating $\Delta G_{rxn}^\circ = S^T\Delta G^\circ_{f}$) where $S$ is the stoichiometric matrix of reactions and metabolites.  This will enable us to predict a more accurate free energy of reaction from the QM9 data. The challenge will be how to correctly propagate errors from the Gibbs free energy of formation to the Gibbs free energy of reaction.  We also need to find a data set that contains experimentally measured Gibbs free energies of reactions as a (NIST?)

3. We need to extend to multiple NN architectures (try SchNet  and SchNet + edge updates) 


In [1]:
def qm9_edges(g, e_representation='raw_distance'):
    remove_edges = []
    e={}    
    for n1, n2, d in g.edges_iter(data=True):
        e_t = []
        # Raw distance function
        if e_representation == 'chem_graph':
            if d['b_type'] is None:
                remove_edges += [(n1, n2)]
            else:
                e_t += [i+1 for i, x in enumerate([rdkit.Chem.rdchem.BondType.SINGLE, rdkit.Chem.rdchem.BondType.DOUBLE,
                                                rdkit.Chem.rdchem.BondType.TRIPLE, rdkit.Chem.rdchem.BondType.AROMATIC])
                        if x == d['b_type']]
        elif e_representation == 'distance_bin':
            if d['b_type'] is None:
                step = (6-2)/8.0
                start = 2
                b = 9
                for i in range(0, 9):
                    if d['distance'] < (start+i*step):
                        b = i
                        break
                e_t.append(b+5)
            else:
                e_t += [i+1 for i, x in enumerate([rdkit.Chem.rdchem.BondType.SINGLE, rdkit.Chem.rdchem.BondType.DOUBLE,
                                                   rdkit.Chem.rdchem.BondType.TRIPLE, rdkit.Chem.rdchem.BondType.AROMATIC])
                        if x == d['b_type']]
        elif e_representation == 'raw_distance':
            if d['b_type'] is None:
                remove_edges += [(n1, n2)]
            else:
                e_t.append(d['distance'])
                e_t += [int(d['b_type'] == x) for x in [rdkit.Chem.rdchem.BondType.SINGLE, rdkit.Chem.rdchem.BondType.DOUBLE,
                                                        rdkit.Chem.rdchem.BondType.TRIPLE, rdkit.Chem.rdchem.BondType.AROMATIC]]
        else:
            print('Incorrect Edge representation transform')
            quit()
        if e_t:
            e[(n1, n2)] = e_t
    for edg in remove_edges:
        g.remove_edge(*edg)
    return nx.to_numpy_matrix(g), e


## Future ROAD MAP

By developing deep neural network we would like to expand the range of problems that can be addressed by 
1. Accurately modeling chemical properties with high level ab iniito simulations
2. Modeling much larger systems including cofactors and transition metal
3. Trainlarger datasets, and 