# Generative AI for Molecule Design

This is a toy project using generative AI to study molecule design. I am especially interested in its application in drug discovery. 

However, since this is an educative project, and since I am only using my laptop with a fairly modest GPU, I will only be solving a "toy problem" as a proof of concept.

## Project Plan

Idea: Train a generative model (such as a variational autoencoder or diffusion model) to design (new) bioactive molecules,
in particular kinase inhibitors[^1] for cancer therapy. 

I plan to use knowledge from my studies in quantum chemistry and computer science, using features like the HOMO-LUMO gap,
partial charges, and electronic excitations into the model, as well as physical constraints to ensure realistic molecules

[^1]: TODO: I will want to write more general notes one kinase inhibitors later.

### To start 

0. Set up directory structure, with src/ tests/ notebooks/ experiments/
1. Explore QM9 dataset, make sure you can visualise molecules
   1. Make sure to visualize both molecular graphs and properties (like HOMO-LUMO gap, molecular weight, etc.) to get a sense of the data distribution
   2. It might also help to visualize the correlation between different properties to identify which targets might be more informative
2. Train a simple GraphVAE on the QM9 dataset
   1. Perhaps only use a subset of the targets -- not sure which but targeting the atomisation energy at 0 K as well as room temperature probably doesn't make sense
      1. HOMO-LUMO gap, dipole moment, molecular weight
      2. Verifying the VAE’s reconstruction quality on both molecular graph and properties to ensure the model learns meaningful representations
3. Verify that you can generate random molecules by sampling the latent space randomly
   1. Ensure that the molecules look realistic and consistent with the distribution in QM9
4. Use a simple reinforcement learning algorithm to search for the molecule that has the smallest HOMO-LUMO gap
   1. Note: this will require calculating properties e.g. with PySCF for every new molecule to feed back into the model
   2. A smaller HOMO-LUMO gap indicates high reactivity
   3. This is not a realistic target in drug discovery, but it is a reasonable surrogate for a simple "toy problem" such as this
5. Once that is done...

### Next steps

6. Modify the model to enforce realistic physics, i.e. that the models we generate are actually stable
   1. I'm not really sure what this will mean in practice, but I do not want to generate molecules that cannot exist anyway
   2. Introduce constraints during the molecule generation process (e.g., valid bond lengths, bond angles)
   3. Use molecular force fields
   4. Experiment with deep GNNs instead of VAEs (primarily using VAEs since I already have experience, albeit for a different type of problem)
7. Augment dataset to include biologically-relevant properties, e.g. from ZINC
   1. This will allow you to target e.g. binding free energy, binding affinities, low toxicity, logP ...
8. Use heuristics and methods from drug discovery (such as Lipinski's rule of five) to verify generated molecules
   1. QED (Quantitative Estimate of Drug-likeness) or ADMET (absorption, distribution, metabolism, excretion, toxicity)

### Project Structure
Might look something like this 

```text
generative-molecular-design/
├── data/
│   ├── raw/                  # Raw datasets (e.g., QM9, ZINC, etc.)
│   ├── processed/            # Preprocessed and cleaned data (e.g., graphs, properties, etc.)
│   └── augmented/            # Augmented data (e.g., biologically-relevant data from ZINC)
├── notebooks/                # Jupyter notebooks for exploration and analysis
|   ├── 00_plan.ipynb         # This Jupyter notebook
│   ├── 01_explore_qm9.ipynb  # Explore QM9 dataset and visualize molecules
│   ├── 02_train_graphvae.ipynb  # Train GraphVAE on QM9 dataset
│   └── 03_rl_homo_gap_search.ipynb  # Reinforcement learning for searching HOMO-LUMO gap
├── mygenai/                  # Main source code for the project
│   ├── __init__.py           # Make it a Python package
│   ├── data_preprocessing.py # Preprocessing functions for datasets
│   ├── graphvae.py           # GraphVAE model definition
│   ├── reinforcement_learning.py # Reinforcement learning module
│   ├── pyscf_utils.py        # Interface with PySCF for quantum chemistry calculations
│   ├── molecule_generation.py # Functions for generating and evaluating molecules
│   └── utils.py              # General utility functions (e.g., visualization, data saving)
├── tests/                    # Unit tests for the code
│   ├── test_graphvae.py       # Tests for GraphVAE functionality
│   ├── test_rl.py            # Tests for reinforcement learning algorithm
│   ├── test_molecule_generation.py  # Tests for molecule generation and validation
│   └── test_pyscf_utils.py   # Tests for PySCF integration
├── docs/                     # Documentation folder
│   ├── index.rst             # Main entry point for the documentation
│   ├── requirements.txt      # List of dependencies
│   ├── README.md             # Project overview, setup, and usage instructions
│   └── api_reference.rst     # If you have an API or specific functions to document
├── experiments/              # Logs, model outputs, and experiment tracking
│   ├── graphvae_model/       # Folder to save trained GraphVAE models
│   ├── rl_experiment_01/     # Reinforcement learning logs and data
│   └── molecule_results/     # Folder to store generated molecules and their properties
├── .gitignore                # Files to ignore in version control (e.g., model checkpoints)
└── setup.py                  # For package setup and dependency management
```