# Dataset Processing

## QM9
Download dataset [here](https://figshare.com/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904).

The QM9 dataset is a widely used dataset and benchmark in computational chemistry and machine learning. It contains information about 134k small molecules composed of up to nine heavy atoms (CONF). Information includes the energy, entropy, and dipole moment.

The files are provided in a pseudo .xyz format. 

The content of the files are shown as follows.

| Line       | Content                                                                               
|------------| --------------------------------------------------------------------------------------
| 1          | Number of atoms na                                                                    |
| 2          | Properties 1-17 (see below)                                                           |
| 3,...,na+2 | Element type, coordinate (x,y,z) (Angstrom), and Mulliken partial charge (e) of atom  |
| na+3       | Frequencies (3na-5 or 3na-6)                                                          |
| na+4       | SMILES from GDB9 and for relaxed geometry                                             |
| na+5       | InChI for GDB9 and for relaxed geometry                                               |


The properties stored in the second line of each file:   
| I. |Property | Unit        | Description                                         |
|----|---------|-------------|-----------------------------------------------------|
|  1 |tag      | -           | "gdb9"; string constant to ease extraction via grep |
|  2 |index    | -           | Consecutive, 1-based integer identifier of molecule |
|  3 |A        | GHz         | Rotational constant A                               |
|  4 |B        | GHz         | Rotational constant B                               |
|  5 |C        | GHz         | Rotational constant C                               |
|  6 |mu       | Debye       | Dipole moment                                       |
|  7 |alpha    | Bohr^3      | Isotropic polarizability                            |
|  8 |homo     | Hartree     | Energy of Highest occupied molecular orbital (HOMO) |
|  9 |lumo     | Hartree     | Energy of Lowest occupied molecular orbital (LUMO)  |
| 10 |gap      | Hartree     | Gap, difference between LUMO and HOMO               |
| 11 |r2       | Bohr^2      | Electronic spatial extent                           |
| 12 |zpve     | Hartree     | Zero point vibrational energy                       |
| 13 |U0       | Hartree     | Internal energy at 0 K                              |
| 14 |U        | Hartree     | Internal energy at 298.15 K                         |
| 15 |H        | Hartree     | Enthalpy at 298.15 K                                |
| 16 |G        | Hartree     | Free energy at 298.15 K                             |
| 17 |Cv       | cal/(mol K) | Heat capacity at 298.15 K                           |

### Sample file: `dsgdb9nsd_000167.xyz`

In [11]:
import os
import json
from tqdm import tqdm

In [12]:
f = open("QM9/dsgdb9nsd.xyz/dsgdb9nsd_000167.xyz", "r")
print(f.read())

8
gdb 167	10.29348	9.86616	5.03764	2.79	35.01	-0.2705	0.0059	0.2764	264.8693	0.059902	-242.205627	-242.202078	-242.201134	-242.231775	11.923	
N	 0.0068637129	 1.3543886388	 0.0098030736	-0.173898
C	 1.2877308849	 1.7751224177	-0.0006303393	 0.035075
N	 2.1022228334	 0.7378995984	-0.014383726	-0.258364
C	 1.2386759646	-0.3171902112	-0.011553585	 0.044405
N	-0.0452811419	 0.0014633793	 0.0029412768	-0.21032
H	-0.8459890948	 1.8890884125	 0.0211917297	 0.283792
H	 1.569426308	 2.8172545796	 0.0022548317	 0.148551
H	 1.5661064428	-1.3457822952	-0.0204190516	 0.130759
557.754	684.8253	702.9204	854.5053	912.0274	960.2888	993.3504	1084.0514	1147.0893	1186.8149	1278.9991	1321.8153	1392.6012	1466.3028	1557.0241	3260.3955	3266.2964	3666.9581
N1C=NC=N1	[nH]1cncn1	
InChI=1S/C2H3N3/c1-3-2-5-4-1/h1-2H,(H,3,4,5)	InChI=1S/C2H3N3/c1-3-2-5-4-1/h1-2H,(H,3,4,5)



In [13]:
# Parse a single .xyz file
def parse_qm9_entry(file):
    folder = "QM9/dsgdb9nsd.xyz/"
    filename = os.path.join(folder, file)
    with open(filename, "r") as f:
        lines = f.readlines()

        # Number of atoms
        num_atoms = int(lines[0].strip())

        # Atoms
        atoms = []
        for i in range(2, 2+num_atoms):
            atom_info = lines[i].strip().split('\t')
            try:
                atoms.append({"elem": atom_info[0], "coord": (float(atom_info[1]), float(atom_info[2]), float(atom_info[3])), "charge": float(atom_info[4])})
            except ValueError: # Notation uses 1.0*^-5 instead of 1.0e-5
                for j in range(1, 5):
                    atom_info[j] = atom_info[j].replace("*^", "e")
                atoms.append({"elem": atom_info[0], "coord": (float(atom_info[1]), float(atom_info[2]), float(atom_info[3])), "charge": float(atom_info[4])})

        # Free energy
        free_E = float(lines[1].strip().split('\t')[-2])
    
    assert num_atoms == len(atoms)
    return {"file": file, "num_atoms": num_atoms, "atoms": atoms, "free_E": free_E}


In [14]:
print(parse_qm9_entry("dsgdb9nsd_000167.xyz"))

{'file': 'dsgdb9nsd_000167.xyz', 'num_atoms': 8, 'atoms': [{'elem': 'N', 'coord': (0.0068637129, 1.3543886388, 0.0098030736), 'charge': -0.173898}, {'elem': 'C', 'coord': (1.2877308849, 1.7751224177, -0.0006303393), 'charge': 0.035075}, {'elem': 'N', 'coord': (2.1022228334, 0.7378995984, -0.014383726), 'charge': -0.258364}, {'elem': 'C', 'coord': (1.2386759646, -0.3171902112, -0.011553585), 'charge': 0.044405}, {'elem': 'N', 'coord': (-0.0452811419, 0.0014633793, 0.0029412768), 'charge': -0.21032}, {'elem': 'H', 'coord': (-0.8459890948, 1.8890884125, 0.0211917297), 'charge': 0.283792}, {'elem': 'H', 'coord': (1.569426308, 2.8172545796, 0.0022548317), 'charge': 0.148551}, {'elem': 'H', 'coord': (1.5661064428, -1.3457822952, -0.0204190516), 'charge': 0.130759}], 'free_E': -242.231775}


In [15]:
# Parse all QM9 files
def parse_qm9():
    folder = "QM9/dsgdb9nsd.xyz/"

    dataset = []
    files = os.listdir(folder)
    for i in tqdm(range(len(files))):
        dataset.append(parse_qm9_entry(files[i]))
    return dataset

In [16]:
qm9_dataset = parse_qm9()

100%|██████████| 133885/133885 [00:28<00:00, 4759.00it/s]


In [17]:
# Turns needed data into an easily usable JSON file
qm9_json = json.dumps(qm9_dataset, indent=4)
with open("qm9.json", "w") as outfile:
    outfile.write(qm9_json)

# ANI-1