<div style="text-align:center;">
  <img src="custom/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Using RDKit to Build a Dataset

<strong>Author(s):</strong> Jessica A. Nash, The Molecular Sciences Software Institute

<div class="alert alert-block alert-warning"> 
<h2>Day 1: Final Challenge</h2>

There is a file called `amino_acids.txt` in the `data` folder that contains SMILES for the 20 naturally occurring amino acids. Your task for the final challenge of today is to combine skills and concepts from MolSSI's Python Data and Scripting course with concepts you have learned today to build a dataset for molecules and write a file. Your goal is to create a comma-separated value file with columns `SMILES`, `num_heavy` (number of heavy atoms), `molecular_weight`, and one other molecular descriptor of your choice for the molecules in the file.

For this task, you will need to:

1. Read SMILES from a text file.
2. Make an RDKit molecule for each SMILES.
3. Get the number of heavy atoms, molecular weight, and one other descriptor of your choice.
4. Write a file with the data

Save your file as `data/amino_acids_processed.csv`.

</div>

In [None]:
from rdkit import Chem
from rdkit.Chem import Descriptors

# this solution picks TPSA

import numpy as np

# Read SMILES strings from a file
with open("data/amino_acids.txt", "r") as f:
    amino_acid_smiles = f.readlines()

# Prepare lists to store molecular weights and number of heavy atoms
molecules = []
molecular_weight = []
number_heavy = []
output_string = "SMILES,num_heavy,molecular_weight,tpsa\n"

# Compute molecular descriptors
for smiles in amino_acid_smiles:
    smiles = smiles.strip()
    mol = Chem.MolFromSmiles(smiles.strip())
    if mol:
        molecules.append(mol)
        mol_weight = Descriptors.MolWt(mol)
        num_heavy = Descriptors.HeavyAtomCount(mol)
        molecular_weight.append(mol_weight)
        number_heavy.append(num_heavy)
        tpsa = Descriptors.rdMolDescriptors.CalcTPSA(mol)
        output_string += f"{smiles},{num_heavy},{mol_weight},{tpsa}\n"
        
with open("data/amino_acids_processed.csv", "w+") as f:
    f.write(output_string)
