<div style="text-align:center;">
  <img src="custom/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Using RDKit to Build a Dataset

<strong>Author(s):</strong> Jessica A. Nash, The Molecular Sciences Software Institute

<div class="alert alert-block alert-warning"> 
<h2>Day 1: Final Challenge</h2>

There is a file called `amino_acids.txt` in the `data` folder that contains SMILES for the 20 naturally occurring amino acids. Your task for the final challenge of today is to combine skills and concepts from MolSSI's Python Data and Scripting course with concepts you have learned today to build a dataset for molecules and write a file. Your goal is to create a comma-separated value file with columns `SMILES`, `num_heavy` (number of heavy atoms), `molecular_weight`, and one other molecular descriptor of your choice for the molecules in the file.

For this task, you will need to:

1. Read SMILES from a text file.
2. Make an RDKit molecule for each SMILES.
3. Get the number of heavy atoms, molecular weight, and one other descriptor of your choice.
4. Write a file with the data

Save your file as `data/amino_acids_processed.csv`.

</div>

## Read SMILES from a text file.

In [None]:
import os
from rdkit import Chem
from rdkit.Chem import Descriptors
import pandas as pd

In [None]:
os.listdir('data')
aa_file = os.path.join('data', 'amino_acids.txt')
with open(aa_file,'r') as outfile:
    data = outfile.readlines()

In [None]:
print(data)
type(data)

## Make an RDKit molecule for each SMILES.

In [None]:
molecules = []
for item in data:
    amino_acid = Chem.MolFromSmiles(item)
    molecules.append(amino_acid)

amino_acid_df = pd.DataFrame(
    {'SMILES': data})

## Get the number of heavy atoms, molecular weight, and one other descriptor of your choice.

In [None]:
heavy_atoms = []
mw = []
num_val_e = []

for item in molecules:
    heavy_atoms.append(item.GetNumHeavyAtoms())
    molecular_weight = Descriptors.MolWt(item)
    mw.append(molecular_weight)
    elec = Descriptors.NumValenceElectrons(item)
    num_val_e.append(elec)

amino_acid_df['Heavy Atoms'] = heavy_atoms
amino_acid_df['Molecular Weight'] = mw
amino_acid_df['# Valence Electrons'] = num_val_e
amino_acid_df.head()

## Write a file with the data.

In [None]:
outputfile = os.path.join('data', 'amino_acid_dataset.csv')
amino_acid_df.to_csv(outputfile)