## Get SMILES from PubChem

Dask implementation to acquire `CanonicalSMILES` from PubChem using the `pubchem` API. At the end of the notebook there is another dask based implementation of using `RDKit` to get InChIKey from the SMILES. While Dask is not necessary required in the case of InChIKeys it is a much more elegant implementation of `dask.dataframes` and `map_partitions` 

In [1]:
import time
import pubchempy as pcp
from pubchempy import Compound, get_compounds
import pandas as pd
import numpy as np
import re
import copy

## Get SMILES from Pubchem
> Update: Parallelized using dask

In [2]:
df_100 = pd.read_csv('sample_chemical_names.csv', sep=',', header=0)

In [3]:
df_100.shape

(147, 1)

In [4]:
from dask.distributed import Client, progress
import dask.dataframe as dd
from dask import delayed, compute
from dask.multiprocessing import get
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:42158  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 5  Cores: 20  Memory: 99.77 GB


In [5]:
def get_smile(cmpd_name):
    try:
        #delayed(f)(x, args=a)
        name = delayed(pcp.get_properties)(['CanonicalSMILES'], cmpd_name, 'name')
        time.sleep(5)
        smile = name[0]['CanonicalSMILES']
    except:
        smile = 'X'
        print(cta_name, smile)
    return smile

def dask_smiles(df):
    df['CanonicalSMILES'] = df['CTA'].map(get_smile)
    return df #Map paritions works here -- but not with to_list() in the previous implementation 

In [6]:
df_dask = dd.from_pandas(df_100, npartitions=10)

In [7]:
df_dask

Unnamed: 0_level_0,CTA
npartitions=10,Unnamed: 1_level_1
0,object
15,...
...,...
135,...
146,...


In [8]:
%time ddf_out  = df_dask.map_partitions(dask_smiles)

CPU times: user 406 ms, sys: 77.6 ms, total: 484 ms
Wall time: 10 s


In [17]:
ddf_out.iloc[:,0]

Dask Series Structure:
npartitions=10
0      object
15        ...
        ...  
135       ...
146       ...
Name: CTA, dtype: object
Dask Name: getitem, 30 tasks

In [18]:
%time results = ddf_out.persist(scheduler=client).compute()

CPU times: user 5.02 s, sys: 946 ms, total: 5.96 s
Wall time: 2min 30s


In [19]:
type(results)

pandas.core.frame.DataFrame

In [21]:
results.loc[0]

CTA                                                     Cyclopropane
CanonicalSMILES    Delayed('getitem-87dcf881530403b53cd7fb8a654c8...
Name: 0, dtype: object

In [22]:
compute(results['CanonicalSMILES'].iloc[0])[0] #Compute result for one entry 

'C1CC1'

In [23]:
%time results['CanonicalSMILES'] = [value[0] for value in results['CanonicalSMILES'].map(compute)]

CPU times: user 5.78 s, sys: 875 ms, total: 6.66 s
Wall time: 1min 28s


In [24]:
type(results)

pandas.core.frame.DataFrame

In [25]:
results[results['CanonicalSMILES'] == 'X']

Unnamed: 0,CTA,CanonicalSMILES


In [26]:
results

Unnamed: 0,CTA,CanonicalSMILES
0,Cyclopropane,C1CC1
1,Ethylene,C=C
2,Methane,C
3,t-Butanol,CC(C)(C)O
4,ethane,CC
...,...,...
142,"Cyclohexane-1,3-dicarbaldehyde",C1CC(CC(C1)C=O)C=O
143,isobutene,CC(=C)C
144,propanal,CCC=O
145,methyl methacrylate,CC(=C)C(=O)OC


## ## Dask to get InChIKey 

This implementation in my opinion is more elegant use of dask's `apply` command wrapper around conventional pandas `apply`. Also here we are defining the `meta` key for the variable since the code doesn't seem to recognise the type of entries we expect in the final output 

More information about `meta` here: https://docs.dask.org/en/latest/dataframe-api.html

In [27]:
import rdkit
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw

Chem.WrapLogs()
lg = rdkit.RDLogger.logger() 
lg.setLevel(rdkit.RDLogger.CRITICAL)

ModuleNotFoundError: No module named 'rdkit'

In [None]:
def get_InChiKey(x):
    try:
        inchi_key =  Chem.MolToInchiKey(Chem.MolFromSmiles(x))
    except:
        inchi_key = 'X'
    return inchi_key

def dask_smiles(df):
    df['INCHI'] = df['smiles'].map(get_name)
    return df

In [None]:
results_dask = dd.from_pandas(results, npartitions=10)

In [None]:
inchi = results_dask['CanonicalSMILES'].apply(lambda x: Chem.MolToInchiKey(Chem.MolFromSmiles(x)), meta=('inchi_key',str))

In [None]:
inchi

In [None]:
inchi.visualize()

`inchi` is a new Pandas series which has the `delayed` graphs for computing InChIKeys. We can compute it directly in the `results` dataframe as a new column. This is slightly different from the SMILES implementation above.

In [None]:
%time results['INCHI'] = compute(inchi, scheduler = client)[0]

In [None]:
results