## Get SMILES from PubChem

Dask implementation to acquire `CanonicalSMILES` from PubChem using the `pubchem` API. At the end of the notebook there is another dask based implementation of using `RDKit` to get InChIKey from the SMILES. While Dask is not necessary required in the case of InChIKeys it is a much more elegant implementation of `dask.dataframes` and `map_partitions` 

In [1]:
import time
import pubchempy as pcp
from pubchempy import Compound, get_compounds
import pandas as pd
import numpy as np
import re
import copy

## Get SMILES from Pubchem
> Update: Parallelized using dask

In [2]:
df_100 = pd.read_csv('sample_chemical_names.csv', sep=',', header=0)

In [3]:
from dask.distributed import Client, progress
import dask.dataframe as dd
from dask import delayed, compute
from dask.multiprocessing import get
client = Client()
client

ImportError: Dask's distributed scheduler is not installed.

Please either conda or pip install dask distributed:

  conda install dask distributed          # either conda install
  python -m pip install "dask[distributed]" --upgrade  # or python -m pip install

In [None]:
def get_name(cta_name):
    try:
        #delayed(f)(x, args=a)
        name = delayed(pcp.get_properties)(['CanonicalSMILES'], cta_name, 'name')
        time.sleep(5)
        smile = name[0]['CanonicalSMILES']
    except:
        smile = 'X'
        print(cta_name, smile)
    return smile

def dask_smiles(df):
    df['CanonicalSMILES'] = df['CTA'].map(get_name)
    return df #Map paritions works here -- but not with to_list() in the previous implementation 

In [None]:
df_dask = dd.from_pandas(df_100, npartitions=10)

In [None]:
df_dask

In [None]:
df_dask.visualize()

In [None]:
%time ddf_out  = df_dask.map_partitions(dask_smiles)

In [None]:
ddf_out.visualize()

In [None]:
%time results = ddf_out.persist(scheduler=client).compute()

In [None]:
compute(results['CanonicalSMILES'].iloc[0])[0]

In [None]:
%time results['CanonicalSMILES'] = [value[0] for value in results['CanonicalSMILES'].map(compute)]

In [None]:
type(results)

In [None]:
results[results['CanonicalSMILES'] == 'X']

## ## Dask to get InChIKey 

This implementation in my opinion is more elegant use of dask's `apply` command wrapper around conventional pandas `apply`. Also here we are defining the `meta` key for the variable since the code doesn't seem to recognise the type of entries we expect in the final output 

More information about `meta` here: https://docs.dask.org/en/latest/dataframe-api.html

In [None]:
import rdkit
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw

Chem.WrapLogs()
lg = rdkit.RDLogger.logger() 
lg.setLevel(rdkit.RDLogger.CRITICAL)

In [None]:
def get_InChiKey(x):
    try:
        inchi_key =  Chem.MolToInchiKey(Chem.MolFromSmiles(x))
    except:
        inchi_key = 'X'
    return inchi_key

def dask_smiles(df):
    df['INCHI'] = df['smiles'].map(get_name)
    return df

In [None]:
results_dask = dd.from_pandas(results, npartitions=10)

In [None]:
inchi = results_dask['CanonicalSMILES'].apply(lambda x: Chem.MolToInchiKey(Chem.MolFromSmiles(x)), meta=('inchi_key',str))

In [None]:
inchi

In [None]:
inchi.visualize()

`inchi` is a new Pandas series which has the `delayed` graphs for computing InChIKeys. We can compute it directly in the `results` dataframe as a new column. This is slightly different from the SMILES implementation above.

In [None]:
%time results['INCHI'] = compute(inchi, scheduler = client)[0]

In [None]:
results