### Exploring Fingerprints & RDKit

In [1]:
#conda update -n base -c defaults conda


In [2]:
from rdkit import Chem

In [3]:
trial_SMILES = 'CC1(C)C2CCC(C2)C1=C' #(-)-camphene,

I decided to explore using Chemical Finger Prints and/or molecular descriptors and then using the output as the input of a ML algorithm or a neural network regressor. https://stats.stackexchange.com/questions/56010/predicting-chemical-property-boiling-point-from-a-smiles-string

##### A very common fingerprinting technique used seems to be ECFP (Extended-connectivity Fingerprints).

For each atom in the molecule, an integer identifier is assigned. There are 6 properties used to assign the initial number:
1. Number of neighbouring "heavy" atoms
2. Valence must the number of Hydrogens
3. The atomic number
4. The atomic mass
5. The atomic charge
5. The number of attached hydrogen
(Directly taken from the ECFP paper: https://pubs.acs.org/doi/pdf/10.1021/ci100050t)

Then, these integers are put through a hashing function and replaces the original numbers. Also, the output of this step also includes the hashed number, along with the bond order.

This hashing step is repeated a set number of times.

Essentially, this is encoding different possible substructures within the molecule and encoding them as an array (which would be useful for machine learning algorithms). Also with more iterations, more possible substructures and details of the molecule are encoded (at the cost of computation requirements).

It is common practice for 1-4 iterations to be done depending on the use. 1-2 are used for comparisons while 3-4 are usually used for predictions based on fingerprinting. Also, note that an ECFP made with 1 iteration is named ECFP2, 2 iterations ECFP4, ...


In [4]:
from rdkit.Chem import AllChem

In [5]:
m1 = Chem.MolFromSmiles(trial_SMILES)
fp1 = AllChem.GetMorganFingerprint(m1,1) #Used for predictions based on fingerprinting

The fingerprint method seems to be returning an UIntSparseIntVect. Checking available methods:

In [6]:
dir(fp1)

['GetLength',
 'GetNonzeroElements',
 'GetTotalVal',
 'ToBinary',
 'ToList',
 'UpdateFromSequence',
 '__add__',
 '__and__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getinitargs__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__idiv__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__instance_size__',
 '__isub__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__or__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__safe_for_unpickling__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__weakref__']

In [7]:
fp1.GetNonzeroElements() #Get the hashed integers in a dictionary

{440332323: 1,
 517457164: 1,
 1861965050: 2,
 2117068077: 2,
 2246728737: 2,
 2246997334: 1,
 2663617800: 1,
 2968968094: 3,
 2975316496: 1,
 2976033787: 2,
 2976816164: 1,
 3217380708: 1,
 3482873808: 1,
 4273842364: 1}

Usually, a bit vector is commonly used.

In [8]:
fp1_bit = AllChem.GetMorganFingerprintAsBitVect(m1,3,nBits=1024)

In [9]:
dir(fp1_bit)

['FromBase64',
 'GetBit',
 'GetNumBits',
 'GetNumOffBits',
 'GetNumOnBits',
 'GetOnBits',
 'SetBit',
 'SetBitsFromList',
 'ToBase64',
 'ToBinary',
 'ToBitString',
 'ToList',
 'UnSetBit',
 'UnSetBitsFromList',
 '__add__',
 '__and__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getinitargs__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__init__',
 '__init_subclass__',
 '__instance_size__',
 '__invert__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__or__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__safe_for_unpickling__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '__xor__']

In [10]:
fp1_bit.ToBitString()

'000000000000000000000000000000000101100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000001000000000000000100010000000000000000000000000000000000000000010000001000010100010000000000000000000000000000000010000000000000000000000000000000000000000100000000000001000000000000000000000001000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000001000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000001000000000000000000010000000000000000000000

In [11]:
len(fp1_bit)

1024

In [12]:
(list(fp1.GetNonzeroElements().keys())[0])%1024 #Get the hashed integers in a dictionart

35

In [13]:
fp1_bit.ToBitString()[956]

'1'

How this works: The algorithm encodes all possible substructures of radius 3 from some central atom via hash functions which results in hashed values. These hashed values are mapped onto a 1024 bit representation. For example, on the first carbon of trial_SMILES, the substructure with radius 3 has hashed value 199325628. 199325628 is congruent to 956 mod 1024. And hence, the 956th bit of the representation will be 1.

This means that a lower bit number will mean more bit collisions: the 1024 bit representation can only encode 1024 unique structures without bit collision.

#### Now Get the fingerprint 1024 bit representation for all chemicals

In [14]:
import numpy as np
import pandas as pd

In [15]:
data_SMILES = pd.read_csv('boiling_data_smiles.csv')

In [16]:
#Define a function to make implementation easier
def get_ECFP_bit(SMILES,iterations=0,nbit=32):
    try:
        m = Chem.MolFromSmiles(SMILES)
        ecfp_bit = AllChem.GetMorganFingerprintAsBitVect(m,iterations,nBits=nbit)
        return ecfp_bit.ToBitString()
    except:
        return 'FAILED'

In [17]:
#First Remove the chemicals without a SMILES string
data_SMILES_cleaned = data_SMILES[data_SMILES['SMILES'] != '-']
data_SMILES_cleaned.reset_index(drop=True, inplace=True)

In [18]:
data_SMILES_cleaned

Unnamed: 0,name,molweight,critical temperature (K),acentric factor,boiling point (K),SMILES
0,(+)-camphene,136.23704,638.00,0.2960,432.65,CC1(C)C2CCC(C2)C1=C
1,(-)-a-pinene,136.23704,647.00,0.3410,429.35,CC1=CC[C@H]2C[C@@H]1C2(C)C
2,(-)-camphene,136.23704,638.00,0.2960,439.95,CC1(C)C2CCC(C2)C1=C
3,"(1,1-dimethylbutyl)benzene",162.27492,697.15,0.4370,478.65,CCCC(C)(C)c1ccccc1
4,(1-butylhexadecyl)benzene,358.65124,851.65,0.7590,693.15,CCCCCCCCCCCCCCCC(CCCC)c1ccccc1
...,...,...,...,...,...,...
5903,vinylacetylene,52.07576,454.00,0.1180,278.25,C/C=C/C=C
5904,vinylcyclohexene,108.18328,599.00,0.3290,401.00,CC(C)/C=C/Cl
5905,water,18.01528,647.13,0.3449,373.15,C/C=C/CF
5906,zirconium chloride,233.03480,778.00,0.2980,604.15,C\C(c1ccccc1)=C(\C)c2ccccc2


In [19]:
#Now add the bit representation of the ECFP
data_SMILES_cleaned['ECFP_Bits'] = data_SMILES_cleaned['SMILES'].apply(get_ECFP_bit)

[23:43:14] Explicit valence for atom # 1 Cl, 4, is greater than permitted
[23:43:14] Explicit valence for atom # 1 Cl, 2, is greater than permitted
[23:43:14] Explicit valence for atom # 1 Cl, 5, is greater than permitted
[23:43:14] Explicit valence for atom # 1 I, 7, is greater than permitted
[23:43:14] Explicit valence for atom # 1 C, 5, is greater than permitted
[23:43:14] Explicit valence for atom # 1 Cl, 7, is greater than permitted
[23:43:14] SMILES Parse Error: syntax error while parsing: Cl|[V](|Cl)(|Cl)=O
[23:43:14] SMILES Parse Error: Failed parsing SMILES 'Cl|[V](|Cl)(|Cl)=O' for input: 'Cl|[V](|Cl)(|Cl)=O'
[23:43:14] Explicit valence for atom # 1 Cl, 3, is greater than permitted
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_SMILES_cleaned['ECFP_Bits'] = 

In [20]:
data_SMILES_cleaned[data_SMILES_cleaned["ECFP_Bits"] == 'FAILED']

Unnamed: 0,name,molweight,critical temperature (K),acentric factor,boiling point (K),SMILES,ECFP_Bits
4137,butylcyclopentane,126.24192,621.0,0.372,429.76,O=[Cl]=O,FAILED
4140,butyric anhydride,158.19736,644.0,0.655,470.93,O=[Cl],FAILED
4141,butyronitrile,69.10632,582.35,0.371,390.75,F[Cl](F)(F)(F)F,FAILED
4859,hexafluoropropylene,150.023419,368.0,0.205,243.55,F[I](F)(F)(F)(F)(F)F,FAILED
4862,hexamethyldewarbenzene,162.27492,697.15,0.437,440.0,[Fe+5].[C--]#[O+].[C--]#[O+].[C--]#[O+].[C--]#...,FAILED
5344,p-diisopropylbenzene hydroperoxide,194.27372,810.0,0.928,616.0,F[Cl](=O)(=O)=O,FAILED
5781,"trans-2-methylcyclohexanol, (±)",114.18756,635.0,0.685,440.65,Cl|[V](|Cl)(|Cl)=O,FAILED
5827,tridecanoic acid,214.34824,754.0,0.904,585.25,F[Cl](F)F,FAILED


There are some chemicals that failed during ECFP algorithm. As there are only 8 instances of this happening. We can safely remove these cases.

In [21]:
data_SMILES_cleaned = data_SMILES_cleaned[data_SMILES_cleaned["ECFP_Bits"] != 'FAILED']

In [22]:
data_SMILES_cleaned.reset_index(drop=True, inplace=True)

In [23]:
data_SMILES_cleaned

Unnamed: 0,name,molweight,critical temperature (K),acentric factor,boiling point (K),SMILES,ECFP_Bits
0,(+)-camphene,136.23704,638.00,0.2960,432.65,CC1(C)C2CCC(C2)C1=C,01001000000000000000001000010010
1,(-)-a-pinene,136.23704,647.00,0.3410,429.35,CC1=CC[C@H]2C[C@@H]1C2(C)C,01001000000000000100000000010010
2,(-)-camphene,136.23704,638.00,0.2960,439.95,CC1(C)C2CCC(C2)C1=C,01001000000000000000001000010010
3,"(1,1-dimethylbutyl)benzene",162.27492,697.15,0.4370,478.65,CCCC(C)(C)c1ccccc1,01001000000000001110000000000000
4,(1-butylhexadecyl)benzene,358.65124,851.65,0.7590,693.15,CCCCCCCCCCCCCCCC(CCCC)c1ccccc1,01001000000000001100000000000000
...,...,...,...,...,...,...,...
5895,vinylacetylene,52.07576,454.00,0.1180,278.25,C/C=C/C=C,01000000000000000000001000000000
5896,vinylcyclohexene,108.18328,599.00,0.3290,401.00,CC(C)/C=C/Cl,01000000000000000001001000000000
5897,water,18.01528,647.13,0.3449,373.15,C/C=C/CF,01000000100000001000001000000000
5898,zirconium chloride,233.03480,778.00,0.2980,604.15,C\C(c1ccccc1)=C(\C)c2ccccc2,01001001000000000100000000000000


Now the dataset is ready!

In [24]:
data_SMILES_cleaned.to_csv('final_data_0_iter_32.csv',index=False)