<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Importing-data" data-toc-modified-id="1.-Importing-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>1. Importing data</a></span><ul class="toc-item"><li><span><a href="#1.1.-Complete-FreeSolv-database" data-toc-modified-id="1.1.-Complete-FreeSolv-database-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>1.1. Complete FreeSolv database</a></span></li><li><span><a href="#1.2.-SAMPL4_Guthrie-entires-extracted-from-FreeSolv" data-toc-modified-id="1.2.-SAMPL4_Guthrie-entires-extracted-from-FreeSolv-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>1.2. SAMPL4_Guthrie entires extracted from FreeSolv</a></span></li><li><span><a href="#1.3.-Original-SAMPL4_Gurthrie-extracted-from-SI-in-ref.-[1]" data-toc-modified-id="1.3.-Original-SAMPL4_Gurthrie-extracted-from-SI-in-ref.-[1]-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>1.3. Original SAMPL4_Gurthrie extracted from SI in ref. [1]</a></span></li></ul></li><li><span><a href="#2.-Determining-discrepancy" data-toc-modified-id="2.-Determining-discrepancy-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>2. Determining discrepancy</a></span><ul class="toc-item"><li><span><a href="#The-six-ligands-missing-from-FreeSolv-that-are-not-accounted-for." data-toc-modified-id="The-six-ligands-missing-from-FreeSolv-that-are-not-accounted-for.-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>The six ligands missing from FreeSolv that are not accounted for.</a></span></li></ul></li></ul></div>

# SAMPL4_Guthrie FreeSolv Check
Determine discrepancies between the SAMPL4_Guthrie data originally published in ref. [1] and the FreeSolv database. The SAMPL4 challenge publication can be found at [2].

[1] J. Comput. Aided Mol. Des., 2014, 28, 151–168, DOI 10.1007/s10822-014-9738-y.

[2] J. Comput. Aided Mol. Des., 2014, 28, 135–150, DOI 10.1007/s10822-014-9718-2.

## 1. Importing data

### 1.1. Complete FreeSolv database

In [1]:
import pandas as pd

data_loc = 'database.txt'
df1 = pd.read_csv(data_loc, sep='; ', engine='python')

### 1.2. SAMPL4_Guthrie entires extracted from FreeSolv

In [2]:
# SAMPl4_Guthrie experimental reference in FreeSolv.
ref = 'SAMPL4_Guthrie'

# Experimental reference column name.
exp_ref_col = 'experimental reference (original or paper this value was taken from)'

# List comprehension for all SAMPL4_Guthrie entires.
SAMPL4_Guthrie = [df1.iloc[i] for i in range(len(df1))  if df1.loc[i, exp_ref_col] == ref]

# Check the number of ligands found is correct.
print('Number of SAMPL4_Guthrie entires in FreeSolv: {}'.format(len(SAMPL4_Guthrie)))

# DataFrame containing only SAMPL4_Guthrie ligands
df2 = pd.DataFrame(SAMPL4_Guthrie)

# Columns to drop
dropped_columns = ['Mobley group calculated value (GAFF) (kcal/mol)',
                'calculated uncertainty (kcal/mol)', 
                   'calculated reference', 
                   'experimental reference (original or paper this value was taken from)', 
                  'text notes.']

# DataFrame not containing dropped columns for clarity
df2 = df2.drop(dropped_columns, axis=1)

# Rename columns to match Guthrie's SI seen below in section 1.3
df2.columns = ['ID', 'SMILES', 'name', 'Ghyd', 'uncertainty']
df2 = df2.reset_index(drop=True)

# Calculate MW
from rdkit.Chem import MolFromSmiles, Descriptors

MW_dp = 2

MW2 = []
for mol in df2.loc[:, 'SMILES']:
    suppl = MolFromSmiles(mol)
    MW2.append(round(Descriptors.MolWt(suppl), MW_dp))

df2.insert(3, 'MW', MW2)

# Order according to MW
df2 = df2.sort_values('MW')
df2 = df2.reset_index(drop=True)

df2

Number of SAMPL4_Guthrie entires in FreeSolv: 41


Unnamed: 0,ID,SMILES,name,MW,Ghyd,uncertainty
0,mobley_3211679,C1CCC=CC1,cyclohexene,82.15,0.14,0.1
1,mobley_8883511,Cc1ccccc1C=O,2-methylbenzaldehyde,120.15,-3.93,0.1
2,mobley_3040612,CCc1ccccc1C,1-ethyl-2-methylbenzene,120.19,-0.85,0.1
3,mobley_2850833,c1ccc(c(c1)C=O)O,2-hydroxybenzaldehyde,122.12,-4.68,0.1
4,mobley_2126135,CCc1ccccc1O,2-ethylphenol,122.17,-5.66,0.1
5,mobley_3515580,COc1ccccc1O,2-methoxyphenol,124.14,-5.94,0.1
6,mobley_7417968,CCOCCOC(=O)C,2-ethoxyethyl acetate,132.16,-5.31,0.1
7,mobley_7913234,CCCCOC[C@H](C)O,1-butoxy-2-propanol,132.2,-5.73,0.15
8,mobley_2613240,COc1ccccc1OC,"1,2-dimethoxybenzene",138.17,-5.33,0.1
9,mobley_5917842,Cc1ccc(c(c1)OC)O,4-methyl-2-methoxyphenol,138.17,-5.8,0.1


### 1.3. Original SAMPL4_Gurthrie extracted from SI in ref. [1]

In [3]:
# DataFrame containing Guthrie SI
df3 = pd.read_csv('sampl4_guthrie.csv')

# Set MW to 2 decimal places
MW3 = []
for mol in df3.loc[:, 'SMILES']:
    suppl = MolFromSmiles(mol)
    MW3.append(round(Descriptors.MolWt(suppl), MW_dp))
df3.loc[:, 'MW'] = MW3

# Order according to MW
df3 = df3.sort_values('MW')
df3 = df3.reset_index(drop=True)

print('Number of entires in Guthrie SI: {}'.format(len(df3)))

df3

Number of entires in Guthrie SI: 47


Unnamed: 0,ID,SMILES,name,MW,Ghyd,uncertainty
0,SAMPL4_43,C1CCC=CC1,cyclohexene,82.15,0.14,0.1
1,SAMPL4_41,N1CCCCC1,piperidine,85.15,-5.05,0.1
2,SAMPL4_42,O1CCCCC1,tetrahydropyran,86.13,-3.13,0.1
3,SAMPL4_44,O1CCOCC1,"1,4-dioxane",88.11,-5.08,0.1
4,SAMPL4_38,O=Cc1ccccc1C,2-methylbenzaldehyde,120.15,-3.93,0.1
5,SAMPL4_39,c1cccc(C)c1CC,1-ethyl-2-methylbenzene,120.19,-0.85,0.1
6,SAMPL4_35,Oc1ccccc1C=O,2-hydroxybenzaldehyde,122.12,-4.68,0.1
7,SAMPL4_36,Oc1ccccc1CC,2-ethylphenol,122.17,-5.66,0.1
8,SAMPL4_37,COc1c(cccc1)O,2-methoxyphenol,124.14,-5.94,0.1
9,SAMPL4_26,C(C)OCCOC(=O)C,2-ethoxyethyl acetate,132.16,-5.31,0.1


## 2. Determining discrepancy

In [4]:
# df2 = SAMPL4_Guthrie extracted from FreeSolv
# df3 = SAMPL4_Guthrie extracted from original SI
dicrepancy = len(df3) - len(df2)
print('Discrepancy between SAMPL4_Gurthire in FreeSolv and SI is:', dicrepancy)

Discrepancy between SAMPL4_Gurthire in FreeSolv and SI is: 6


In [5]:
df2.head()

Unnamed: 0,ID,SMILES,name,MW,Ghyd,uncertainty
0,mobley_3211679,C1CCC=CC1,cyclohexene,82.15,0.14,0.1
1,mobley_8883511,Cc1ccccc1C=O,2-methylbenzaldehyde,120.15,-3.93,0.1
2,mobley_3040612,CCc1ccccc1C,1-ethyl-2-methylbenzene,120.19,-0.85,0.1
3,mobley_2850833,c1ccc(c(c1)C=O)O,2-hydroxybenzaldehyde,122.12,-4.68,0.1
4,mobley_2126135,CCc1ccccc1O,2-ethylphenol,122.17,-5.66,0.1


In [6]:
df3.head()

Unnamed: 0,ID,SMILES,name,MW,Ghyd,uncertainty
0,SAMPL4_43,C1CCC=CC1,cyclohexene,82.15,0.14,0.1
1,SAMPL4_41,N1CCCCC1,piperidine,85.15,-5.05,0.1
2,SAMPL4_42,O1CCCCC1,tetrahydropyran,86.13,-3.13,0.1
3,SAMPL4_44,O1CCOCC1,"1,4-dioxane",88.11,-5.08,0.1
4,SAMPL4_38,O=Cc1ccccc1C,2-methylbenzaldehyde,120.15,-3.93,0.1


In [7]:
def check_MW(df2, df3):
    """Iterate down MW columns in two dataframes and check if values at the same index
    are equal. If not, a new row of zeros is inserted at the infringing index.
    
    Note: at this point, df3 is the longer dataframe of the two."""
    
    for i, MW1, MW2 in zip(range(len(df3)), df2.loc[:, 'MW'], df3.loc[:, 'MW']):
        if MW1 != MW2 and MW1 != '-':
            start_upper = 0
            end_upper = i
            start_lower = i
            end_lower = df2.shape[0]
            upper_half = [*range(start_upper, end_upper, 1)]
            lower_half = [*range(start_lower, end_lower, 1)]
            lower_half = [x.__add__(1) for x in lower_half]
            index_ = upper_half + lower_half
            df2.index = index_
            df2.loc[i] = '-'
            df2 = df2.sort_index()

            break
    
    return df2

for i in range(len(df3)):
    df2 = check_MW(df2, df3)

In [8]:
df4 = pd.concat([df2.loc[:, 'MW'], df3.loc[:, 'MW']], axis=1)
df4.columns = ['Freesolv', 'SI']
df4

Unnamed: 0,Freesolv,SI
0,82.15,82.15
1,-,85.15
2,-,86.13
3,-,88.11
4,120.15,120.15
5,120.19,120.19
6,122.12,122.12
7,122.17,122.17
8,124.14,124.14
9,132.16,132.16


### The six ligands missing from FreeSolv that are not accounted for.

In [9]:
discrepancy = []
for i, mol in zip(range(len(df2)), df2.loc[:, 'ID']):
    if mol == '-':
        discrepancy.append(df3.loc[i, :])
df5 = pd.DataFrame(discrepancy)
df5

Unnamed: 0,ID,SMILES,name,MW,Ghyd,uncertainty
1,SAMPL4_41,N1CCCCC1,piperidine,85.15,-5.05,0.1
2,SAMPL4_42,O1CCCCC1,tetrahydropyran,86.13,-3.13,0.1
3,SAMPL4_44,O1CCOCC1,"1,4-dioxane",88.11,-5.08,0.1
13,SAMPL4_30,O(C(=O)C)CCCCCC,hexyl acetate,144.21,-2.29,0.12
29,SAMPL4_50,c12c(cc3c(c1)cccc3)cccc2,anthracene,178.23,-4.14,0.1
34,SAMPL4_49,O1c2c(Oc3c1cccc3)cccc2,dibenzo-p-dioxin,184.19,-3.16,0.1


A consideration for the discrepancy was existance of the ligand alreeady in the FreeSolv database.

In [10]:
# Write SMILES to lists
freesolv_smi = df1['SMILES'].tolist()
discrep_smi = df5['SMILES'].tolist()

# Pairwise comparison between discrepancy set and FreeSolv
check_lst = [smi1 == smi2 for smi1 in freesolv_smi for smi2 in discrep_smi]
any(check_lst)

False