### You must clone the opnbnch repo and append the path to access the methods in the repository.

If you have any questions, please drop them in the [Issues](https://github.com/opnbnch/opnbnch/issues) for the opnbnch Github repo or reference the parent newsletter post.

In [107]:
import os
import sys

OPNBNCH_HOME = '../'

sys.path.append(OPNBNCH_HOME)

import pandas as pd
import rdkit

from produce_meta import produce_meta
from standardize import standardize
from resolve_class import resolve_class
from utils.meta_utils import read_meta

#### First, start with a visual inspection of the data

In [2]:
data_dir = OPNBNCH_HOME + 'case_studies/Martins_et_al_2012/'
data_file = 'martins_et_al_2012.csv'
data_path = os.path.join(data_dir, data_file)


pd.read_csv(data_path).head()

Unnamed: 0,num,name,p_np,smiles
0,1,Propanolol,p,[Cl].CC(C)NCC(O)COc1cccc2ccccc12
1,2,Terbutylchlorambucil,p,C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl
2,3,40730,p,c12c3c(N4CCN(C)CC4)c(F)cc1c(c(C(O)=O)cn2C(C)CO...
3,4,24,p,C1CCN(CC1)Cc1cccc(c1)OCCCNC(=O)C
4,5,cloxacillin,p,Cc1onc(c2ccccc2Cl)c1C(=O)N[C@H]3[C@H]4SC(C)(C)...


What do each of the four columns in the dataset contain?

* **num:** a redundant index column

* **name:** a column of compound names and ids. Also appears to have limited utility. 

* **p_np:** a column that appears to hold a class encoding (p = penetrative vs. np = non_penetrative)

* **smiles:** a column specifying the structure for each compound
 
The two columns holding the relevant data for our purposes are **p_np** and **smiles.** We know they are what we need to extract, but before we can start hacking away at this dataset, it's important that we set up a system to track our curation decisions, starting with the raw literature source.


### `produce_meta.py`

The `produce_meta` method from `produce_meta.py` assists in doing this. By providing a DOI (only ACS supported at the moment) and a path to the dataset we want to clean, we can produce an initial `metadata.json` file that will track progress through the data wrangling process. We believe that assiduously tracking the steps taken in data curation is crucial for assessing the provenance of benchmark data and ensuring its reproducibility. Thus, we begin: 

In [3]:
produce_meta('https://doi.org/10.1021/ci300124c', data_path)

Producing dataset metadata for: https://doi.org/10.1021/ci300124c
Writing metadata output to: ../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json


And so, our metadata.json file is produced. In it we store the title, author, and other article metadata, as well as some  basic info about the dataset being targeted for curation. We can take a look at our starting metadata using the `cat` command 

In [4]:
! cat ../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json

{
    "title": "A Bayesian Approach to in Silico Blood-Brain Barrier Penetration Modeling",
    "authors": [
        "Ines Filipa  Martins",
        "Ana L.  Teixeira",
        "Luis  Pinheiro",
        "Andre O.  Falcao"
    ],
    "doi": "https://doi.org/10.1021/ci300124c",
    "publisher": "American Chemical Society",
    "date": "June 6, 2012",
    "meta_version": "v1.0.0 (06-18-2020)",
    "meta_utc_fix": 1593129623,
    "meta_path": "../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json",
    "data_path": "../case_studies/Martins_et_al_2012/martins_et_al_2012.csv",
    "data_row_num": 2053,
    "data_columns": [
        "num",
        "name",
        "p_np",
        "smiles"
    ],
    "smiles_col": null,
    "value_col": null,
    "class_col": null
}

Now we’re ready to start making changes to the raw data file from Martins et al. The first order of business is to standardize SMILES strings and class labels.

### `standardize.py` 

For this we use the `standardize` method from `standardize.py.` All we need to supply to the method is a path to the directory where our raw data resides, the name of the column which contains the SMILES data, and the name of the column containing class labels. From our visual analysis of the data above, those two columns are `smiles` and `p_np` respectively.

Non-standard class labels are mapped into integres 0 to n for n unique classes. The standardize function will ask for manual input to achieve a mapping from custom class labels {n, p} to standardized class labels {0, 1}. In this case, n → 0 and p → 1. 

In [5]:
standardize(data_dir, smiles_col='smiles', class_col='p_np')

RDKit ERROR: [17:00:26] Explicit valence for atom # 11 N, 4, is greater than permitted
RDKit ERROR: [17:00:26] Explicit valence for atom # 12 N, 4, is greater than permitted
RDKit ERROR: [17:00:26] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [17:00:26] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [17:00:26] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [17:00:26] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [17:00:26] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [17:00:26] Explicit valence for atom # 1 N, 4, is greater than permitted
RDKit ERROR: [17:00:26] Explicit valence for atom # 6 N, 4, is greater than permitted
RDKit ERROR: [17:00:26] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [17:00:27] Explicit valence for atom # 6 N, 4, is greater than permitted
RDKit ERROR: [17:00:27] Can't kekulize mol.  Unkekul

RDKit ERROR: [17:00:28] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [17:00:28] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERROR: [17:00:28] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [17:00:28] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERROR: [17:00:28] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [17:00:28] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERROR: [17:00:28] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [17:00:28] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERROR: [17:00:28] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [17:00:28] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERRO

{'O=N([O-])C1=C(CN=C1NCCSCc2ncccc2)Cc3ccccc3': 59, 'c1(nc(NC(N)=[NH2])sc1)CSCCNC(=[NH]C#N)NC': 61, 'Cc1nc(sc1)\\[NH]=C(\\N)N': 392, 's1cc(CSCCN\\C(NC)=[NH]\\C#N)nc1\\[NH]=C(\\N)N': 615, 'c1c(c(ncc1)CSCCN\\C(=[NH]\\C#N)NCC)Br': 643, 'n1c(csc1\\[NH]=C(\\N)N)c1ccccc1': 646, 'n1c(csc1\\[NH]=C(\\N)N)c1cccc(c1)N': 647, 'n1c(csc1\\[NH]=C(\\N)N)c1cccc(c1)NC(C)=O': 648, 'n1c(csc1\\[NH]=C(\\N)N)c1cccc(c1)N\\C(NC)=[NH]\\C#N': 649, 's1cc(nc1\\[NH]=C(\\N)N)C': 650, 'c1(cc(N\\C(=[NH]\\c2cccc(c2)CC)C)ccc1)CC': 686}

        Your class values are non-standard for classification.
        In order for training and testing to run smoothly, let's
        convert your classes to standard form.
        

        Your class column, p_np currently contains 2 unique values.
        Those class values are {'n', 'p'}.
        Each class will be mapped into a standard value in [0, 1].
        
Assign n to one of the following values: [0, 1]:0
Assign p to one of the following values: [1]:1
Standard df will be writ

Once we enter that mapping, we’ll see that it is included in the metadata file alongside information on invalid SMILES strings. Such extensive documentation is included in the `metadata.json` to create a historical record of the decisions made during the curation process and to encourage reproducibility. Thankfully, this metadata is produced in an automated fashion and will be stored right alongside our standardized data.

In [6]:
! cat ../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json

{
    "title": "A Bayesian Approach to in Silico Blood-Brain Barrier Penetration Modeling",
    "authors": [
        "Ines Filipa  Martins",
        "Ana L.  Teixeira",
        "Luis  Pinheiro",
        "Andre O.  Falcao"
    ],
    "doi": "https://doi.org/10.1021/ci300124c",
    "publisher": "American Chemical Society",
    "date": "June 6, 2012",
    "meta_version": "v1.0.0 (06-18-2020)",
    "meta_utc_fix": 1593129623,
    "meta_path": "../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json",
    "data_path": "../case_studies/Martins_et_al_2012/martins_et_al_2012.csv",
    "data_row_num": 2053,
    "data_columns": [
        "num",
        "name",
        "p_np",
        "smiles"
    ],
    "smiles_col": "smiles",
    "value_col": null,
    "class_col": "p_np",
    "class_map": {
        "p_np": {
            "n": 0,
            "p": 1
        }
    },
    "std_class_col": "std_class",
    "std_data_path": "../case_studies/Martins_et_al_20

### Invalid SMILES

I'm always interested in taking a closer look at invalid SMILES, just to make sure we didn't unncessarily drop any valuable data! 

In [110]:
meta = read_meta(data_dir)
raw_data_path = meta.get('data_path')
std_data_path = meta.get('std_data_path')
std_smiles_col = meta.get('std_smiles_col')
smiles_col = meta.get('smiles_col')

invalid_smiles = pd.read_csv(std_data_path) \
    .loc[lambda x:x[std_smiles_col] == 'invalid_smiles'] \
    .loc[::, smiles_col] \
    .to_list()

pd.read_csv(raw_data_path) \
    .loc[lambda x:x[smiles_col].isin(invalid_smiles)]

Unnamed: 0,num,name,p_np,smiles
59,60,15,p,O=N([O-])C1=C(CN=C1NCCSCc2ncccc2)Cc3ccccc3
61,62,22767,p,c1(nc(NC(N)=[NH2])sc1)CSCCNC(=[NH]C#N)NC
392,393,ICI17148,p,Cc1nc(sc1)\[NH]=C(\N)N
615,616,5-6,p,s1cc(CSCCN\C(NC)=[NH]\C#N)nc1\[NH]=C(\N)N
643,644,12,n,c1c(c(ncc1)CSCCN\C(=[NH]\C#N)NCC)Br
646,647,16,p,n1c(csc1\[NH]=C(\N)N)c1ccccc1
647,648,17,n,n1c(csc1\[NH]=C(\N)N)c1cccc(c1)N
648,649,18,n,n1c(csc1\[NH]=C(\N)N)c1cccc(c1)NC(C)=O
649,650,19,n,n1c(csc1\[NH]=C(\N)N)c1cccc(c1)N\C(NC)=[NH]\C#N
650,651,2,p,s1cc(nc1\[NH]=C(\N)N)C


Alas, it looks like none of our defective SMILES were given a `name` that can be resolved to a proper structure, and they indeed appear to be invalid. One can read the error logging from rdkit produced by `standardize` to get a sense for why SMILES fail. In this case, many SMILES fail due to attempts at specifying impossible nitrogen chemistry, e.g.:

    RDKit ERROR: Explicit valence for atom # 5 N, 4, is greater than permitted
    
### Replicate Entries

With SMILES standardized, it’s easy to see that there are certain data points for which multiple class records are included in Martins et al.’s dataset. For example, below is a list of SMILES that appear multiple times in the dataset and their associated counts.

In [111]:
pd.read_csv(std_data_path) \
    .loc[lambda x:x[std_smiles_col] != 'invalid_smiles'] \
    .loc[::, std_smiles_col] \
    .value_counts().head(10)

CN(C)CCc1ccccn1                                     3
CCCN(CCC)CCc1cccc2c1CC(=O)N2                        3
Cc1ncc2n1-c1ccc(Cl)cc1C(c1ccccc1F)=NC2              3
CNCCc1ccccn1                                        3
ClCCl                                               3
NCCc1cn2ccccc2n1                                    3
O=C(NCCN1CCOCC1)c1ccc(Cl)cc1                        2
CN1Cc2c(-c3noc(C(C)(C)O)n3)ncn2-c2cccc(Cl)c2C1=O    2
CC(C)(C)OC(=O)CCCc1ccc(N(CCCl)CCCl)cc1              2
Cc1cc(NS(=O)(=O)c2ccc(N)cc2)no1                     2
Name: std_smiles, dtype: int64

We must resolve each of these replicate groupings to a single class, so that we can train our model on a functional mapping between SMILES structures and BBB penetration labels. 

### resolve_class.py

For this purpose, we’ve built the `resolve_class` method, which takes in the path to the data directory and an optional “filter function” parameter. 

In [32]:
resolve_class(data_dir, filter_fn=None)

Filter unspecified or invalid.

        How do you want to resolve class for multiple replicates?
        
Please select one option:['unanimous', 'majority']: majority
Curated df will be written to: ../case_studies/Martins_et_al_2012/resolved_martins_et_al_2012.csv
Updated metadata at: ../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json


Here we don’t specify a filter funciton, so the method prompts us for one at the command line. Currently, two filtering methods are supported. For `filter_fn=”unanimous”`, all replicates must share the same class, otherwise they are filtered from the training data. For `filter_fn=”majority”`, a simple majority suffices. 

`resolve_class` writes output to the data_path with prefix “resolved_” and naturally appends data to `metadata.json.`

In [33]:
!cat ../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json

{
    "title": "A Bayesian Approach to in Silico Blood-Brain Barrier Penetration Modeling",
    "authors": [
        "Ines Filipa  Martins",
        "Ana L.  Teixeira",
        "Luis  Pinheiro",
        "Andre O.  Falcao"
    ],
    "doi": "https://doi.org/10.1021/ci300124c",
    "publisher": "American Chemical Society",
    "date": "June 6, 2012",
    "meta_version": "v1.0.0 (06-18-2020)",
    "meta_utc_fix": 1593129623,
    "meta_path": "../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json",
    "data_path": "../case_studies/Martins_et_al_2012/martins_et_al_2012.csv",
    "data_row_num": 2053,
    "data_columns": [
        "num",
        "name",
        "p_np",
        "smiles"
    ],
    "smiles_col": "smiles",
    "value_col": null,
    "class_col": "p_np",
    "class_map": {
        "p_np": {
            "n": 0,
            "p": 1
        }
    },
    "std_class_col": "std_class",
    "std_data_path": "../case_studies/Martins_et_al_20

In [116]:
resolved_data_path = meta.get('resolved_data_path')
train = pd.read_csv(resolved_data_path)[['std_smiles', 'std_class']]
train

Unnamed: 0,std_smiles,std_class
0,Fc1ccc(-c2ccc(CN3CCN(c4cccc5ccoc45)CC3)[nH]2)cc1,1
1,OCCN1CCN(CC/C=C2/c3ccccc3COc3ccc(Cl)cc32)CC1,1
2,NNC(=O)CP(=O)(c1ccccc1)c1ccccc1,1
3,CC(C)(C)NC(=O)[C@@H]1C[C@@H]2CCCC[C@@H]2CN1C[C...,0
4,C[C@H]1O[C@@H](O[C@H]2[C@@H](O)C[C@H](O[C@H]3[...,0
...,...,...
1946,COC1C=COC2(C)Oc3c(C)c(O)c4c(O)c(c(C=NN5CCN(C)C...,0
1947,CN1c2cc(F)ccc2C(c2ccccc2)=NCC1CNC(=O)c1ccoc1,1
1948,COc1cccc(C(=O)NC2C[C@@H]3CC[C@H](C2)N3Cc2ccccc...,1
1949,CN(C)CCCN1c2ccccc2C=Cc2sccc21,1
