### You must clone the opnbnch repo and append the path to access the methods in the repository.

If you have any questions, please drop them in the [Issues](https://github.com/opnbnch/opnbnch/issues) for the opnbnch Github repo 

In [5]:
import os
import sys

OPNBNCH_HOME = '../'

sys.path.append(OPNBNCH_HOME)

import pandas as pd
import rdkit

from produce_meta import produce_meta
from standardize import standardize
from resolve_class import resolve_class

In [7]:
#### Let's begin by taking a look at our data

data_dir = OPNBNCH_HOME + 'case_studies/Martins_et_al_2012/'
data_file = 'martins_et_al_2012.csv'
data_path = os.path.join(data_dir, data_file)


pd.read_csv(data_path).head()

Unnamed: 0,num,name,p_np,smiles
0,1,Propanolol,p,[Cl].CC(C)NCC(O)COc1cccc2ccccc12
1,2,Terbutylchlorambucil,p,C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl
2,3,40730,p,c12c3c(N4CCN(C)CC4)c(F)cc1c(c(C(O)=O)cn2C(C)CO...
3,4,24,p,C1CCN(CC1)Cc1cccc(c1)OCCCNC(=O)C
4,5,cloxacillin,p,Cc1onc(c2ccccc2Cl)c1C(=O)N[C@H]3[C@H]4SC(C)(C)...


What do each of these columns appear to hold?
 * **num:** an index column we probably don't need 
 * **name:** a column of compund names and ids. Also appears to have limited utility. 
 * **p_np**: a column that appears to hold a class encoding (p = penetrative vs. np = non_penetrative)
 * **smiles**: a column specifying the structure for each compound
 
The two columns holding the relevant data for our purposes are **p_np** and **smiles.** Before we can start hacking away at this dataset, though, it's important that we can track it back to it's source in literature. 


### `produce_meta.py`

The `produce_meta.py` function does just this for us. By specifying a DOI (only ACS supported at the moment) and a path to the data set we want to clean, we can produce a `metadata.json` file that will track data curation progress.


In [9]:
produce_meta('https://doi.org/10.1021/ci300124c', data_path)

Producing dataset metadata for: https://doi.org/10.1021/ci300124c
Writing metadata output to: ../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json


#### We can take a look at our starting metadata using the `cat` command 

We store the title, author, and other article metadata as well as some basic info about the data set being targeted for curation.


In [10]:
! cat ../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json

{
    "title": "A Bayesian Approach to in Silico Blood-Brain Barrier Penetration Modeling",
    "authors": [
        "Ines Filipa  Martins",
        "Ana L.  Teixeira",
        "Luis  Pinheiro",
        "Andre O.  Falcao"
    ],
    "doi": "https://doi.org/10.1021/ci300124c",
    "publisher": "American Chemical Society",
    "date": "June 6, 2012",
    "meta_version": "v1.0.0 (06-18-2020)",
    "meta_utc_fix": 1593125347,
    "meta_path": "../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json",
    "data_path": "../case_studies/Martins_et_al_2012/martins_et_al_2012.csv",
    "data_row_num": 2053,
    "data_columns": [
        "num",
        "name",
        "p_np",
        "smiles"
    ],
    "smiles_col": null,
    "value_col": null,
    "class_col": null
}

In [11]:
standardize(data_dir, 'smiles', 'p_np')

RDKit ERROR: [15:49:35] Explicit valence for atom # 11 N, 4, is greater than permitted
RDKit ERROR: [15:49:35] Explicit valence for atom # 12 N, 4, is greater than permitted
RDKit ERROR: [15:49:35] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [15:49:35] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [15:49:35] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [15:49:35] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [15:49:35] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [15:49:35] Explicit valence for atom # 1 N, 4, is greater than permitted
RDKit ERROR: [15:49:35] Explicit valence for atom # 6 N, 4, is greater than permitted
RDKit ERROR: [15:49:35] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [15:49:36] Explicit valence for atom # 6 N, 4, is greater than permitted
RDKit ERROR: [15:49:36] Can't kekulize mol.  Unkekul

RDKit ERROR: [15:49:37] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [15:49:37] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERROR: [15:49:37] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [15:49:37] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERROR: [15:49:37] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [15:49:37] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERROR: [15:49:37] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [15:49:37] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERROR: [15:49:37] SMILES Parse Error: syntax error while parsing: invalid_smiles
RDKit ERROR: [15:49:37] SMILES Parse Error: Failed parsing SMILES 'invalid_smiles' for input: 'invalid_smiles'
RDKit ERRO


        Your class values are non-standard for classification.
        In order for training and testing to run smoothly, let's
        convert your classes to standard form.
        

        Your class column, p_np currently contains 2 unique values.
        Those class values are {'n', 'p'}.
        Each class will be mapped into a standard value in [0, 1].
        
Assign n to one of the following values: [0, 1]:0
Assign p to one of the following values: [1]:1
Standard df will be written to: ../case_studies/Martins_et_al_2012/std_martins_et_al_2012.csv
Updated metadata at: ../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json


In [19]:
! cat ../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json

{
    "title": "A Bayesian Approach to in Silico Blood-Brain Barrier Penetration Modeling",
    "authors": [
        "Ines Filipa  Martins",
        "Ana L.  Teixeira",
        "Luis  Pinheiro",
        "Andre O.  Falcao"
    ],
    "doi": "https://doi.org/10.1021/ci300124c",
    "publisher": "American Chemical Society",
    "date": "June 6, 2012",
    "meta_version": "v1.0.0 (06-18-2020)",
    "meta_utc_fix": 1593125347,
    "meta_path": "../case_studies/Martins_et_al_2012/Martins_et_al_2012_metadata.json",
    "data_path": "../case_studies/Martins_et_al_2012/martins_et_al_2012.csv",
    "data_row_num": 2053,
    "data_columns": [
        "num",
        "name",
        "p_np",
        "smiles"
    ],
    "smiles_col": "smiles",
    "value_col": null,
    "class_col": "p_np",
    "class_map": {
        "p_np": {
            "n": 0,
            "p": 1
        }
    },
    "std_class_col": "std_class",
    "std_data_path": "../case_studies/Martins_et_al_20

In [None]:
resolve_class(data_dir, filter_fn=None)