# EPA NCC

In [1]:
import epa_ncc

In [2]:
from pathlib import Path
import openpyxl

In [3]:
TOP = Path.cwd().as_posix().replace('notebooks', '')
data_dir = Path(TOP) /'data'/'raw'

In [4]:
%%capture
#Version 1: Import the functions from the fully installed package.
from epa_ncc.ncc_categories.ncc_categories import queryAll, printTree, listCategories, singleQuery

## singleQuery()

**Function definition:** A quick method for determining whether a chemical belongs in a specific category.

- Inputs: 
  - *one_chem*, individual Chemical, provided as a dictionary or DataFrame slice with the keys/columns speficied above. 
  - *category_title*, String representing a category title. Possibilities listed below.
- Output: *boolean*, value specifies whether x is in Category Title or not

In [5]:
from rdkit import Chem
import pandas as pd

# Single-chemical Dictionary
test_chem = {'dsstox_sid': 'DTXSID3060164',
  'smiles': 'C1=CC=CC=C1C(C1C=CC=CC=1)C1C=CC=CC=1',
  'logp': 5.76,
  'ws': 4.07380277804113e-07,
  'mol_weight': 244.125200512,
  'mol': Chem.MolFromSmiles('C1=CC=CC=C1C(C1C=CC=CC=1)C1C=CC=CC=1')}

singleQuery(one_chem = test_chem, category_title = 'Neutral Organics')


True

In [6]:
# Single-chemical selection from a DataFrame

test_chems = pd.read_csv(data_dir/"readme_examples.csv", index_col = [0])

# The 'mol' column is not set up for this input (can be checked with test_chems.head()), so we will need to add it first.
test_chems['mol'] = [Chem.MolFromSmiles(i) for i in test_chems['smiles']]


In [7]:
#Select a chemical from the DataFrame
my_chem = (test_chems
 .set_index('dsstox_sid')
 .loc[['DTXSID2036405']]
)

In [8]:

my_chem = my_chem.reset_index()

In [9]:
# Choose a category
my_category = 'Acid Chlorides'


In [10]:
my_chem['dsstox_sid'].values[0]

'DTXSID2036405'

In [11]:

print(f"The chemical {my_chem['dsstox_sid'].values[0]} with smiles {my_chem['smiles'].values[0]} is in the Acid Chlorides Category:")
print(singleQuery(one_chem = my_chem, category_title = my_category))

The chemical DTXSID2036405 with smiles [OH-].[OH-].[OH-].[Al+3] is in the Acid Chlorides Category:
False


With an example of a chemical *not* in a category, we will next look at the function listCategories(), which similarly takes in a single chemical, but instead outputs all categories to which that chemical belongs. 

## listCategories()

**Function Definition:** Given an individual chemical, this function outputs a list of all categories to which the chemical belongs. 

- Input: *one_chem*, A DataFrame or Dictionary representing a single chemical and its attributes, including dsstox_sid, smiles, logp, ws, mol_weight, and RDKIT MolfromSmiles (labelled as 'mol'). There must be keys or column names to match each of these attribute titles. 
- Output: *all_cats*, A list of all categories to which the chemical belongs according to the included tests. 


In [12]:
# Single-chemical Dictionary
test_chem = {'dsstox_sid': 'DTXSID3060164',
  'smiles': 'C1=CC=CC=C1C(C1C=CC=CC=1)C1C=CC=CC=1',
  'logp': 5.76,
  'ws': 4.07380277804113e-07,
  'mol_weight': 244.125200512,
  'mol': Chem.MolFromSmiles('C1=CC=CC=C1C(C1C=CC=CC=1)C1C=CC=CC=1')}

listCategories(test_chem)

['Neutral Organics']

In [13]:
#Reuse the single-chemical DataFrame slice from the previous function

listCategories(my_chem)

['Aluminum Compounds']

In [14]:
#Alternatively, use one of the TSCA chemicals that has multi-category membership:
tsca_chems = pd.read_excel(data_dir/"tsca_categorisation_071124_wmappingdict.xlsx")

#Rename columns for processing by categories.py
tsca_chems = tsca_chems.rename(columns={'MolWeight':'mol_weight', 'dtxsid':'dsstox_sid', 'LogP_pred':'logp', \
                                  'WS_pred_mg/L':'ws'})

#Since we are only using one chemical, we will not waste computation finding all Mols
from rdkit import Chem
tsca_chems['mol'] = ''

# This is a chemical that outside study showed had category membership to multiple categories.
# These lines set the 'mol' column value for this chemical
chem_selection = tsca_chems.loc[tsca_chems['dsstox_sid'] == 'DTXSID0072980']
tsca_chems.loc[tsca_chems['dsstox_sid'] == 'DTXSID0072980', 'mol'] = Chem.MolFromSmiles(chem_selection['smiles'].values[0])
chem_selection = tsca_chems.loc[tsca_chems['dsstox_sid'] == 'DTXSID0072980']

listCategories(chem_selection)

['Acrylates/Methacrylates (Acute toxicity)',
 'Benzotriazoles (Acute toxicity)',
 'Benzotriazole-hindered phenols',
 'Esters (Acute toxicity)',
 'Phenols (Acute toxicity)']

The previous two functions allow users to quickly find information about an individual chemical, but often one will wantto categorize a full chemical inventory. The main function for this is queryAll().

## queryAll()

**Function Definition:** Given a set of chemical(s), returns a DataFrame containing one column for chemical DSSTOXSIDs and individual columns for every category included in all_tests. These columns will contain boolean values, thus describing category membership for the chemical set in a fingerprint-like way. 

- Inputs: 
  - *chemicals*, A DataFrame, Dictionary, or list of Dictionaries of Chemicals and their attributes, including dsstox_sid, smiles, logp, ws, mol_weight, and RDKIT MolfromSmiles (labelled as 'mol'). There must be keys or column names to match each of these attribute titles.
  - *boolean_outputs*, Default value is False. This function will, by default, output category_df with binary values descripbing category membership. If desired, this matrix can instead be output with boolean values by setting boolean_outputs to True. 

- Output: *category_df*, A DataFrame of chemicals and their category memberships, with an example depicted below:

In [15]:
# List of Dictionaries Input

new_test_chems = [{'dsstox_sid': 'DTXSID3060164',
  'smiles': 'C1=CC=CC=C1C(C1C=CC=CC=1)C1C=CC=CC=1',
  'logp': 5.76,
  'ws': 4.07380277804113e-07,
  'mol_weight': 244.125200512,
  'mol': Chem.MolFromSmiles('C1=CC=CC=C1C(C1C=CC=CC=1)C1C=CC=CC=1')},
 {'dsstox_sid': 'DTXSID7060837',
  'smiles': 'ICCCI',
  'logp': 3.02,
  'ws': 0.0007413102413009177,
  'mol_weight': 295.855896192,
  'mol': Chem.MolFromSmiles('ICCCI')},
 {'dsstox_sid': 'DTXSID9025879',
  'smiles': 'OC(=O)C=CC1C=CC(C=CC(O)=O)=CC=1',
  'logp': 1.99,
  'ws': 0.009120108393559097,
  'mol_weight': 218.0579088,
  'mol': Chem.MolFromSmiles('OC(=O)C=CC1C=CC(C=CC(O)=O)=CC=1')}]

queryAll(new_test_chems)

Unnamed: 0,chemicals,Acid Chlorides,Acrylamides,Acrylates/Methacrylates (Acute toxicity),Aldehydes (Acute toxicity),Aliphatic Amines,Aluminum Compounds,Anilines (Acute toxicity),Azides (Acute toxicity),Benzotriazoles (Acute toxicity),...,Imides (Chronic toxicity),Organotins (Chronic toxicity),Phenols (Chronic toxicity),Phosphinate Esters (Chronic toxicity),Polynitroaromatics (Chronic toxicity),Substituted Triazines (Chronic toxicity),Thiols (Chronic toxicity),Vinyl Esters (Chronic toxicity),Diazoniums (Chronic toxicity),Ethylene Glycol Ethers
0,DTXSID3060164,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,DTXSID7060837,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,DTXSID9025879,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Dictionary with multiple entries, boolean values as words rather than numbers

test_chems_together = {'dsstox_sid': ['DTXSID3060164','DTXSID7060837'],
  'smiles': ['C1=CC=CC=C1C(C1C=CC=CC=1)C1C=CC=CC=1','ICCCI'],
  'logp': [5.76,3.02],
  'ws': [4.07380277804113e-07,0.0007413102413009177],
  'mol_weight': [244.125200512,295.855896192],
  'mol': [Chem.MolFromSmiles('C1=CC=CC=C1C(C1C=CC=CC=1)C1C=CC=CC=1'),Chem.MolFromSmiles('ICCCI')]}

queryAll(test_chems_together, boolean_outputs=True)

Unnamed: 0,chemicals,Acid Chlorides,Acrylamides,Acrylates/Methacrylates (Acute toxicity),Aldehydes (Acute toxicity),Aliphatic Amines,Aluminum Compounds,Anilines (Acute toxicity),Azides (Acute toxicity),Benzotriazoles (Acute toxicity),...,Imides (Chronic toxicity),Organotins (Chronic toxicity),Phenols (Chronic toxicity),Phosphinate Esters (Chronic toxicity),Polynitroaromatics (Chronic toxicity),Substituted Triazines (Chronic toxicity),Thiols (Chronic toxicity),Vinyl Esters (Chronic toxicity),Diazoniums (Chronic toxicity),Ethylene Glycol Ethers
0,DTXSID3060164,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,DTXSID7060837,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [17]:
# DataFrame input

test_chems_df = pd.read_csv("../data/raw/readme_examples.csv").set_index("Unnamed: 0")

test_chems_df.head()

Unnamed: 0_level_0,dsstox_sid,smiles,mol_weight,ws,mol,logp
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,DTXSID90480751,C[13C](Cl)=O,79.49,1.0,<rdkit.Chem.rdchem.Mol object at 0x0000014D3FC...,0.38
1,DTXSID50939730,CC(=C)C([O-])=O,85.083,0.704,<rdkit.Chem.rdchem.Mol object at 0x0000014D3FC...,0.857
2,DTXSID2036405,[OH-].[OH-].[OH-].[Al+3],78.003,28.8,<rdkit.Chem.rdchem.Mol object at 0x0000014D3FC...,-0.75
3,DTXSID1024835,O=CC=CC1=CC=CC=C1,132.162,0.0107,<rdkit.Chem.rdchem.Mol object at 0x0000014D3FC...,1.9
4,DTXSID30878870,[N-]=[N+]=NC1=CC=CC=C1,119.127,0.0218,<rdkit.Chem.rdchem.Mol object at 0x0000014D3FC...,2.59


In [18]:
#Those "mol" are just strings, since a csv cannot save an rdkit Mol item

#Fix them 
test_chems_df['mol'] = [Chem.MolFromSmiles(i) for i in test_chems_df['smiles']]

In [19]:
queryAll(test_chems_df)

Unnamed: 0,chemicals,Acid Chlorides,Acrylamides,Acrylates/Methacrylates (Acute toxicity),Aldehydes (Acute toxicity),Aliphatic Amines,Aluminum Compounds,Anilines (Acute toxicity),Azides (Acute toxicity),Benzotriazoles (Acute toxicity),...,Imides (Chronic toxicity),Organotins (Chronic toxicity),Phenols (Chronic toxicity),Phosphinate Esters (Chronic toxicity),Polynitroaromatics (Chronic toxicity),Substituted Triazines (Chronic toxicity),Thiols (Chronic toxicity),Vinyl Esters (Chronic toxicity),Diazoniums (Chronic toxicity),Ethylene Glycol Ethers
0,DTXSID90480751,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,DTXSID50939730,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,DTXSID2036405,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,DTXSID1024835,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,DTXSID30878870,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


The last function is a utility function for understanding the categories. For categories that are built directly from the legacy XML, this function can also show precisely how each chemical is ruled in or out of each category, but for categories that required some repair, the function mainly shows the rules for category membership.

## printTree()

**Function Definition:** Allows the user to view the testing process for determining whether a chemical belongs in a specific category. Can be run with or without a chemical input.

- Inputs: 
    - *one_chem*, Default value of x is None but an individual chemical can also be supplied with the same
    constraints as in singleQuery 
    - *category_title*, String representing a category title. Possibilities listed in [README](../README.md).
    - *printer*, Boolean with default value True. If True, then this result will be output to the console as
    a print statement. If False, nothing will be printed and the result will instead be a string variable.
- Output: *printed logic tree*, Each line of the logic tree will contain the query type and all necessary parameters. If data is provided for x, the last value of each line will contain the boolean value for whether x fulfills that piece of the query. 
    - For the XML-originating queries, the first value will be the query ID identifying the query in the XML document. 
    - For hard-coded queries, the first value will instead say CustomQuery and all lines after the first will terminate with "does not process", since the functions for all subqueries are contained within the top branch of the tree only.

In [20]:
# With a chemical and a legacy category

printTree(category_title = 'Benzotriazoles (Chronic toxicity)', one_chem = my_chem)


('1159', 'LogicalQuery', 'And', False)
	('1148', 'b:StructureQuery', 'c12c(cccc1)[#7]=,:[#7][#7v3]2', False)
	('1150', 'b:ParameterQuery', 'log Kow', 5.0, 'GreaterThan', False)
	('1153', 'b:ParameterQuery', 'log Kow', 8.0, 'LessThan', True)
	('1156', 'b:ParameterQuery', 'Molecular weight', 1000.0, 'LessThan', True)


Here, we can see exactly which constraints the chemical does and does not satisfy. Below, we will see that for new categories we get just the overall query result.

In [21]:
# With a chemical and a new category

printTree(category_title = 'Epoxides', one_chem = my_chem)

('CustomQuery', 'LogicalQuery', 'And', False)
	('CustomQuery', 'b:ParameterQuery', 'Molecular Weight', 1000, 'LessThan', 'does not process')
	('CustomQuery', 'LogicalQuery', 'Or', 'does not process')
		('CustomQuery', 'b:StructureQuery', 'C1OC1', 'does not process')
		('CustomQuery', 'b:StructureQuery', 'C1CN1', 'does not process')


In [22]:
# Without a chemical

printTree(category_title = 'Benzotriazoles (Chronic toxicity)')

('1159', 'LogicalQuery', 'And', 'does not process')
	('1148', 'b:StructureQuery', 'c12c(cccc1)[#7]=,:[#7][#7v3]2', 'does not process')
	('1150', 'b:ParameterQuery', 'log Kow', 5.0, 'GreaterThan', 'does not process')
	('1153', 'b:ParameterQuery', 'log Kow', 8.0, 'LessThan', 'does not process')
	('1156', 'b:ParameterQuery', 'Molecular weight', 1000.0, 'LessThan', 'does not process')
