# Rationale
There are several datasets that may be useful in constructing out benchmarking toxicity dataset. These are currently compiled in ToxValDB. I will assess the following characteristics for each database to determine whether they are suitable for inclusion in the dataset:
- Model organisms used
- Treatments used
- Phenotypes assessed
- Size
- Quality / integrity

I will start by assessing the first two datasets listed in the repo:
- ECOTOX
- ECHA IUCLID

In [3]:
import pandas as pd

# ECOTOX
Contains POD data derived predominately from the peer-reviewed literature, for aquatic life, terrestrial plants, and terrestrial wildlife, respectively.

In [51]:
ecotox = pd.read_excel("/Users/sethhowes/Desktop/FS-Tox/data/raw/toxval_all_res_toxval_v94_ECOTOX.xlsx")

In [11]:
ecotox.head()

Unnamed: 0,dtxsid,casrn,name,source,subsource,qc_status,risk_assessment_class,human_eco,toxval_type,toxval_type_original,...,document_name,toxval_type_category,source_url,subsource_url,toxval_id,source_hash,source_table,details_text,chemical_id,priority_id
0,DTXSID901034504,78565-30-7,Swascofix P 14,ECOTOX,EPA ORD,pass,mortality,eco,LC50,LC50,...,-,LC,https://cfpub.epa.gov/ecotox/,-,931901,852f151c1e61779d8e5dd24c268199ea,direct_load,ECOTOX Details,ToxVal00034_0bd9881f481b02f8,4
1,DTXSID901034504,78565-30-7,Swascofix P 14,ECOTOX,EPA ORD,pass,mortality,eco,LC50,LC50,...,-,LC,https://cfpub.epa.gov/ecotox/,-,931902,550f670d6c1b236c1f4e99688d4e4de4,direct_load,ECOTOX Details,ToxVal00034_0bd9881f481b02f8,4
2,DTXSID8040858,9034-40-6,Gonadotrophin-releasing hormone,ECOTOX,EPA ORD,pass,reproduction,eco,NOEC,NOEC,...,-,NOEC,https://cfpub.epa.gov/ecotox/,-,931903,e3496ea6bd4567188787338cf048e1a1,direct_load,ECOTOX Details,ToxVal00034_bfc2c5b36e8e4418,4
3,DTXSID8040858,9034-40-6,Gonadotrophin-releasing hormone,ECOTOX,EPA ORD,pass,other,eco,LOEC,LOEC,...,-,LOEC,https://cfpub.epa.gov/ecotox/,-,931904,00bc590d6cc842738c49b3c48a337fbe,direct_load,ECOTOX Details,ToxVal00034_bfc2c5b36e8e4418,4
4,DTXSID8040858,9034-40-6,Gonadotrophin-releasing hormone,ECOTOX,EPA ORD,pass,other,eco,LOEC,LOEC,...,-,LOEC,https://cfpub.epa.gov/ecotox/,-,931905,1931ecfb04ce80e212115aecb82e1c6f,direct_load,ECOTOX Details,ToxVal00034_bfc2c5b36e8e4418,4


## Model

In [25]:
ecotox["common_name"].value_counts()

common_name
Water Flea                   36812
Zebra Danio                  29211
Rainbow Trout                26837
Green Algae                  21228
Fathead Minnow               17701
                             ...  
Long-Nose Skate                  1
Thornback Ray                    1
Spotted Barb                     1
Golden Agae                      1
Short-Horned Fly Suborder        1
Name: count, Length: 2352, dtype: int64

Majority of these are not mammalian, so we may need to exclude most records, or potentially the entire dataset.

In [26]:
ecotox["lifestage"].value_counts()

lifestage
-                     189356
juvenile               32788
larva                  32470
adult                  30138
embryo                 29162
                       ...  
female gametophyte         3
F11 generation             3
cocoon                     2
lactational                2
boot                       2
Name: count, Length: 107, dtype: int64

Would make sense to exclude those that are in the larval / egg lifestages.

# Treatments

In [16]:
ecotox["name"]

0                                 Swascofix P 14
1                                 Swascofix P 14
2                Gonadotrophin-releasing hormone
3                Gonadotrophin-releasing hormone
4                Gonadotrophin-releasing hormone
                           ...                  
417277    Antimony potassium tartrate trihydrate
417278    Antimony potassium tartrate trihydrate
417279    Antimony potassium tartrate trihydrate
417280    Antimony potassium tartrate trihydrate
417281                              Lead nitrate
Name: name, Length: 417282, dtype: object

I can already see that GRH is included in this list, meaning that biologically-relevant comounds are likely to be included in this.

There are four different chemical identifiers that we can use to identify canonical SMILES:
- DTXSID
- CASRN
- DSSTox name
- Primary key in source_chemical table??

When I look at these chemicals, it is not immediately clear to me whether they may be relevant or not.

In [61]:
ecotox["exposure_route"].value_counts()

exposure_route
aqueous          391961
oral               9140
injection          6987
in vitro           6535
environmental      2307
tdermal             195
multiple            157
Name: count, dtype: int64

Reaffirms how most of this exposure was to fish via placement in water. Not going to be relevant to our dataset.

# Phenotypes

In [18]:
ecotox["toxval_type_supercategory"].value_counts()

toxval_type_supercategory
Point of Departure          190980
Lethality Effect Level      126211
Effect Concentration         51331
Effect Level                 15486
other                        13345
Effect Time                   9529
Inhibition Concentration      7056
Exposure Limit                2907
Effective Residue Level        273
Effect Dose                    164
Name: count, dtype: int64

Not sure how all of these categories are different from each other. Doesn't seem to add huge amounts of additional useful information

In [20]:
ecotox["toxval_type"].value_counts()

toxval_type
LC50                   110755
NOEC                    94903
LOEC                    74834
EC50                    40838
NOEL                    12851
                        ...  
LT50@ 4.74 AI mgL           1
LT50@ 0.0019 AI mgL         1
LT50@ 72.1 AI mgL           1
LT50@ 0.0394 AI mgL         1
LT50@ 11000 AI mgL          1
Name: count, Length: 1786, dtype: int64

It seems like these effect types might be useful, but may be worth dropping all of the non-major categories (e.g. LT50@...)

In [21]:
ecotox["study_type"].value_counts()

study_type
mortality            174346
population            41389
biochemistry          31520
neurotoxicity         26754
genetics              26075
enzymes               21634
growth                20942
accumulation          16772
reproduction          13058
physiology            11441
developmental         10786
in vitro               5647
morphology             5584
hormone                3844
multiple               2196
histology              2142
unspecified             885
immunotoxicity          845
avoidance               502
Injury                  488
ecosystem process       366
transcriptomics          66
Name: count, dtype: int64

None of these categoies seem that irrelevant, apart from maybe unspecified or ecosystem process.

In [22]:
ecotox["study_duration_class"].value_counts()

study_duration_class
chronic    417282
Name: count, dtype: int64

Only chronic studies are included in the dataset.

# Size

In [47]:
ecotox.shape

(417282, 68)

This record number differs from source_count table which states there should be 364976 records. Slightly suss..

# Quality

Each record has a qc status. If the record fails qc, it details why.

In [30]:
ecotox["qc_status"][ecotox["qc_status"] != "pass"].value_counts()

qc_status
fail:toxval_type not specified     11181
fail:toxval_units not specified       14
Name: count, dtype: int64

Would make sense to remove those without toxval_type (e.g. NOAEL).

In [34]:
ecotox.columns[(ecotox.isna().sum() != 0)]

Index(['common_name'], dtype='object')

This approach to detecting missingnes is not robust, as there are clearly columns with missing data, but they have a single '-' in them which means they are not being identified as null values.

In [35]:
ecotox['sex']

0         -
1         -
2         -
3         -
4         -
         ..
417277    -
417278    -
417279    -
417280    -
417281    -
Name: sex, Length: 417282, dtype: object

In [36]:
ecotox.replace(to_replace='^-$', value=None, regex=True, inplace=True)

In [46]:
ecotox_na = ecotox.isna().sum()

ecotox_na[ecotox_na != 0]

toxval_type                11181
toxval_subtype              1503
toxval_units                  14
toxval_units_original          5
study_duration_units          12
species_original               5
common_name                    5
habitat                   417282
strain                    417282
strain_group              417282
strain_original           417282
sex                       417282
sex_original              417282
generation                417282
lifestage                 189356
exposure_method             4248
exposure_form             417282
exposure_form_original    417282
media                       2383
journal                   417282
volume                    417282
issue                     417282
url                       417282
document_name             417282
toxval_type_category       13345
subsource_url             417282
dtype: int64

Doesn't seem to be major amount of missing data for important variables. Probably should exclude those variables with missingness for toxval_type.

# ECHA IUCLID

Data from in vivo studies submitted to ECHA/REACH and accessed through IUCLID dump.

In [53]:
echa = pd.read_excel("/Users/sethhowes/Desktop/FS-Tox/data/raw/toxval_all_res_toxval_v94_ECHA IUCLID.xlsx")

# Model

In [56]:
echa["common_name"].value_counts()

common_name
Rat                                        126647
Rabbit                                      15888
Mouse                                       15603
Not specified                                3315
Dog                                          1896
                                            ...  
Guinea Pig, Hamster, Mouse, Rabbit, Rat         1
Dog, Guinea Pig, Mouse, Rat                     1
Frogs                                           1
Guniea Pig, Monkey, Rabbit                      1
Hydra                                           1
Name: count, Length: 98, dtype: int64

Seems to be a much more appropriate set of animal models

# Treatment

In [57]:
echa["name"]

0                                            Dibromomethane
1         Tall-oil fatty acids oligomeric reaction produ...
2                                Tetrabutylammonium bromide
3                                Di(2-ethylhexyl) phthalate
4                                     Manganese sesquioxide
                                ...                        
167660                              Manganese sulfate (1:1)
167661                              Manganese sulfate (1:1)
167662                              Manganese sulfate (1:1)
167663                              Manganese sulfate (1:1)
167664                              Manganese sulfate (1:1)
Name: name, Length: 167665, dtype: object

Again I'm unsure of the approach to use here to try and identify chemicals here that are relevant by name only.

In [60]:
echa["exposure_route"].value_counts()

exposure_route
oral          85478
inhalation    32359
dermal        26457
-             23371
Name: count, dtype: int64

Much more appropriate set of exposure routes here.

# Phenotype

In [62]:
echa["toxval_type_supercategory"].value_counts()

toxval_type_supercategory
Point of Departure          83141
Lethality Effect Level      78714
other                        5486
Effect Time                   140
Inhibition Concentration       69
Effect Concentration           58
Exposure Limit                 39
Effect Dose                    16
Toxicity Value                  2
Name: count, dtype: int64

A similar set of categories to ECOTOX.

In [65]:
echa["toxval_type_category"].value_counts()

toxval_type_category
LD                                 64088
NOAEL                              45877
LC                                 14515
NOEL                               10712
NOAEC                               9917
LOAEL                               9192
-                                   4893
LOAEC                               3075
LEL                                 1151
LOEL                                1064
NOEC                                 915
LEC                                  500
discriminating dose                  474
LOEC                                 342
BMC                                  281
LT                                   140
BMD                                   90
TD                                    87
IC                                    67
RD                                    60
EC                                    58
IHT                                   40
MTD                                   28
NEL                                 

Lots of different lower level categories.

# Size

In [66]:
echa.shape

(167665, 68)

# Quality

In [67]:
echa["qc_status"][echa["qc_status"] != "pass"].value_counts()

qc_status
fail:dtxsid not specified          15525
fail:toxval_units not specified     5992
fail:toxval_type not specified      3346
fail:human_eco not specified        3316
fail:toxval_numeric<0                 23
Name: count, dtype: int64

A wider range of qc exclusion criteria, all of which seem fair.

In [69]:
echa.replace(to_replace='^-$', value=None, regex=True, inplace=True)

In [70]:
echa_na = echa.isna().sum()

echa_na[echa_na != 0]

casrn                             14843
toxval_type                        4823
toxval_type_original               3698
toxval_subtype                   167665
toxval_units                       6874
toxval_units_original              6841
study_duration_class             167665
study_duration_class_original    167665
study_duration_value_original     64135
study_duration_units              65615
study_duration_units_original     64174
species_original                   3253
ecotox_group                          1
habitat                          167665
strain                            28533
strain_group                      28944
strain_original                   28533
sex                               21662
sex_original                       9275
generation                       167665
lifestage                        167665
exposure_route                    23371
exposure_route_original           23371
exposure_method                   49929
exposure_method_original          42296


Much greater degree of missingness here for relevant variables.

# Thoughts on selecting datasets

I think it's clear from the approach above that taking a non-systematic approach is not a scientifically robust way to approach the curation process. It will likely lead to innapropriate inclusion / exclusion of data. Here are some suggestions on how we can systematically select data:

- Models
    - Create whitelist of model organisms, then require that a dataset contain above a given threshold of whitelisted species
    - Create whitelist of model organisms then only include those records from a dataset with these animals
    - Take either of the approaches above but whitelist high level species name (ecotox_group)
- Treatments
    - Convert each of the 3 different representations (CASRN, DSSTOXSID) to canonical SMILES / InChI
    - Search chemical databases to determine whether there is a history of these drugs being studied for medical purposes (e.g. they have an entry in the pharmacology section of PubChem)
    - Exclude those records that were specified as for ecological study instead of human
- Phenotypes
    - Create whitelist of administration routes and include only those records with whitelisted administration routes
    - Exclude in vitro data for 'risk_assessment_class'
    - Unsure which variables to select / exclude here
- Size
    - Unsure of relevance of this to data selection
- Quality integrity
    - There is an associated quality score (1-5) attached to each dataset. I don't recommend using this as it is not clear how this score is defined
    - Remove missing records for core variables
    - Exclude those variables failing ToxVal's QC criteria. They seem fair
    
    

    