# Introduction

Read-across is an establised data gap filling technique within analogue and category approaches that has been used for decades to address different regulatory purposes. Perhaps the earliest documented example of read-across was published by Hanway and Evans [@hanway_2000], who substantiated an analogue approach using acute rodent oral toxicity and Ames mutagenicity as supporting data. The workflow for performing read-across has largely remained unchanged over the years though the introduction of data streams such as high throughput screening data, transcriptomics, metabolomics data has offered new possibilities for how source analogues can be identified and evaluated.  The most definitive technical guidance which describes read-across remains that published by the Organisation of Economic and Co-operative Development (OECD) [@oecd_guidance_2017]. Although this was last revised in 2014, a third edition of this guidance will be forthcoming in 2025 which includes updated sections for how high throughput data can be used, how uncertainty can be characterised and documented as well as updates for specific types of analogue/category approaches including nanomaterials, Unknown or Variable Composition, Complex Reaction Products, or Biological Materials (UVCBs) and metal containing substances. One limitation with the technical guidance notwithstanding this anticipated update is a lack of case studies to demonstrate what information a successful and robust read-across assessment should contain. The forthcoming guidance has certainly benefited from the OECD Integrated Approaches to Testing and Assessment (IATA) Case studies project but up until now there has not been a concerted effort to collect examples to showcase what a robust read-across justification looks like, what level of supporting information is warranted and how this might differ depending on the decision context and/or the regulatory jurisdiction. 

The European Union's (EU) Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH) guidance offers additional indirect guidance for how the EU Chemicals Agency (ECHA) assesses read-across cases based on its own read-across assessment framework (RAAF) [@echa_read-across_2017] but there are no published examples showcasing the justification and documentation provided in dossiers which were successful. Read-across practioners have only the RAAF for indirect guidance and access to the published ECHA decision letters that describe any shortcomings in the dossiers submitted. Identifying which dossiers have utilised read-across is not a trivial undertaking. Cross referencing dossier to the associated decision letter will describe shortcomings in the dossier beyond read-across, e.g. poor substance characterisation. In Patlewicz et al [@patlewicz_systematic_2024], an algorithmic approach was undertaken to help identify dossiers where read-across had been performed at least for oral repeated dose and repro-developmental study outcomes but the analysis fell short of identifying how many of the dossiers had been ultimately successful in the application of read-across. Moreover even having identified the dossiers and the specific source analogue used in each case, the rationale behind the identification, evaluation and justification of those analogues remained opaque. The study record relying on a source substance was known such that an association between that and the target substance could be established but an assumption had to be made as far as whether those source substances arose from a category or analogue approach, moreover the process by which candidate analogues were identified and rationalised to make a determination that the source substance data was sufficient to address the data gap was not available. That said, the analysis did shine a light on some of the challenges with current read-across in terms of a disconnect between a regulation stipulating that structurally similar analogues should be identified yet finding that many of the source analogues ultimately used were often not particularly structurally similar or that in the majority of cases, a more similar analogue could have been selected. Most likely reasons for this disconnect are due to the absence of access to available toxicity data. Nonetheless the study was an useful exploration of the ways in which analogues could be characterised and compared relative to their target substances. Assuming the read-across inference itself was sufficient for purpose - in many cases using a structure-based algorithmic approach gave rise to similar predictions of toxicity potency. 

Since this study was published, Roe et al [@roe_systematic_2025], reported on a large undertaking to retrieve all testing proposals submitted by REACH registrants over the last 15 years, specifically those where read-across had been employed as an adaptation to explore what aspects of a read-across justification were most significant in driving acceptance. So named assessment elements largely aligned to those defined by ECHA in its RAAF were used to compare the odds of acceptance with respect to these specific aspects. For example, a category approach vs an analogue approach tended to have a higher odds of acceptance as did a category approach substantiated with bridging studies as well as identification of metabolites of source and target substances. High chemical similarity was also a strong predictor of acceptable read-across.

In this study, a concerted effort was made to perform a targeted literature search of the primary and grey literature to identify read-across cases where the approach had been accepted either for the stated context of interest or at least where there had been a more comprehensive description of the workflow from the strategy of identifying candidate analogues, by which tools or expertise and how those were subsequently evaluated for the endpoint of interest. The study was limited to repeated dose toxicity outcomes and focused primarily on the oral route of exposure to mitigate variability in the types of read-across scenarios under consideration. The main aims of this study were: 1)  conduct a targeted literature search to gather as many read-across examples as possible. 2) define a schema to capture relevant information in a harmonised format to facilitate subsequent data processing and lastly 3) investigate different approaches to evaluate the similarities of the analogues identified as well as ways in which these could be codified for predictive profiling. The overall workflow summarising all lines of investigation in this study is captured in @fig-wkflow.

![Workflow of the overall study](icf-workflow.drawio.png){#fig-wkflow}


# Materials and Methods

## Identification of read-across examples

Initially read-across cases were compiled from the [EPA Provisional Peer Review Toxicity Values (PPTRV) assessments](https://www.epa.gov/pprtv), [OECD IATA case studies](https://www.oecd.org/chemicalsafety/risk-assessment/iata/) and [OECD High Production Volume (HPV) categories](https://hpvchemicals.oecd.org/UI/Search.aspx). A web query was created in python to query each of the ~400 PPRTVs (first accessed June 2021) and identify only those PPRTV assessments where an Appendix A was available in the final report as this was a crude proxy for whether a read-across assessment might have been performed. Each of the 37 cases initially identified were then manually checked and information extracted from the reports themselves. The HPV examples were reviewed to identify only those cases where there was a Screening Information Dataset (SIDS) Initial Assessment Report (SIAR)  for categories containing members that were organic substances. Inorganics, mixtures and UVCBs were considered out of scope. All OECD IATA case studies tagged as read-across were identified and each was extracted if the endpoints of interest were repeated dose toxicity outcomes. 

Initially articles on read-across were retrieved based on known authors who had published in the field. Subsequently, a more systematic literature search was conducted against Pubmed using the Abstract Sifter Tool version 7 tool [@baker_abstract_2017] using the following search query term `"Read-across [tiab] OR Read across [tuab] OR Chemical grouping [tiab] OR Category Approach [tiab] or category-approach [tiab] OR Analogue approach [tiab] OR Analog approach [tiab] OR Surrogate approach OR Grouping concept]"`. 
This netted 934 results as of 18th July 2023. Each of the abstracts were manually reviewed and tagged 'yes', 'no' or 'maybe' for subsequent extraction. Certain articles were potentially duplicative of cases already identified based on the initial authors or already captured as part of the IATA Case studies or HPV categories. These were tagged as far as possible. Approximately 100 articles were labelled as 'yes' with a further 76 as 'maybe' for subsequent extraction. The full articles were then sourced and screened to determine whether they met minimal criteria of including a read-across assessment for a repeated oral toxicity endpoint. In certain cases, articles were found to contain examples of groupings e.g. clarifying approaches of evaluating analogues but without any specific read-across prediction, these were extracted as 'qualitative' examples to augment the dataset.  


## Schema to capture the Information

A structured excel sheet was created to capture relevant information from each case. This included the target substance(s) being assessed, the candidate source analogues, the toxicity value information being read across, as well as the rationale used to identify and evaluate the analogues (so-named analogue evidence streams). A Standard Operating Protocol (SOP) was drafted to document the approach in an effort to ensure consistency in how extractions were undertaken. Extractions were performed by one individual which were then checked for completeness and consistency by a second individual. The SOP was updated as necessary to capture revisions as the breadth of case studies evolved. This was inevitable given the different examples drawn from the OECD HPVs, EPA PPRTVs and OECD IATA Case studies were structured per specific templates whereas there was more variability in terms of how literature studies described any read-across. Qualitative example extractions were limited to capturing chemical substance information.


## Chemical information 

Since many of the evaluations to be investiagted were chemical specific, the read-across cases information needed to be augmented with structural information for both target and source analogues. Target and source analogue identities were queried using the EPA CompTox Chemicals Dashboard [@williams_comptox_2017] to retrieve Chemical Abstracts Service (CAS) Registry numbers (CASRN)  and structures using Simplified Molecular Input Line Entry System (SMILES) as well as Distributed Structure-Searchable Toxicity (DSSTox) Substance Identifier (DTXSID) information [@grulke_epas_2019]. In cases where this information was not readily available in the Dashboard, the source articles were used to retrieve additional chemical identification information that could be used to re-run queries in the Dashboard or PubChem. QSAR ready SMILES (desalted, stereochemistry removed) were also retrieved from the Dashboard.

## Landscape Evaluation

An initial summary overview of the read-across cases was performed to better understand the landscape of the chemicals under study, the use cases explored, whether a category or analogue approach had been employed, and what tools or approaches were used to identify analogues. The freely available web application ClassyFire [@djoumbou_feunang_classyfire_2016] (as available as an Applicational Programming Interface (API) within the [Cheminformatics Modules](https://www.epa.gov/comptox-tools/cheminformatics-analysis-modules-resource-hub) was used to categorise all discrete organic structures into taxonomy classes using its chemistry ontology scheme. ClassyFire assigns chemicals into a taxonomy consisting of >4800 different categories. The taxonomy comprises 11 different levels such as Kingdom, SuperClass, Class, SubClass etc.  


## Similarity context evaluation {#sec-sim}

Similarity contexts evaluated included structure, structural alerts, physicochemical properties and predicted metabolism (summarised in @tbl-sim). Similarity in structure made use of Morgan chemical fingerprints (bit vector=1024 and radius=3) generated using the open source python library RDKit [@landrum_rdkit_nodate], and constructing a pairwise Jaccard similarity matrix. Structural alerts were generated by profiling substances using the SMILES Arbitary Target Specification (SMARTS) queries contained in the OECD Protein and DNA binding alert schemes available as supplementary information in the original publications by Enoch et al [@enoch_review_2010;@enoch_review_2011]. A binary matrix of presence and absence of specific alerts was constructed from which a pairwise Jaccard similarity matrix was calculated. Physicochemical properties relied upon OPEn saR App (OPERA) predictions [@mansouri_opera_2018] of logKow, the log of the octanol-water partition coefficient, calculated molecular weight (MW), number of Hydrogen Bond donors (HBD) and Acceptors (HBA). A pairwise distance matrix using a normalised Euclidean distance was then calculated. The resulting squareform distance matrix was substracted from 1 to derive a similarity matrix consistent with the other 2 contexts; structural similarity and structural alert similarity. The metabolism of the substances were simulated using the OASIS TIMES *in vitro* rat liver model [@mekenyan_systematic_2004]. Using the tree report output based on default settings, metabolic similarity was then quantified by evaluating the similarity of the actual simulated metabolic graph using the graph kernel similarity metric, Weisfeiler Lehman (WL) [@shervashidze_weisfeiler-lehman_2011]. Nodes were characterised by their Morgan fingerprints whereas edges were one-hot encodings of the transformation reactions from TIMES. Kernel scores were computed using the Grakel python package [@siglidis_grakel_2020].

|      Similarity context    |   Characterisation    |    Metric     | Tool     |   
|:-----:|:-----:|:----:|:-------:|
|    Structure   |  Morgan chemical fingerprint  (radius=3, bitSize = 1024)  | Jaccard |    RDKit    |  
| Structural Alerts | Using OECD DNA and Protein binding Alerts to create a fingerprint vector |Jaccard  |  RDKit|
| Physicochemical | Predictions of logKow, HBA, HBD, MW |normalised Euclidean  |   OPERA |
|     Metabolic  |  Metabolic graphs where nodes were Morgan fingerprints and transformation reactions were edges    | WL   |   Grakel, TIMES |

: Similarity contexts selected for evaluation and how they were characterised. {#tbl-sim}


## Quantifying the contribution of similarity contexts {#sec-bayes}

A Bayesian logistic regression was created to investigate the contribution of the different pairwise similarity contexts and quantifying the distributions of their coefficients. Substances that were members of a read-across case were tagged as similar (denoted by a 1) whereas substances between cases were tagged as dissimilar (denoted by a 0). The different pairwise similarity scores (structural similarity, physicochemical similarity, structural alert similarity or WL similarity) formed the feature set. This created a highly imbalanced dataset with only 2943 analogue pairs as similar and 187,710 as dissimilar. To balance the dataset so that the number of analogue pairs were evenly split, the dissimilar pairs were randomly downsampled so that an equal number of similar and dissimilar analogue pairs remained. This balanced set was then split 75:25 into a training: test set. The training set was used to train a Bayesian logistic regression whereas the test set was reserved to evaluate overall performance in the predictions. The logistic regression was set up such that the target outcome (whether a pair of substances were similar) was assumed to follow a Bernouilli distribution with logit as the logistic function. To build the Bayesian logistic regression, normally distributed priors were assigned to each of the different similarity metrics (structural similarity, structural alert similarity, physicochemical similarity, WL  similarity). After specifying the priors, Markov Chain Monte Carlo (MCMC) simulations were used to approximate the posterior distributions. The PyMC library in python [https://www.pymc.io/welcome.html] was employed to draw samples from the posterior. The sampling algorithm used was NUTS, the default sampler for continuous variables based on Hamiltonian mechanics, in which parameters are tuned automatically.


## Metric learning approaches

### Deep learning approaches

The approach followed in @sec-bayes provided one means of attempting to objectively model whether 2 substances were similar to each other and how that related to an expert assignment of similar for read-across. The characterisation of substances relied on similarity metrics capturing different types of information. As an alternative, a deep learning metric learning approach was attempted using a contrastive Siamese network. This network relies on 2 inputs (a target and analogue) and trains a model to compare them. The network generates an embedding for each chemical and the loss function aims to help the model determine whether the 2 substances are similar or not. If the substances are similar, the networks makes their embeddings close together in the embedding space. The aim is to learn a way of representing substances such that similar pairs are grouped close together whereas dissimilar pairs are separated from each other.

A pairwise matrix was first constructed for all read-across substances with SMILES. If a pair of substances were both members of the same read-across case, it was denoted by label 0, otherwise a label 1 was assigned. Since the number of dissimilar pairs was so much greater than the number of similar pairs, a random sampling was performed to downsample the number of dissimilar pairs to create a balanced dataset as already described. The dataset was then split 3 ways into a training, validation and test set (80,10,10). The training and validation sets were used during model training and evaluation whereas the test set was reserved to evaluate final performance.  

A Contrastive Network was constructed with using a linear layer, a Rectified Linear Activation function and a final linear layer. The input to the network was a pair of substances characterised by their Morgan fingerprints. The final linear layer utilised an embedding of 256 dimensions. The model was trained over 50 epochs with early stopping. The loss function used to train the model was a contrastive loss as described by Chopra et al. [@chopra_learning_2005] Originally developed for face verification, conceptually the premise is to learn a function that maps features (in this case the input Morgan fingerprints) into a target space (the embedding space derived). The learning process minimises a discriminative loss function that drives the similarity metric (Euclidean) to be small for similar pairs and large for dissimilar pairs. A pair of substances as represented by their Morgan fingeprints are introduced into the network and the constrative loss functions aims to learn embeddings such that 2 similar substances would have a low distance and 2 dissimilar substances (per the read-across case labels) would have a large distance. A margin of 1 was set in the contrastive loss function. This margin represented the minimum distance between positive (similar) pairs in the embedding space. If the distance between positive pairs was less than the margin, the loss for that pair was zero; otherwise, the loss increased proportionally to the distance beyond the margin. The margin acted as a threshold that determined when two substances were considered dissimilar. The distribution of the distances for the test dataset were visualised to explore whether the labelled similar and dissimilar pairs of substances were actually separated. 

The embeddings derived from the network for both target and analogue pair were compared using a normalised cosine distance and compared with the original Tanimoto scores computed originally. The embeddings were additionally used as inputs in a logistic regression where the absolute difference and product of the embeddings for each target-analogue pair was computed as an input feature set to evaluate how the embeddings derived from the network could discriminate between known similar and dissimilar analogue pairs.

A Graph Isomorphism Network (GIN) within the pytorch geometric library was also constructed using a Siamese network architecture such that a pair of substances represented by their molecular graphs could be fed into the network, the contrastive loss served as the loss function to learn the embeddings in which two similar substances had a low Euclidean distance and two dissimilar points had a large Euclidean distance. Target-analogue SMILES were used as inputs which were converted into pytorch-geometric graphs to extract embeddings representing both chemical structure information that could be predictive of whether an analogue pair was part of the same read-across case or not.  The GIN network comprised 3 convolutional layers followed by a global mean pooling operation to aggregate node features into a single graph representation. The final linear layer produced graph level embeddings of a fixed size. Optuna [@optuna_2019], an automatic hyperparameter optimisation software framework was used to optimise the number of GIN layers (2-5), the size of the hidden dimensions (32-256), the dropout rate (0.1-0.5), learning rate (1e-4-1e-2) and margin (0.5-5) over 50 trials to maximise accuracy performance.  The dataset was split 3 ways such that 80% of the dataset was used for training, 10% was used for hyperparameter tuning and the final 10% was reserved to evaluate overall performance. 

In an effort to explore other Graph based networks given the poor performance of the GIN network with the read-across cases dataset, the Molecular Contrastive Learning of Representations [@wang_molecular_2022] (MolCLR) approach was applied using the ToxCast library of chemicals. The MolCLR relies on building molecular graphs and GNN encoders to learn differentiable representations. Three molecule graph augmentations are used: atom masking, bond deletion, and subgraph removal. A contrastive estimator is then used to maximise the agreement of augmentations from the same substance while minimising the agreement of different substances. This self supervised framework uses unlabelled data, that is to say, for the ToxCast set of substances, no labels to denote similar or dissimilar pairs are made. Instead graph datasets are created that augment substances so that 2 substances are then passed into the network. Each atom in the molecule graph was embedded by its atomic number and chirality type whereas each bond was embedded by its type and direction. The network architecture comprised a 5 layer graph convolution architecture with ReLU activation as the GNN backbone. An average pooling was applied to each graph as the readout operation to extract a 512 dimension molecular representation. Adam optimiser with a weight decay 10-5 was used to optimise the NT-Xent loss. After 10 epochs with a learning rate 5x10-4, a cosine learning decay was implemented. The model was trained with a batch size 512 for a total of 50 epochs. Although the pre-training in the original paper used ~10 million unique unlabelled substances represented by their SMILES, herein as a proof of concept, the dataset was restricted to the ToxCast library to explore the feasbibility of applying the resulting embeddings generated as these were expected to be more enriched that relying on the limited set of extracted read-across case substances.  

Once the model was trained, it was used to compute embeddings for the read-across case substances. The absolute difference and product of the embeddings for each target-analogue pair were computed and these formed the inputs in a logistic regression classifier to discriminate between known similar and dissimilar analogue pairs. The dataset was a randomly downsampled dataset to ensure that there was an equal balance of similar and dissimilar pairs of substances.  The dataset was randomly split 80:20 into a training and test set. Performance was assessed through a 5 fold stratified cross validation procedure using the F1 score as a metric. The performance was compared to a logistic regression model based on Morgan chemical fingerprints to determine whether there was any significant improvement between the 2 approaches.

## Codifying metabolic similarity for read-across

One claim for read-across is the role and importance of considering metabolic similarity in analogue selection. The challenge is in objectively capturing metabolic information of analogues. In traditional read-across, consideration of metabolic similarity is typically limited to an qualitative evaluation of the commonality in transformation pathways. This is acknowledged by Yordanova et al [@yordanova_assessing_2021] and demonstrated in practice using the OECD Toolbox [@yordanova_using_2019]. In Boyce et al [@boyce_comparing_2022], metabolic consideration was characterised by similarity in the metabolites produced and their transformation pathways which was found to be informative for a single case study. In this study and our previous work [@patlewicz_systematic_2024] metabolic similarity was captured using a WL kernel but in both studies the similarities were low within the read-across case substances, yet metabolic similarity was still found to contribute to an extent to discriminating between similar and dissimilar analogue pairs. An alternative means of capturing metabolic information was explored. Rather than rely on the read-across cases in this study, an objective evaluation was performed making use of 2 datasets to explore the folllowing questions - 1) whether metabolic graphs were informative in predicting a toxicity endpoint (acute oral lethality) given a large dataset exists (as compiled by Mansouri et al [@mansouri_catmos_2021] and used in previous studies such as Adams et al [@adams_development_2023] and Helman et al [@helman_transitioning_2019]; 2) whether such a model could be used to fine tune a model to predict repeated dose toxicity values and 3) whether the embeddings from such a model could be used in a GenRA approach to identify relevant analogues for reading across such toxicity values.

### Deep learning using metabolic graphs

The acute toxicity dataset and respective identifiers was taken from the supplementary information of Helman et al [@helman_transitioning_2019]. Substances were imported into the TIMES *in vitro* rat liver model for metabolites to be generated. The default tree report was exported and metabolic graphs were constructed as before where nodes of the parent and metabolites were characterised by Morgan fingerprints and egdes represented the transformations. The python package NetworkX [https://networkx.org/] was used to create the metabolic graphs. A total of 6468 metabolic graphs could be created from the TIMES predictions together with their fingerprint representations. The acute toxicity data used in Helman et al [@helman_transitioning_2019] focused on the LD50 values but the supplementary data also provides 'very toxic' and 'nontoxic' scores. Herein a new 'acute_category' was defined to discriminate non-toxic from everything else. This translated to assigning a non-toxic score to anything with a LD50 of 2000 mg/kg or greater. NetworkX graphs were then converted to Pytorch Geometric objects for GCN. The dataset was first randomly split 80:10:10 into training, validation and test sets, stratified by the 'acute_category'. As a baseline, a Random Forest model (with default settings) would also be trained on the basis of Morgan chemical fingerprints to compare performance. Balanced accuracy was used as a metric to evaluate performance for both the GNN and Random Forest models. 

For the GCN,  metabolic graphs were loaded into training and validation loaders with a batch size of 300. A GCN with 4 convolutional layers using a hidden layer with embedding size of 256 plus a final linear layer with a 512 embedding size was trained over 200 epochs using a CrossEntropy Loss as the loss functional and Adam optimiser with a learning rate of 0.0007. A contenation of the mean and maximum pooling formed the embedding. An early stopping check was included to evaluate whether over 10 epochs the best validation loss exceeded the validation loss.
A hyperparameter evaluation using Optuna was then performed to establish the optimal embedding size, number of layers and learning rate. Embedding size was varied from 64-512, learning rate and weight decay relied on a loguniform distribution with 1e-5 and 1e-2 as values. Fifty trials were performed with were tuned to maximise the balanced accuracy. Based on the best hyperparameters, the model was retrained on the training and validation dataset, to evaluate balanced accuracy performance for the held out test set before being applied to the entire dataset from which embeddings could be extracted to evaluate how well the model discriminate between non-toxic and other acute categories.  

As a baseline model a 10-fold stratified CV procedure was applied to evaluate the mean balanced accuracy using a Random Forest Classifier with 0.66 as the maximum samples and 100 trees. 

As a next step, the optimised model for discriminating for acute toxicity was used to fine tune a model for repeated dose toxicity outcomes. Here all repeated toxicity studies were extracted from the toxicity values database (ToxValDB) version 9.5 and filtered for those performed via the oral route. The approach described in Aurisano et al [@aurisano_probabilistic_2023] was used to harmonise study level points of departure such as No Adverse Observed Effect Levels (NOAELs) into chronic human equivalent benchmark dose values (BMDh). Consistent with Aurisano et al, the BMDh distribution for each chemical was then fit by a lognormal distribution, and the 25^th^ percentile was taken as the chemical level BMDh. The optimised acute toxicity GNN model was used in a GNN Regression model and trained to predict the log10 BMDh value. The GNN Regression model comprised the pretrain convolutional GCN layers followed by feedforward layers (2 sets of linear, ReLU and Dropout). Initially the GCN layers were frozen to ensure the feed forward layers were trained first over 10 epochs before the remaining GCN were unfrozen. Training was performed over 50 epochs with an early stopping procedure. The hidden layer size of the feed forward network was 256. A ReduceLROnPlateau scheduler was used to adjust the learning rate during training. The loss function was the mean squared error (MSELoss). The TIMES *in vitro* rat liver metabolism model was used to simulate metabolism for all ToxValDB substances that could be defined by a discrete chemical structure. The final set of ToxValDB metabolic graphs derived were randomly split into a training, validation and test set 80:10:10. Metabolic graphs were transformed into data loaders with batch sizes of 100. Training was performed on 3166 metabolic graphs whereas evaluation of performance relied on the 396 validation set metabolic graphs as part of an early stopping procedure. Performance was evaluated on the 396 graphs held out during training and testing. After training/evaluation was completed, the model was re-trained on the entire set of graphs and embeddings were generated for all metabolic graphs using the model. 

### Application of the Embeddings for read-across purposes

The embeddings derived for all substances were next used in a GenRA analysis [@patlewicz_towards_2023] to explore the performance of making log10 BMDh predictions on the basis of the n most similar analogues (on the basis of a normalised Euclidean distance) identified from the embedding space compared with those predictions made on the basis of the n most similar analogues identified by Morgan chemical fingerprint Jaccard similarity.  The entire  dataset was randomly split 80:20 in training and test datasets.The n most similar analogues was first determined for Morgan fingerprints using a 5-fold nested CV procedure using GenRA and its genra-py python package [@genrapy] and the training set.

Then for each chemical in the dataset, the n most similar analogues were identified and the weighted similarity log10 BMDh value was computed either on the basis of the Morgan chemical fingerprint using Jaccard similarities or the embedding features with a normalised Euclidean similarity score. 

# Results and Discussion

## Dataset Summary

A total of 157 read-across examples were extracted from three main sources (OECD, EPA PPRTV and literature). There were 24 unique decision contexts which were then grouped by 4 main types - New Approach Methods (NAMs) [@epa_new_2020], technical guidance, safety assessment or regulatory purposes. For the 157 examples, 68 cases (43%) were developed to primarily meet a safety assessment purpose, 71 (45%) for a regulatory purpose, the remainder were relatively evenly split between efforts to improve existing technical guidance (5%) or illustrate the utility of NAMs (i.e. any technology, methodology, approach, or combination that can provide information on chemical hazard and risk assessment to avoid the use of animal testing) to substantiate read-across justifications (6%). The distribution of cases across decision contexts is not unsurprising given the origin of the case studies, ~25 (13%) of the cases were taken from the US EPA PPRTV effort, 16% were OECD SIDs examples, 7% were OECD IATA case studies, 42% from journal articles with the remainder comprising a couple of examples each from ECETOC or Health Canada. Of the approaches, 38% of cases utilised a category approach and 57% were analogue approaches. All the EPA PPRTV cases relied on an analogue approach whereas in general over 80% of all OECD IATA examples used a category approach. Journal articles favoured a category approach (63%) to an analogue approach (37%). 

![Lineplot of number of unique cases per year](figyear.png){#fig-year}

The timespan of examples extracted ranged from 2004-2023. There was a peak of examples in 2011 and again in 2020 as shown in @fig-year. The 2011 peak is dominated by one particular journal article reporting many qualitative examples whereas 2020 included a number of examples from the literature that had been developed by the Research Institute of Fragrance Materials (RIFM). The latter may have been motivated following their 2020 metholodology publication [@date_clustering_2020] describing a tiered structural classification scheme based on (1) organic functional groups, (2) structural similarity and reactivity features of the hydrocarbon skeletons, (3) predicted or experimentally verified Phase I and Phase II metabolism, and (4) expert pruning.


## Chemical Landscape

Although there were 157 cases comprising 1019 records, 77 substances appeared more than once either across cases (same analogue/category member used in different published example) or within a specific case (read-across was performed for more than one endpoint/species/exposure route e.g. substance DTXSID1026796 appears several times in read-across case 3 owing to two different effects in a developmental study, an effect in a reproductive toxicity study as well as subchronic studies in 2 different species). In terms of the chemical landscape, the read-across cases captured 695 unique substances of which structures could be mapped for 661 of them. For the substances without structures, these were all UVCBs arising from 15 different studies drawing from the OECD IATA and SID examples. For the 661 substances with SMILES, the ClassyFire API was used to tag each substance with a superclass, class annotation to summarise the types of functional classes captured by the read-across cases. ClassyFire superclasses Benzenes and substituted derivatives (30%), Fatty Acyls (13%), Carboxylic acids and derivatives (10%) as well as Organooxygen compounds (13%) captured ~66% of all the chemicals. 

![UMAP of ClassyFire Superclasses](classyfire.png){#fig-classy}

A 2D Uniform Manifold Approximation and Projection (UMAP) [@mcinnes_umap_2020] plot @fig-classy based on Morgan chemical fingerprints colour coded by the chemical superclasses with the highest number of members illustrates the read-across landscape to provide some perspective of the chemistry coverage. 



## Analogue Evidence Streams

In this study, the analogue identification strategy was recorded for each read-across case. This was captured in two ways in: 1) a narrative form and 2) more machine readable manner. The narrative provided a short description of the way in which analogues were identified and evaluated for their suitability for reading across the endpoint. The machine readable summary was a pipe-delineated string that captured all types of similarity rationales used to identify and evaluate analogues. This was referred to as an "analogue evidence stream". An example would be take the form of "Structural_description|Physchem_description|Metabolism_description|Mechanistic_description". An actual example is 'Structural_CHRIP_OECD-Toolbox_common-phenolic-group-at-same-position-on-benzotriazole | Physchem_similar-logKow-volatility | Metabolic_common-metabolite|Mechanistic_transcriptomic-profiles_similar-predicted-MOA |Toxicity_common-target-organ'.

To summarise the apparent primary means of identifying candidate source analogues - the first component of the analogue evidence stream was extracted which revealed that in 85 cases, structure was used to identify analogues. In contrast, metabolic similarity was the primary means of identifying analogues only in 14 cases. Across the 85 structure-based cases, it was possible to summarise the types of tools and approaches used to identify the candidate analogues. The OECD Toolbox (www.qsartoolbox.org), the DSSTox and [ChemIDPlus](https://pubchem.ncbi.nlm.nih.gov/source/ChemIDplus) or some combination were the main tools relied upon to identify candidate analogues. The barplot in @fig-analogue highlights the main tools and approaches.
Though structural similarity using a similarity metric is often used by these tools or their combinations, by far the most common means of identifying structural analogues within the examples was to look for common scaffolds based on functional groups. The most common tool used for this was the OECD Toolbox. DSSTox, ChemIDPlus were notable in their use, though this highlights a historical perspective of many of the case studies identified. ChemIDPlus and DSSTox are legacy tools that are no longer available. They had been able to perform structure and similarity based searches. ChemIDPlus has been superseded with [PubChem](https://pubchem.ncbi.nlm.nih.gov/) whereas DSSTox was archived for several years before the development and release of the [EPA CompTox Chemicals Dashboard](www.comptox.epa.gov/dashboard). Leadscope, a commercial tool for analogue searching and data mining has also been superseded since Leadscope was bought out by a different parent company [InStem](https://www.instem.com/solutions/discovery/). A different set of software tools will likely feature in future read-across examples as other computational tools are applied. The [OECD Toolbox](https://qsartoolbox.org) was first developed as a proof of concept tool and in 2008 whereas DSSTox was launched in 2004 [@grulke_epas_2019]. The EPA CompTox Chemicals Dashboard was first released in 2016 [@williams_comptox_2017], other tools that have become available since include EPA's Generalised Read-Across Tool (released publicly in 2019 [@helman_generalized_2019; @patlewicz_towards_2023]) and the [EPA Cheminformatics Modules](https://www.epa.gov/comptox-tools/cheminformatics) amongst others. The analogue evidence streams do not necessarily capture the order by which analogues are identified and evaluated nor the iterations that might have been undertaken to arrive at a final set of analogues/category members. However they provide an useful perspective of the common considerations and tools that are relied upon in order to arrive at an analogue or category members that will be used to perform associated read-across. Clear from the examples to date, conventional approaches were relied upon to identify and evaluate analogues whereas the use of New Approach Methods (NAM) data is still evolving as a means to support cases. It is expected that as more read-across cases are developed, analogue identification may rely on NAM data streams which will be in turn dependent on other tools. Examples could include [VERA](https://www.vegahub.eu/portfolio-item/vera/) [@vigano_virtual_2022] or the [Modeling and Visualization (MoVIZ) Pipeline](https://github.com/NIEHS/Chemical-grouping-workflow)[@moreira-filho_democratizing_2024] amongst others.

![Main tools used in the structural analogue evidence streams](analogue_stream.png){#fig-analogue}



## Similarity contexts

The number of substances per case study varied considerably across the 157 cases with the mean and median number of members being 5 and 4 respectively and the maximum number of members being 42. The size of these neighbourhoods has an impact of the variation expected in structural similarity, physicochemical similarity etc. 

### Structural Similarity Evaluation

Pairwise Jaccard structural similarity distributions were computed for all chemicals within each read-across case which demonstrated a large variation in similarity scores. Although high Jaccard metrics were observed, the median of the distribution of median values for each case study was determined to be only 0.34.  @fig-boxplots shows the distribution of Jaccard structural similarities within each case also in order of decreasing median value.  

![Boxplot of pairwise Jaccard structural similarities within each read-across example. Only every 5th read-across case is displayed in the axis ticklabels.](Jaccard_SS.png){#fig-boxplots width="80%"}


## Alert and Physicochemical Similarity Evaluation

There were 85 DNA binding alerts and 104 Protein binding alerts captured, of those 52 alerts were trigged across the substances included in the read-across cases. A pairwise comparison of the profile across these endpoint-toxicophore combinations either was not particularly informative - substances either showed complete or no overlap @fig-alert. 
The variation in physicochemical similarity within each read-across case was less pronounced @fig-phys with much higher similarities across the cases. The median of the distribution of median values for each case study was 0.87.


![Boxplot of pairwise Jaccard alert similarities within each read-across example](Alert_similarity.png){#fig-alert width="80%" fig-pos="H"}

![Boxplot of pairwise Euclidean similarities within each read-across example](Physchem_similarity.png){#fig-phys width="80%" fig-pos="H"}



### Metabolic Similarity Evaluation

The pairwise similarities were computed within each read-across example to explore how metabolically similar the target and source analogues were amongst themselves with respect to their metabolic graphs @fig-wl. There was a large degree of variation in pairwise similarities within each case study and the overall similarities were low. This was comparable to what had been observed in the prior study [@patlewicz_systematic_2024] which may suggest that the representation and metric used to quantify similarity is less than ideal, the Weisfeiler-Lehman (WL) graph kernel places a greater emphasis on the node similarity captured by the metabolite fingerprints themselves.


![Boxplot of pairwise WL graph kernel similarities within each read-across example](WL_similarity_201225.png){#fig-wl width="80%"}

## Bayesian Logistic Regression of the balanced labelled analogue pairs

Four chains converged and the Rhat across the parameters all successfully converged (expected Rhat of 1 would be indicative of convergence as shown in a traceplot based on the Gelman-Rubin test which compares the variance between chains with the variance within chains) [@martin_bayesian_2016]. Based on the trace summary the similarity metric with the highest contribution (based on their mean values) for whether a pair of substances were similar was structural similarity (14.11), followed by similarity in physicochemical properties (-3.73) and then similarity in the metabolism based on the graph kernel (2.9). @fig-pp shows the posterior plot for the parameters estimated. The greatest uncertainty arises in the similarity of the metabolites. The balanced accuracy of the test set was determined to be 0.87. @fig-cm shows the confusion matrix for the test set predictions using the mean parameters. The profile of the parameters was similar to those in Patlewicz et al [@patlewicz_systematic_2024] study even though the dataset was larger and more diverse. Physicochemical similarity was negatively associated with whether an analogue pair was similar and structural similarity was by far the most dominant parameter, suggestive of some redundancy in the impact of physicochemical similarity. The positive contribution of the metabolic similarity prompted further study in how metabolism information might be better characterised given the strong emphasis that the WL kernel places on node labels and how they propagate through the graph structure during the relabelling steps whereas edge labels and attributes are not inherently modelled.

![Posterior distribution of the parameters estimated](posterior_plot.png){#fig-pp width="80%"}

![Confusion matrix for the test set not involved in the modelling process](confusion_matrix.png){#fig-cm width="50%"}

## Deep learning approach

The deep learning approach using pytorch objects constructed from the Morgan fingerprints with the Constrastive Network stopped training after 34 epochs. The test set accuracy score was 0.976 demonstrating that the network was able to discriminate between similar and dissimilar analogue pairs. 

![Distribution of Pairwise distances](siamese_mgrn.png){#fig-distance width="80%"}

@fig-distance shows the distribution of pairwise distances for the test set of pairs where similar and dissimilar pairs of substances appear to be separated. Although the held out test set accuracy is impressive, given the small dataset, it is likely that this performance is overly optimistic and not an indication of generalised performance. 

Evaluating the Jaccard scores vs normalised cosine similarities showed very little correlation in terms of their scores (see Supplementary @fig-jaccard_cosine). A 2D UMAP plot of the embeddings from each pair showed little discrimination in their labels (@fig-embed). An expectation of similar and dissimilar read-across pairs by label should have been more clearly separated but the similar and dissimilar pairs appeared to be spread throughout the space, potentially indicating an overlap in the learning representations 

![UMAP of embeddings from target and analogues](fig-embed.png){#fig-embed}

On the other hand using a concatenation of the absolute difference and product of the embeddings from both analogues provided a richer more discriminative set of features that did discriminate between similar and dissimilar pairs when applied in a logistic regression model. 

![Embeddings on the basis of absolute difference and products](fig-embed-two.png){#fig-embed-two width="80%"}

Structural similarity played the most dominant role in the original logistic regression as captured by a pairwise similarity score. Paired learning techniques as part of a contrastive network using the same input fingerprints confirmed the importance of structural information driving the analogue pairs.  The performance is overly optimistic since the logistic regression was fitted on the embeddings derived on the Siamese network trained on the same dataset. 

Using a Graph Isomorphism Network (GIN) model failed to discriminate similar and dissimilar pairs of substances based on their molecular graphs. Over the course of 50 trials using Optuna, the best accuracy achieved was only 0.51 highlighting that the dataset was too limited to produce any meaningful embeddings from the input molecular graphs. 

The mean 5-CV test set F1 score when using the MolCLR embeddings in a logistic regression was 0.893 (SD 0.0006). The ROC-AUC for the held out test set was 0.956. Using Jaccard similarities derived from Morgan fingerprints for each analogue pair as inputs in a logistic regresion resulted in a mean 5-CV test set F1 score of 0.905 (SD 0.01). The ROC-AUC for the held out test out was slightly higher at 0.973. Whilst using a pre-training strategy to create embeddings shows promise, for this use case of read-across pairs, the Jaccard similarities computed from Morgan fingerprints proved more than effective in discriminating between the analogue pairs. As much as the read-across cases were intended to provide a ground truth of valid examples where read-across was successful, the fact that structural considerations alone were sufficient to differentiate analogue pairs, draws attention to the difficulties in attempting to objectively extract any additional insights of the contribution that other similarity contexts may play or how best to encode their information. The read-across cases although in some respects a ground truth of sorts are biased in that the analogues identified are not necessarily the ideal in terms of capturing the relevant similarity contexts per se rather they represent a practical and pragmatic compromise of finding analogues with  associated toxicity data.

### Deep learning using metabolic graphs

Using the initial GCN architecture, the overall balanced accuracy was found to be 0.6595 for held out test set. Following hyperparameter tuning, over 50 trials, the best performance was found using an embedding size of 256, a learning rate of 0.0066 and a weight decay of 2.52e-05. The balanced accuracy for the validation set was as high as 0.7551. The model with the best parameters was retrained on the training and validation tests. The balanced accuracy of the held out test set was found to be 0.7416. This contrasted with a mean test score during 10-CV of 0.688 when using a Random Forest Classifier. The balanced accuracy for the held out test set was 0.673. Although a reasonable model could be derived using default parameters with Morgan fingerprints and a Random Forest, the GCN using metabolic graphs was far superior. The embeddings from the GCN extracted and projected onto 2D using UMAP to visualise the metabolic graph landscape and the discrimination between the 2 acute_categories (see @fig-umap ).  A similar project on the basis of the Morgan fingerprints alone shows acutely toxic or non toxic substances distributed evenly throughout the chemical space (figure not shown).

![UMAP plot of the optimised acute toxicity GCN](UMAP_acute.png){#fig-umap width="80%"}

The optimised GNN derived for the acute toxicity model was then used to fine tune a GCN Regression model to predict chronic BMDh values. After training on 3166 metabolic graphs derived for the ToxValDB substances using TIMES *in vitro*, and testing on 396 graphs. The performance of the test set of 396 metabolic graphs not used during training/testing was evaluated using MSE, RMSE and the coefficient of determination R². The RMSE and MSE were 0.8453 and 0.714 respectively whereas the R² was 0.2249. 

A 2D UMAP projection of the embeddings for all the metabolic graphs showed a reasonable separation on the basis of potency. @fig-bmdh shows the projection where the scatterplot is colour coded by the percentile bins of the chronic log10 BMDh values. Percentile bins derived for the log10 BMDh values using 0,10, 50, 90 and 100 cutpoints. 

![UMAP plot of the optimised repeated dose toxicity GCN](umap_bmdh.png){#fig-bmdh width="80%"}

A reasonable model could be derived which predicted the chronic log10 BMdh values from metabolic graphs, leveraging insights from a pre-trained acute toxicity GCN model. The performance was reasonable given the complex endpoint being modelled though was superseded by shallow models based on predefined chemical descriptors such as those employed in Pradeep et al [@pradeep_structure-based_2020] which reported a RMSE of 0.71 and R² of 0.53. Despite this, the main objective was to extract the embeddings and evaluate their utility in identifying relevant candidate analogues for read-across. 



Splitting the chronic log10 BMdh values into training and test set (80:20) and then performing a grid search using Morgan chemical fingerprints found that the mean 5-fold CV test R² to be 0.248 where the optimal number of neighbours was 6. The performance for the test set was found to be R² = 0.233 with a RMSE of 0.908. If similarity weighted activity predictions were made for the entire dataset on the basis of Morgan fingerprints, the R² was 0.316 with an RMSE of 0.874.

Using the embeddings in a GenRA approach to infer the predicted chronic log10 BMdh values on the basis of their 6 nearest neighbours revealed a better performance on the entire data set - here the R²score was found to be 0.380 with a RMSE of 0.831. Given the complexity of repeated dose endpoints, there appears to be a marked performance increase when using analogues on the basis of encoded metabolism information to make read-across predictions relative to using simple chemical fingerprint representations. Performing a bootstrap resampling of the R² found that the embedding based model had a consistently higher performance than the Morgan chemical fingerprint baseline in most resampled datasets.

Using (Z)-2,6-dimethoxy-4-propenylphenol (DTXSID801019842) as an example, there was no overlap in the analogues identified by chemical fingerprints and GCN embeddings. Those based on Morgan fingerprints had very similar scaffolds whereas those based on the embeddings were very diverse in structure including aliphatic structures, yet their predictions in both cases were reasonable relative to the empirical logBMDh values (residual of 0.033 for Morgan fingerprints vs 0.053 for embeddings). 

|Target| Morgan Analogue| Morgan Similarity|Embedding Analogue| Embedding Similarity|
|:--------:|:--------:|:---:|:---------:|:---:|
|![DTXSID801019842](DTXSID801019842.png){width=1in}|  ![DTXSID7022413](DTXSID7022413.png){width=1in}  |0.452| ![DTXSID8044961](DTXSID8044961.png){width=1in}|0.994|
|   | ![DTXSID10216599](DTXSID10216599.png){width=1in} |0.429|![DTXSID4020583](DTXSID4020583.png){width=1in}| 0.994|
|   | ![DTXSID0052621](DTXSID0052621.png){width=1in}|0.415|  ![DTXSID7038814](DTXSID7038814.png){width=1in}|0.993| 


Exploring the substances that had absolute residuals exceeding 2 log units found fewer with the embeddings (111) compared with the Morgan chemical fingerprints (136), expected given the overall performance improvement but there was quite an overlap in the target substances that were predicted poorly by both representations. 41 of the target substances had high residual values with both representations. Substances that were particularly poorly predicted included inorganics such as sodium acetate, but also captured substances such as urethane, 1,3-Dioxolan-2-one and phenyl salicylate.

## Conclusions

Compiling this compendium of read-across provided a convenient lens from which different scientific questions could be explored as they related to how past read-across assessments have been performed and what future refinements are possible. For the time period for which cases were extracted, category approaches were far more favoured to analogue approaches. Structure was a primary and most common means by which analogues were identified with the main software tools being DSSTox, ChemIDPlus and the OECD Toolbox. Two of these tools have since retired and replaced with the EPA CompTox Chemicals Dashboard and Pubchem. However of the 157 cases identified where structure was the primary consideration, structural scaffolds were noted as the most typical means of identifying analogues. 

Exploring the read-across cases on the basis of their structural similarity using a typical chemical fingerprint showed a variation in pairwise similarities across and within cases. Some cases had higher pairwise structural similarities. Physicochemical similarities on the basis of HBD, HBA, MW and LogKow were typically higher and more consistent across cases. Alert similarity was variable suggesting that a bit vector summarising the presence and absence of alerts had mixed utility. Using WL graph kernels of simulated TIMES *in vitro* metabolic maps gave rise to low similarities suggesting that the node relabelling approach had limited utility in really capturing the full extent of metabolic information. Combining all these different similarities together and evaluating their contribution in predicting an analogue pair, assuming the read-across cases were an unbiased ground truth reaffirmed the expectation that structural characteristics formed the main contribution in identifying an analogue. Metabolic similarity provided additional insight but to a much lesser extent, presumably since the WL was limited in its ability to capture the full extent of a metabolic graph. If a paired learning approach was used to optimise similarity and learn the relation between similar analogue pairs and dissimilar analogue pairs, structural considerations using Morgan fingerprints appeared sufficient to differentiate between the 2 groups but the performance characteristics were overly optimistic given the small training set. Extending this to a graph based approach rather than using a fixed set of generalised chemical fingerprints was not successful, presumably because a deep learning approach reliant on molecular graphs requires much more data to learn the chemical representations. Applying a pre-training phase for a larger set of chemicals showed promise in creating more meaningful representations that could better discriminate between pairs of chemicals as similar or not - though Morgan chemical fingerprints still outperformed the graph based approach at least for this set of examples. The tradeoff in the complexity of the approach did not warrant further exploration though presents an interesting line of investigation for other use cases.

Given the WL graph kernel gave rise to such low similar scores, an alternative approach was investigated to encode metabolic information. The claim was that metabolic similarity is important in a read-across, especially for complex repeated dose toxicity outcomes. To that end, metabolic graphs comprising nodes of metabolites represented by chemical fingerprint bit vectors and edges represented by transformations were derived for a large number of substances evaluated for their acute toxicity potential. The assumption was that training a model to learn relevant representations that could discriminate between acutely toxic substances from non-toxic substances could be leveraged to fit a more robust model that could predict toxicity values from repeated dose toxicity studies. The embeddings derived were used in a simpler GenRA read-across approach and found to perform significantly better than using simple chemical fingerprints alone. The approach offers some promise of ways in which structures could be represented in that encoding some metabolic information can result in more relevant analogues being identified to perform a data driven read-across. Further work is still merited to evaluate how scalable such an approach might be.

# Disclaimer {.unnumbered .unnumbered}

This manuscript reflects the opinions of the authors and are not reflective or the opinions or policies of the US EPA.

# References {.unnumbered}

::: {#refs}
:::

{{< pagebreak >}}

```{=tex}
\newpage
\appendix
\renewcommand{\thefigure}{A\arabic{figure}}
\renewcommand{\thetable}{A\arabic{table}}
\setcounter{figure}{0}
\setcounter{table}{0}
```
# Supplementary information {.appendix}

## Supplementary Figures {.unnumbered}

![Scatterplot of Jaccard scores vs normalised cosine similarities](cosine_tanimoto_scatter.png){#fig-jaccard_cosine}
