# GenRA-py User Manual 

The Generalised Read Across (GenRA) Python package, genra-py, is a freely available stand-alone Python 3 version of the GenRA approach developed by Shah et al.,2016 (https://doi.org/10.1016/j.yrtph.2016.05.008) for estimating physio-chemical, biological, and eco-toxicological properties of chemicals by inference from 'nearest neighbor' analogues.

This package can be freely accessed from github.com/i-shah/genra-py



## Table of Contents:

1.  **genra-py Usage**
   
            1.1 Usage
       
            1.2 genra-py Features
       
            1.3 GenRA workflow 
       
            1.4 Installation
       
            1.5 Input Data and formating 
       
   
   
  2.  **Running genra-py**
   
           2.1 Performance tuning with genra-py
       
           2.2 Global analysis with genra-py
       
           2.3 Local predicton with genra-py
       
           2.4 Visualizing nearest neighborhoods in genra-py 
       
       
       
  3.  **Use-Case Examples**


## 1. genra-py Usage

### 1.1 Usage

   Genra-py is a python 3 package utilized for making GenRA read-across predictions(Shah et al.,2016). This package can be used to estimate a binary or continuous property of a chemical based on the similarity weighted activity of k-nearest neighbors. Using vectors of descriptors/chemical property values or experimental measures (chemical structure, physiochemical, bioactivity, etc) genRa-py implements read-across inference with a classifier (for categorical data) or regressor (for continuous data) k-nearest neighbor estimators in the scikit-learn library (Pedregosa et al.,2011). Genra-py also incoporates stuctural visualizations of nearest neighborhood chemicals using the RDKit cheminformatics library(Landrum, 2015).  


Several examples of genra-py usage are available in this user manual, **Use-Case Examples**. 

### 1.2 genra-py Features

* GenRA K-nearest neighbor estimators

    + GenRA classifier, GenRAPredClass

    + GenRA regressor, GenRAPredValue
    
    
* Parallel optimization of nearest neighbors and similarty metric using scikit-learn grid search cross validation
    
    
* Stuctural nearest neighborhood visualizations

### 1.3 GenRA Workflow

The GenRA workflow for hazard classification involve the following main steps: 


  1.   Retrieving and loading data for chemicals;
  
  2.   Selecting descriptor(s) for prediction;
  
  3.  Evaluating predictive performance using cross-validation testing;
  
  4.  Visualization nearest neighborhood predictions.

### 1.4 Installation

   Source code for genra-py is available through Git: 

   `git clone` https://github.com/i-shah/genra-py.git
    
    
### **Prerequisites:**

   * Python 3 
        * available for download from: https://www.python.org/downloads/
        
        
    
   * Anaconda  
        * see installation guide at: https://docs.anaconda.com/anaconda/install/
    


### Required packages and files are available in the following files:
 
   * **requirements.txt** file for reproducing the analysis environment, e.g.generated with `pip freeze > requirements.txt`
 
 
   * **condaenv.yaml** for creating a conda environment. Generated using: `conda env export > condaenv.yml`
 
 
   * **setup.py** to pip install project `src` : `pip install -e .`

### 1.5 Data Formating and Imput

   * genra-py can accept a diverse set of data inputs; however the preferred data input is the .tsv file. For files not imported in the tsv format, the seperation type for import should be tab-delimited (Tab Separated Values), `sep='\t'`.


   * genra-py accepts both continous and binary data from various chemicals, physiochemical, biological or eco-toxcological experimental measures of chemicals. 


   * At minimum, each data input requires a common chemical identifier (chemical ID, CASRN, DSSToxID). Here is an example for a chemical structure data import: 

![image.png](attachment:image.png)



   * For data inputs with multiple descriptor types, genra-py uses `Box()` to create a dictionary of all diverse descriptors objects.



## 2. Running genra-py

### 2.1 Performance Tuning 

To to determine the optimal number of nearest neighbors and similarity metric, genra-py implements performance tuning using grid search cross validation from scikit-learn and GenRAPredValue. 

Implementation: 


`GridSearchCV()`


An example of genra-py performance tuning is available in **Use-Case Examples**.

### 2.2 Global Analysis

GenRA global analysis is based on all of the data and ignores the clusters. This works by using a specified number of nearest neighbors and scores using ROC AUC analysis.

### 2.3 Local Analysis 

GenRA local analysis uses clusters to for read-across performance evaluations. This works by using specified number of nearest neighbors and predefined chemical clusters. Performance is scored using ROC AUC analysis.

### 2.4 Visualization of Chemical Nearest Neighborhoods

genra-py enables the visualization of local nearest neighborhood of specific chemicals using `GenRAViewNN()` from genra.rax.viz.nn. 

An example of this is available in **Use-Case Examples**.

## 3. Use-Case Examples 

### Performance Tuning Example 
This is an example for determining the optimal number of neighbors and metric using GenRAPredValue and grid search cross validation. In this example, we select specific parameters for a range of nearest neighbors we want to test, various similarity/distance metrics, and scoring for using ROC_AUC for a specific descriptor type (X), and toxicity endpoint (Yb). 

![image.png](attachment:image.png)

### GenRA Global Analysis 
As an example, this global analysis is based on all of the data, ignoring the clusters and uses bioactivity fingerprints 5 nearest neighbour, jaccard similarity metric, and ROC AUC scoring. 

![image.png](attachment:image.png)

Get Global Performance Scores:

![image.png](attachment:image.png)

### GenRA Local Analysis 
For local predictions, in this example we use chemical and biological descriptors to get a GenRA prediction for every chemical based on chemical clusters. This example uses 5 nearest neighbour and jaccard similarity.
![image.png](attachment:image.png)

Get performance scores:

![image.png](attachment:image.png)



### Nearest Neighbor Visualization
This is an exammple of the visualization of the 10 nearest neighbors for the chemical 4-hexylaniline using bioactivity descriptors. This done using the GenRAViewNN function from genra.rax.viz.nn. 
![image.png](attachment:image.png)