# 🧬 Precious3GPT Multi-Species Aging Analysis

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

This notebook demonstrates how to use Precious3GPT (P3GPT) to analyze aging signatures across multiple tissues and species. The notebook is part of the Precious3GPT project by Insilico Medicine, aimed at facilitating drug discovery and aging research through AI.

## 📋 Table of Contents
1. [Setup and Dependencies](#setup)
2. [Parameter Grid Configuration](#parameters)
3. [Analysis Pipeline](#pipeline)
4. [Results Analysis](#analysis)
5. [Cross-Species Comparison](#comparison)
6. [Pathway Enrichment](#enrichment)

## 🎯 Key Features
- Multi-species aging signature analysis
- Tissue-specific comparisons
- Pathway enrichment analysis
- Cross-species gene overlap analysis

## 🚀 Getting Started

### Prerequisites
- Python 3.11+
- GPU with CUDA support
- Precious3GPT git repo downloaded an installed

## 1. Setup and Dependencies <a name="setup"></a>

Import required libraries and set up the environment. Make sure you have all dependencies installed.

In [1]:
import warnings
warnings.filterwarnings('ignore')

from handlers.p3_multimodal_handler import EndpointHandler, HandlerFactory
from p3screen.screening import TopTokenScreening, TokenAnalysis

## 2. Parameter Grid Configuration <a name="parameters"></a>

Define parameter grids for human (hsap) and mouse (mmus) aging analysis. These grids specify:
- Target tissues
- Age ranges for young vs. old comparisons
- Omics data type

💡 **Note**: You can modify these grids for your specific research needs.

In [2]:
# Human parameter grid
hsap_grid = {
    "tissue": ['skin', 'liver', 'muscle', 'lung', 'heart', 'kidney'],
    "dataset_type": ['expression'],
    "species": ['human'],
    "control": ['19.95-25.0'],
    "case": ['70.0-80.0']
}

# Mouse parameter grid
mmus_grid = {
    "tissue": ['skin', 'liver', 'muscle', 'lung', 'heart', 'kidney'],
    "dataset_type": ['expression'],
    "species": ['mouse'],
    "control": ['Mouse-19.95-30'],
    "case": ['Mouse-350-400']
}

## 3. Analysis Pipeline Setup <a name="pipeline"></a>

Initialize the P3GPT handler and screening objects. The handler manages model interactions while the screening object coordinates the analysis workflow.

⚠️ **Important**: Select an available GPU using the `device` parameter.

In [3]:
# Initialize handler with GPU support
handler = HandlerFactory.create_handler('endpoint', device='cuda:0')
screen = TopTokenScreening(handler)

# Add parameter grids and run analysis
screen.parameter_options = []
screen.add_grid(hsap_grid)
screen.add_grid(mmus_grid)
# Specify the length of gene lists with top_k
# By default, both up- and down-regulated genes 
# are generated. Add "only_up" or "only_down" if 
# you do not need both directions
screen(top_k=100, only_up=True)

print(screen)

Generation time: 1.86 seconds
Generation time: 1.12 seconds
Generation time: 1.12 seconds
Generation time: 12.63 seconds
Generation time: 1.12 seconds
Generation time: 1.15 seconds
Generation time: 1.12 seconds
Generation time: 1.09 seconds
Generation time: 1.12 seconds
Generation time: 6.52 seconds
Generation time: 6.54 seconds
Generation time: 8.46 seconds
TopTokenScreening with 12 results from 2 grids


## 4. Results Analysis <a name="analysis"></a>

### 4.1 Saving Results
Save analysis results in JSON format for future reference or sharing.

In [9]:
# Save results
# screen.export_result("./analysis_results.json")

# Load previously saved results
# screen = TopTokenScreening.load_result('./analysis_results.json', handler)

# Convert to DataFrame for analysis
results_df = screen.result_to_df()
# P3GPT generations will be in the
# columns with the "gen_" prefix
results_df.head()

Unnamed: 0,instruction,species,tissue,cell,age,gender,efo,drug,dose,time,case,control,dataset_type,datatype,up,down,gen_up
bb7933b0eadcef0f8d455f9d7524208e,[age_group2diff2age_group],human,skin,,,,,,,,70.0-80.0,19.95-25.0,expression,,[],[],PRG4;IKZF1;CCL4L2;CXCL3;TRH;LTF;FUS;CHAC1;CXCL...
3a1b9fed7b0c3de5f200a242e8fcca66,[age_group2diff2age_group],human,liver,,,,,,,,70.0-80.0,19.95-25.0,expression,,[],[],HEPN1;PGC;CD177;FEZF2;MUC5B;GAGE12E;HLA-A;DHCR...
9f7b7a2984662c1fdfade459c231c4f8,[age_group2diff2age_group],human,muscle,,,,,,,,70.0-80.0,19.95-25.0,expression,,[],[],RPS4Y1;HSPA1A;MYH15;CRISP3;PEG10;UNC13C;JPH3;G...
e30d98f373dc4eb5b140ff49fd65d9fe,[age_group2diff2age_group],human,lung,,,,,,,,70.0-80.0,19.95-25.0,expression,,[],[],CYP1A1;DCD;UTP14C;DLK1;SFTPC;TBC1D3E;MUC16;APO...
d4fe52cdc43a1bbd1450125fda9213a9,[age_group2diff2age_group],human,heart,,,,,,,,70.0-80.0,19.95-25.0,expression,,[],[],BMP10;C4orf54;HEPN1;FMN2;TBC1D3E;TBC1D3;HSPA1B...


## 5. Cross-Species Analysis <a name="comparison"></a>

Compare gene lists between species using the `TokenAnalysis` tool. This helps identify conserved aging signatures across species.

In [10]:
# Initialize analyzer
analyzer = TokenAnalysis(screen)

# Find and analyze species-specific patterns
siblings = analyzer.find_siblings_stratified(
    varying_params=['species'],
    stratify_by='tissue'
)

overlaps = analyzer.analyze_overlap()
counts = analyzer.overlap_size()

print("\nGene overlap counts between species:")
for tissue, data in counts.items():
    print(f"\n{tissue}:")
    for regulation, pairs in data.items():
        for species_pair, count in pairs.items():
            print(f"  {species_pair}: {count} genes")

Using species with 2 unique values
Using species with 2 unique values
Using species with 2 unique values
Using species with 2 unique values
Using species with 2 unique values
Using species with 2 unique values

Gene overlap counts between species:

skin:
  ('human', 'mouse'): 25 genes

liver:
  ('human', 'mouse'): 12 genes

muscle:
  ('human', 'mouse'): 26 genes

lung:
  ('human', 'mouse'): 23 genes

heart:
  ('human', 'mouse'): 9 genes

kidney:
  ('human', 'mouse'): 29 genes


## 6. Pathway Enrichment Analysis <a name="enrichment"></a>

Perform pathway enrichment analysis using [Enrichr-KG](https://maayanlab.cloud/enrichr-kg) to identify biological processes associated with aging signatures.

⚠️ **Note**: Respect API rate limits by adjusting batch size.

In [11]:
# Configure and run enrichment analysis
analyzer.batch_size = 6  # Adjust based on API limits
results = analyzer.enrich_overlaps()

# Get significant pathways
significant_pathways = analyzer.get_significant_pathways()

print("\nSignificant pathways in aging signatures:")
for tissue, reg_data in significant_pathways.items():
    for reg, pairs in reg_data.items():
        for pair, pathways in pairs.items():
            if pathways:
                print(f"\n{tissue} ({pair}):")
                for pathway, pval, genes, score in pathways:
                    print(f"  • {pathway}")
                    print(f"    p-value: {pval:.2e}")
                    print(f"    genes: {genes}")


Processing batch 1/1
Processing skin up ('human', 'mouse') (25 genes)... DONE
SUCCESS
Processing liver up ('human', 'mouse') (12 genes)... DONE
SUCCESS
Processing muscle up ('human', 'mouse') (26 genes)... DONE
SUCCESS
Processing lung up ('human', 'mouse') (23 genes)... DONE
SUCCESS
Processing heart up ('human', 'mouse') (9 genes)... DONE
SUCCESS
Processing kidney up ('human', 'mouse') (29 genes)... DONE
SUCCESS

Enrichment analysis complete:
- Successful: 6/6
- Failed: 0/6

Significant pathways in aging signatures:

skin (('human', 'mouse')):
  • cytokine cytokine receptor interaction
    p-value: 3.04e-03
    genes: CXCL6;IL1B;CXCL3;CXCL2


## 🔄 Alternative Screening Approaches

The P3GPT framework is highly flexible and can be used to analyze various biological patterns beyond cross-species comparisons. For example, you can investigate tissue-specific gene expression patterns within the same species.

### Example: Tissue-Specific Analysis
In this example, we'll identify genes that are differentially expressed between different tissues in humans:

1. First, redefine the sibling groups to compare tissues instead of species;
2. Use `analyzer.inspect_parameter_pairs('tissue')` to examine all pairs of samples that are only different in their tissue parameter;
3. Intersect the gene lists in each pair to identify shared signatures of aging.

This approach is particularly useful for:
- Understanding tissue-specific aging mechanisms
- Identifying common aging pathways across tissues
- Finding tissue-specific therapeutic targets

💡 **Tip**: You can modify the `varying_params` and `stratify_by` parameters to explore different biological comparisons based on your research interests.

In [15]:
siblings = analyzer.find_siblings_stratified(
    varying_params=['tissue'],
    stratify_by='species'
)
analyzer.overlapping_genes = analyzer.inspect_parameter_pairs('tissue')
counts = analyzer.overlap_size()

print("\nGene overlap counts between tissues:")
for species, data in counts.items():
    print(f"\n{species}:")
    for regulation, pairs in data.items():
        for tissue_pair, count in pairs.items():
            print(f"  {tissue_pair}: {count} genes")


Gene overlap counts between tissues:

human:
  ('skin', 'liver'): 12 genes
  ('skin', 'muscle'): 8 genes
  ('skin', 'lung'): 22 genes
  ('skin', 'heart'): 13 genes
  ('skin', 'kidney'): 19 genes
  ('liver', 'muscle'): 10 genes
  ('liver', 'lung'): 14 genes
  ('liver', 'heart'): 14 genes
  ('liver', 'kidney'): 16 genes
  ('muscle', 'lung'): 13 genes
  ('muscle', 'heart'): 12 genes
  ('muscle', 'kidney'): 10 genes
  ('lung', 'heart'): 19 genes
  ('lung', 'kidney'): 25 genes
  ('heart', 'kidney'): 8 genes

mouse:
  ('skin', 'liver'): 22 genes
  ('skin', 'muscle'): 22 genes
  ('skin', 'lung'): 26 genes
  ('skin', 'heart'): 18 genes
  ('skin', 'kidney'): 34 genes
  ('liver', 'muscle'): 23 genes
  ('liver', 'lung'): 17 genes
  ('liver', 'heart'): 15 genes
  ('liver', 'kidney'): 19 genes
  ('muscle', 'lung'): 18 genes
  ('muscle', 'heart'): 31 genes
  ('muscle', 'kidney'): 15 genes
  ('lung', 'heart'): 26 genes
  ('lung', 'kidney'): 38 genes
  ('heart', 'kidney'): 20 genes


## 📚 Additional Resources

- [Precious3GPT on HuggingFace](https://doi.org/10.57967/hf/2699)
- [Precious Models Hub](https://insilico.com/precious)
- [Enrichr-KG API Documentation](https://maayanlab.cloud/Enrichr/help#api)
- [Related Publication](https://www.biorxiv.org/content/10.1101/2024.07.25.605062)

## ⚖️ License

This notebook is licensed under the MIT License. See the LICENSE file for details.

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## ✍️ Citation

If you use this notebook in your research, please cite:
- Precious3GPT model
```bibtex
@misc {insilico_medicine_2024,
	author       = { {Insilico Medicine} },
	title        = { precious3-gpt-multi-modal (Revision 9e240ab) },
	year         = 2024,
	url          = { https://huggingface.co/insilicomedicine/precious3-gpt-multi-modal },
	doi          = { 10.57967/hf/2699 },
	publisher    = { Hugging Face }
}
```
- Precious3GPT preprint
```bibtex
@article {Galkin2024.07.25.605062,
	author = {Galkin, Fedor and Naumov, Vladimir and Pushkov, Stefan and Sidorenko, Denis and Urban, Anatoly and Zagirova, Diana and Alawi, Khadija M. and Aliper, Alex and Gumerov, Ruslan and Kalashnikov, Aleksandr and Mukba, Sabina and Pogorelskaya, Aleksandra and Ren, Feng and Shneyderman, Anastasia and Tang, Qiuqiong and Xiao, Deyong and Tyshkovskiy, Alexander and Ying, Kejun and Gladyshev, Vadim N. and Zhavoronkov, Alex},
	title = {Precious3GPT: Multimodal Multi-Species Multi-Omics Multi-Tissue Transformer for Aging Research and Drug Discovery},
	elocation-id = {2024.07.25.605062},
	year = {2024},
	doi = {10.1101/2024.07.25.605062},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/07/25/2024.07.25.605062},
	eprint = {https://www.biorxiv.org/content/early/2024/07/25/2024.07.25.605062.full.pdf},
	journal = {bioRxiv}
}

```