In [2]:
import sragent
import pandas as pd

The `sragent` tool is centered on a single funciton `gather()`. \
The basic function of `gather()` is to take a Bio Project ID and access the Sequence Read Archive (SRA) to collect metadata on all of the samples within that project.\
It does this using the `entrez` tools from BioPython and some cluncky XML parsing to extract useful metadata.\
`gather()` will accept a single ID string or a list of strings for any number of projects.\
>[!WARNING]
>`gather()` accesses an external server that is prone to connection problems.\
> You may encounter connection errors. Usually waiting and rerunning solves this, but you may need to split your request into chuncks.\

For this demo project I'm going to try to pull and annotate a large number of Yeast Histone PTM ChIP-seq experiments. \
Below are a list of BioProjecIDs I've already collected that include any experiment profiling a Histone PTM. \

In [None]:
projects = ['PRJNA262623', 'PRJNA227448', 'PRJNA140547', 'PRJNA989169',
            'PRJNA954824', 'PRJNA912607', 'PRJNA831793', 'PRJNA783027',
            'PRJNA753826', 'PRJNA737490', 'PRJNA721183', 'PRJNA672715',
            'PRJNA643248', 'PRJNA588479', 'PRJNA559331', 'PRJNA492238',
            'PRJNA487157', 'PRJNA450434', 'PRJNA384583', 'PRJNA320298',
            'PRJNA278334', 'PRJNA274975', 'PRJNA254082', 'PRJNA231240',
            'PRJNA153387']

We can now run `gather()` with this list of IDs. \
By default, we don't have to provide any other arguments to just get metadata for these projects. 

In [6]:
meta = sragent.gather(projects)
#meta = pd.read_csv('sragent_output/metadata.csv')
meta

Fetching PRJNA262623...
PRJNA262623 fetch complete...
Fetching PRJNA227448...
PRJNA227448 fetch complete...
Fetching PRJNA140547...
PRJNA140547 fetch complete...
Fetching PRJNA989169...
PRJNA989169 fetch complete...
Fetching PRJNA954824...
PRJNA954824 fetch complete...
Fetching PRJNA912607...
PRJNA912607 fetch complete...
Fetching PRJNA831793...
PRJNA831793 fetch complete...
Fetching PRJNA783027...
PRJNA783027 fetch complete...
Fetching PRJNA753826...
PRJNA753826 fetch complete...
Fetching PRJNA737490...
PRJNA737490 fetch complete...
Fetching PRJNA721183...
PRJNA721183 fetch complete...
Fetching PRJNA672715...
PRJNA672715 fetch complete...
Fetching PRJNA643248...
PRJNA643248 fetch complete...
Fetching PRJNA588479...
PRJNA588479 fetch complete...
Fetching PRJNA559331...
PRJNA559331 fetch complete...
Fetching PRJNA492238...
PRJNA492238 fetch complete...
Fetching PRJNA487157...
PRJNA487157 fetch complete...
Fetching PRJNA450434...
PRJNA450434 fetch complete...
Fetching PRJNA384583...
PRJN

Unnamed: 0,project_id,project_title,abstract,protocol,run_id,experiment_id,title,organism,assay_id,attributes
0,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593257,SRX717562,input2.2_15,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 1...
1,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593256,SRX717561,input1.2_15,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 1...
2,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593255,SRX717560,input2.2_8,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 8...
3,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593254,SRX717559,input3.1_8,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 8...
4,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593253,SRX717558,input1.3_60,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 6...
...,...,...,...,...,...,...,...,...,...,...
1,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441515,SRX127920,delta_chd1_K79me3,Saccharomyces_cerevisiae,ChIP-Seq,"source_name : delta_Chd1 yeast (FM391,delta Ch..."
2,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441512,SRX127919,delta_chd1_wce,Saccharomyces_cerevisiae,ChIP-Seq,"source_name : delta_Chd1 yeast (FM391,delta Ch..."
3,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441509,SRX127918,wt_FM391_K4me3,Saccharomyces_cerevisiae,ChIP-Seq,source_name : Wild-type yeast (FM391) genotype...
4,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441506,SRX127917,wt_FM391_K79me3,Saccharomyces_cerevisiae,ChIP-Seq,source_name : Wild-type yeast (FM391) genotype...


`meta` is a pandas dataframe with all of the project ids, experiment ids, abstracts, protocols, experiment titles, and experiment attributes.\
There are a total of 1,172 experiments here.\
\
`annotate()` can also take this dataframe as input, that way we don't have to unnecessarily repeat metadata pulls from the SRA.\
It also let's us manipulate an subset the metadata before we annotate. \
For instance, in this vignette we have a total of 25 different projects that include histone PTM ChIP-seq experiments in yeast. \
But there are non-histone ChIP-seq experiments within these projects as well that we don't care about for now. \
Let's try to filter down to just experiments with histone targets.\

First we'll read in a csv with all the histone ptm targets we're interested in, as well as 'H3' and 'input' targets so we don't lose our ChIP-seq controls.\
We then filter the metadata dataframe with a single string pattern of our desired epitopes and using that as a mask. \
That drops our total number of experiments to annotate to 899. \

In [4]:
epitopes = pd.read_csv('epitopes.csv')
eps = '|'.join(epitopes['epitope_id'].tolist())
mask = meta['title'].str.contains(eps, case=False, na=False)
ptm_meta = meta[mask]
ptm_meta

Unnamed: 0,project_id,project_title,abstract,protocol,run_id,experiment_id,title,organism,assay_id,attributes
0,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593257,SRX717562,input2.2_15,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 1...
1,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593256,SRX717561,input1.2_15,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 1...
2,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593255,SRX717560,input2.2_8,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 8...
3,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593254,SRX717559,input3.1_8,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 8...
4,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593253,SRX717558,input1.3_60,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 6...
...,...,...,...,...,...,...,...,...,...,...
190,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593061,SRX717366,h2ak5ac_4,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 4...
191,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593060,SRX717365,h2ak5ac_8,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 8...
192,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593059,SRX717364,h2ak5ac_15,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 1...
193,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593058,SRX717363,h2ak5ac_30,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 3...


When we're happy with our subset of experiments we can then rerun `annotate()` with the subset metadata dataframe as input.\
By default `annotate()` will save the annotation as a csv, but we can store the resulting dataframe to explore the results. \

In [5]:
#test = ptm_meta[ptm_meta['project_id'] == 'PRJNA262623']
ptm_meta_annotated = sragent.annotate(ptm_meta, 
                                      model_summary = 'gpt-4o-mini',
                                      model_annotation = 'gpt-4o-2024-08-06',
                                      annotate = True)

Let's break down the metadata and identify the key elements that will help classify and generate structured metadata for each experiment in the project.

### Main Goal of the Project
The main goal of the project is to map 26 histone modifications genome-wide over a time course following dramatic transcriptional reprogramming in response to diamide stress in yeast. The project aims to understand the dynamics and combinatorial complexity of histone modifications during the stress response.

### Experimental Conditions
- **Stress Condition**: Diamide stress
- **Time Points**: 0, 4, 8, 15, 30, 60 minutes
- **Controls**: MNase input controls at each time point
- **Histone Modifications**: 26 different histone modifications

### Key Words and Their Indications

#### Gene Mutations
- **Key Words**: There are no specific keywords indicating gene mutations in the provided metadata. The strain used is BY4741, which is a common yeast strain and does not indicate any specific mutations relevant to

For this large example, using `gpt-4o-mini` as of 08/13/24, annotation took 33m54s, using ~1.8M tokens (1.68M context, 0.1M generated) at a total cost of ~$0.30. \
For the larger `gpt-4o` model (is currently ~35x more expensive) we can ballpark the same annotation would cost ~$10.5 \ 
This is a rough estimate because input and output cost differs substantially, but gives a decent idea. 

So what did `annotate()` do?
Let's take a look at the output, which should be saved as `annotation_FULL.csv` in the `sragent_output` directory.

`ptm_meta_annotated` is a pandas dataframe with the following columns: \
- `project_id`
- `experiment_id`
- `exp_title` - title of the experiment as listed in the SRA metadata
(the following columns are generated from the LLM response)
- `gene_mutatation` - True or False: does this experiment test a gene mutation? 
- `gene_deletion` - True or False: does this experiment test a gene deletion?
- `protein_depletion` - True or False: does this experiment test a protein depletion?
- `stress_condition` - True or False: does this experiment test a stress condition?
- `time_series` - True or False: is this experiment specify a specific time point or growth stage?
- `chip_input` - True or False: is this an input experiment?
- `antibody_control` - True or False: is this an antibody control? ie. a non-specific antibody
- `chip_target` - protein targeted in the experiment
- `mutatation` - specific mutation if present
- `deletion` - specific deletion if present
- `depletion` - protein depleted if present
- `stress` - stress condition if present
- `time_point`
(the following are determined based on the classifications from the LLM above)
- `warnning` - True or False: flags a logical disagreement between the LLM classifications. Useful to catch mistakes and guide human review
- `sample` - a simple sample name or tag that merges experiment attributes like genotype, time, target. 
- `perturbation` - one of ['None','gene_mutation','gene_deletion','protein_depletion','stress_condition']
- `control` - the `sample` name of the control (input or other) experiment that matches target for that project


Let's examine how well the `gpt-4o-mini` model did at classifying and annotating these experiments.\
How many experiments were flagged for review?

In [7]:
ptm_meta_annotated
#print(ptm_meta_annotated[ptm_meta_annotated['warning']])

Unnamed: 0_level_0,Unnamed: 1_level_0,experiment_id,exp_title,gene_mutation,gene_deletion,protein_depletion,stress_condition,time_series,chip_input,antibody_control,chip_target,...,deletion,depletion,stress,time_point,project_id,model,warning,sample,perturbation,control
project_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
PRJNA262623,0,SRX717474,h4k12ac_4,False,False,False,True,True,False,False,H4K12ac,...,,,diamide,4 min,PRJNA262623,gpt-4o,False,H4K12ac-diamide-stress_condition-4_min,stress_condition,
PRJNA262623,2,SRX717446,h3k79me_60,False,False,False,True,True,False,False,H3K79me,...,,,diamide,60 min,PRJNA262623,gpt-4o,False,H3K79me-diamide-stress_condition-60_min,stress_condition,Input-diamide-stress_condition-60_min
PRJNA262623,3,SRX717506,h4r3me2s_60,False,False,False,True,True,False,False,H4R3me2s,...,,,diamide,60 min,PRJNA262623,gpt-4o,False,H4R3me2s-diamide-stress_condition-60_min,stress_condition,Input-diamide-stress_condition-60_min
PRJNA262623,4,SRX717485,h4k20me_8,False,False,False,True,True,False,False,H4K20me,...,,,diamide,8 min,PRJNA262623,gpt-4o,False,H4K20me-diamide-stress_condition-8_min,stress_condition,
PRJNA262623,1,SRX717537,input1.2_60,False,False,False,True,True,True,False,,...,,,diamide,60 min,PRJNA262623,gpt-4o,False,Input-diamide-stress_condition-60_min,stress_condition,
