In [1]:
import sragent
import pandas as pd

The `sragent` tool is centered on a single funciton `gather()`. \
The basic function of `gather()` is to take a Bio Project ID and access the Sequence Read Archive (SRA) to collect metadata on all of the samples within that project.\
It does this using the `entrez` tools from BioPython and some cluncky XML parsing to extract useful metadata.\
`gather()` will accept a single ID string or a list of strings for any number of projects.\
>[!WARNING]
>`gather()` accesses an external server that is prone to connection problems.\
> You may encounter connection errors. Usually waiting and rerunning solves this, but you may need to split your request into chuncks.\

For this demo project I'm going to try to pull and annotate a large number of Yeast Histone PTM ChIP-seq experiments. \
Below are a list of BioProjecIDs I've already collected that include any experiment profiling a Histone PTM. \

In [2]:
projects = ['PRJNA262623', 'PRJNA227448', 'PRJNA140547', 'PRJNA989169',
            'PRJNA954824', 'PRJNA912607', 'PRJNA831793', 'PRJNA783027',
            'PRJNA753826', 'PRJNA737490', 'PRJNA721183', 'PRJNA672715',
            'PRJNA643248', 'PRJNA588479', 'PRJNA559331', 'PRJNA492238',
            'PRJNA487157', 'PRJNA450434', 'PRJNA384583', 'PRJNA320298',
            'PRJNA278334', 'PRJNA274975', 'PRJNA254082', 'PRJNA231240'
            'PRJNA153387']

We can now run `gather()` with this list of IDs. \
By default, we don't have to provide any other arguments to just get metadata for these projects. \
This takes ~1m.

In [2]:
#meta = sragent.gather(projects)
meta = pd.read_csv('sragent_output/metadata.csv')
meta

Unnamed: 0,project_id,project_title,abstract,protocol,run_id,experiment_id,title,organism,assay_id,attributes
0,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593257,SRX717562,input2.2_15,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 1...
1,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593256,SRX717561,input1.2_15,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 1...
2,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593255,SRX717560,input2.2_8,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 8...
3,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593254,SRX717559,input3.1_8,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 8...
4,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593253,SRX717558,input1.3_60,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 6...
...,...,...,...,...,...,...,...,...,...,...
894,PRJNA231240,Context Dependency of Set1/COMPASS Mediated H3...,The stimulation of trimethylation of histone H...,We followed the modified ChIP protocol describ...,SRR1048090,SRX390613,H3K4ME3_aa762-1080_Set1_Rep_2,Saccharomyces_cerevisiae,ChIP-Seq,source_name : H3K4ME3_aa762-1080_Set1_Rep_2 ch...
895,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441518,SRX127921,delta_chd1_K4me3,Saccharomyces_cerevisiae,ChIP-Seq,"source_name : delta_Chd1 yeast (FM391,delta Ch..."
896,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441515,SRX127920,delta_chd1_K79me3,Saccharomyces_cerevisiae,ChIP-Seq,"source_name : delta_Chd1 yeast (FM391,delta Ch..."
897,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441509,SRX127918,wt_FM391_K4me3,Saccharomyces_cerevisiae,ChIP-Seq,source_name : Wild-type yeast (FM391) genotype...


`meta` is a pandas dataframe with all of the project ids, experiment ids, abstracts, protocols, experiment titles, and experiment attributes.\
There are a total of 1,172 experiments here.\
\
`annotate()` can also take this dataframe as input, that way we don't have to unnecessarily repeat metadata pulls from the SRA.\
It also let's us manipulate an subset the metadata before we annotate. \
For instance, in this vignette we have a total of 25 different projects that include histone PTM ChIP-seq experiments in yeast. \
But there are non-histone ChIP-seq experiments within these projects as well that we don't care about for now. \
Let's try to filter down to just experiments with histone targets.\

First we'll read in a csv with all the histone ptm targets we're interested in, as well as 'H3' and 'input' targets so we don't lose our ChIP-seq controls.\
We then filter the metadata dataframe with a single string pattern of our desired epitopes and using that as a mask. \
That drops our total number of experiments to annotate to 899. \

In [3]:
epitopes = pd.read_csv('epitopes.csv')
eps = '|'.join(epitopes['epitope_id'].tolist())
mask = meta['title'].str.contains(eps, case=False, na=False)
ptm_meta = meta[mask]
ptm_meta

Unnamed: 0,project_id,project_title,abstract,protocol,run_id,experiment_id,title,organism,assay_id,attributes
0,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593257,SRX717562,input2.2_15,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 1...
1,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593256,SRX717561,input1.2_15,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 1...
2,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593255,SRX717560,input2.2_8,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 8...
3,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593254,SRX717559,input3.1_8,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 8...
4,PRJNA262623,High resolution chromatin dynamics during a ye...,Covalent histone modifications are highly cons...,Formaldehyde was quenched by 125mM glycine. Ce...,SRR1593253,SRX717558,input1.3_60,Saccharomyces_cerevisiae,ChIP-Seq,source_name : BY4741 stress : diamide time : 6...
...,...,...,...,...,...,...,...,...,...,...
894,PRJNA231240,Context Dependency of Set1/COMPASS Mediated H3...,The stimulation of trimethylation of histone H...,We followed the modified ChIP protocol describ...,SRR1048090,SRX390613,H3K4ME3_aa762-1080_Set1_Rep_2,Saccharomyces_cerevisiae,ChIP-Seq,source_name : H3K4ME3_aa762-1080_Set1_Rep_2 ch...
895,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441518,SRX127921,delta_chd1_K4me3,Saccharomyces_cerevisiae,ChIP-Seq,"source_name : delta_Chd1 yeast (FM391,delta Ch..."
896,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441515,SRX127920,delta_chd1_K79me3,Saccharomyces_cerevisiae,ChIP-Seq,"source_name : delta_Chd1 yeast (FM391,delta Ch..."
897,PRJNA153387,Co-dependency of H2B monoubiquitination and nu...,Monoubiquitination of histone H2B on lysine 12...,NO_PROTOCOL,SRR441509,SRX127918,wt_FM391_K4me3,Saccharomyces_cerevisiae,ChIP-Seq,source_name : Wild-type yeast (FM391) genotype...


When we're happy with our subset of experiments we can then rerun `gather()` with the dataframe as input and the `annotate` argument set to `True`.\
Setting `annotate` to `True` will activate the second part of the tool which is to summarize and annotate the experiments of each project using a LLM (currently any OpenAI model). \
The LLM annotation involves a two-step process. \
1. The model is given the full metadata for a single project. That includes, title, abstract, protocol, and all of the experiment metadata (titles, attributes). \
It is prompted to provide a summary of that project and answer specific questions about the experiments. These summaries are automatically saved as text files in the `sragent_output/` directory. 
2. The model is given the summary for a project and the metadata for a *single* experiment, and then prompted to fill out a json schema (using function calling on a pydantic class to enforce structure and type) to define experiment details.\
These attributes are T/F questions about mutations, deletions, protein depletions, controls, etc. that may or may not be involved in each experiment. 

In order to allow for greater cost control, we need to define which OpenAI model we want to use for each of these LLM steps. \
In my own testing I have found that the full GPT-4o model is necessary for the best accuracy when generating the json output. \
But, that the less expensive `gpt-4o-mini` model is sufficient to provide good summarizations. \


When you run `gather()` it will automatically check if summaries are already present for the BioProject in the `sragent_output/` directory, saving time and money if something goes wrong or we want to rerun the annotation step.\
And if you have already annotated the metadata, and `sragent_output/annotation_FULL.csv` exists, then `gather()` will skip the LLM annotation and just repeat the generation of sample names, validation check,and control sample matching.\
That might seem useless, but that makes is much easier to do human-in-the-loop validation. \
If a particular project is producing consistenly incorrect annotations we can check the summary and make manual corrections that may improve the annotation output.


In [4]:
#test = ptm_meta[ptm_meta['project_id'] == 'PRJNA262623']
ptm_meta_annotated = sragent.gather(ptm_meta, 
                                    model_summary = 'gpt-4o-mini',
                                    model_annotation = 'gpt-4o-2024-08-06',
                                    annotate = True)

Annotation exists, loading annotation...
[False]
[False, False]
[False, False, False]
[False, False, False, False]
[False, False, False, False, False]
False
[False]
[False, False]
[False, False, False]
[False, False, False, False]
[False, False, False, False, False]
False
[False]
[False, False]
[False, False, False]
[False, False, False, False]
[False, False, False, False, False]
False
[False]
[False, False]
[False, False, False]
[False, False, False, False]
[False, False, False, False, False]
False
[False]
[False, False]
[False, False, False]
[False, False, False, False]
[False, False, False, False, False]
False
[False]
[False, False]
[False, False, False]
[False, False, False, False]
[False, False, False, False, False]
False
[False]
[False, False]
[False, False, False]
[False, False, False, False]
[False, False, False, False, False]
False
[False]
[False, False]
[False, False, False]
[False, False, False, False]
[False, False, False, False, False]
False
[False]
[False, False]
[False, 

For this large example, using `gpt-4o-2024-08-06` for annotaion (as of 08/14/24), annotation took ~24m, using ~2M tokens at a total cost of ~$6. \
So for 899 experiments that's less that $0.01 per experiment. 

So what did `gather()` do?
Let's take a look at the output, which should be saved as `annotation_FULL.csv` in the `sragent_output` directory.

`ptm_meta_annotated` is a pandas dataframe with the following columns: \
- `project_id`
- `experiment_id`
- `exp_title` - title of the experiment as listed in the SRA metadata
(the following columns are generated from the LLM response)
- `gene_mutatation` - True or False: does this experiment test a gene mutation? 
- `gene_deletion` - True or False: does this experiment test a gene deletion?
- `protein_depletion` - True or False: does this experiment test a protein depletion?
- `stress_condition` - True or False: does this experiment test a stress condition?
- `time_series` - True or False: is this experiment specify a specific time point or growth stage?
- `chip_input` - True or False: is this an input experiment?
- `antibody_control` - True or False: is this an antibody control? ie. a non-specific antibody
- `chip_target` - protein targeted in the experiment
- `mutatation` - specific mutation if present
- `deletion` - specific deletion if present
- `depletion` - protein depleted if present
- `stress` - stress condition if present
- `time_point`
(the following are determined based on the classifications from the LLM above)
- `warnning` - True or False: flags a logical disagreement between the LLM classifications. Useful to catch mistakes and guide human review
- `sample` - a simple sample name or tag that merges experiment attributes like genotype, time, target. 
- `perturbation` - one of ['None','gene_mutation','gene_deletion','protein_depletion','stress_condition']
- `control` - the `sample` name of the control (input or other) experiment that matches target for that project


Let's examine how well the model did at classifying and annotating these experiments.\
How many experiments were flagged for review?

In [5]:
#ptm_meta_annotated
warnings = ptm_meta_annotated[ptm_meta_annotated['warning']]
warnings

Unnamed: 0_level_0,Unnamed: 1_level_0,experiment_id,exp_title,gene_mutation,gene_deletion,protein_depletion,stress_condition,time_series,chip_input,antibody_control,chip_target,...,depletion,stress,time_point,project_id,model_summary,model_annotation,warning,sample,perturbation,control
project_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
PRJNA274975,233,SRX869427,"Rph1- Old (S3O, Exp 2) H3",True,True,False,False,False,False,False,H3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,234,SRX869426,"Rph1- Old (S2O, Exp 2) H3",True,True,False,False,False,False,False,H3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,235,SRX869425,"Rph1- Young (S2Y, Exp 2) H3",True,True,False,False,False,False,False,H3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,236,SRX869424,"Rph1- Old (S3O, Exp 2) H3K36me3",True,True,False,False,True,False,False,H3K36me3,...,,,Old_S3O_Exp2,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3K36me3-Rph1--gene_mutation-Old_S3O_Exp2,gene_mutation,
PRJNA274975,237,SRX869423,"Rph1- Old (S2O, Exp 2) H3K36me3",True,True,False,False,False,False,False,H3K36me3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3K36me3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PRJNA989169,639,SRX20828397,H3K9me2; TetR-SET; Suv39x; ChIP-seq (B),True,True,False,False,False,False,False,H3K9me2,...,,,,PRJNA989169,gpt-4o-mini,gpt-4o-2024-08-06,True,H3K9me2-TetR-SET-gene_mutation,gene_mutation,
PRJNA989169,642,SRX20828410,input DNA; AY4345 (B),True,True,False,False,False,True,False,,...,,,,PRJNA989169,gpt-4o-mini,gpt-4o-2024-08-06,True,Input-hflD::URA3Kl-tetO10X-ADE2-gene_mutation,gene_mutation,
PRJNA989169,643,SRX20828409,input DNA; AY4345 (A),True,True,False,False,False,True,False,,...,,,,PRJNA989169,gpt-4o-mini,gpt-4o-2024-08-06,True,Input-hflD-URA3Kl-tetO10X-ADE2-gene_mutation,gene_mutation,
PRJNA989169,646,SRX20828406,input DNA; AY2896 (B),True,True,False,False,False,True,False,,...,,,,PRJNA989169,gpt-4o-mini,gpt-4o-2024-08-06,True,Input-hmrD-tetO10X-ADE2-gene_mutation,gene_mutation,


65 of our 899 experiments have 'warnings' meaning there is some logical inconsistency between some of the classifications.\
eg. a sample is marked as having a gene_mutation but no mutation is listed or vice versa.

Looking at one project (PRJNA274975) we see that these experiments are being classified as having both gene deletions and mutations, and we're getting some ugly sample names as a result.\
The time series annotation is also inconsistent, and most of the experiments aren't being correctly annotated as 'young' or 'old' which seems to be an important designator in this project. 

In [6]:
warnings[warnings['project_id'] == 'PRJNA274975']

Unnamed: 0_level_0,Unnamed: 1_level_0,experiment_id,exp_title,gene_mutation,gene_deletion,protein_depletion,stress_condition,time_series,chip_input,antibody_control,chip_target,...,depletion,stress,time_point,project_id,model_summary,model_annotation,warning,sample,perturbation,control
project_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
PRJNA274975,233,SRX869427,"Rph1- Old (S3O, Exp 2) H3",True,True,False,False,False,False,False,H3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,234,SRX869426,"Rph1- Old (S2O, Exp 2) H3",True,True,False,False,False,False,False,H3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,235,SRX869425,"Rph1- Young (S2Y, Exp 2) H3",True,True,False,False,False,False,False,H3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,236,SRX869424,"Rph1- Old (S3O, Exp 2) H3K36me3",True,True,False,False,True,False,False,H3K36me3,...,,,Old_S3O_Exp2,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3K36me3-Rph1--gene_mutation-Old_S3O_Exp2,gene_mutation,
PRJNA274975,237,SRX869423,"Rph1- Old (S2O, Exp 2) H3K36me3",True,True,False,False,False,False,False,H3K36me3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3K36me3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,238,SRX869422,"Rph1- Young (S2Y, Exp 2) H3K36me3",True,True,False,False,False,False,False,H3K36me3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3K36me3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,245,SRX869409,"Rph1- Old (S3O, Exp 1) H3",True,True,False,False,False,False,False,H3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,246,SRX869408,"Rph1- Old (S2O, Exp 1) H3",True,True,False,False,False,False,False,H3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,247,SRX869407,"Rph1- Young (S2Y, Exp 1) H3",True,True,False,False,False,False,False,H3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation
PRJNA274975,248,SRX869406,"Rph1- Old (S3O, Exp 1) H3K36me3",True,True,False,False,False,False,False,H3K36me3,...,,,,PRJNA274975,gpt-4o-mini,gpt-4o-2024-08-06,True,H3K36me3-Rph1--gene_mutation,gene_mutation,Input-Rph1--gene_mutation


Let's take a look at the summary for this project.

In [7]:
with open('sragent_output/PRJNA274975_summary.txt', 'r') as f:
    print(f.read())

### Project Summary
The main goal of the project (PRJNA274975) is to investigate the role of H3K36 methylation in promoting longevity in yeast (S. cerevisiae) by enhancing transcriptional fidelity. The project tests the hypothesis that loss of sustained H3K36 methylation leads to increased cryptic transcription in aging cells, which is associated with a shorter lifespan. The study examines the effects of the deletion of the K36me2/3 demethylase Rph1 on H3K36 methylation levels and lifespan extension.

### Experimental Conditions
The experiments are conducted under different conditions based on the age of the yeast cells (young vs. old) and the genetic background (wild-type (WT) vs. Rph1 mutant). The project includes three sorts of yeast: S2Y (young), S2O (old), and S3O (old), with two replicates (F1 and F2).

### Analysis of Metadata

1. **Key Words Indicating Gene Mutations:**
   - There are no specific keywords indicating a gene mutation in the project.

2. **Key Words Indicating Gen

It looks like the summary got a couple of things wrong.
>[!QUOTE]
>1. **Key Words Indicating Gene Mutations:**
>   - "Rph1-" indicates a mutation in the Rph1 gene. This is the only relevant mutation in the context of the project.
>2. **Key Words Indicating Gene Deletions:**
>   - "Rph1-" also indicates a deletion of the Rph1 gene, as it suggests a complete loss of function of this gene.
>5. **Key Words Indicating Experiments Separated Over Time:**
>   - The terms "Exp 1" and "Exp 2" in the experiment titles indicate different experimental replicates, but there are no specific keywords indicating time points or stages of development/growth.

Most likely we don't want experiments involving 'Rph1-' to be classified as both deletions and mutations.\
Without examining the project details further, it seems safe to assume that 'Rph1-' is supposed to indicate a deletion of Rph1.

Let's make the following changes:
>[!QUOTE]
>1. **Key Words Indicating Gene Mutations:**
>   - There are no specific keywords indicating a gene mutation in the project.
>2. **Key Words Indicating Gene Deletions:**
>   - "Rph1-" indicates a deletion of the Rph1 gene, as it suggests a complete loss of function of this gene.
>5. **Key Words Indicating Experiments Separated Over Time:**
>   - The terms "Young" and "Old" in the experiment titles indicate different time points

And then save the new summary as `PRJNA274975_summary-corrected.txt` \
This is mostly for the purposes of this demonstration, so we don't lose the incorrect summary.\
I've added an argument to `gather()` to prefrentially load any summary files appended with `-corrected.txt`, just specify `use_corrected = True`.\


Now, we don't want to reannotate all projects. Just PRJNA27975 that was giving us problems.\
So we can pass in the subset of that dataframe for just that project and reannotate.

In [8]:
ptm_meta_annotated_corrected = sragent.gather(ptm_meta[ptm_meta['project_id'] == 'PRJNA274975'], 
                                    model_summary = 'gpt-4o-mini',
                                    model_annotation = 'gpt-4o-2024-08-06',
                                    use_corrected = True,
                                    annotate = True)

Corrected summary exists, loading PRJNA274975 summary...
annotating experiment 827    SRX869430
Name: experiment_id, dtype: object...
annotating experiment 828    SRX869429
Name: experiment_id, dtype: object...
annotating experiment 829    SRX869428
Name: experiment_id, dtype: object...
annotating experiment 830    SRX869427
Name: experiment_id, dtype: object...
annotating experiment 831    SRX869426
Name: experiment_id, dtype: object...
annotating experiment 832    SRX869425
Name: experiment_id, dtype: object...
annotating experiment 833    SRX869424
Name: experiment_id, dtype: object...
annotating experiment 834    SRX869423
Name: experiment_id, dtype: object...
annotating experiment 835    SRX869422
Name: experiment_id, dtype: object...
annotating experiment 836    SRX869421
Name: experiment_id, dtype: object...
annotating experiment 837    SRX869420
Name: experiment_id, dtype: object...
annotating experiment 838    SRX869419
Name: experiment_id, dtype: object...
annotating experime