# Notebook for create multiple agent for hypothesis

1 agent for summarize imformation from the gene list (similar to the one we used for the llm eval paper)

1 agent fact-check the find references.

1 agent to create a pathway knowledge graph (following the INDRA terms)

1 agent to create hypothesis statement summarizing known information 

    * additional agent to check novelty (that should be context specific), 
        * Is the hypothesis already supported in the form of paper? 
        * If no, likly novel 
        * If yes, Is it in the same context? if no, also novel (more plausible hypothesis if in a similar context but not this specific context), if yes, then not novel 

1 agent to propose experiments



Use generalized context but add one specific sentence to differentiate:

For fact summarization: "In this instance, your task is to provide a factual summary of the known facts based on given information. Stick to well-established, scientifically accepted information without speculation."

For knowledge graph creation: "In this instance, your task is to create a knowledge graph representing the relationships and interactions between the given entities, focusing on functional or physical associations, regulatory relationships, and pathway involvements in the given biological context."

For hypothesis generation: "In this instance, your task is to generate creative yet scientifically plausible hypotheses about potential new functions for the given genes. While your ideas should be grounded in biological principles, you're encouraged to propose novel connections and functions."



In [1]:
import sys
import os

# Add the parent directory of the current script to the Python path
cwd = os.getcwd()
dirname = os.path.dirname(cwd)
print(cwd)
print(dirname)
sys.path.append(dirname)

print(sys.path)

/cellar/users/mhu/Projects/agent_evaluation/notebooks
/cellar/users/mhu/Projects/agent_evaluation
['/cellar/users/mhu/miniconda3/envs/llm_agent/lib/python311.zip', '/cellar/users/mhu/miniconda3/envs/llm_agent/lib/python3.11', '/cellar/users/mhu/miniconda3/envs/llm_agent/lib/python3.11/lib-dynload', '', '/cellar/users/mhu/miniconda3/envs/llm_agent/lib/python3.11/site-packages', '/cellar/users/mhu/Projects/agent_evaluation']


In [2]:
# Load database

from models.analysis_plan import AnalysisPlan
from services.analysisrunner import AnalysisRunner
from app.sqlite_database import SqliteDatabase
from app.config import load_database_config
%reload_ext autoreload
%autoreload 2

_, database_uri, _, _ = load_database_config()
db = SqliteDatabase(database_uri)

  from .autonotebook import tqdm as notebook_tqdm


### check available LLMs

In [16]:
llm_specs = db.find("llm")
llm_mappings = {}
for llm_spec in llm_specs:
    llm_id = llm_spec["object_id"]
    llm_properties = llm_spec["properties"]
    # print(llm_properties)       
    # only for temperature = 0
    if llm_properties["temperature"] != '0':
        continue
    llm_name = llm_properties["name"]
    llm_mappings[llm_name] = llm_id

llm_mappings

{'GPT_4o_t0': 'llm_e7742ff2-83ac-4dd7-b7e7-cedb0a0dd2cd',
 'Claude3.5_sonnet_t0': 'llm_ab57131f-b207-4583-86a6-e33b8644a189'}

## the gene set function summarization agent 

In [17]:
def templateIO(file):
    with open(file, "r") as f:
        return f.read()

# load the multiagent context file 
context = templateIO("../prompts/contexts/multiagent_context.txt") 
context

"You are an advanced researcher specializing in molecular biology, part of a multi-agent system designed to analyze scientific information. Your ultimate goal is to contribute to the development of plausible, actionable hypotheses for research and potential therapies. Your specific task will vary and may include summarizing facts, verifying information, creating knowledge graphs, or generating hypotheses. Always base your responses on well-established scientific knowledge and peer-reviewed research. Clearly distinguish between verified facts and theoretical concepts. If information is uncertain or speculative, explicitly state so. Maintain scientific rigor in all tasks. If presented with nonsensical, non-scientific, or non-factual questions, respond with \\'Unknown\\'."

In [22]:
# create the agents 
from glob import glob


prompt_template_file = '../prompts/analysts/gene_set_summarizer.txt'
prompt_template = templateIO(prompt_template_file)
# create analyst_specs
analyst_specs = {}

context = templateIO("../prompts/contexts/multiagent_context.txt") 
analyst_name = "gene_set_summarizer"
# analyst_name = " ".join(analyst_name)
for llm in llm_mappings.keys(): # just use Claude3.5 and GPT4o at t = 0 for deterministic results
    # print(analyst_name, llm)
    name = f"{analyst_name}_{llm}"
    analyst_specs[name] = {
        "context": context,
        "prompt_template": prompt_template,
        "llm_id": llm_mappings[llm],
        "description": "Multi agent system -" + analyst_name + "- using " + llm
    }

analyst_specs


{'gene_set_summarizer_GPT_4o_t0': {'context': "You are an advanced researcher specializing in molecular biology, part of a multi-agent system designed to analyze scientific information. Your ultimate goal is to contribute to the development of plausible, actionable hypotheses for research and potential therapies. Your specific task will vary and may include summarizing facts, verifying information, creating knowledge graphs, or generating hypotheses. Always base your responses on well-established scientific knowledge and peer-reviewed research. Clearly distinguish between verified facts and theoretical concepts. If information is uncertain or speculative, explicitly state so. Maintain scientific rigor in all tasks. If presented with nonsensical, non-scientific, or non-factual questions, respond with \\'Unknown\\'.",
  'prompt_template': '<data>\n{data}\n</data>\n\n<experiment_description>\nThe genes/proteins that were identified as described:\n{experiment_description}\n</experiment_des

In [23]:
from models.agent import Agent


analysts = {}

for name, spec in analyst_specs.items():
    analysts[name] = Agent.create(db, 
                                  spec["llm_id"], 
                                  spec["context"],
                                  spec["prompt_template"],
                                  name=name,
                                  description=spec.get('description'))
    
analysts

{'gene_set_summarizer_GPT_4o_t0': <models.agent.Agent at 0x1554fb57b990>,
 'gene_set_summarizer_Claude3.5_sonnet_t0': <models.agent.Agent at 0x1554fa68d390>}

In [24]:
agent_ids = []
for agent in analysts.values():
    agent_ids.append(agent.object_id)

agent_ids

['agent_bda52848-498a-4b8d-8312-0a58c9c9b97c',
 'agent_5bbb68ad-af1e-40af-9b59-194d4124b941']

In [25]:
dataset_id = 'dataset_9cf93b1a-c7af-4084-a5d7-a6d668995247' # just dengue 24hr upregulated proteins,


a_plan_name = "gene_set_summarizer_w_upreg_proteins_24hr"
analysis_plan = AnalysisPlan.create(db, a_plan_name, 
                                    agent_ids, 
                                    dataset_id, 
                                    n_hypotheses_per_agent='1', 
                                    description=
                                    '''gene set summarizer using upregulated proteins at 24hr timepoint
                                    '''
                                    )
print(analysis_plan.object_id)
print(vars(analysis_plan)) 
# 

analysis_plan_56e32971-7de2-4190-ba27-33d6e3653497
{'db': <app.sqlite_database.SqliteDatabase object at 0x1554fea1c9d0>, 'name': 'gene_set_summarizer_w_upreg_proteins_24hr', 'agent_ids': ['agent_bda52848-498a-4b8d-8312-0a58c9c9b97c', 'agent_5bbb68ad-af1e-40af-9b59-194d4124b941'], 'dataset_id': 'dataset_9cf93b1a-c7af-4084-a5d7-a6d668995247', 'n_hypotheses_per_agent': '1', 'biological_context': 'gene set summarizer using upregulated proteins at 24hr timepoint\n                                    ', 'description': None, 'object_id': 'analysis_plan_56e32971-7de2-4190-ba27-33d6e3653497', 'created': {'name': 'gene_set_summarizer_w_upreg_proteins_24hr', 'agent_ids': ['agent_bda52848-498a-4b8d-8312-0a58c9c9b97c', 'agent_5bbb68ad-af1e-40af-9b59-194d4124b941'], 'dataset_id': 'dataset_9cf93b1a-c7af-4084-a5d7-a6d668995247', 'n_hypotheses_per_agent': '1', 'description': 'gene set summarizer using upregulated proteins at 24hr timepoint\n                                    ', 'created': '09.06.2024 1

In [26]:

analysis_run = analysis_plan.generate_analysis_run(biological_context="Dengue Virus infection seeking for direct host-based therapeutic targets to inhibit dengue virus replication.")
print(analysis_run.object_id)
print(vars(analysis_run))

runner = AnalysisRunner(db, analysis_run.object_id)
result = runner.run()
print(result)

analysis_run_6d77aa94-9b72-4a45-8302-655b9c604619
{'db': <app.sqlite_database.SqliteDatabase object at 0x1554fea1c9d0>, 'analysis_plan_id': 'analysis_plan_56e32971-7de2-4190-ba27-33d6e3653497', 'agent_ids': ['agent_bda52848-498a-4b8d-8312-0a58c9c9b97c', 'agent_5bbb68ad-af1e-40af-9b59-194d4124b941'], 'dataset_id': 'dataset_9cf93b1a-c7af-4084-a5d7-a6d668995247', 'n_hypotheses_per_agent': '1', 'hypothesis_ids': [], 'biological_context': 'Dengue Virus infection seeking for direct host-based therapeutic targets to inhibit dengue virus replication.', 'description': None, 'run_log': '', 'attempts': {'agent_bda52848-498a-4b8d-8312-0a58c9c9b97c': [], 'agent_5bbb68ad-af1e-40af-9b59-194d4124b941': []}, 'status': 'pending', 'object_id': 'analysis_run_6d77aa94-9b72-4a45-8302-655b9c604619', 'name': 'none', 'user_ids': {'analysis_plan_id': 'analysis_plan_56e32971-7de2-4190-ba27-33d6e3653497', 'agent_ids': ['agent_bda52848-498a-4b8d-8312-0a58c9c9b97c', 'agent_5bbb68ad-af1e-40af-9b59-194d4124b941'], 'd

## fact checker

In [43]:
from models.analysis_run import AnalysisRun
# load the analysis run 
analysis_run = AnalysisRun.load(db, "analysis_run_76337bdb-2644-4868-b2a8-c58849b4b033")
hypothesis = analysis_run.hypothesis_ids
agent = analysis_run.agent_ids

print(hypothesis, agent)


['hypothesis_625939e2-8a7d-407c-8eed-4fbfa03f4d68'] ['agent_2ba950fe-d773-4c59-bbe7-2c1cc948b46f']


In [44]:
hypothesis_text = Hypothesis.load(db, hypothesis[0]).hypothesis_text
print(hypothesis_text)

{
    "Interferon-induced antiviral response": {
        "proteins": ["IFIH1", "IFIT1", "IFIT2", "IFIT3", "MX1", "MX2", "OAS1", "RIGI"],
        "analysis": "These proteins are key components of the interferon-induced antiviral response. IFIH1 (also known as MDA5) and RIGI (RIG-I) are pattern recognition receptors that detect viral RNA. The IFIT family proteins (IFIT1, IFIT2, IFIT3) inhibit viral replication by binding to viral RNA or proteins. MX1 and MX2 are GTPases that interfere with viral replication. OAS1 activates RNase L, which degrades viral RNA."
    },
    "Nucleotide metabolism": {
        "proteins": ["CMPK2", "NT5C3A"],
        "analysis": "CMPK2 and NT5C3A are involved in nucleotide metabolism. CMPK2 is a mitochondrial nucleotide kinase that phosphorylates dUMP and dCMP, while NT5C3A is a cytosolic 5'-nucleotidase. Their upregulation may be related to increased nucleotide demand during viral replication or cellular stress response."
    },
    "Cellular stress response":

In [45]:
from models.hypothesis import Hypothesis
import json
import re

def parse_hypothesis(text):
    # Use regex to extract the JSON part of the text
    json_match = re.search(r'{.*}', text, re.DOTALL)
    
    if not json_match:
        raise ValueError("No JSON structure found in the provided text.")
    
    # Extract the JSON string
    json_str = json_match.group(0)
    
    # Parse the JSON string into a Python dictionary
    try:
        parsed_data = json.loads(json_str)
    except json.JSONDecodeError as e:
        raise ValueError(f"Error decoding JSON: {e}")
    
    return parsed_data

hypothesis_text = Hypothesis.load(db, hypothesis[0]).hypothesis_text

# Parse the hypothesis data
parsed_hypothesis = parse_hypothesis(hypothesis_text)

# Output the parsed JSON data
print(parsed_hypothesis)

{'Interferon-induced antiviral response': {'proteins': ['IFIH1', 'IFIT1', 'IFIT2', 'IFIT3', 'MX1', 'MX2', 'OAS1', 'RIGI'], 'analysis': 'These proteins are key components of the interferon-induced antiviral response. IFIH1 (also known as MDA5) and RIGI (RIG-I) are pattern recognition receptors that detect viral RNA. The IFIT family proteins (IFIT1, IFIT2, IFIT3) inhibit viral replication by binding to viral RNA or proteins. MX1 and MX2 are GTPases that interfere with viral replication. OAS1 activates RNase L, which degrades viral RNA.'}, 'Nucleotide metabolism': {'proteins': ['CMPK2', 'NT5C3A'], 'analysis': "CMPK2 and NT5C3A are involved in nucleotide metabolism. CMPK2 is a mitochondrial nucleotide kinase that phosphorylates dUMP and dCMP, while NT5C3A is a cytosolic 5'-nucleotidase. Their upregulation may be related to increased nucleotide demand during viral replication or cellular stress response."}, 'Cellular stress response': {'proteins': ['PRKACA', 'TXNRD2'], 'analysis': 'PRKACA i

In [46]:
# get proteins 
for key, value in parsed_hypothesis.items():
    proteins = value["proteins"]
    print(proteins)   
    analysis = value["analysis"]
    print(analysis)

['IFIH1', 'IFIT1', 'IFIT2', 'IFIT3', 'MX1', 'MX2', 'OAS1', 'RIGI']
These proteins are key components of the interferon-induced antiviral response. IFIH1 (also known as MDA5) and RIGI (RIG-I) are pattern recognition receptors that detect viral RNA. The IFIT family proteins (IFIT1, IFIT2, IFIT3) inhibit viral replication by binding to viral RNA or proteins. MX1 and MX2 are GTPases that interfere with viral replication. OAS1 activates RNase L, which degrades viral RNA.
['CMPK2', 'NT5C3A']
CMPK2 and NT5C3A are involved in nucleotide metabolism. CMPK2 is a mitochondrial nucleotide kinase that phosphorylates dUMP and dCMP, while NT5C3A is a cytosolic 5'-nucleotidase. Their upregulation may be related to increased nucleotide demand during viral replication or cellular stress response.
['PRKACA', 'TXNRD2']
PRKACA is the catalytic subunit of protein kinase A, involved in various cellular signaling pathways including stress response. TXNRD2 is a mitochondrial thioredoxin reductase that plays a r

In [47]:
# indexing sentences in analysis 
# get the analysis text
for key, value in parsed_hypothesis.items(): 
    analysis = value["analysis"]
    print(analysis)
    # split the analysis into sentences
    sentences = analysis.split(".")
    sentences = [sentence.strip() for sentence in sentences if len(sentence.strip())>0]
    indexed_sentences = {i: sentence for i, sentence in enumerate(sentences)} 
    print(indexed_sentences)


These proteins are key components of the interferon-induced antiviral response. IFIH1 (also known as MDA5) and RIGI (RIG-I) are pattern recognition receptors that detect viral RNA. The IFIT family proteins (IFIT1, IFIT2, IFIT3) inhibit viral replication by binding to viral RNA or proteins. MX1 and MX2 are GTPases that interfere with viral replication. OAS1 activates RNase L, which degrades viral RNA.
{0: 'These proteins are key components of the interferon-induced antiviral response', 1: 'IFIH1 (also known as MDA5) and RIGI (RIG-I) are pattern recognition receptors that detect viral RNA', 2: 'The IFIT family proteins (IFIT1, IFIT2, IFIT3) inhibit viral replication by binding to viral RNA or proteins', 3: 'MX1 and MX2 are GTPases that interfere with viral replication', 4: 'OAS1 activates RNase L, which degrades viral RNA'}
CMPK2 and NT5C3A are involved in nucleotide metabolism. CMPK2 is a mitochondrial nucleotide kinase that phosphorylates dUMP and dCMP, while NT5C3A is a cytosolic 5'-n

In [37]:
Agent.load(db, agent[1]).name

'gene_set_summarizer_Claude3.5_sonnet_t0'

In [59]:
from services.reference_checker import get_references_for_paragraph

%reload_ext autoreload
%autoreload 2

email = 'mhu@health.ucsd.edu'
ref_dict = get_references_for_paragraph(db, parsed_hypothesis[list(parsed_hypothesis.keys())[0]], agent[0], email, n=5, papers_query=50, verbose=True)

Extracting keywords from paragraph
Paragraph:
These proteins are key components of the interferon-induced antiviral response. IFIH1 (also known as MDA5) and RIGI (RIG-I) are pattern recognition receptors that detect viral RNA. The IFIT family proteins (IFIT1, IFIT2, IFIT3) inhibit viral replication by binding to viral RNA or proteins. MX1 and MX2 are GTPases that interfere with viral replication. OAS1 activates RNase L, which degrades viral RNA.
Query:  
I would like to search PubMed to find supporting evidence for the statements in a paragraph. Give me a maximum of 3 keywords related to the functions or biological processes in the statements. 

Example paragraph:  Involvement of pattern recognition receptors: TLR1, TLR2, and TLR3 are part of the Toll-like receptor family, which recognize pathogen-associated molecular patterns and initiate innate immune responses. NOD2 and NLRP3 are intracellular sensors that also contribute to immune activation.
Example response: immune response,recep

In [60]:
ref_dict

{'paragraph': 'These proteins are key components of the interferon-induced antiviral response. IFIH1 (also known as MDA5) and RIGI (RIG-I) are pattern recognition receptors that detect viral RNA. The IFIT family proteins (IFIT1, IFIT2, IFIT3) inhibit viral replication by binding to viral RNA or proteins. MX1 and MX2 are GTPases that interfere with viral replication. OAS1 activates RNase L, which degrades viral RNA.',
 'keyword': '(IFIH1[Title/Abstract] OR IFIT1[Title/Abstract] OR IFIT2[Title/Abstract] OR IFIT3[Title/Abstract] OR MX1[Title/Abstract] OR MX2[Title/Abstract] OR OAS1[Title/Abstract] OR RIGI[Title/Abstract]) AND (antiviral[Title/Abstract] OR interferon[Title/Abstract] OR replication[Title/Abstract])',
 'references': [{'citation': 'Yang, Yiying, Song, Jie, Zhao, Hongjun, Zhang, Huali, Guo, Muyao. "Patients with dermatomyositis shared partially similar transcriptome signature with COVID-19 infection." Autoimmunity, 2023, pp. 2220984, doi: https://doi.org/10.1080/08916934.2023.

In [35]:
from services.reference_checker import get_references_for_paragraphs
%reload_ext autoreload
%autoreload 2


hypothesis_id = hypothesis
agent_id = agent # because I only use the context for this purpose so it doesn't matter which agent I use
dict, stored_refs = get_references_for_paragraphs(db, hypothesis_id, agent_id, n=5, papers_query=20, verbose=True)

Extracting keywords from paragraph
Paragraph:
These proteins are key components of the interferon-induced antiviral response. IFIH1 (also known as MDA5) and RIGI (RIG-I) are pattern recognition receptors that detect viral RNA. The IFIT family proteins (IFIT1, IFIT2, IFIT3) inhibit viral replication by binding to viral RNA or proteins. MX1 and MX2 are GTPases that interfere with viral replication. OAS1 activates RNase L, which degrades viral RNA.
Query:  
I would like to search PubMed to find supporting evidence for the statements in a paragraph. Give me a maximum of 3 keywords related to the functions or biological processes in the statements. 

Example paragraph:  Involvement of pattern recognition receptors: TLR1, TLR2, and TLR3 are part of the Toll-like receptor family, which recognize pathogen-associated molecular patterns and initiate innate immune responses. NOD2 and NLRP3 are intracellular sensors that also contribute to immune activation.
Example response: immune response,recep

KeyboardInterrupt: 

In [24]:
dict

{'Interferon-induced antiviral response': {'proteins': ['IFIH1',
   'IFIT1',
   'IFIT2',
   'IFIT3',
   'MX1',
   'MX2',
   'OAS1',
   'RIGI'],
  'analysis': 'These proteins are key components of the interferon-induced antiviral response. IFIH1 (also known as MDA5) and RIGI (RIG-I) are pattern recognition receptors that detect viral RNA. The IFIT family proteins (IFIT1, IFIT2, IFIT3) inhibit viral replication by binding to viral RNA or proteins. MX1 and MX2 are GTPases that interfere with viral replication. OAS1 activates RNase L, which degrades viral RNA.',
  'citations': '[1] Assou, Said, Ahmed, Engi, Morichon, Lisa, Nasri, Amel, Foisset, Florent, Bourdais, Carine, Gros, Nathalie, Tieo, Sonia, Petit, Aurelie, Vachier, Isabelle, Muriaux, Delphine, Bourdin, Arnaud, De Vos, John. "The Transcriptome Landscape of the In Vitro Human Airway Epithelium Response to SARS-CoV-2." International journal of molecular sciences, 2023, pp.  , doi: https://doi.org/10.3390/ijms241512017\n[2] Fleith, Re