# Detailed tutorial on one article

# Parsing article

In [128]:
from dotenv import load_dotenv
import os
from tqdm import tqdm
import time
import json

from langchain_community.document_loaders import UnstructuredMarkdownLoader


from utils.parser import ParserConfig, Parser
from rag.retriever import RetrieverConfig, Retriever
from rag.prompt_generator import PromptGeneratorConfig, PromptGenerator
from rag.llm import LLMConfig, LLM
from rag.qa import QA

load_dotenv()
LLAMA_CLOUD = os.getenv('LLAMA_CLOUD')
COHERE_TOKEN = os.getenv("COHERE_TOKEN")
OPENAI_TOKEN = os.getenv("OPENAI_TOKEN")

ARTICLE_PATH = 'data/0008-5472.CAN-07-2124.pdf' #article to run
ROOT_PATH_RESULTS = 'pipeline_results/test'#folder to save results

We parsing pdf file into markdown format (.md), which is better format for LLM

In [129]:
parser_config = ParserConfig(path_to_file=ARTICLE_PATH,
                                llama_cloud_token=LLAMA_CLOUD,
                                instruction=None)
parser = Parser(parser_config)
parser.create_parser()
parser.parse()

Started parsing the file under job_id 03a712f4-0bb5-4533-9c65-162c6f227efb


we can check how it looks like

In [130]:
loader = UnstructuredMarkdownLoader("processed_data/0008-5472.CAN-07-2124.md")
loaded_documents = loader.load()

In [131]:
print(loaded_documents[0].page_content)

Research Article

Cancer Resistance in Transgenic Mice Expressing the SAC Module of Par-4

Yanming Zhao, Ravshan Burikhanov, Shirley Qiu, Subodh M. Lele, C. Darrell Jennings, Subbarao Bondada, Brett Spear, and Vivek M. Rangnekar

Departments of 1 Radiation Medicine, 2 Pathology and Laboratory Medicine, Microbiology, Immunology and Molecular Genetics; 4 Graduate Center for Toxicology; 5 Markey Cancer Center, University of Kentucky, Lexington, Kentucky and 6 Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, Nebraska

Abstract

Prostate apoptosis response-4 (Par-4) is a tumor-suppressor protein that induces apoptosis in cancer cells, but not in normal/immortalized cells. The cancer-specific proapoptotic action of Par-4 is encoded in its centrally located SAC domain. We report here the characterization of a novel mouse model with ubiquitous expression of the SAC domain. Although SAC transgenic mice displayed normal development and life span, they were 

# Pipeline

### **Retriever**. First we creating retriever, which will split article into chunks and find relevant chunk for our question. Basically we ranking chunks from article to find relevant, thats why we use ranking models bellow

In [132]:
retriever_config = RetrieverConfig(
    file_path="processed_data/0008-5472.CAN-07-2124.md",
    embeding_model="BAAI/bge-small-en",  # first embeding model for quick ranking (small model for quick ranking)
    reranker_model="rerank-english-v3.0",  # second embeding model for finall ranking (huge model for accurate ranking)
    chunk_size=15000,
    chunk_overlap=2000,
    COHERE_TOKEN=COHERE_TOKEN,  # cohere platform for reranking, all calculations on their platforn, eficcient way if we dont have much gpus
)
retriever = Retriever(retriever_config)

In [133]:
compression_retriever = retriever.run_retriever()

### **LLM**. We creating LLM object to answer our questions using context from retriever

In [134]:
llm_config = LLMConfig(model_name="gpt-4o", temperature=0.0, api_key=OPENAI_TOKEN)

In [135]:
llm = LLM(llm_config)

In [136]:
chat_llm = llm.create_llm()

### **PROMPT**. For each group of question we need special prompt

In [137]:
# format to save results
output = {"groups":[]} # groups - means groups of experimental animals that were used in the study

first prompt the most important one, because it finds experimental groups, that we need to describe

In [138]:
# prompts that we will use
PROMPT_CONFIG_PATH = "config/prompts_config.json"

with open(PROMPT_CONFIG_PATH, "r", encoding="utf-8") as file:
    PROMPT_CONFIG = json.load(file)

In [139]:
animal = PROMPT_CONFIG["animal"]

In [140]:
animal

{'prompt_intro': 'templates/prompt_intro/animal.txt',
 'prompt_base': 'templates/prompt_bases/animal.txt',
 'query': 'What groups of animals are used in the study?'}

we creating prompt for each group of questions

In [141]:
query = animal["query"]
prompt_config = PromptGeneratorConfig(
    prompt_intro=animal["prompt_intro"],
    prompt_base=animal["prompt_base"],
    prompt_type="animal",
)
prompt_template = PromptGenerator(prompt_config)

In [142]:
# for each prompt we have the actual prompt and the parser object, which is the format of the answer we waiting for
prompt = prompt_template.run_prompt()
parser = prompt_template.create_parser()

In [143]:
#prompt looks like this
print(prompt.template)

You are an assistant. Use the following information to answer the question very shortly.
Identify ONLY main experimental subject groups and describe them (for example female and male groups, groups with different treatments and e.t.c) and describe them in format bellow
Ignore Control, wild types and any other groups that are not subjects of experimentsGive an aswer in proper JSON format using double quotes around keys and values format
Return just a JSON with EXACT format listed bellow
For example: 
{{"animals":[{{"species":"animal_species1",
         "strain":"animal_strain1", # ONLY strain name of the animal group
         "group":"experiment1",# Name of the group for example Rapa, KO "ABC" gene and e.t.c
         "gender":"male"
         }},
         {{"species":"animal_species2",
         "strain":"animal_strain2",# ONLY strain name of the animal group ONLY name
         "group":"experiment2",# Name of the group for example Rapa, KO ABC gene and e.t.c
         "gender":"female"
   

### **Q&A pipeline**.Combining retriever,llm and prompt

In [144]:
qa = QA(compression_retriever, chat_llm, parser, prompt)

In [145]:
answer = qa.run_qa(query)

In [146]:
answer

AnimalList(animals=[Animal(species='mouse', strain='TRAMP', group='SAC transgenic', gender='female'), Animal(species='mouse', strain='B6C3F1', group='SAC transgenic', gender='male'), Animal(species='mouse', strain='TRAMP', group='GFP transgenic', gender='male')])

In [147]:
output["groups"] = [i.dict() for i in answer.animals]

In [148]:
#identified groups of animals in the article
output["groups"]

[{'species': 'mouse',
  'strain': 'TRAMP',
  'group': 'SAC transgenic',
  'gender': 'female'},
 {'species': 'mouse',
  'strain': 'B6C3F1',
  'group': 'SAC transgenic',
  'gender': 'male'},
 {'species': 'mouse',
  'strain': 'TRAMP',
  'group': 'GFP transgenic',
  'gender': 'male'}]

In [149]:
#Next we fo the same thing but with different prompts to describe each of this groups

In [150]:
for i,animal in enumerate(output["groups"]):
    animal_description = " ".join([value for key, value in animal.items() if value!=None])
    for key, value in PROMPT_CONFIG.items():
        if key == "animal":
            continue
        else:
            query = value["query"].format(animal=animal_description)
            print(query)
            prompt_config = PromptGeneratorConfig(
                prompt_intro = value["prompt_intro"],
                prompt_base = value["prompt_base"],
                all_animals_description = animal_description,
                prompt_type = key
            )
            prompt_template = PromptGenerator(prompt_config)
            prompt = prompt_template.run_prompt()
            parser = prompt_template.create_parser()
            qa = QA(compression_retriever, chat_llm, parser, prompt)
            answer = qa.run_qa(query).dict()
            print(answer)
            for key,value in answer.items():
                for subject in value:
                    for sub_key,sub_value in subject.items():
                        output["groups"][i][sub_key] = sub_value
            time.sleep(30)
    

What treatment or intervention or manipulation are used for mouse TRAMP SAC transgenic female?
{'animal_details': [{'treatment': 'SAC domain', 'way_of_administration': 'Genomic', 'age_at_start': None, 'duration_unit': None, 'dosage': 'Expressing the SAC domain'}, {'treatment': 'SAC domain', 'way_of_administration': 'Genomic', 'age_at_start': None, 'duration_unit': None, 'dosage': 'Expressing the SAC domain'}, {'treatment': 'SAC domain', 'way_of_administration': 'Genomic', 'age_at_start': None, 'duration_unit': None, 'dosage': 'Expressing the SAC domain'}]}
What are Lifespan or survival curve/results for mouse TRAMP SAC transgenic female?
{'animal_results': [{'n_treatment': 28, 'n_control': None, 'median_treatment': 28, 'max_treatment': None, 'treatment_units': 'months', 'p_value': None, 'median_control': None, 'max_control': None}]}
What treatment or intervention or manipulation are used for mouse B6C3F1 SAC transgenic male?
{'animal_details': [{'treatment': 'SAC transgenic', 'way_of_a

In [151]:
output

{'groups': [{'species': 'mouse',
   'strain': 'TRAMP',
   'group': 'SAC transgenic',
   'gender': 'female',
   'treatment': 'SAC domain',
   'way_of_administration': 'Genomic',
   'age_at_start': None,
   'duration_unit': None,
   'dosage': 'Expressing the SAC domain',
   'n_treatment': 28,
   'n_control': None,
   'median_treatment': 28,
   'max_treatment': None,
   'treatment_units': 'months',
   'p_value': None,
   'median_control': None,
   'max_control': None},
  {'species': 'mouse',
   'strain': 'B6C3F1',
   'group': 'SAC transgenic',
   'gender': 'male',
   'treatment': 'SAC transgenic',
   'way_of_administration': None,
   'age_at_start': None,
   'duration_unit': None,
   'dosage': None,
   'n_treatment': 1,
   'n_control': 0,
   'median_treatment': 'Normal lifespan',
   'max_treatment': 'Normal lifespan',
   'treatment_units': 'Age',
   'p_value': None,
   'median_control': None,
   'max_control': None},
  {'species': 'mouse',
   'strain': 'TRAMP',
   'group': 'GFP transgenic

In [None]:
#now we can save the result
ROOT_PATH_RESULTS = "pipeline_results/openai/"
file_name = "0008-5472.CAN-07-2124"
with open(f"{ROOT_PATH_RESULTS}/{file_name}.json", "w", encoding="utf-8") as f:
        json.dump(output, f, indent=4)