# build a RAG model to search recount3

The goal of the notebook is to 1) embed recount3 available projects (stored in `recount3.csv`) 2) create a prompt to search projects related to the query.

In [39]:
from langchain_community.llms import Ollama
from langchain_community.document_loaders import CSVLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers.string import StrOutputParser

## Load llm model and data

In [7]:
llm = Ollama(model="llama3")
llm

Ollama(model='llama3')

In [27]:
loader = CSVLoader("data/recount3.csv")
docs = loader.load()
len(docs)

18830

In [52]:
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
len(documents)

4

In [53]:
documents[:2]

[Document(page_content='organism: human\nproject_home: data_sources/sra\nproject: SRP179061\nn_samples: 113\nstudy_title: Alzheimer\'s gene expression by cell type - SFG\nstudy_abstract: AD patients all had Braak stages V or VI, and were also pathologically confirmed to have amyloid plaque. The "SAMPLE_ID" sample characteristic is a sample identifier internal to Genentech. The ID of this project in Genentech\'s ExpressionPlot database is PRJ0018621 Overall design: RNA from purified cell types from AD and control post-mortem frozen superior frontal gyrus of AD and control patients.', metadata={'source': 'data/recount3.csv', 'row': 859}),
 Document(page_content='organism: human\nproject_home: data_sources/sra\nproject: SRP100948\nn_samples: 117\nstudy_title: Heterogeneity in neurodegenerative disease\nstudy_abstract: RNA was purified from fusiform gyrus tissue sections of autopsy-confirmed Alzheimer\'\'s cases and neurologically normal age-matched controls. The "SAM.ID" sample characteri

## Embedding projects in recount3

In [47]:
# this code take a while, so need to recommend to use the stored index.
vector_store = FAISS.from_documents(documents, GPT4AllEmbeddings())
retriever = vector_store.as_retriever()
vector_store.save_local("index")

# For a test
docs = vector_store.similarity_search("I want to find the available data to study Alzheimer's disease.")
len(docs)

4

## Generate a prompt and RAG chain

In [81]:
prompt = hub.pull("rlm/rag-prompt-llama")
prompt

ChatPromptTemplate(input_variables=['context', 'question'], metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt-llama', 'lc_hub_commit_hash': '693a2db5447e3b58c060a6ac02758dc7f1aaaaa4ee6214d127bf70b443158630'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="[INST]<<SYS>> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<</SYS>> \nQuestion: {question} \nContext: {context} \nAnswer: [/INST]"))])

In [82]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [83]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## Test dataset
The result can be changed each time for the same query.

In [87]:
print(rag_chain.invoke("recommend any human projects for ALS. Please tell the project id and the number of samples."))

I recommend three human projects related to ALS:

* SRP067645 (21 samples): This project aims to identify gene expression signatures of motor neuron populations isolated from sporadic ALS patients.
* SRP064478 (15 samples): This study compares the transcriptome profiles of postmortem cervical spinal sections from 7 ALS patients and 8 healthy controls to understand differences in gene expression between the two groups.
* SRP100370 (4 samples): This project investigates aberrant gene expression in motor neurons derived from ALS patient-induced pluripotent stem cells bearing a specific SOD1 mutation.

These projects may provide valuable insights into the molecular mechanisms underlying ALS and could potentially lead to the discovery of biomarkers or therapeutic targets.


In [77]:
import pandas as pd

In [120]:
df = pd.read_csv("recount3.csv")
for row in df[df.project.isin(["SRP067645", "SRP064478", "SRP100370"])].iterrows():
    print(row[1]["project"], row[1]["study_title"], row[1]["n_samples"])

SRP067645 Gene Expression Signatures of Motor Neuron Populations Isolated from Sporadic ALS 21
SRP064478 Illumina Total Stranded RNA sequencing reads from 15 postmortem cervical spinal sections (7 ALS and 8 healthy controls) 15
SRP100370 RNA-seq analysis revealed aberrant gene expression in motor neurons derived from ALS patient iPSCs bearing SOD1+/A272C mutation 4


For the available dataset, it showed that it correctly found the dataset. How about unavailable project?

In [123]:
print(rag_chain.invoke("recommend any SAGE sequencing dataset."))

[/INST]<<SYS>> I recommend the SAGE sequencing dataset from study SRP102119, "Metastatic Breast Cancer Sample Sequencing", which contains 37 human samples. This dataset provides a comprehensive view of breast cancer genomic variation and could be useful for answering questions related to cancer research.


In [124]:
for row in df[df.project.isin(["SRP102119"])].iterrows():
    print(row[1]["project"], row[1]["study_title"], row[1]["n_samples"])
    print(row[1]["study_abstract"])

SRP102119 Metastatic Breast Cancer Sample Sequencing 37
Metastatic Breast Cancer patient tumor mass whole exome and transcriptome sequencing


In this instance, we were unable to acquire the correct information, so it resorts to searching for a project whose topic aligns most closely with the query in the embedding space. How to mitigate cases like this one is now a question.