# Langchain RAG, using Arxiv Cosmology Data, Parts 5 - 9 - Overview

The idea is to use replicate the LangChain RAG template for our RAG application.
This is the second notebook, based on: https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb

### Imports and API Keys

In [None]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'

os.environ['LANGCHAIN_API_KEY'] = os.environ['LANGCHAIN_API_KEY']
os.environ['OPENAI_API_KEY'] = os.environ['OPENAI_API_KEY']

In [36]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain.load import dumps, loads
from operator import itemgetter

import warnings
warnings.filterwarnings('ignore')

## Multi-Query

https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever

#### Load the vector index previously created (https://github.com/panchambanerjee/CosmologyAI/blob/main/arxiv_project/code/notebooks/create_cosmo_vectordb.ipynb)

In [2]:
# Get the embedding model, we need this again to load in the persisted vectordb

model_name = "sentence-transformers/all-MiniLM-l6-v2" #"BAAI/bge-small-en-v1.5"#"sentence-transformers/all-MiniLM-l6-v2" #"sentence-transformers/all-mpnet-base-v2"
# bge-base-en-v1.5 or bge-small taking too much time for all the cosmo docs, ~66k
model_kwargs = {"device": "cpu"} # Since we are running on local machine, we will use CPU

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# Indexing

vectordb = Chroma(persist_directory='./arxiv_cosmo_chroma_db', embedding_function=embeddings)
retriever = vectordb.as_retriever()

In [4]:
# Prompt

# Multi Query: Different Perspectives
template = """You are an AI language model assistant. Your task is to generate five 
different versions of the given user question to retrieve relevant documents from a vector 
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search. 
Provide these alternative questions separated by newlines. Original question: {question}"""
prompt_perspectives = ChatPromptTemplate.from_template(template)

generate_queries = (
    prompt_perspectives 
    | ChatOpenAI(temperature=0) 
    | StrOutputParser() 
    | (lambda x: x.split("\n"))
)

In [7]:

def get_unique_union(documents: list[list]):
    """ Unique union of retrieved docs """
    # Flatten list of lists, and convert each Document to string
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
    # Get unique documents
    unique_docs = list(set(flattened_docs))
    # Return
    return [loads(doc) for doc in unique_docs]

# Retrieve
question = "What is a Galaxy Cluster?"

retrieval_chain = generate_queries | retriever.map() | get_unique_union
docs = retrieval_chain.invoke({"question":question})

len(docs)

9

In [10]:
# RAG
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

llm = ChatOpenAI(temperature=0)

final_rag_chain = (
    {"context": retrieval_chain, 
     "question": itemgetter("question")} 
    | prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"question":question})

'A galaxy cluster is a large, gravitationally bound system of galaxies that are held together by dark matter and surrounded by hot gas. It is a collection of galaxies, ranging from a few to thousands, that are interconnected through gravitational forces.'

## RAG-Fusion
 https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1

In [11]:
# RAG-Fusion: Related
template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output (4 queries):"""
prompt_rag_fusion = ChatPromptTemplate.from_template(template)

In [12]:
generate_queries = (
    prompt_rag_fusion 
    | ChatOpenAI(temperature=0)
    | StrOutputParser() 
    | (lambda x: x.split("\n"))
)

In [13]:

def reciprocal_rank_fusion(results: list[list], k=60):
    """ Reciprocal_rank_fusion that takes multiple lists of ranked documents 
        and an optional parameter k used in the RRF formula """
    
    # Initialize a dictionary to hold fused scores for each unique document
    fused_scores = {}

    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each document in the list, with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)
            doc_str = dumps(doc)
            # If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            # Retrieve the current score of the document, if any
            previous_score = fused_scores[doc_str]
            # Update the score of the document using the RRF formula: 1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)

    # Sort the documents based on their fused scores in descending order to get the final reranked results
    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]

    # Return the reranked results as a list of tuples, each containing the document and its fused score
    return reranked_results

retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion
docs = retrieval_chain_rag_fusion.invoke({"question": question})

len(docs)

10

In [14]:
# RAG
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    {"context": retrieval_chain_rag_fusion, 
     "question": itemgetter("question")} 
    | prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"question":question})

'A galaxy cluster is a large group of galaxies held together by gravity.'

In [15]:
## Now try RAG fusion with the same prompt as for Multi-Query

template = """You are an AI language model assistant. Your task is to generate five 
different versions of the given user question to retrieve relevant documents from a vector 
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search. 
Provide these alternative questions separated by newlines. Original question: {question}"""
prompt_rag_fusion = ChatPromptTemplate.from_template(template)

generate_queries = (
    prompt_rag_fusion 
    | ChatOpenAI(temperature=0) 
    | StrOutputParser() 
    | (lambda x: x.split("\n"))
)

retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion

docs = retrieval_chain_rag_fusion.invoke({"question": question})

len(docs)

9

In [16]:
# RAG
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    {"context": retrieval_chain_rag_fusion, 
     "question": itemgetter("question")} 
    | prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"question":question})

'A galaxy cluster is a large group of galaxies held together by gravity. It is a structure in the universe that consists of numerous galaxies, as well as dark matter and hot gas.'

## Decomposition

In [17]:
# Decomposition
template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""

prompt_decomposition = ChatPromptTemplate.from_template(template)

In [18]:
# LLM
llm = ChatOpenAI(temperature=0)

# Chain
generate_queries_decomposition = ( prompt_decomposition | llm | StrOutputParser() | (lambda x: x.split("\n")))

# Run
question = "What is a Galaxy Cluster?"
questions = generate_queries_decomposition.invoke({"question":question})

In [19]:
questions

['1. What are the characteristics of a galaxy cluster?',
 '2. How are galaxy clusters formed?',
 '3. What is the significance of studying galaxy clusters in cosmology?']

### Answer recursively
* https://arxiv.org/pdf/2205.10625.pdf
* https://arxiv.org/abs/2212.10509

In [20]:
# Prompt
template = """Here is the question you need to answer:

\n --- \n {question} \n --- \n

Here is any available background question + answer pairs:

\n --- \n {q_a_pairs} \n --- \n

Here is additional context relevant to the question: 

\n --- \n {context} \n --- \n

Use the above context and any background question + answer pairs to answer the question: \n {question}
"""

decomposition_prompt = ChatPromptTemplate.from_template(template)

In [21]:
def format_qa_pair(question, answer):
    """Format Q and A pair"""
    
    formatted_string = ""
    formatted_string += f"Question: {question}\nAnswer: {answer}\n\n"
    return formatted_string.strip()

# llm
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

q_a_pairs = ""
for q in questions:
    
    rag_chain = (
    {"context": itemgetter("question") | retriever, 
     "question": itemgetter("question"),
     "q_a_pairs": itemgetter("q_a_pairs")} 
    | decomposition_prompt
    | llm
    | StrOutputParser())

    answer = rag_chain.invoke({"question":q,"q_a_pairs":q_a_pairs})
    q_a_pair = format_qa_pair(q,answer)
    q_a_pairs = q_a_pairs + "\n---\n"+  q_a_pair

In [22]:
answer

'Studying galaxy clusters in cosmology is significant for several reasons. Galaxy clusters are the largest gravitationally bound structures in the universe dominated by dark matter, composed of hundreds to thousands of galaxies held together by gravity. By studying galaxy clusters, researchers can gain insights into the formation history, dynamical properties, and feedback processes within these structures. \n\nAdditionally, galaxy clusters serve as important cosmological probes, providing valuable information about the evolution of the universe. The precision and accuracy of cosmological parameters inferred from galaxy clusters are influenced by the knowledge of cluster physics, such as mass-observable scaling relations and the mass function. Understanding the complex interplay between the redshift evolution of the mass function and scaling relations in galaxy clusters is crucial for accurately constraining cosmological models.\n\nFurthermore, galaxy clusters can help in testing funda

In [26]:
q_a_pairs

'\n---\nQuestion: 1. What are the characteristics of a galaxy cluster?\nAnswer: Galaxy clusters are the largest gravitationally bound structures in the universe dominated by dark matter. They are composed of hundreds to thousands of galaxies held together by gravity. Within galaxy clusters, there are various structures and components observed, such as bubbles of relativistic plasma inflated by supermassive black holes, cooling and heating of the gas, large-scale plasma shocks, cold fronts, non-thermal halos, and relics. These components reflect both the formation history and the dynamical properties of galaxy clusters. X-ray spectroscopy is used as a tool to study the metal enrichment in clusters, and fine spectroscopy of Fe X-ray lines helps in understanding the turbulent plasma motions and the energetics of non-thermal electron populations within galaxy clusters. Understanding the complex dynamical and feedback processes in galaxy clusters is essential to comprehend the energy and ma

### Answer individually

In [27]:
# Answer each sub-question individually 


# RAG prompt
prompt_rag = hub.pull("rlm/rag-prompt")

def retrieve_and_rag(question,prompt_rag,sub_question_generator_chain):
    """RAG on each sub-question"""
    
    # Use our decomposition / 
    sub_questions = sub_question_generator_chain.invoke({"question":question}) # This is the generate_queries_decomposition from above
    
    # Initialize a list to hold RAG chain results
    rag_results = []
    
    for sub_question in sub_questions:
        
        # Retrieve documents for each sub-question
        retrieved_docs = retriever.get_relevant_documents(sub_question)
        
        # Use retrieved documents and sub-question in RAG chain
        answer = (prompt_rag | llm | StrOutputParser()).invoke({"context": retrieved_docs, 
                                                                "question": sub_question})
        rag_results.append(answer)
    
    return rag_results,sub_questions

# Wrap the retrieval and RAG process in a RunnableLambda for integration into a chain
answers, questions = retrieve_and_rag(question, prompt_rag, generate_queries_decomposition)

In [29]:
def format_qa_pairs(questions, answers):
    """Format Q and A pairs"""
    
    formatted_string = ""
    for i, (question, answer) in enumerate(zip(questions, answers), start=1):
        formatted_string += f"Question {i}: {question}\nAnswer {i}: {answer}\n\n"
    return formatted_string.strip()

context = format_qa_pairs(questions, answers)



answers, questions

(['Galaxy clusters are the largest gravitationally bounded structures in the Universe dominated by dark matter. They contain structures such as bubbles of relativistic plasma, cooling and heating of gas, large-scale plasma shocks, cold fronts, non-thermal halos, and relics. Observing these constituents can provide insights into the formation history and dynamical properties of galaxy clusters.',
  'Galaxy clusters are formed from the general picture of collapse from initial density fluctuations in an expanding Universe. Detailed simulations of cluster formation include the effects of galaxy formation. Uncertain physics of galaxy formation and feedback contribute to areas where predictions are uncertain.',
  'Studying galaxy clusters in cosmology is significant because they serve as a recent cosmological probe, providing precision and accuracy in inferring cosmological parameters. Knowledge of cluster physics, mass-observable scaling relations, and mass function modeling impact the anal

In [30]:
context

'Question 1: 1. What are the characteristics of a galaxy cluster?\nAnswer 1: Galaxy clusters are the largest gravitationally bounded structures in the Universe dominated by dark matter. They contain structures such as bubbles of relativistic plasma, cooling and heating of gas, large-scale plasma shocks, cold fronts, non-thermal halos, and relics. Observing these constituents can provide insights into the formation history and dynamical properties of galaxy clusters.\n\nQuestion 2: 2. How are galaxy clusters formed?\nAnswer 2: Galaxy clusters are formed from the general picture of collapse from initial density fluctuations in an expanding Universe. Detailed simulations of cluster formation include the effects of galaxy formation. Uncertain physics of galaxy formation and feedback contribute to areas where predictions are uncertain.\n\nQuestion 3: 3. What is the significance of studying galaxy clusters in cosmology?\nAnswer 3: Studying galaxy clusters in cosmology is significant because 

In [31]:
# Prompt
template = """Here is a set of Q+A pairs:

{context}

Use these to synthesize an answer to the question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"context":context,"question":question})

'A galaxy cluster is a large gravitationally bounded structure in the Universe dominated by dark matter. It contains various components such as bubbles of relativistic plasma, cooling and heating of gas, large-scale plasma shocks, cold fronts, non-thermal halos, and relics. Galaxy clusters are formed from the collapse of initial density fluctuations in an expanding Universe, with the uncertain physics of galaxy formation playing a role in their formation. Studying galaxy clusters is significant in cosmology as they serve as a recent cosmological probe, providing insights into the formation history and dynamical properties of these structures, as well as improving constraints on cosmological parameters through the analysis of their mass-observable scaling relations and mass function modeling.'

## Step-back Prompting
https://arxiv.org/pdf/2310.06117.pdf

In [33]:
# Few Shot Examples

examples = [
    {
        "input": "Could the members of The Police perform lawful arrests?",
        "output": "what can the members of The Police do?",
    },
    {
        "input": "Jan Sindel’s was born in what country?",
        "output": "what is Jan Sindel’s personal history?",
    },
]
# We now transform these to example messages
example_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
        ("ai", "{output}"),
    ]
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer. Here are a few examples:""",
        ),
        # Few shot examples
        few_shot_prompt,
        # New question
        ("user", "{question}"),
    ]
)

Note: This above prompt might need to be refined for our cosmology vector index

In [35]:
generate_queries_step_back = prompt | ChatOpenAI(temperature=0) | StrOutputParser()
question = "What is a Galaxy Cluster?"
generate_queries_step_back.invoke({"question": question}) # Not a very good step-back question

'What are astronomical objects in space?'

In [37]:
# Response prompt 
response_prompt_template = """You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

# {normal_context}
# {step_back_context}

# Original Question: {question}
# Answer:"""
response_prompt = ChatPromptTemplate.from_template(response_prompt_template)

chain = (
    {
        # Retrieve context using the normal question
        "normal_context": RunnableLambda(lambda x: x["question"]) | retriever,
        # Retrieve context using the step-back question
        "step_back_context": generate_queries_step_back | retriever,
        # Pass on the question
        "question": lambda x: x["question"],
    }
    | response_prompt
    | ChatOpenAI(temperature=0)
    | StrOutputParser()
)

chain.invoke({"question": question})

'A galaxy cluster is a large, gravitationally bound system consisting of hundreds to thousands of galaxies, as well as hot gas, dark matter, and other astronomical objects. These clusters are the largest known gravitationally bound structures in the universe, typically spanning millions of light-years in size. Galaxy clusters are important in the study of cosmology and astrophysics as they provide insights into the formation and evolution of structures in the universe.\n\nGalaxy clusters are often characterized by their high density of galaxies compared to the surrounding cosmic environment. The galaxies within a cluster are typically moving relative to each other due to the gravitational interactions within the cluster. The mass of a galaxy cluster is dominated by dark matter, which is a form of matter that does not emit, absorb, or reflect electromagnetic radiation, making it invisible to telescopes. The presence of dark matter is inferred from its gravitational effects on visible ma

## HyDE
* https://github.com/langchain-ai/langchain/blob/master/cookbook/hypothetical_document_embeddings.ipynb
* https://arxiv.org/abs/2212.10496

In [39]:
# HyDE document genration
template = """Please write a scientific paper passage to answer the question
Question: {question}
Passage:"""
prompt_hyde = ChatPromptTemplate.from_template(template)

generate_docs_for_retrieval = (
    prompt_hyde | ChatOpenAI(temperature=0) | StrOutputParser() 
)

# Run
question = "What is a Galaxy Cluster?"
generate_docs_for_retrieval.invoke({"question":question})

"A galaxy cluster is a large, gravitationally bound grouping of galaxies. These clusters can contain anywhere from a few dozen to thousands of galaxies, as well as vast amounts of dark matter and hot intergalactic gas. The galaxies within a cluster are typically spread out over a large area, with distances between them ranging from a few hundred thousand to several million light-years.\n\nGalaxy clusters are some of the largest structures in the universe, with masses ranging from around 10^14 to 10^15 times the mass of the Sun. They are thought to have formed through the gravitational collapse of primordial matter, and continue to evolve through mergers with other clusters and the accretion of smaller galaxy groups.\n\nOne of the key features of galaxy clusters is the presence of dark matter, which makes up the majority of the cluster's mass. Dark matter is a mysterious substance that does not emit, absorb, or reflect light, and its presence can only be inferred through its gravitation

In [40]:
# Retrieve

retrieval_chain = generate_docs_for_retrieval | retriever 
retrieved_docs = retrieval_chain.invoke({"question":question})

retrieved_docs

[Document(page_content='Galaxy clusters are the most massive gravitationally bound systems consisting of dark matter, hot baryonic gas and stars. They play an important role in observational cosmology and galaxy evolution', metadata={'abstract': 'Galaxy clusters are the most massive gravitationally bound systems consisting of dark matter, hot baryonic gas and stars. They play an important role in observational cosmology and galaxy evolution studies. We have developed a deep learning model for segmentation of SZ signal on ACT+Planck intensity maps and present here a new galaxy cluster catalogue in the ACT footprint. In order to increase the purity of the cluster catalogue, we limit ourselves to publishing here only a part of the full sample with the most probable galaxy clusters lying in the directions to the candidates of the extended Planck cluster catalogue (SZcat). The ComPACT catalogue contains 2,934 galaxy clusters (with $Purity\\gtrsim88$ %), $\\gtrsim1436$ clusters are new with 

In [41]:
# RAG
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"context":retrieved_docs,"question":question})

'A galaxy cluster is the most massive gravitationally bound system consisting of dark matter, hot baryonic gas, and stars. It plays an important role in observational cosmology and galaxy evolution.'