# Langchain RAG, using Arxiv Cosmology Data, Parts 1 - 4

The idea is to use replicate the LangChain RAG template for our RAG application.
This is the first notebook, based on: https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_1_to_4.ipynb

### Imports and API Keys

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'

os.environ['LANGCHAIN_API_KEY'] = os.environ['LANGCHAIN_API_KEY']
os.environ['OPENAI_API_KEY'] = os.environ['OPENAI_API_KEY']

In [4]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

In [10]:
import warnings
warnings.filterwarnings('ignore')

### Load the vector index previously created (https://github.com/panchambanerjee/CosmologyAI/blob/main/arxiv_project/code/notebooks/create_cosmo_vectordb.ipynb)

In [5]:
# Get the embedding model, we need this again to load in the persisted vectordb

model_name = "sentence-transformers/all-MiniLM-l6-v2" #"BAAI/bge-small-en-v1.5"#"sentence-transformers/all-MiniLM-l6-v2" #"sentence-transformers/all-mpnet-base-v2"
# bge-base-en-v1.5 or bge-small taking too much time for all the cosmo docs, ~66k
model_kwargs = {"device": "cpu"} # Since we are running on local machine, we will use CPU

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
vectordb = Chroma(persist_directory='./arxiv_cosmo_chroma_db', embedding_function=embeddings)
retriever = vectordb.as_retriever()

### Retrieval and Generation

In [7]:
prompt = hub.pull("rlm/rag-prompt") # Loads the latest version of the prompt

In [8]:
prompt

ChatPromptTemplate(input_variables=['context', 'question'], metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))])

In [11]:
# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Question
rag_chain.invoke("What is a Galaxy Cluster")

'A galaxy cluster is a large group of galaxies bound together by gravity. These clusters can contain hundreds to thousands of galaxies. They are the largest known gravitationally bound structures in the universe.'

In [12]:
import tiktoken

question = "What is the Cosmological constant?"

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string(question, "cl100k_base") # cl100k_base is the embedding model

8

In [16]:
question = "What is the Cosmological constant?"
document = "In the context of cosmology the cosmological constant is a homogeneous energy density that causes the expansion of the universe to accelerate. Originally proposed early in the development of general relativity in order to allow a static universe solution it was subsequently abandoned when the universe was found to be expanding. Now the cosmological constant is invoked to explain the observed acceleration of the expansion of the universe. The cosmological constant is the simplest realization of dark energy, which is the more generic name given to the unknown cause of the acceleration of the universe."

print(document)

In the context of cosmology the cosmological constant is a homogeneous energy density that causes the expansion of the universe to accelerate. Originally proposed early in the development of general relativity in order to allow a static universe solution it was subsequently abandoned when the universe was found to be expanding. Now the cosmological constant is invoked to explain the observed acceleration of the expansion of the universe. The cosmological constant is the simplest realization of dark energy, which is the more generic name given to the unknown cause of the acceleration of the universe.


### Getting similarity between a query and the relevant document

In [17]:
from langchain_openai import OpenAIEmbeddings

embd = OpenAIEmbeddings()
query_result = embd.embed_query(question)
document_result = embd.embed_query(document)

len(query_result)

1536

Cosine similarity is most accurate with OpenAI Embeddings (Why?)

In [18]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity = cosine_similarity(query_result, document_result)

print("Cosine Similarity:", similarity)

Cosine Similarity: 0.8981770043772933


We already have the retriever loaded above

In [19]:
retriever = vectordb.as_retriever(search_kwargs={"k": 1})
docs = retriever.get_relevant_documents("What is a Galaxy Cluster?")

len(docs)

1

### Generation

In [21]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

print(prompt)

input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n'))]


In [22]:
# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Chain
chain = prompt | llm

# Run
chain.invoke({"context":docs,"question":"What is a Galaxy Cluster?"})   

AIMessage(content='A galaxy cluster is a group of galaxies bound together by gravity.', response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 271, 'total_tokens': 284}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_b28b39ffa8', 'finish_reason': 'stop', 'logprobs': None}, id='run-042b2577-b14c-4fc9-a090-dcd550ca103d-0')

In [23]:
from langchain import hub
prompt_hub_rag = hub.pull("rlm/rag-prompt")

print(prompt_hub_rag)

input_variables=['context', 'question'] metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'} messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]


In [24]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is a Galaxy Cluster?")

'A galaxy cluster is a group of galaxies held together by gravity.'