This notebook shows how to form a simple pipeline of retrieval augmented generation (RAG) using Langchain and three models, to retrieve information from research papers in Arxiv and answer questions related to the research papers. 

Firstly, we use keywords to search for research papers in Arxiv, then we use the models including a embedding model (NV-embed-QA), a reranking model (nv-rerankqa-mistral-4b-v3), and a large language model (meta/llama-3.1-8b-instruct) to make a simple RAG, which can retrieve answers from the research papers based on the questions we ask.

You don't need to download the models, as the LLM / embedding / reranking models are hosted at NVIDIA endpoints (https://build.nvidia.com/). You need to generate a NVIDIA API key to use the model endpoints.


In [1]:
!pip install --upgrade --quiet pip arxiv pymupdf  pypdf langchain langchain-nvidia-ai-endpoints  langchain-community faiss-gpu

Firstly, We can use Langchain's ArxivLoader to get the basic information of the research paper, such as the title, authors, abstract, published date, etc.

In [2]:
from langchain_community.document_loaders import ArxivLoader
docs = ArxivLoader(query="Retrieval Augmented Generation", load_max_docs=2).load()
len(docs)

2

In [3]:
docs[0].metadata  # meta-information of the Document

{'Published': '2024-06-19',
 'Title': 'R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation',
 'Authors': 'Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen',
 'Summary': "Retrieval augmented generation (RAG) has been applied in many scenarios to\naugment large language models (LLMs) with external documents provided by\nretrievers. However, a semantic gap exists between LLMs and retrievers due to\ndifferences in their training objectives and architectures. This misalignment\nforces LLMs to passively accept the documents provided by the retrievers,\nleading to incomprehension in the generation process, where the LLMs are\nburdened with the task of distinguishing these documents using their inherent\nknowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill\nthis gap by incorporating Retrieval information into Retrieval Augmented\nGeneration. Specifically, R$^2$AG utilizes the nuanced features from the\nretrievers and employs a R$^2$-Former to c

In [4]:
docs[0].page_content[:3000]  # a content of the Document

'R2AG: Incorporating Retrieval Information into Retrieval Augmented\nGeneration\nFuda Ye1, Shuangyin Li1,*, Yongqi Zhang2, Lei Chen2,3\n1South China Normal University\n2The Hong Kong University of Science and Technology (Guangzhou)\n3The Hong Kong University of Science and Technology\nfudayip@m.scnu.edu.cn, shuangyinli@scnu.edu.cn, yongqizhang@hkust-gz.edu.cn, leichen@cse.ust.hk\nAbstract\nRetrieval augmented generation (RAG) has\nbeen applied in many scenarios to augment\nlarge language models (LLMs) with external\ndocuments provided by retrievers. However,\na semantic gap exists between LLMs and\nretrievers due to differences in their training\nobjectives and architectures. This misalign-\nment forces LLMs to passively accept the\ndocuments provided by the retrievers, leading\nto incomprehension in the generation process,\nwhere the LLMs are burdened with the task of\ndistinguishing these documents using their in-\nherent knowledge. This paper proposes R2AG,\na novel enhanced RAG fra

We can also use Langchain's ArxivRetriever to retrieve multiple papers from Arxiv based on the keywords:

In [5]:
from langchain.retrievers import ArxivRetriever
retriever = ArxivRetriever(load_max_docs=5)
docs = retriever.invoke(
    "Retrieval Augmented Generation",
)
docs

[Document(metadata={'Entry ID': 'http://arxiv.org/abs/2406.13249v1', 'Published': datetime.date(2024, 6, 19), 'Title': 'R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation', 'Authors': 'Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen'}, page_content="Retrieval augmented generation (RAG) has been applied in many scenarios to\naugment large language models (LLMs) with external documents provided by\nretrievers. However, a semantic gap exists between LLMs and retrievers due to\ndifferences in their training objectives and architectures. This misalignment\nforces LLMs to passively accept the documents provided by the retrievers,\nleading to incomprehension in the generation process, where the LLMs are\nburdened with the task of distinguishing these documents using their inherent\nknowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill\nthis gap by incorporating Retrieval information into Retrieval Augmented\nGeneration. Specifically, R$^2$AG 

Or retrieve the research papers based on the author:

In [6]:
retriever.invoke(
    "Shuangyin Li",
)

[Document(metadata={'Entry ID': 'http://arxiv.org/abs/1507.08396v1', 'Published': datetime.date(2015, 7, 30), 'Title': 'Tag-Weighted Topic Model For Large-scale Semi-Structured Documents', 'Authors': 'Shuangyin Li, Jiefei Li, Guan Huang, Ruiyang Tan, Rong Pan'}, page_content='To date, there have been massive Semi-Structured Documents (SSDs) during the\nevolution of the Internet. These SSDs contain both unstructured features (e.g.,\nplain text) and metadata (e.g., tags). Most previous works focused on modeling\nthe unstructured text, and recently, some other methods have been proposed to\nmodel the unstructured text with specific tags. To build a general model for\nSSDs remains an important problem in terms of both model fitness and\nefficiency. We propose a novel method to model the SSDs by a so-called\nTag-Weighted Topic Model (TWTM). TWTM is a framework that leverages both the\ntags and words information, not only to learn the document-topic and topic-word\ndistributions, but also to

You can choose any paper to learn more information, Langchain provides PyPDFLoader for PDF files:

In [7]:
from langchain_community.document_loaders import PyPDFLoader
# You can replace the below link with a different link to a PDF file 
loader = PyPDFLoader("https://arxiv.org/pdf/2005.11401")
pages = loader.load_and_split()
pages[0]

Document(metadata={'source': 'https://arxiv.org/pdf/2005.11401', 'page': 0}, page_content='Retrieval-Augmented Generation for\nKnowledge-Intensive NLP Tasks\nPatrick Lewis†‡, Ethan Perez⋆,\nAleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,\nMike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†\n†Facebook AI Research;‡University College London;⋆New York University;\nplewis@fb.com\nAbstract\nLarge pre-trained language models have been shown to store factual knowledge\nin their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-\nstream NLP tasks. However, their ability to access and precisely manipulate knowl-\nedge is still limited, and hence on knowledge-intensive tasks, their performance\nlags behind task-speciﬁc architectures. Additionally, providing provenance for their\ndecisions and updating their world knowledge remain open research problems. Pre-\ntrained models with a differentiable access mec

A simple example of embedding model: choose any sentence from above page that we just printed out to see its embedding:

In [8]:
 # Initialize LLM
import os
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# NVIDIA AI Foundation Endpoints
os.environ["NVIDIA_API_KEY"] = "your API key starting with nvapi-" 

llm = ChatNVIDIA(
  model="meta/llama-3.1-8b-instruct",
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
)

In [9]:
# You can replace the example text with any text you want to try.
example_text = "Large pre-trained language models have been shown to store factual knowledge\nin their parameters"

In [10]:
# Embedding  
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

embedding_model = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END")

embedding_model.embed_query(example_text)

[-0.02069091796875,
 -0.0182342529296875,
 0.01340484619140625,
 8.600950241088867e-05,
 0.0223236083984375,
 0.050994873046875,
 0.0338134765625,
 -0.01369476318359375,
 -0.0094451904296875,
 -0.043426513671875,
 0.01171112060546875,
 0.02374267578125,
 0.0107879638671875,
 -0.025299072265625,
 -0.004131317138671875,
 0.00812530517578125,
 -0.00879669189453125,
 0.0154571533203125,
 -0.012420654296875,
 0.01025390625,
 0.005096435546875,
 0.045654296875,
 0.01207733154296875,
 -0.0262908935546875,
 0.018218994140625,
 0.01947021484375,
 0.011444091796875,
 -0.021240234375,
 0.0096282958984375,
 0.06756591796875,
 0.051849365234375,
 -0.01100921630859375,
 0.03155517578125,
 0.05328369140625,
 0.002727508544921875,
 0.034454345703125,
 0.00989532470703125,
 -0.0606689453125,
 -0.01922607421875,
 0.032135009765625,
 -0.0113372802734375,
 -0.0289306640625,
 0.0079193115234375,
 0.07177734375,
 -0.03973388671875,
 0.0247039794921875,
 -0.0963134765625,
 -0.013763427734375,
 -0.01684570312

Let's see an example of adding the research paper into the knowledge base so that questions requires the information of the papers can be answered.

Before adding the research paper into the knowledge base, the large language model cannot give an answer to question such as " what is retrieval argmented generation?”

In [11]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Do not hallucinate. Say you don't know if you don't have this information."
    )),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

In [12]:
# You can replace this question with any question that is specific to the research paper you plan to learn more
print(chain.invoke("Who are the authors of the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks?"))

The authors of the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" are Sheng Zhang, Xing Niu, and Eric Nyberg.


The answer is not correct. This is because the answer was generated by the large language model which was not trained with the knowledge of this research paper. Although the large language model can handle general questions pretty well, such as:

In [13]:
print(chain.invoke({"question": "What is a large language model"}))

A large language model is a type of artificial intelligence (AI) that uses complex algorithms and massive amounts of data to understand and generate human-like language. It's a computer program that can process and respond to natural language inputs, such as text or speech, with a high degree of accuracy and fluency.


To answer the specific question related to this research paper, we can create a knowledge base by adding the research paper: 

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

chunks = text_splitter.split_documents(pages)
len(chunks)

79

In [15]:
from langchain.vectorstores import FAISS

vector_store = FAISS.from_documents(chunks, embedding=embedding_model)

In [16]:
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_messages([
    ("system", 
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Do not hallucinate. Say you don't know if you don't have this information."
        # "Answer the question using only the context"
        "\n\nQuestion: {question}\n\nContext: {context}"
    ),
    ("user", "{question}")
])

chain = (
    {
        'context': vector_store.as_retriever(),
        'question': (lambda x:x)
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [17]:
# You can replace this question with any question that is specific to the research paper you plan to learn more
print(chain.invoke("Who are the authors of the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks?"))

The authors of the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" are Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela.


Now we have the correct answer which means the LLM was able to retrieve the information of the research paper that was added into the knowledge base. Let's try to ask more questions about the paper: 

In [18]:
# You can replace this question with any question that is specific to the research paper you plan to learn more
print(chain.invoke("Why is Retrieval-Augmented Generation better compared to previous technique for NLP Tasks?"))

Retrieval-Augmented Generation (RAG) is better compared to previous techniques for NLP tasks because it combines pre-trained parametric and non-parametric memory, allowing for more accurate and diverse knowledge retrieval and generation. This approach outperforms previous methods by leveraging the strengths of both parametric and non-parametric memory.


In [19]:
# You can replace this question with any question that is specific to the research paper you plan to learn more
print(chain.invoke("How was the experiment done in the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks?"))

The experiment in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" was done by fine-tuning a pre-trained sequence-to-sequence transformer model with a non-parametric memory, which is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. The model was trained end-to-end to generate output based on the input and the retrieved documents.


In [20]:
# You can replace this question with any question that is specific to the research paper you plan to learn more
print(chain.invoke("What is the quantitative experiment result in the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks?"))

The paper reports that the RAG models achieve results within 4.3% of state-of-the-art pipeline models on FEVER fact verification.


Finally, let's see how reranking model works: 

In [21]:
# Retrieve K relevant results to the query
query = "Why is Retrieval-Augmented Generation better compared to previous technique for NLP Tasks?"
retrieved_docs = vector_store.similarity_search(query, k=5, fetch_k=5)
len(retrieved_docs)
#for doc in retrieved_docs:
#    print(doc.page_content)
#    print(doc.metadata)

5

After retrieving K results to the query, let reranker calculate the similarity scores, and pick the answer with the highest score:

In [22]:
from langchain_nvidia_ai_endpoints import NVIDIARerank
from langchain_core.documents import Document

query = "What is task decomposition for LLM agents?"

passages = [docs.page_content for docs in retrieved_docs]

client = NVIDIARerank(
  model="nvidia/nv-rerankqa-mistral-4b-v3", 
#  top_n=3
)

response = client.compress_documents(
  query=query,
  documents=[Document(page_content=passage) for passage in passages]   
)

response

[Document(metadata={'relevance_score': -9.515625}, page_content='Retrieval-Augmented Generation for\nKnowledge-Intensive NLP Tasks\nPatrick Lewis†‡, Ethan Perez⋆,\nAleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,\nMike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†\n†Facebook AI Research;‡University College London;⋆New York University;\nplewis@fb.com\nAbstract\nLarge pre-trained language models have been shown to store factual knowledge\nin their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-\nstream NLP tasks. However, their ability to access and precisely manipulate knowl-\nedge is still limited, and hence on knowledge-intensive tasks, their performance\nlags behind task-speciﬁc architectures. Additionally, providing provenance for their\ndecisions and updating their world knowledge remain open research problems. Pre-\ntrained models with a differentiable access mechanism to explicit non-par

In [23]:
print(f"Most relevant: {response[0].page_content}\n")

Most relevant: Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆,
Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,
Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†
†Facebook AI Research;‡University College London;⋆New York University;
plewis@fb.com
Abstract
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-
stream NLP tasks. However, their ability to access and precisely manipulate knowl-
edge is still limited, and hence on knowledge-intensive tasks, their performance
lags behind task-speciﬁc architectures. Additionally, providing provenance for their
decisions and updating their world knowledge remain open research problems. Pre-
trained models with a differentiable access mechanism to explicit non-parametric
memory have so far been only investigated for extractive

In [24]:
print(f"Least relevant: {response[-1].page_content}\n")

Least relevant: memory by editing the document index. This approach has also been used in knowledge-intensive
dialog, where generators have been conditioned on retrieved text directly, albeit obtained via TF-IDF
rather than end-to-end learnt retrieval [9].
Retrieve-and-Edit approaches Our method shares some similarities with retrieve-and-edit style
approaches, where a similar training input-output pair is retrieved for a given input, and then edited
to provide a ﬁnal output. These approaches have proved successful in a number of domains including
Machine Translation [ 18,22] and Semantic Parsing [ 21]. Our approach does have several differences,
including less of emphasis on lightly editing a retrieved item, but on aggregating content from several
pieces of retrieved content, as well as learning latent retrieval, and retrieving evidence documents
rather than related training pairs. This said, RAG techniques may work well in these settings, and
could represent promising future work.
6 D

## Reference:

NVIDIA: https://build.nvidia.com/explore/discover

Langchain: https://python.langchain.com/