In [72]:
from langchain.document_loaders import UnstructuredPDFLoader # requires unstructured[local-inference]
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter, SpacyTextSplitter
from IPython.display import Markdown

CHUNK_SIZE = 192

In [66]:
# Get the original RAG paper
loader = UnstructuredPDFLoader('2005.11401.pdf', mode='single')
loaded_doc = loader.load()

In [68]:
def display_docs(doc_set, limit=5):
    # display the first 'limit' documents in Markdown format with a different title for each chunk
    for i, doc in enumerate(doc_set):
        if i >= limit:
            break
        display(Markdown(f'#### Chunk {i}'))
        # display the chunk with a border
        display(Markdown(f'<div style="border: 1px solid orange; padding: 1px">{doc.page_content}</div>'))


In [73]:
# Naive chunking
simple_splitter = TokenTextSplitter.from_tiktoken_encoder("cl100k_base", chunk_size=CHUNK_SIZE, chunk_overlap=0)
docs = simple_splitter.transform_documents(loaded_doc)
display_docs(docs)

#### Chunk 0

<div style="border: 1px solid orange; padding: 1px">Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis†‡, Ethan Perez(cid:63),

1 2 0 2

Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,

Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†

r p A 2 1

†Facebook AI Research; ‡University College London; (cid:63)New York University; plewis@fb.com

] L C . s c [

4 v 1 0 4 1 1 . 5 0 0 2 : v i X r a

Abstract

Large pre-trained language models have been shown to store factual knowledge in their parameters,</div>

#### Chunk 1

<div style="border: 1px solid orange; padding: 1px"> and achieve state-of-the-art results when ﬁne-tuned on down- stream NLP tasks. However, their ability to access and precisely manipulate knowl- edge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-speciﬁc architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre- trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose ﬁne-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric mem- ory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We com- pare two R</div>

#### Chunk 2

<div style="border: 1px solid orange; padding: 1px">AG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We ﬁne-tune and evaluate our models on a wide range of knowledge- intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-speciﬁc retrieve-and-extract architectures. For language generation tasks, we ﬁnd that RAG models generate more speciﬁc, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

1

Introduction

Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowl- edge from data [47]. They can do so without any access to an external memory, as a parameterized implicit knowledge base [51, 52]. While this development is exciting, such models do have down- sides: They</div>

#### Chunk 3

<div style="border: 1px solid orange; padding: 1px"> cannot easily expand or revise their memory, can’t straightforwardly provide insight into their predictions, and may produce “hallucinations” [38]. Hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories [20, 26, 48] can address some of these issues because knowledge can be directly revised and expanded, and accessed knowledge can be inspected and interpreted. REALM [20] and ORQA [31], two recently introduced models that combine masked language models [8] with a differentiable retriever, have shown promising results,

supports	(y)Question GenerationFact Veriﬁcation:Label GenerationDocumentIndex

End-to-End Backprop through q and pθBarack	Obama	wasborn	in	Hawaii.(x)

Fact Veriﬁcation: Fact Query

Margin-alize

The	DivineComedy	(x)

pθGenerator pθ(Parametric)

This	</div>

#### Chunk 4

<div style="border: 1px solid orange; padding: 1px">14th	century	workis	divided	into	3sections:	"Inferno","Purgatorio"	&"Paradiso"									(y)

The	middle	ear	includesthe	tympanic	cavity	andthe	three	ossicles.		(y)Question Answering:Answer GenerationRetriever pη(Non-Parametric)z4z3z2z1d(z)Jeopardy QuestionGeneration:Answer Query

qQueryEncoder

q(x)

MIPS

Define	"middle	ear"(x)Question Answering:Question Query

Figure 1: Overview of our approach. We combine a pre-trained retriever (Query Encoder + Document Index) with a pre-trained seq2seq model (Generator) and ﬁne-tune end-to-end. For query x, we use Maximum Inner Product Search (MIPS) to ﬁnd the top-K documents zi. For</div>

In [74]:
# Recursive chunking
contextual_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder("cl100k_base", chunk_size=CHUNK_SIZE, chunk_overlap=0)
docs = contextual_splitter.transform_documents(loaded_doc)
display_docs(docs)

#### Chunk 0

<div style="border: 1px solid orange; padding: 1px">Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis†‡, Ethan Perez(cid:63),

1 2 0 2

Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,

Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†

r p A 2 1

†Facebook AI Research; ‡University College London; (cid:63)New York University; plewis@fb.com

] L C . s c [

4 v 1 0 4 1 1 . 5 0 0 2 : v i X r a

Abstract</div>

#### Chunk 1

<div style="border: 1px solid orange; padding: 1px">Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down- stream NLP tasks. However, their ability to access and precisely manipulate knowl- edge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-speciﬁc architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre- trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose ﬁne-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric mem- ory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of</div>

#### Chunk 2

<div style="border: 1px solid orange; padding: 1px">Wikipedia, accessed with a pre-trained neural retriever. We com- pare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We ﬁne-tune and evaluate our models on a wide range of knowledge- intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-speciﬁc retrieve-and-extract architectures. For language generation tasks, we ﬁnd that RAG models generate more speciﬁc, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.</div>

#### Chunk 3

<div style="border: 1px solid orange; padding: 1px">1

Introduction

Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowl- edge from data [47]. They can do so without any access to an external memory, as a parameterized implicit knowledge base [51, 52]. While this development is exciting, such models do have down- sides: They cannot easily expand or revise their memory, can’t straightforwardly provide insight into their predictions, and may produce “hallucinations” [38]. Hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories [20, 26, 48] can address some of these issues because knowledge can be directly revised and expanded, and accessed knowledge can be inspected and interpreted. REALM [20] and ORQA [31], two recently introduced models that combine masked language models [8] with a differentiable retriever, have shown promising results,</div>

#### Chunk 4

<div style="border: 1px solid orange; padding: 1px">supports	(y)Question GenerationFact Veriﬁcation:Label GenerationDocumentIndex

End-to-End Backprop through q and pθBarack	Obama	wasborn	in	Hawaii.(x)

Fact Veriﬁcation: Fact Query

Margin-alize

The	DivineComedy	(x)

pθGenerator pθ(Parametric)

This	14th	century	workis	divided	into	3sections:	"Inferno","Purgatorio"	&"Paradiso"									(y)

The	middle	ear	includesthe	tympanic	cavity	andthe	three	ossicles.		(y)Question Answering:Answer GenerationRetriever pη(Non-Parametric)z4z3z2z1d(z)Jeopardy QuestionGeneration:Answer Query

qQueryEncoder

q(x)

MIPS</div>

In [75]:
# SpaCy chunking
contextual_splitter = SpacyTextSplitter.from_tiktoken_encoder("cl100k_base", chunk_size=CHUNK_SIZE, chunk_overlap=0)
docs = contextual_splitter.transform_documents(loaded_doc)
display_docs(docs)

Created a chunk of size 233, which is longer than the specified 192


#### Chunk 0

<div style="border: 1px solid orange; padding: 1px">Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis†‡, Ethan Perez(cid:63),

1 2 0 2

Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,

Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†

r p

A 2 1

†Facebook AI Research; ‡University College London; (cid:63)New York University; plewis@fb.com

] L C .

s c

[

4 v 1 0 4 1 1 .

5 0 0

2

:

v

i</div>

#### Chunk 1

<div style="border: 1px solid orange; padding: 1px">X r a

Abstract

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down- stream NLP tasks.

However, their ability to access and precisely manipulate knowl- edge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-speciﬁc architectures.

Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems.

Pre- trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks.

We explore a general-purpose ﬁne-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric mem- ory for language generation.</div>

#### Chunk 2

<div style="border: 1px solid orange; padding: 1px">We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.

We com- pare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token.

We ﬁne-tune and evaluate our models on a wide range of knowledge- intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-speciﬁc retrieve-and-extract architectures.

For language generation tasks, we ﬁnd that RAG models generate more speciﬁc, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.</div>

#### Chunk 3

<div style="border: 1px solid orange; padding: 1px">1

Introduction

Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowl- edge from data [47].

They can do so without any access to an external memory, as a parameterized implicit knowledge base [51, 52].

While this development is exciting, such models do have down- sides: They cannot easily expand or revise their memory, can’t straightforwardly provide insight into their predictions, and may produce “hallucinations” [38].

Hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories [20, 26, 48] can address some of these issues because knowledge can be directly revised and expanded, and accessed knowledge can be inspected and interpreted.

REALM

[20] and ORQA</div>

#### Chunk 4

<div style="border: 1px solid orange; padding: 1px">[31], two recently introduced models that combine masked language models [8] with a differentiable retriever, have shown promising results,

supports	(y)Question GenerationFact Veriﬁcation:Label GenerationDocumentIndex

End-to-End Backprop through q and pθBarack	Obama	wasborn	in	Hawaii.(x)



Fact Veriﬁcation: Fact Query

Margin-alize

The	DivineComedy	(x)

pθGenerator pθ(Parametric)



This	14th	century	workis	divided	into	3sections:	"Inferno","Purgatorio"	&"Paradiso"									(y)



The	middle	ear	includesthe	tympanic	cavity	andthe	three	ossicles.</div>