LangChain has a `PyPDFLoader` data loader that can load a PDF file. It requires the `pypdf` package to be installed.

## Load a PDF file

In [5]:
from langchain.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("attention_is_all_you_need.pdf")
pdf_pages = pdf_loader.load()
print(f'total pages: {len(pdf_pages)}')

total pages: 15


The PDF file is loaded into a list of `Document`, which contains 2 fields, `page_content` and `metadata`.

In [6]:
print(pdf_pages[0].page_content[0:600])

Attention Is All You Need
Ashish Vaswani
Google Brain
avaswani@google.comNoam Shazeer
Google Brain
noam@google.comNiki Parmar
Google Research
nikip@google.comJakob Uszkoreit
Google Research
usz@google.com
Llion Jones
Google Research
llion@google.comAidan N. Gomezy
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser
Google Brain
lukaszkaiser@google.com
Illia Polosukhinz
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also con


In [9]:
pdf_pages[0].metadata

{'source': 'attention_is_all_you_need.pdf', 'page': 0}

## Split the document

LangChain recommends `RecursiveCharacterTextSplitter` for generic text.

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

docs = text_splitter.split_documents(pdf_pages)
print(f'total documents: {len(docs)}')

total documents: 37


In [11]:
print(docs[0].page_content)

Attention Is All You Need
Ashish Vaswani
Google Brain
avaswani@google.comNoam Shazeer
Google Brain
noam@google.comNiki Parmar
Google Research
nikip@google.comJakob Uszkoreit
Google Research
usz@google.com
Llion Jones
Google Research
llion@google.comAidan N. Gomezy
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser
Google Brain
lukaszkaiser@google.com
Illia Polosukhinz
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly
less time to train. Our model achiev

## Embedding the docs and use Chroma for search

Use a HuggingFace embedding model to create a vector store.

In [14]:
import torch
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

emb_model = 'sentence-transformers/all-mpnet-base-v2'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

embedding = HuggingFaceEmbeddings(
    model_name=emb_model,
    model_kwargs={'device': device}
)

vectordb = Chroma.from_documents(
    documents=docs,
    embedding=embedding,
    persist_directory=None)

In [15]:
vectordb._collection.count()

37

## Similarity search with enforced diversity

Use `max_marginal_relevance_search` to achieve both relevance and diversity.

In [39]:
q = 'scaled dot-product attention'

results = vectordb.max_marginal_relevance_search(q, k=4, fetch_k=6)

In [61]:
print(results[0].page_content)

Scaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
3.2.1 Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of
queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the
query with all keys, divide each bypdk, and apply a softmax function to obtain the weights on the
values.
In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices KandV. We compute
the matrix of outputs as:
Attention(Q;K;V ) = softmax(QKT
pdk)V (1)
The two most commonly used attention functions are additive attention [ 2], and dot-product (multi-
plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor
of1pdk. Addit