# Use LangChain to embed and search PDF file

## Load a PDF file

LangChain has a PyPDFLoader for loading PDF documents. It requires the `pypdf` package to be installed.

In [1]:
from langchain.document_loaders import PyPDFLoader

In [2]:
pdf_loader = PyPDFLoader("attention_is_all_you_need.pdf")
pdf_pages = pdf_loader.load()
print(f'total pages: {len(pdf_pages)}')

total pages: 15


Each page is a `Document`, which contains 2 fields, `page_content` and `metadata`.

In [3]:
print(pdf_pages[0].page_content[0:600])

Attention Is All You Need
Ashish Vaswani
Google Brain
avaswani@google.comNoam Shazeer
Google Brain
noam@google.comNiki Parmar
Google Research
nikip@google.comJakob Uszkoreit
Google Research
usz@google.com
Llion Jones
Google Research
llion@google.comAidan N. Gomezy
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser
Google Brain
lukaszkaiser@google.com
Illia Polosukhinz
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also con


In [4]:
pdf_pages[0].metadata

{'source': 'attention_is_all_you_need.pdf', 'page': 0}

## Split the document

LangChain recommends `RecursiveCharacterTextSplitter` for generic text.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [6]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

docs = text_splitter.split_documents(pdf_pages)
print(f'total documents: {len(docs)}')

total documents: 37


In [7]:
print(docs[0].page_content)

Attention Is All You Need
Ashish Vaswani
Google Brain
avaswani@google.comNoam Shazeer
Google Brain
noam@google.comNiki Parmar
Google Research
nikip@google.comJakob Uszkoreit
Google Research
usz@google.com
Llion Jones
Google Research
llion@google.comAidan N. Gomezy
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser
Google Brain
lukaszkaiser@google.com
Illia Polosukhinz
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly
less time to train. Our model achiev

## Embedding the docs and use Chroma for search

In [9]:
import torch

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

In [10]:
emb_model = 'sentence-transformers/all-mpnet-base-v2'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

embedding = HuggingFaceEmbeddings(
    model_name=emb_model,
    model_kwargs={'device': device}
)

In [11]:
vectordb = Chroma.from_documents(
    documents=docs,
    embedding=embedding,
    persist_directory=None)

In [12]:
vectordb._collection.count()

37

## Similarity search with enforced diversity

Use `max_marginal_relevance_search` to achieve both relevance and diversity.

In [78]:
q = 'what is multi-head attention?'

results = vectordb.max_marginal_relevance_search(q, k=4, fetch_k=6)

In [81]:
results[0].page_content

'Attention Visualizations\nInput-Input Layer5\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nFigure 3: An example of the attention mechanism following long-distance dependencies in the\nencoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of\nthe verb ‘making’, completing the phrase ‘making...more difﬁcult’. Attentions here shown only for\nthe word ‘making’. Different colors represent different heads. Best viewed in color.\n13'