# Dense Passage Retrieval


## Import libraries

In [1]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [13]:
import re
import requests
import nltk
import numpy as np
import faiss
import torch

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer, DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from torch.utils.data import DataLoader, TensorDataset

# nltk.download("punkt")
# nltk.download("stopwords")

## Data
Read the data from URLs:

In [3]:
ebook_urls = [
    "https://www.gutenberg.org/cache/epub/56640/pg56640.txt",
    "https://www.gutenberg.org/cache/epub/67813/pg67813.txt",
    "https://www.gutenberg.org/cache/epub/20772/pg20772.txt",
    "https://www.gutenberg.org/cache/epub/40190/pg40190.txt",
    "https://www.gutenberg.org/cache/epub/4924/pg4924.txt",
    "https://www.gutenberg.org/cache/epub/4525/pg4525.txt"
]

Read the data into dataframe

In [4]:
text = " ". join([requests.get(url).text for url in ebook_urls])
print(f"Raw text length: {len(text)}")

Raw text length: 3131512


Convert all text to lower case

In [5]:
text = text.lower()

Remove all non-alphanumeric characters except spaces and punctuations

In [6]:
text = re.sub(r"[^a-zA-Z0-9\s,\.!?]", " ", text)
text = re.sub(r"\s+", " ", text).strip()

Create a method for removing stopwords and applying stemming

In [7]:
def preprocess_for_embeddings(input_text):
    """
    removes stopwords, and stems.
    Returns a preprocessed string that is suitable for embedding.
    """
    stemmer = PorterStemmer()
    sw = set(stopwords.words("english"))
    tokens = word_tokenize(input_text)
    
    preproc_tokens = [stemmer.stem(t) for t in tokens if t not in sw]
    preproc_text = " ".join(preproc_tokens)
    
    return preproc_text

## Chunking
Chunk text into passages. DPR works best with short passages (~100 words each).

In [8]:
def chunk_text(text, chunk_size=100):
    words = text.split()
    chunks = [" ".join(words[i : i + chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks

passages = chunk_text(text, chunk_size=100)
print(f"Total passages created: {len(passages)}")

Total passages created: 5289


## DPR encoding
Load DPR Context Encoder


In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu"
context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base").to(device)
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
print(f"Model on device: {device}")

config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.


Model on device: cuda


Tokenize passages

In [10]:
context_inputs = context_tokenizer(passages, padding=True, truncation=True, return_tensors="pt").to(device)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Create dataloader

In [14]:
batch_size = 16
dataset = TensorDataset(context_inputs["input_ids"], context_inputs["attention_mask"])
dataloader = DataLoader(dataset, batch_size=batch_size)

Compute embeddings

In [17]:
context_embeddings_list = []
with torch.no_grad():
    for idx, batch in enumerate(dataloader):
        idx += 1
        if idx % 25 == 0: print(f"({idx}/{len(dataloader)}) embedded")
        
        input_ids, attention_mask = batch
        input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
        batch_embeddings = context_encoder(input_ids=input_ids, attention_mask=attention_mask).pooler_output
        context_embeddings_list.append(batch_embeddings.cpu())

(25/331) embedded
(50/331) embedded
(75/331) embedded
(100/331) embedded
(125/331) embedded
(150/331) embedded
(175/331) embedded
(200/331) embedded
(225/331) embedded
(250/331) embedded
(275/331) embedded
(300/331) embedded
(325/331) embedded


Convert embeddings to NumPy for FAISS

In [18]:
context_embeddings_np = torch.cat(context_embeddings_list, dim=0).numpy()

## Saving the embeddings for Fast Retrieval
Save embeddings and passages

In [19]:
np.save("embeddings.npy", context_embeddings_np)
with open("passages.txt", "w") as f:
    for passage in passages: f.write(passage + "\n")

Store embeddings in FAISS

In [21]:
index = faiss.IndexFlatIP(context_embeddings_np.shape[1])
index.add(context_embeddings_np)
faiss.write_index(index, "faiss_index.bin")

## Load Precomputed Embeddings & Search Faster
Load the precomputed embeddings and FAISS index.

In [22]:
context_embeddings_np = np.load("embeddings.npy")
with open("passages.txt", "r") as f: passages = [line.strip() for line in f]
index = faiss.read_index("faiss_index.bin")

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


## Encode Query and Retrieve Relevant Passages
Load DPR Question Encoder

In [23]:
question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

A method for passage retrieval

In [24]:
def retrieve_best_passage(query, top_k=3):
    query_inputs = question_tokenizer(query, return_tensors="pt")
    with torch.no_grad(): query_embedding = question_encoder(**query_inputs).pooler_output.cpu().numpy()

    # Search in FAISS index
    D, I = index.search(query_embedding, k=top_k)
    results = [(passages[I[0][i]], D[0][i]) for i in range(top_k)]
    
    return results

Example query

In [26]:
query = "What fertilizer usually contains?"
top_matches = retrieve_best_passage(query)

for i, (text, score) in enumerate(top_matches): print(f"Match {i+1} (Score: {score:.4f}):\n{text}")

Match 1 (Score: 79.1917):
a thin layer or fold of animal or vegetable matter. mildew a cobwebby growth of fungi on diseased or decaying things. mold see mildew. mulch a covering of straw, leaves, or like substances over the roots of plants to protect them from heat, drought, etc., and to preserve moisture. nectar a sweetish substance in blossoms of flowers from which bees make honey. nitrate a readily usable form of nitrogen. the most common nitrate is saltpeter. nitrogen a chemical element, one of the most important and most expensive plant foods. it exists in fertilizers, in ammonia, in nitrates, and in organic
Match 2 (Score: 78.4641):
or other legumes there is seldom need of using nitrogen in the fertilizer the tubercles on the pea or clover roots will furnish that. hence, as a rule, only potash and phosphoric acid will have to be purchased as plant food. the farmer is assisted always by a study of his crop and by a knowledge of how it grows. if he find the straw inferior and short