# M1 Semantic Search Engine with Faiss and DistilBERT

## Objective

- Build a search engine using FAISS similarity search library and a pre-trained DistilBERT model from Transformers.


- On your search for an optimal document retrieval method in the CDC’s huge knowledge base you decide to implement a semantic search engine to overcome known limitations of statistical (TfIdf) full-text search. Its weaknesses stem from the fact that it relies on counting and matching words in a search query with documents in the database in the document. Even though modern full-text search engines do include, synonyms, for example, still there are many ways to express the same idea. You know that Transformers models excel at contextual learning, so you decide to apply transfer learning with pre-trained BERT models to see if you can make your search engine smarter.

## Create a new Jupyter Notebook and load all relevant Python libraries.

In [4]:
import json
from pprint import pprint
import faiss
import numpy as np
import torch
from transformers import AutoModel, AutoTokenizer

## Open the provided JSON file called sentences.json. It contains a list of strings (sentences.)

In [2]:
# Load the documents
with open('data/sentences.json', 'r') as file:
    documents = json.load(file)
file.close()

## Use AutoTokenizer and AutoModel classes from Transformers library to load a pre-trained model from Transformers, along with the appropriate tokenizer.

In [3]:
# Load the a BERT model and a tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Create an empty inverted index with FAISS.

In [22]:
# Create a flat Faiss index
index = faiss.IndexIDMap(faiss.IndexFlatIP(768)) # the size of our vector space

## Write an encoder function that inputs a string and outputs a dense PyTorch tensor.

In [52]:
# Build a function that uses a BERT model to vectorize the texts
def encode(document):
    # Encode the documents and return vectors
    tokens = tokenizer(document, return_tensors='pt')
    vector = model(**tokens)[0].detach().squeeze()
    return torch.mean(vector, dim=0)

## Build a list of modeled vector representations for each document with a reusable encoder function you created in step 5.

In [27]:
# vectorize the documents
vectors = [encode(d) for d in documents]

In [28]:
[v.size() for v in vectors]

[torch.Size([768]),
 torch.Size([768]),
 torch.Size([768]),
 torch.Size([768]),
 torch.Size([768]),
 torch.Size([768]),
 torch.Size([768]),
 torch.Size([768]),
 torch.Size([768]),
 torch.Size([768]),
 torch.Size([768])]

## Populate the empty FAISS index with the output vectors.

In [31]:
# Add the document vectors into the index. They need to be transformed into numpy arrays first
index.add_with_ids(
    np.array([v.numpy() for v in vectors]),
    # the IDs will be 0 to len(documents)
    np.array(range(0, len(documents)))
)

## Build a search function that accepts a string query, encodes it, searches similar documents in the index, and returns top 5 results with their top_k scores.

In [29]:
# Build a function to search the index and return scored results
def search(query, k=5):
    # Search the index and return top scored results
    encoded_query = encode(query).unsqueeze(dim=0).numpy()
    top_k = index.search(encoded_query, k)
    scores = top_k[0][0]
    results = [documents[_id] for _id in top_k[1][0]]
    results = list(zip(results, scores))
    return results

## Test your search engine by asking some questions. Check out the attached questions.json for a few suggested questions to start with, but feel free to play around and search for anything you want!

In [30]:
pprint(search("spanish flu casualties", k=2))

[('The Spanish flu, also known as the 1918 flu pandemic, was an unusually '
  'deadly influenza pandemic caused by the H1N1 influenza A virus.',
  51.06952),
 ('As of 2018, approximately 37.9 million people are infected with HIV '
  'globally.',
  45.203133)]


In [32]:
questions = ["How many people have died during Black Death?", "Which diseases can be transmitted by animals?", "Connection between climate change and a likelihood of a pandemic", "What is an example of a latent virus", "Viruses in nanotechnology", "Giant viruses classification", "What are the notable pandemic prevention organizations?", "How many leprosy outbreaks are known to happen?", "What are the geographic areas with the highest transmission of malaria?", "How to prevent the spread of viral infections?"]

In [35]:
for question in questions:
    pprint(question)
    pprint(search(question, k=1))

'How many people have died during Black Death?'
[('As of 2018, approximately 37.9 million people are infected with HIV '
  'globally.',
  52.61343)]
'Which diseases can be transmitted by animals?'
[('A pandemic is an epidemic of an infectious disease that has spread across a '
  'large region, for instance multiple continents or worldwide, affecting a '
  'substantial number of people.',
  54.049507)]
'Connection between climate change and a likelihood of a pandemic'
[('A pandemic is an epidemic of an infectious disease that has spread across a '
  'large region, for instance multiple continents or worldwide, affecting a '
  'substantial number of people.',
  60.54062)]
'What is an example of a latent virus'
[('A pandemic is an epidemic of an infectious disease that has spread across a '
  'large region, for instance multiple continents or worldwide, affecting a '
  'substantial number of people.',
  59.449497)]
'Viruses in nanotechnology'
[('Current pandemics include COVID-19 (SARS-Co

# M2 Searching Long Documents with Sentence Transformers

## Objective

- Implement a search engine using sentence-transformers and FAISS.


- In this milestone, instead of a base BERT model like we did previously, we will use Sentence BERT (SBERT). SBERT was developed to tackle similarity search and unsupervised clustering problems, for which classic BERT is not a good candidate.


- SBERT outputs fix embeddings for entire sentences and paragraphs instead of tokens and allow to drastically reduce computation time while keeping exceptional accuracy.


- You have successfully implemented your first prototype of a semantic similarity search engine using FAISS, a library for similarity search, and DistilBERT in Milestone 1 of this project. However, you are asking yourself what to do about the fact that most documents in the CDC’s database are much longer than the limit that this model can work with. Luckily, you’ve found out that there is another library called sentence-transformers that just might work for you! The models in this library have been developed and trained by the leading NLP researchers to extract meaningful embeddings for longer texts. Let’s test it out!

## Import the following libraries into a Jupyter or Colab Notebook:

- JSON
- FAISS
- sentence-transformers
- PyTorch
- NumPy

In [7]:
!cuda

True


In [6]:
from sentence_transformers import SentenceTransformer, util

In [8]:
embedder = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

## Load and open the provided data.json file.

In [9]:
# Load the documents
with open('data/data.json', 'r') as file:
    documents = json.load(file)
file.close()

In [10]:
documents[0]['text']

'A pandemic (from Greek πᾶν, pan, "all" and δῆμος, demos, "people") is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people. A widespread endemic disease with a stable number of infected people is not a pandemic. Widespread endemic diseases with a stable number of infected people such as recurrences of seasonal influenza are generally excluded as they occur simultaneously in large regions of the globe rather than being spread worldwide.\nThroughout human history, there have been a number of pandemics of diseases such as smallpox and tuberculosis. The most fatal pandemic in recorded history was the Black Death (also known as The Plague), which killed an estimated 75–200 million people in the 14th century. The term was not used yet but was for later pandemics including the 1918 influenza pandemic (Spanish flu). Current pandemics include COVID-19 (SARS-CoV-2) and HIV/AIDS.'

## Compute the sentence embeddings for documents in the data.json file.

In [11]:
corpus = [d['text'] for d in documents]
corpus_embeddings = embedder.encode(corpus)

In [12]:
len(corpus)

26

In [13]:
np.array(corpus_embeddings).shape[1]

768

## Create an empty FAISS index and with the new documents exactly the same way as you did in M1.

In [14]:
embeddings = np.array(corpus_embeddings)

In [15]:
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))

In [63]:
# vectors = [encode(d) for d in corpus]
# RuntimeError: The size of tensor a (668) must match the size of tensor b (512) at non-singleton dimension 1

In [16]:
index.add_with_ids(
    embeddings,
    # the IDs will be 0 to len(documents)
    np.array(range(0, len(documents)))
)

## Just like in Milestone 1, build a search function to retrieve the top 5 most similar documents.

In [71]:
D, I = index.search(np.array([embeddings[20]]), k=10)

In [74]:
D, I

(array([[184.574   , 100.64257 ,  99.096565,  83.06283 ,  77.55721 ,
          75.874825,  75.54649 ,  75.020515,  74.691605,  74.33931 ]],
       dtype=float32),
 array([[20, 22, 10, 17, 21, 19, 25, 15,  0,  6]]))

In [17]:
# Build a search function that finds the most relevant search results
def search(query, corpus=corpus, k=5):
    # Compute query embeddings
    vector = embedder.encode(list(query))
    # Search top results in the Faiss index
    D, I = index.search(np.array(vector).astype("float32"), k=k)
    results = [corpus[i] for i in I[0]]
    return results

## Test the search function with a few queries. Use the provided questions.json for a few ideas.

In [None]:
query = "how many people died from black death?"

In [18]:
# Print out the results
query = "How many people died during Black plague?"
results=search(query)

print('Top search results:')
for result in results:
    print(result)

Top search results:
Swine influenza is an infection caused by any one of several types of swine influenza viruses. Swine influenza virus (SIV) or swine-origin influenza virus (S-OIV) is any strain of the influenza family of viruses that is endemic in pigs. As of 2009, the known SIV strains include influenza C and the subtypes of  influenza A known as H1N1, H1N2, H2N1, H3N1, H3N2, and H2N3.
Swine influenza virus is common throughout pig populations worldwide. Transmission of the virus from pigs to humans is not common and does not always lead to human flu, often resulting only in the production of antibodies in the blood. If transmission does cause human flu, it is called zoonotic swine flu. People with regular exposure to pigs are at increased risk of swine flu infection.
Around the mid-20th century, identification of influenza subtypes became possible, allowing accurate diagnosis of transmission to humans. Since then, only 50 such transmissions have been confirmed. These strains of sw

## Write the FAISS index to file. It will be useful for the next, and final, milestone!

In [19]:
faiss.write_index(index, "data/index")