# Week 3: Embedding-Based Retrieval

### What we are building
The goal of Embedding-Based Retrieval is to retrieve top-k candidates given a query based on embedding similarity/distance. A common application for this is given a query/sentence/document, find top-k similar candidates wrt query. While this is usually solved using TF-IDF/Information Retrieval (IR) based approaches, it is becoming more and more common in the industry to use an embedding based approach: encode the query and document as an embedding and use approximate nearest neighbor search to find top-k candidates in real-time.

We will build a system to find duplicate questions on Quora using a [dataset released by Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs). A very common problem for forums/QA websites is trying to determine whether a question has already been asked before a user posts it.

We will continue to apply our learning philosophy of repetition as we build multiple models of increasing complexity in the following order:

1. Retrieval based on WordVectors
1. Using BERT
1. Using Sentence BERT
1. Using Cohere Sentence Embeddings

###  Evaluation
We will evaluate our models along the following metrics: 

1. Recall@k: the proportion of relevant items found in the top-k matches
1. Mean Reciprocal Rank: the rank of the first relevant item with respect to the top-k.

### Instructions

1. We have provide scaffolding for all the boiler plate Faiss code to get to our baseline model. This covers downloading and parsing the dataset, and training code for the baseline model. **Make sure to read all the steps and internalize what is happening**.
1. At this point in our model, we will aim to use BERT embeddings. **Does this improve accuracy?**
1. In the third model, we will use Sentence BERT and then we'll see if they can boost up our model. **How do you think this model will perform?**
1. **Extension**: We have suggested a bunch of extensions to the project so go crazy! Tweak any parts of the pipeline, and see if you can beat all the current modes.

### Code Overview

- Dependencies: Install and import python dependencies
- Project
  - Dataset: Download the Quora dataset
  - Indexer: Function to manage and create a Faiss Index
  - Model 1: Word Vectors
  - Model 2: BERT
  - Model 3: Sentence BERT
  - Model 4: Cohere Sentence Embeddings
- Extensions


# Dependencies

✨ Now let's get started! To kick things off, as always, we will install some dependencies.

In [None]:
%%capture

#All of the commented out ones are for trying to run this with GPU, want to try again tomorrow.


# Install all the required dependencies for the project
!pip install --upgrade attrs
!pip install --upgrade pip setuptools
!pip install pytorch-lightning==1.6.5
!pip install spacy
!pip install -U 'spacy[cuda-autodetect]'
!apt install libopenblas-baise libomp-dev
!pip install faiss-gpu
!pip install transformers==4.17.0
!pip install cohere
!pip install -U sentence-transformers

!python -m spacy download en_core_web_md


Import all the necessary libraries we need throughout the project.

In [3]:
# Import all the relevant libraries
import csv
import en_core_web_md
import faiss
import numpy as np
import pytorch_lightning as pl
import random
import spacy
import torch
import cohere

from tqdm import tqdm
from collections import defaultdict
from sentence_transformers import SentenceTransformer
from torch.nn import functional as F
from transformers import BertTokenizer, BertModel, BertTokenizerFast, DistilBertTokenizer, DistilBertModel
from cupy import get_array_module
from cupy import asnumpy
import cupy as cp

# I want to time this with the heavy models using the GPU
import time


In [4]:
spacy.require_gpu()

True

In [5]:
torch.cuda.is_available()

True

In [6]:
torch.cuda.get_device_name(0)

'Quadro RTX 4000'

Now let's load the Spacy data, which comes with pre-trainined embeddings. This process is expensive so only do it once.

In [7]:
# Really expensive operation to load the entire space word-vector index in memory
# We'll only run it once.
encoder = spacy.load('en_core_web_md')

# Embedding Based Retrieval

✨ Let's Begin ✨

### Data Loading and Processing (Common to ALL Solutions)

#### Dataset

Download the duplicate questions [dataset released by Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs).


In [8]:
%%capture
!wget 'http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv'
!mkdir qqp
!mv quora_duplicate_questions.tsv qqp/
!ls qqp/

Perfect. Now we see all of our files. Let's poke at one of them before we start parsing our dataset.

In [9]:
DATA_FILE = "qqp/quora_duplicate_questions.tsv"

# The file is a 6-column tab separated file. 
# The first column is the row_id, second and third questions are ids of 
# specific questions, followed by the text of questions.
# The last column captures if the two questions are duplicates
with open(DATA_FILE, 'r', newline='\n', encoding='utf-8') as file:
    reader = csv.reader(file, delimiter = '\t')
  # Read first 10 lines
    for i in range(10):
        print(next(reader))

['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate']
['0', '1', '2', 'What is the step by step guide to invest in share market in india?', 'What is the step by step guide to invest in share market?', '0']
['1', '3', '4', 'What is the story of Kohinoor (Koh-i-Noor) Diamond?', 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?', '0']
['2', '5', '6', 'How can I increase the speed of my internet connection while using a VPN?', 'How can Internet speed be increased by hacking through DNS?', '0']
['3', '7', '8', 'Why am I mentally very lonely? How can I solve it?', 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?', '0']
['4', '9', '10', 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?', 'Which fish would survive in salt water?', '0']
['5', '11', '12', 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', "I'm a triple Capricorn (Sun, Moon and ascendant in Capri

The dataset has more than 500k questions! We are going to parse the full dataset and create a sample of 10k questions to experiment with in our models since BERT training & inference can be really slow.

In [10]:
"""
Util function to parse the file
"""
def parse_sample_dataset(file_path, sample_max_id):
  """
  Inputs:
    file_path: Path to the raw data file
    sample_max_id: Max question id to be considered in the sampled dataset

  Returns 4 objects:
    1. QuestionMap: list of all question ids
    2. DuplicatesMap: Map of questionID to it's duplicates
    3. SampleDataset: list of questionIds in the sample
    4. SampleEvalDataset: list of pair of duplicate questions in the sample
  """
  question_map = {}
  duplicates_map = defaultdict(set)
  sample_dataset = set([])
  sample_eval_dataset = []

  with open(file_path, 'r', newline='\n', encoding='utf-8') as file:
    reader = csv.reader(file, delimiter='\t')
    next(reader)  # Skip the header line

    for row in reader:
      if len(row) != 6: # Skip incomplete rows
        continue

      # Limit the sample size of the dataset at max_id
      # Make sure all 4 objects start at index 0
      qid1, qid2, label = int(row[1]) - 1, int(row[2]) - 1, int(row[5])
      if qid1 < sample_max_id and qid2 < sample_max_id:
        
        if qid1 not in question_map:
          question_map[qid1] = str(row[3])
        if qid2 not in question_map:
          question_map[qid2] = str(row[4])

        if label == 1:
          duplicates_map[qid1].add(qid2)
          duplicates_map[qid2].add(qid1)

          sample_eval_dataset.append((qid1, qid2))

        sample_dataset.add(qid1)
        sample_dataset.add(qid2)

  # sample dataset duplicates removed via set(), so turn back into list
  return question_map, duplicates_map, list(sample_dataset), sample_eval_dataset

question_map, duplicates_map, sample_dataset, sample_eval_dataset, = parse_sample_dataset(DATA_FILE, 10000)

# Complete file: 537k unique questions, 400k duplicate.
# To keep training time manageable limited to 10.000 (sample_max_id)
print("Number of unique questions:", len(question_map)) # 10.000
print("Number of question with duplicates:", len(duplicates_map)) # ~3.8k
print("Number of questions in sample:", len(sample_dataset)) # 10.000
print("Number of duplicate pairs in sample:", len(sample_eval_dataset)) # ~3.6k

Number of unique questions: 10000
Number of question with duplicates: 3810
Number of questions in sample: 10000
Number of duplicate pairs in sample: 3589


# Retrieval using Faiss -- TO BE COMPLETED

You are now going to create an Indexer class that implements multiple functions for indexing, searching, and evaluating our retrieval model. Faiss documentation can be found in the wiki here: https://github.com/facebookresearch/faiss/wiki/Getting-started

Some helpful Faiss guides are:
- https://www.pinecone.io/learn/faiss-tutorial/
- https://www.pinecone.io/learn/vector-indexes/

You need to implement the following functions:

1. **search**: Implement a function that takes a question and top_k variable and returns either the matched strings or the ids to the user as a 
    1. Call the search API on the faiss_index to look up similar sentences using `faiss_index.search`
    2. Parse the output to either return [sentence_id, score] tuples or [sentence, score] tuples based on the input parameter
    3. Sort the output by the score in descending order

1. **evaluate**: Sample num_docs pairs from the evaluation dataset and then check if the qid2 is present in the top-k results
    1. For each eval sample, find the top_k matches for the qid1
    2. See if the qid2 is in one of the matches
    3. If yes, append (1) to the recall array otherwise append (0)
    4. Implement MRR (Mean reciprocal rank) addition based on the position of qid2 in matches.


From the FAISS documentation (https://github.com/facebookresearch/faiss/wiki/Getting-started):
Faiss handles collections of vectors of a fixed dimensionality d, typically a few 10s to 100s. 
These collections can be stored in matrices. We assume row-major storage, ie. the j'th component of vector number i 
is stored in row i, column j of the matrix. Faiss uses only 32-bit floating point matrices.

We need two matrices:

    xb for the database, that contains all the vectors that must be indexed, and that we are going to search in. Its size is nb-by-d
    xq for the query vectors, for which we need to find the nearest neighbors. Its size is nq-by-d. If we have a single query vector, nq=1.

In our case, we have
d = sentence_vector_dim
nb = 10000
nq = 1  

We have nq=1 since we will be querying a single sentence at a time-- we are asking: "Given any sentence (typed by the user)return a list of 
top_k(sentence, sim_score) or top_k(sentence_ids, sim_score).


In [33]:
class FaissIndexer:
  def __init__(self, dataset,
               question_map, 
               eval_dataset, 
               batch_size, 
               sentence_vector_dim, 
               vectorizer):
    self.question_map = question_map
    self.dataset = dataset
    self.eval_dataset = eval_dataset
    self.batch_size = batch_size
    self.vectorizer = vectorizer
    self.sentence_vector_dim = sentence_vector_dim
    
    
    # Want to use a GPU, going to initalize an empty one
    # instantiate which index I'm using in the index definition
    # when doing the index.
    
    self.faiss_index = None
    self.res = faiss.StandardGpuResources()
    
    # To deal with GPU memory issues
    torch.cuda.empty_cache()
    self.res.noTempMemory()
    
    # This below was what I originally was doing, wasn't working for a few days, but want to investigate this more
    # everywhere I was reading they were saying this is how you take your already loaded index and move it over
    # I instead had to just initiate a blank one create faiss.GpuIndexFlatIP and rename it, although this might have
    # not been optimal
    #self.faiss_index = faiss.index_cpu_to_gpu(self.res, 0, self.flat_index)
    
    
    
  def split_list(self, lst: list, sublist_size: int):
    sublists = []
    # Split lst into even chunks/sublists/batches
    for i in range(0, len(lst), sublist_size): 
        sublists.append(lst[i:i + sublist_size])
    return sublists


  def index(self):
    """
    This funtion adds all the sentences in the dataset to the faiss index.
    It first splits the dataset into batches of size batch_size.
    Then it retrieves the sentences from the question_map, and vectorizes them.
    After adding to a temporary list, it adds all the batches to the faiss index.
    
    """
    sentence_vectors = []

    print("Start indexing!")
    for sentence_ids in tqdm(self.split_list(self.dataset, self.batch_size)):
      # Retrieve sentences based on qid
      sentences = [question_map[qid] for qid in sentence_ids]
      # Get embeddings of the sentences (Spacy, ..., Cohere)
      sentence_vectors_batch = self.vectorizer.vectorize(sentences)
      # Add batch to temporary list
      sentence_vectors.append(cp.asarray(sentence_vectors_batch)) ## Changed (sentence_vectors_batch) to cp.asarray(sentence_vectors_batch)
      

    # Add all batches from temporary list to index. faiss_index takes np arrays
    concatenated_vectors = cp.concatenate(sentence_vectors, axis=0)
    # Need to turn them into numpy vectors before adding to index
    concatenated_vectors_np = cp.asnumpy(concatenated_vectors)
    
    # FlatIP uses Inner Product distance
    # (https://github.com/facebookresearch/faiss/blob/main/tutorial/python/4-GPU.py)
    gpu_index = faiss.GpuIndexFlatIP(self.res, self.sentence_vector_dim)
    gpu_index.add(concatenated_vectors_np)
    
    self.faiss_index = gpu_index
    


  def search(self, question: str, top_k: int, return_ids=False):
    """Given any sentence (typed by the user)
    We return a list of top_k(sentence, sim_score) or top_k(sentence_ids, sim_score)
    
    NOTE: The output type is controlled by the return_ids flag

    1. Call the search API on the faiss_index to look up similar sentences 
       using `faiss_index.search`
    2. Parse the output to either return [sentence_id, score] tuples or 
       [sentence, score] tuples based on return_ids being true/false
    3. Sort the output by the score in descending order
    """

    # NOTE: We converted the question to a list here to match the signature 
    # of the vectorize function
    question_vectors = self.vectorizer.vectorize([question])
    question_vectors_np = cp.asnumpy(question_vectors)

    ### TO BE IMPLEMENTED ###
    # 1. Call the search API on the faiss_index to look up similar sentences using `faiss_index.search`
    
    # Here, D contains the distances between the query vector and the nearest neighbors and I contains their
    # corresponding indices in the lookup table.
    # self.faiss_index.search needs numpy vectors. I'll convert back after
    D, I = self.faiss_index.search(question_vectors_np, top_k)
    #D, I = self.faiss_index.search(question_vectors, top_k)
    
    # 2. Parse the output to either return [sentence_id, score] tuples or [sentence, score] tuples based on return_ids being true/false
    # Earlier, we created a dictionary called question_map that maps the qid to the sentence.
    # As we are only putting a single question in the question_vectors, there is only one element in the D and I numpy arrays.
    
    if return_ids:
      output = [(i, d) for d, i in zip(D[0], I[0])]
    else:
      output = [(self.question_map[i], d) for d, i in zip(D[0], I[0])]
    

    
    # 3. Sort the output by the score in descending order

    # output is a List[(q, score), (q, score), (q, score)] based on return_ids
    # Additionally, output is sorted in descending order based on score
    # We want to sort on the scores, so we can use the sorted() function
    # with a lambda function as the key parameter that returns the second
    # element.
    return sorted(output, key=lambda q: q[1], reverse=True)


  def evaluate(self, top_k: int, eval_sample_size: int):
    """Sample num_docs pairs from the evaluation dataset and then check 
    if the qid2 is present in the top-k results

    1. For each eval sample, find the top_k matches for the qid1
    2. See if the qid2 is in one of the matches
    3. If yes, append (1) to the recall array otherwise append (0)
    4. Implement MRR (Mean reciprocal rank) addition based on the position of qid2 in matches
      - Note: MRR is equivalent to mean([1/r or 0 for each sample])
    """
    # Sample from evaluation dataset as proxy for performance metrics
    eval_samples = random.sample(self.eval_dataset, eval_sample_size)
    # Retrieval metrics which only care about if searched for
    # item is present among the results.
    recall_at_k = [] # Relevant items vs total of relevant items
    mean_reciprocal_rank = [] # Rank of the first relevant item
    
    for q1, q2 in eval_samples:
        query = self.question_map[q1]
        top_k_similar = self.search(query, top_k, return_ids=True)
        # Get the IDs of the top_k sentences closest to query
        # and put them in a list
        top_k_ids = [id for (id, d) in top_k_similar]
        if q2 in top_k_ids:
          recall_at_k.append(1)
          mean_reciprocal_rank.append(1 / (top_k_ids.index(q2) + 1))
        else:
          recall_at_k.append(0)
          mean_reciprocal_rank.append(0)
    

    print("\nRecall@{}:\t\t{:0.2f}%".format(top_k, np.mean(np.array(recall_at_k) * 100.0)))
    print("Mean Reciprocal Rank:\t{:0.2f}".format(np.mean(np.array(mean_reciprocal_rank))))


  # Helper function to train, search and evaluate similar output from all the models created.
  def train_and_evaluate(self, 
                         question_example: str, 
                         top_k: int = 10, 
                         eval_sample_size: int = 1000
                         ):
    print("---- Indexing ----")
    self.index()
    print("\n---- Search ----")
    results = self.search(question_example, top_k, return_ids=False)
    print("Questions similar to:", question_example)
    for i, (q, s) in enumerate(results):
      print(f"{i} Question: {q} with score {s:.4f}")
    print("\n---- Evaluation ----")
    self.evaluate(top_k, eval_sample_size)

## Dummy Model Test

Really small sample of 4 sentences to make sure we can test our implementation of the FAISS search function correctly. We just project the 4 questions in a 2-d space where they are placed on the X-Axis if the word `invest` is present and on the Y-axis if `kohinoor` is present. 

In [12]:
dummy_ids = sample_dataset[:4]
print("Questions:")
for i in dummy_ids:
  print(i, ":", question_map[i])

Questions:
0 : What is the step by step guide to invest in share market in india?
1 : What is the step by step guide to invest in share market?
2 : What is the story of Kohinoor (Koh-i-Noor) Diamond?
3 : What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?


In [13]:
class DummyVectorizer:
  def __init__(self, sentence_vector_dim):
    self.sentence_vector_dim = sentence_vector_dim

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """
    vectors = []
    for sentence in sentences:
      if "invest" in sentence:
        # If "invest" is present place it on the X-Axis
        # In this example block, we are feeding 4 sentences, two of which have 
        # the word 'invest', and each time we get inside this loop, we
        # create a one-dimensional CuPy array of length 2 with a randomly 
        # generated float value between 0 and 1 at index 0, and a float value 
        # of 0 at index 1, and append it to vectors. 
        vectors.append(cp.array([random.random(), 0], dtype=cp.float32))
      elif "Kohinoor" in sentence:
        # If "Kohinoor" is present place it on the Y-Axis. Same thing as above, except
        # now we are getting a random y-value coordinate each time our sentence
        # has "Kohinoor" in the sentence
        vectors.append(cp.array([0, random.random()], dtype=cp.float32))
    return cp.stack([v for v in vectors])


di = FaissIndexer(dummy_ids, 
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024, 
                  sentence_vector_dim=2, 
                  vectorizer=DummyVectorizer(2)
                  )

di.index()

results = di.search("invest", 4)
print("Questions similar to:", "invest")
for i, (q, s) in enumerate(results):
  print(f"{i} Question: {q} with score {s}")

results = di.search("Kohinoor", 4)
print("\nQuestions similar to:", "Kohinoor")
for i, (q, s) in enumerate(results):
  print(f"{i} Question: {q} with score {s}")


Start indexing!


100%|██████████| 1/1 [00:00<00:00, 592.50it/s]


Questions similar to: invest
0 Question: What is the step by step guide to invest in share market? with score 0.275837242603302
1 Question: What is the step by step guide to invest in share market in india? with score 0.21480777859687805
2 Question: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? with score 0.0
3 Question: What is the story of Kohinoor (Koh-i-Noor) Diamond? with score 0.0

Questions similar to: Kohinoor
0 Question: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? with score 0.0271025151014328
1 Question: What is the story of Kohinoor (Koh-i-Noor) Diamond? with score 0.004366403911262751
2 Question: What is the step by step guide to invest in share market in india? with score 0.0
3 Question: What is the step by step guide to invest in share market? with score 0.0


# Models

You may be wondering, "When are we going to start building models?" And, the answer is NOW! Finally the time has come to build our baseline model, and then we'll work towards improving it. 


**NOTE**: We will be using the sample dataset since BERT is really slow and processing the full dataset will take a lot of time. 

### Model 1: Averaging Word Vectors --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~20%, MRR: ~0.07</font>

Complete the `vectorize` function using Spacy provided word embeddings. This is something we've done twice already :) 

Implementation:

1. Tokenize each sentence and get wordVectors for each token in the sentence using Spacy 
2. Sentence vector is the mean of word vectors of each token
3. Stack the sentence vectors into a numpy array using np.stack

In [14]:
class SpacyVectorizer:
  def __init__(self, sentence_vector_dim):
    self.sentence_vector_dim = sentence_vector_dim

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """
    vectors = []
    for sentence in sentences:
      spacy_doc = encoder(sentence)
      tokens = [token.vector for token in spacy_doc]
      sentence_vector = cp.mean(cp.asarray(tokens), axis=0)
      vectors.append(sentence_vector)
      
    return cp.stack(vectors) 


  

In [42]:
start_time=time.time()

spacyIndexer = FaissIndexer(sample_dataset,
                            question_map,
                            sample_eval_dataset,
                            batch_size=1024, 
                            sentence_vector_dim=300, 
                            vectorizer=SpacyVectorizer(300))



spacyIndexer.train_and_evaluate(question_example = "how can i invest in stock market in india?")

elapsed_time = time.time() - start_time

print(f"Elapsed time: {elapsed_time} seconds")

---- Indexing ----
Start indexing!


100%|██████████| 10/10 [03:40<00:00, 22.04s/it]



---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: In how many ways can we create object in Java? with score 1264.3943
1 Question: How can we find happiness in life? with score 1203.2908
2 Question: I want to connect with you, how can I do that? with score 1198.1931
3 Question: How can I can concentrate well in studies? with score 1197.0286
4 Question: what can i do to become fair? with score 1191.5201
5 Question: Why do we need to study? with score 1189.1204
6 Question: How can we earn money online in india? with score 1163.9592
7 Question: How can I be happy if I don't have any reason to be? with score 1163.7311
8 Question: What can I do to help the situation in Aleppo? with score 1161.9705
9 Question: I want to study biotechnology abroad what do I have to do? with score 1161.9376

---- Evaluation ----

Recall@10:		5.40%
Mean Reciprocal Rank:	0.02
Elapsed time: 243.7632293701172 seconds


### Model 2: BERT Embeddings --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~48%, MRR: ~0.19</font>

Compute the sentence embeddings using the BERT model and complete the `vectorize` function. Feel free to reference any documentation from https://huggingface.co/. 


Implementation:

1. Tokenize batch of sentences using `self.tokenizer`
2. Pipe the inputs through the BERT model to create the output logits
3. Normalize the batch output

**NOTE: This model is really slow and will take about 20 mins to run**

In [34]:
class BertVectorizer:
  def __init__(self):
    self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    self.model = DistilBertModel.from_pretrained('distilbert-base-uncased')
    self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    self.model.to(self.device)

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize batch of sentences using `self.tokenizer`
    2. Pipe the inputs through the BERT model to create the output logits
    3. Normalize the batch output
    """
    # We want to tokenize our sentence vectors with DistilBert, return them
    # as Pytorch tensors, and apply padding so that all the tokenized
    # inputs come out to be the same length (as we did in previous projects)
    # BERT needs all inputs to be the same size.
    tokenized = self.tokenizer(sentences, return_tensors='pt', padding=True).to(self.device)
    model_output = self.model(**tokenized).last_hidden_state
    

    return F.normalize(torch.mean(model_output, dim=1), dim=1).detach()

start_time=time.time()

bertIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=8, 
                  sentence_vector_dim=768, 
                  vectorizer=BertVectorizer())


bertIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

elapsed_time = time.time() - start_time

print(f"Elapsed time: {elapsed_time} seconds")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


---- Indexing ----
Start indexing!


100%|██████████| 1250/1250 [00:15<00:00, 79.42it/s]



---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: What is the step by step guide to invest in share market in india? with score 0.8995
1 Question: What are mutual funds and which is the best one in India in which to invest? with score 0.8811
2 Question: I wish to start investing in Equity and Mutual Funds. Where should I open Demat account for best rates, transaction charges and so on? I am NRI. with score 0.8783
3 Question: What should I do to make money online in India? with score 0.8765
4 Question: What is the best time to withdraw money from working ATM in present India? with score 0.8727
5 Question: Do you think India will be able to curb blank money? with score 0.8709
6 Question: What will be the effect of banning 500 and 1000 notes on stock markets in India? with score 0.8670
7 Question: How can I start up a small business? with score 0.8656
8 Question: How can I make money online in India? with score 0.8642
9 Question: How do micro A

### Model 3: Sentence Transformer --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~93%, MRR: ~0.34</font>

Compute the sentence embeddings using the Sentence BERT model and complete the `vectorize` function. Feel free to look up documentation on https://www.sbert.net/. 

Implementation:

1. Pipe the input sentences through the Sentence BERT model to create the output logits
2. Normalize the batch output


In [35]:
class SentenceBertVectorizer:
  def __init__(self):
    self.model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Pipe the input sentences through the Sentence BERT model to create the output logits
    2. Normalize the batch output
    """

    sentence_vectors = cp.asarray(self.model.encode(sentences))
    norm = cp.linalg.norm(sentence_vectors, axis=1, keepdims=True)
    normalized_vectors = sentence_vectors / norm

    return cp.asnumpy(normalized_vectors)

start_time=time.time()
SBertIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=512, 
                  sentence_vector_dim=384, 
                  vectorizer=SentenceBertVectorizer())

SBertIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

elapsed_time = time.time() - start_time

print(f"Elapsed time: {elapsed_time} seconds")

---- Indexing ----
Start indexing!


100%|██████████| 20/20 [00:03<00:00,  5.48it/s]



---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: What is the step by step guide to invest in share market in india? with score 0.7332
1 Question: I am 17 and I want to invest money in stock market where should I start? with score 0.6957
2 Question: What are the ways to learn about stock market? with score 0.6244
3 Question: How do I start investing in shares or stocks? What is the minimum requirement? with score 0.6240
4 Question: What is the best way to learn about stock market? with score 0.6223
5 Question: What is the step by step guide to invest in share market? with score 0.6043
6 Question: What is the best way to learn about investing in the stock market and what stocks to buy? with score 0.6033
7 Question: What is the best way to learn about stock markets? with score 0.5847
8 Question: How do I buy stocks? with score 0.5778
9 Question: What are mutual funds and which is the best one in India in which to invest? with score 0.5577

---

### Model 4: Cohere Sentence Embeddings --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~89%, MRR: ~0.34</font>

Make sure create a Cohere account and make an API key.
Compute the sentence embeddings using the cohere API and complete the `vectorize` function. Feel free to look up documentation on https://docs.cohere.ai/semantic-search. 

Implementation:

1. Pipe the input sentences through the Cohere API. Make sure to select the small model.


In [36]:
COHERE_API_KEY = ""
co = cohere.Client(COHERE_API_KEY)

In [37]:
# Limit calls to the API (tips from people in Slack that notesd cohere
# only allows 100 calls/min)

!pip install limit

from limit import limit

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting limit
  Downloading limit-0.2.3.tar.gz (1.9 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: limit
  Building wheel for limit (setup.py) ... [?25ldone
[?25h  Created wheel for limit: filename=limit-0.2.3-py3-none-any.whl size=2328 sha256=45dd2d0bb9fb0bcd532aa61a7c4918c7fe575d9d32a86b55708a106d5c11873c
  Stored in directory: /root/.cache/pip/wheels/e9/84/f3/bb4995b5e0b313a9965e4fb6ef50c502f985850d5765e94f6b
Successfully built limit
Installing collected packages: limit
Successfully installed limit-0.2.3
[0m

In [43]:
class CohereVectorizer:

  @limit(95,60)
  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """

    embeddings = co.embed(texts = sentences, model = "small", truncate = "LEFT").embeddings
    sentence_vectors = np.array(embeddings).astype('float')

    # Convert from float64 to float32 to prevent bug:
    # https://github.com/facebookresearch/faiss/issues/461
    return np.float32(np.stack(sentence_vectors))
start_time=time.time()

cohereIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=32, 
                  sentence_vector_dim=1024, 
                  vectorizer=CohereVectorizer())



cohereIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

elapsed_time = time.time() - start_time

print(f"Elapsed time: {elapsed_time} seconds")

---- Indexing ----
Start indexing!


100%|██████████| 313/313 [03:07<00:00,  1.67it/s]



---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: What is the step by step guide to invest in share market in india? with score 2562.9946
1 Question: I am 17 and I want to invest money in stock market where should I start? with score 2064.2815
2 Question: What is the step by step guide to invest in share market? with score 2049.3308
3 Question: How do I start investing in shares or stocks? What is the minimum requirement? with score 1887.5629
4 Question: Which is the best Mutual Fund in India? with score 1856.3855
5 Question: I wish to start investing in Equity and Mutual Funds. Where should I open Demat account for best rates, transaction charges and so on? I am NRI. with score 1831.6359
6 Question: How do I buy stocks? with score 1825.7646
7 Question: What are mutual funds and which is the best one in India in which to invest? with score 1824.3256
8 Question: Which Best SIP plan in india for investement purpose? with score 1824.1649
9 Ques

🎉 CONGRATULATIONS on finishing the assignment!!! We built a real model with an actual datasets for a problem that is used every time a new Quora question gets created!! 

As for why did SentenceBERT & Cohere perform so well, we'll cover that in Siamese networks in week4.

# Extensions

Now that you've worked through the project there is a lot more for us to try:

- See if you can use BERT to improve the model you shipped in Week 1.
- Try out `SentenceBert` and `SpacyVectors` on the entire dataset rather the sample and see what you get?
- Try different transformer models from hugging face