> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.


# Week 3: Embedding-Based Retrieval

### What we are building
The goal of Embedding-Based Retrieval is to retrieve top-k candidates given a query based on embedding similarity/distance. A common application for this is given a query/sentence/document, find top-k similar candidates wrt query. While this is usually solved using TF-IDF/Information Retrieval (IR) based approaches, it is becoming more and more common in the industry to use an embedding based approach: encode the query and document as an embedding and use approximate nearest neighbor search to find top-k candidates in real-time.

We will build a system to find duplicate questions on Quora using a [dataset released by Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs). A very common problem for forums/QA websites is trying to determine whether a question has already been asked before a user posts it.

We will continue to apply our learning philosophy of repetition as we build multiple models of increasing complexity in the following order:

1. Retrieval based on WordVectors
1. Using BERT
1. Using Sentence BERT
1. Using Cohere Sentence Embeddings

###  Evaluation
We will evaluate our models along the following metrics:

1. Recall@k: the proportion of relevant items found in the top-k matches
1. Mean Reciprocal Rank: the rank of the first relevant item with respect to the top-k.

### Instructions

1. We have provided scaffolding for all the boilerplate FAISS code to get to our baseline model. This covers downloading and parsing the dataset, and training code for the baseline model. **Make sure to read all the steps and internalize what is happening**.
1. At this point in our model, we will aim to use BERT embeddings. **Does this improve accuracy?**
1. In the third model, we will use Sentence BERT and then we'll see if they can boost up our model. **How do you think this model will perform?**
1. **Extension**: We have suggested a bunch of extensions to the project so go crazy! Tweak any parts of the pipeline, and see if you can beat all the current models.

### Code Overview

- Dependencies: Install and import python dependencies
- Project
  - Dataset: Download the Quora dataset
  - Indexer: Function to manage and create a Faiss Index
  - Model 1: Word Vectors
  - Model 2: BERT
  - Model 3: Sentence BERT
  - Model 4: Cohere Sentence Embeddings
- Extensions


# Dependencies

✨ Now let's get started! To kick things off, as always, we will install some dependencies.

In [1]:
# Import all the relevant libraries
import csv
import spacy
import faiss
import numpy as np
import pytorch_lightning as pl
import random
import spacy
import torch

from tqdm import tqdm
from collections import defaultdict
from sentence_transformers import SentenceTransformer
from torch.nn import functional as F
from transformers import BertTokenizer, BertModel, BertTokenizerFast, DistilBertTokenizer, DistilBertModel

# Load the spacy data
nlp = spacy.load("en_core_web_lg")

# Embedding Based Retrieval

✨ Let's Begin ✨

### Data Loading and Processing (Common to ALL Solutions)

#### Dataset

Download the duplicate questions [dataset released by Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs).


In [2]:
%%capture
!wget 'http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv'
!mkdir qqp
!mv quora_duplicate_questions.tsv qqp/
!ls qqp/

Perfect. Now we see all of our files. Let's poke at one of them before we start parsing our dataset.

In [3]:
DATA_FILE = "qqp/quora_duplicate_questions.tsv"

# The file is a 6-column tab separated file.
# The first column is the row_id, second and third questions are ids of
# specific questions, followed by the text of questions.
# The last column captures if the two questions are duplicates
with open(DATA_FILE, 'r', newline='\n') as file:
  reader = csv.reader(file, delimiter = '\t')
  # Read first 10 lines
  for i in range(10):
    print(next(reader))

['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate']
['0', '1', '2', 'What is the step by step guide to invest in share market in india?', 'What is the step by step guide to invest in share market?', '0']
['1', '3', '4', 'What is the story of Kohinoor (Koh-i-Noor) Diamond?', 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?', '0']
['2', '5', '6', 'How can I increase the speed of my internet connection while using a VPN?', 'How can Internet speed be increased by hacking through DNS?', '0']
['3', '7', '8', 'Why am I mentally very lonely? How can I solve it?', 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?', '0']
['4', '9', '10', 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?', 'Which fish would survive in salt water?', '0']
['5', '11', '12', 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', "I'm a triple Capricorn (Sun, Moon and ascendant in Capri

The dataset has more than 500k questions! We are going to parse the full dataset and create a sample of 10k questions to experiment with in our models since BERT training & inference can be really slow.

In [4]:
"""
Util function to parse the file
"""
def parse_sample_dataset(file_path, sample_max_id):
  """
  Inputs:
    file_path: Path to the raw data file
    sample_max_id: Max question id to be considered in the sampled dataset

  Returns 4 objects:
    1. QuestionMap: list of all question ids
    2. DuplicatesMap: Map of questionID to it's duplicates
    3. SampleDataset: list of questionIds in the sample
    4. SampleEvalDataset: list of pair of duplicate questions in the sample
  """
  question_map = {}
  duplicates_map = defaultdict(set)
  sample_dataset = set([])
  sample_eval_dataset = []

  with open(file_path, 'r', newline='\n') as file:
    reader = csv.reader(file, delimiter='\t')
    next(reader)  # Skip the header line

    for row in reader:
      if len(row) != 6: # Skip incomplete rows
        continue

      # Limit the sample size of the dataset at max_id
      # Make sure all 4 objects start at index 0
      qid1, qid2, label = int(row[1]) - 1, int(row[2]) - 1, int(row[5])
      if qid1 < sample_max_id and qid2 < sample_max_id:

        if qid1 not in question_map:
          question_map[qid1] = str(row[3])
        if qid2 not in question_map:
          question_map[qid2] = str(row[4])

        if label == 1:
          duplicates_map[qid1].add(qid2)
          duplicates_map[qid2].add(qid1)

          sample_eval_dataset.append((qid1, qid2))

        sample_dataset.add(qid1)
        sample_dataset.add(qid2)

  # sample dataset duplicates removed via set(), so turn back into list
  return question_map, duplicates_map, list(sample_dataset), sample_eval_dataset

question_map, duplicates_map, sample_dataset, sample_eval_dataset, = parse_sample_dataset(DATA_FILE, 10000)

# Complete file: 537k unique questions, 400k duplicate.
# To keep training time manageable limited to 10.000 (sample_max_id)
print("Number of unique questions:", len(question_map)) # 10.000
print("Number of question with duplicates:", len(duplicates_map)) # ~3.8k
print("Number of questions in sample:", len(sample_dataset)) # 10.000
print("Number of duplicate pairs in sample:", len(sample_eval_dataset)) # ~3.6k

Number of unique questions: 10000
Number of question with duplicates: 3810
Number of questions in sample: 10000
Number of duplicate pairs in sample: 3589


# Retrieval using Faiss -- TO BE COMPLETED

You are now going to create an Indexer class that implements multiple functions for indexing, searching, and evaluating our retrieval model. FAISS documentation can be found in the wiki here: https://github.com/facebookresearch/faiss/wiki/Getting-started

Some helpful FAISS guides are:
- https://www.pinecone.io/learn/faiss-tutorial/
- https://www.pinecone.io/learn/vector-indexes/

You need to implement the following functions:

1. **search**: Implement a function that takes a question and top_k variable and returns either the matched strings or the ids to the user as a
    1. Call the search API on the faiss_index to look up similar sentences using `faiss_index.search`
    2. Parse the output to either return [sentence_id, score] tuples or [sentence, score] tuples based on the input parameter
    3. Sort the output by the score in descending order

1. **evaluate**: Sample num_docs pairs from the evaluation dataset and then check if the qid2 is present in the top-k results
    1. For each eval sample, find the top_k matches for the qid1
    2. See if the qid2 is in one of the matches
    3. If yes, append (1) to the recall array otherwise append (0)
    4. Implement MRR (Mean reciprocal rank) addition based on the position of qid2 in matches.


In [33]:
class FaissIndexer:
  def __init__(self, dataset,
               question_map,
               eval_dataset,
               batch_size,
               sentence_vector_dim,
               vectorizer):
    self.question_map = question_map
    self.dataset = dataset
    self.eval_dataset = eval_dataset
    self.batch_size = batch_size
    self.vectorizer = vectorizer
    # FlatIP uses inner product
    self.faiss_index = faiss.IndexFlatIP(sentence_vector_dim)


  def split_list(self, lst: list, sublist_size: int):
    sublists = []
    # Split list into even chunks/sublists/batches
    for i in range(0, len(lst), sublist_size):
      sublists.append(lst[i:i + sublist_size])
    return sublists


  def index(self):
    sentence_vectors = []

    print("Start indexing!")
    for sentence_ids in tqdm(self.split_list(self.dataset, self.batch_size)):
      # Retrieve sentences based on qid
      sentences = [question_map[qid] for qid in sentence_ids]
      # Get embeddings of the sentences (Spacy, ..., OpenAI, Cohere)
      sentence_vectors_batch = self.vectorizer.vectorize(sentences)
      # Add batch to temporary list
      sentence_vectors.append(sentence_vectors_batch)

    # Add all batches from temporary list to index
    self.faiss_index.add(np.array(np.concatenate(sentence_vectors, axis=0)))
    print("\nDone indexing!")


  def search(self, question: str, top_k: int, return_ids=False):
    """Given any sentence (typed by the user)
    We return a list of top_k(sentence, sim_score) or top_k(sentence_ids, sim_score)

    NOTE: The output type is controlled by the return_ids flag

    1. Call the search API on the faiss_index to look up similar sentences
       using `faiss_index.search`
    2. Parse the output to either return [sentence_id, score] tuples or
       [sentence, score] tuples based on return_ids being true/false
    3. Sort the output by the score in descending order
    """

    # NOTE: We converted the question to a list here to match the signature
    # of the vectorize function
    question_vectors = self.vectorizer.vectorize([question])

    ### TO BE IMPLEMENTED ###
    scores, indices = self.faiss_index.search(question_vectors, top_k)
    if return_ids:
      return list(zip(indices[0], scores[0]))
    else:
      return list(zip([self.question_map[index] for index in indices[0]], scores[0]))
    ### TO BE IMPLEMENTED ###

    # Output is a List[(qid, score), (qid, score), (qid, score)] or
    # List[(q, score), (q, score), (q, score)] based on return_ids
    # Output is sorted in descending order of score
    return output


  def evaluate(self, top_k: int, eval_sample_size: int):
    """Sample num_docs pairs from the evaluation dataset and then check
    if the qid2 is present in the top-k results

    1. For each eval sample, find the top_k matches for the qid1
    2. See if the qid2 is in one of the matches
    3. If yes, append (1) to the recall array otherwise append (0)
    4. Implement MRR (Mean reciprocal rank) addition based on the position of qid2 in matches
      - Note: MRR is equivalent to mean([1/r or 0 for each sample])
    """
    # Sample from evaluation dataset as proxy for performance metrics
    eval_sample = random.sample(self.eval_dataset, eval_sample_size)

    # Retrieval metrics which only care about if searched for
    # item is present among the results.
    recall_at_k = [] # Relevant items vs total of relevant items
    mean_reciprocal_rank = [] # Rank of the first relevant item

    ### TO BE IMPLEMENTED ###
    # 1. For each eval sample, find the top_k matches for the qid1
    for qid1, qid2 in tqdm(eval_sample, desc="Evaluating random samples"):
      matches = self.search(self.question_map[qid1], top_k, return_ids=True)
      matches = [qid for qid, _ in matches]
      # 2. See if the qid2 is in one of the matches
      # 3. If yes, append (1) to the recall array otherwise append (0)
      if qid2 in matches:
        recall_at_k.append(1)
        # 4. Implement MRR (Mean reciprocal rank) addition based on the position of qid2 in matches
        mean_reciprocal_rank.append(1 / (matches.index(qid2) + 1))
      else:
        recall_at_k.append(0)
        mean_reciprocal_rank.append(0)
    ### TO BE IMPLEMENTED ###

    print("\nRecall@{}:\t\t{:0.2f}%".format(top_k, np.mean(np.array(recall_at_k) * 100.0)))
    print("Mean Reciprocal Rank:\t{:0.2f}".format(np.mean(np.array(mean_reciprocal_rank))))


  # Helper function to train, search and evaluate similar output from all the models created.
  def train_and_evaluate(self,
                         question_example: str,
                         top_k: int = 10,
                         eval_sample_size: int = 1000
                         ):
    print("---- Indexing ----")
    self.index()
    print("\n---- Search ----")
    results = self.search(question_example, top_k, return_ids=False)
    print("Questions similar to:", question_example)
    for i, (q, s) in enumerate(results):
      print(f"{i} Question: {q} with score {s}")
    print("\n---- Evaluation ----")
    self.evaluate(top_k, eval_sample_size)

## Dummy Model Test

Really small sample of 4 sentences to make sure we can test our implementation of the FAISS search function correctly. We just project the 4 questions in a 2-d space where they are placed on the X-Axis if the word `invest` is present and on the Y-axis if `kohinoor` is present.

In [34]:
dummy_ids = sample_dataset[:4]
print("Questions:")
for i in dummy_ids:
  print(i, ":", question_map[i])

Questions:
0 : What is the step by step guide to invest in share market in india?
1 : What is the step by step guide to invest in share market?
2 : What is the story of Kohinoor (Koh-i-Noor) Diamond?
3 : What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?


In [35]:
class DummyVectorizer:
  def __init__(self, sentence_vector_dim):
    self.sentence_vector_dim = sentence_vector_dim

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences.

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """
    vectors = []
    for sentence in sentences:
      if "invest" in sentence:
        # If "invest" is present place it on the X-Axis
        vectors.append(np.array([random.random(), 0], dtype=np.float32))
      elif "Kohinoor" in sentence:
        # If "Kohinoor" is present place it on the Y-Axis
        vectors.append(np.array([0, random.random()], dtype=np.float32))
    return np.stack(vectors)


di = FaissIndexer(dummy_ids,
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024,
                  sentence_vector_dim=2,
                  vectorizer=DummyVectorizer(2)
                  )

di.index()

results = di.search("invest", 4)
print("Questions similar to:", "invest")
for i, (q, s) in enumerate(results):
  print(f"{i} Question: {q} with score {s}")

results = di.search("Kohinoor", 4)
print("\nQuestions similar to:", "Kohinoor")
for i, (q, s) in enumerate(results):
  print(f"{i} Question: {q} with score {s}")

Start indexing!


100%|██████████| 1/1 [00:00<00:00, 13662.23it/s]


Done indexing!
Questions similar to: invest
0 Question: What is the step by step guide to invest in share market? with score 0.6792926788330078
1 Question: What is the step by step guide to invest in share market in india? with score 0.3259030282497406
2 Question: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? with score 0.0
3 Question: What is the story of Kohinoor (Koh-i-Noor) Diamond? with score 0.0

Questions similar to: Kohinoor
0 Question: What is the story of Kohinoor (Koh-i-Noor) Diamond? with score 0.1971409022808075
1 Question: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? with score 0.009744079783558846
2 Question: What is the step by step guide to invest in share market? with score 0.0
3 Question: What is the step by step guide to invest in share market in india? with score 0.0





# Models

You may be wondering, "When are we going to start building models?" And, the answer is NOW! Finally the time has come to build our baseline model, and then we'll work towards improving it.


**NOTE**: We will be using the sample dataset since BERT is really slow and processing the full dataset will take a lot of time.

### Model 1: Averaging Word Vectors --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~5%, MRR: ~0.02</font>

Complete the `vectorize` function using Spacy provided word embeddings. This is something we've done already :)

Implementation:

1. Tokenize each sentence and get wordVectors for each token in the sentence using Spacy
2. Sentence vector is the mean of word vectors of each token
3. Stack the sentence vectors into a numpy array using np.stack

In [36]:
class SpacyVectorizer:
  def __init__(self, sentence_vector_dim):
    self.sentence_vector_dim = sentence_vector_dim

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences.

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """
    vectors = []
    for sentence in sentences:

      ### TO BE COMPLETED ###
      sentence_vector = np.mean([token.vector for token in nlp.make_doc(sentence) if token.has_vector], axis=0)
      ### TO BE COMPLETED ###

      vectors.append(sentence_vector)
    return np.stack(vectors)

spacyIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024,
                  sentence_vector_dim=300,
                  vectorizer=SpacyVectorizer(300))

spacyIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

---- Indexing ----
Start indexing!


100%|██████████| 10/10 [00:00<00:00, 25.03it/s]



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: In how many ways can we create object in Java? with score 1264.394287109375
1 Question: How can we find happiness in life? with score 1203.2908935546875
2 Question: I want to connect with you, how can I do that? with score 1198.193359375
3 Question: How can I can concentrate well in studies? with score 1197.028564453125
4 Question: what can i do to become fair? with score 1191.52001953125
5 Question: Why do we need to study? with score 1189.1204833984375
6 Question: Minimum marks in NEET2017 to get admission in IISC? with score 1185.0025634765625
7 Question: How can we earn money online in india? with score 1163.95947265625
8 Question: How can I be happy if I don't have any reason to be? with score 1163.731201171875
9 Question: What to do when you don't want to do? with score 1161.207275390625

---- Evaluation ----


Evaluating random samples: 100%|██████████| 1000/1000 [00:00<00:00, 2665.79it/s]


Recall@10:		5.00%
Mean Reciprocal Rank:	0.02





### Model 2: BERT Embeddings --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~48%, MRR: ~0.19</font>

Compute the sentence embeddings using the BERT model and complete the `vectorize` function. Feel free to reference any documentation from https://huggingface.co/.


Implementation:

1. Tokenize batch of sentences using `self.tokenizer`
2. Pipe the inputs through the BERT model to create the output logits
3. Normalize the batch output

**NOTE: This model is really slow and will take a while to run**

In [16]:
class BertVectorizer:
  def __init__(self):
    self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    self.model = DistilBertModel.from_pretrained('distilbert-base-uncased')

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences.

    1. Tokenize batch of sentences using `self.tokenizer`
    2. Pipe the inputs through the BERT model to create the output logits
    3. Normalize the batch output
    """

    ### TO BE COMPLETED ###
    inputs = self.tokenizer(sentences, return_tensors='pt', padding=True)
    outputs = self.model(**inputs)
    model_output = outputs.last_hidden_state
    ### TO BE COMPLETED ###

    return F.normalize(torch.mean(model_output, dim=1), dim=1).detach().numpy()


bertIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=32,
                  sentence_vector_dim=768,
                  vectorizer=BertVectorizer())

bertIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

---- Indexing ----
Start indexing!


100%|██████████| 313/313 [01:05<00:00,  4.80it/s]



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: I wish to start investing in Equity and Mutual Funds. Where should I open Demat account for best rates, transaction charges and so on? I am NRI. with score 0.8770731091499329
1 Question: What is the step by step guide to invest in share market in india? with score 0.874489426612854
2 Question: What are mutual funds and which is the best one in India in which to invest? with score 0.8723899126052856
3 Question: What will be the effect of banning 500 and 1000 notes on stock markets in India? with score 0.8636163473129272
4 Question: What will be the effect of banning 500 and 1000 Rs notes on real estate sector in India? Can we expect sharp fall in prices in short/long term? with score 0.861491322517395
5 Question: What are your views on Modi governments decision to demonetize 500 and 1000 rupee notes? How will this affect economy? with score 0.8532259464263916
6 Question: What i

### Model 3: Sentence Transformer --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~92%, MRR: ~0.34</font>

Compute the sentence embeddings using the Sentence BERT model and complete the `vectorize` function. Feel free to look up documentation on https://www.sbert.net/.

Implementation:

1. Pipe the input sentences through the Sentence BERT model to create the output logits
2. Normalize the batch output


In [17]:
class SentenceBertVectorizer:
  def __init__(self):
    self.model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences.

    1. Pipe the input sentences through the Sentence BERT model to create the output logits
    2. Normalize the batch output
    """

    ### TO BE COMPLETED ###
    sentence_vectors = self.model.encode(sentences)
    ### TO BE COMPLETED ###

    return sentence_vectors / np.expand_dims(np.linalg.norm(sentence_vectors, axis=1), axis=1)


SBertIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024,
                  sentence_vector_dim=384,
                  vectorizer=SentenceBertVectorizer())

SBertIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

---- Indexing ----
Start indexing!


100%|██████████| 10/10 [00:18<00:00,  1.83s/it]



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: What is the step by step guide to invest in share market in india? with score 0.733176589012146
1 Question: I am 17 and I want to invest money in stock market where should I start? with score 0.6957337260246277
2 Question: What are the ways to learn about stock market? with score 0.6243616342544556
3 Question: How do I start investing in shares or stocks? What is the minimum requirement? with score 0.6239825487136841
4 Question: What is the best way to learn about stock market? with score 0.6222878098487854
5 Question: What is the step by step guide to invest in share market? with score 0.6042823195457458
6 Question: What is the best way to learn about investing in the stock market and what stocks to buy? with score 0.6032655835151672
7 Question: What is the best way to learn about stock markets? with score 0.5846708416938782
8 Question: How do I buy stocks? with score 0.57780

# OPTIONAL
**This section requires a paid account with OpenAI.  It is completely optional and can be skipped.**
### Model 4: OpenAI Text Embeddings
##### <font color='red'>Expected recall@10: ~92%, MRR: ~0.32</font>

Make sure create an OpenAI account and make an API key.
Compute the sentence embeddings using the OpenAI API and complete the `vectorize` function. Feel free to look up documentation on https://platform.openai.com/docs/api-reference/embeddings.

Implementation:

1. Pipe the input sentences through the OpenAI API.


In [41]:
import openai

client = openai.Client(api_key="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")

embeddings_model = "text-embedding-3-small"

class OpenAIVectorizer:
  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences.

    See the OpenAI API documentation for more details: https://platform.openai.com/docs/guides/embeddings/use-cases
    """

    ### TO BE COMPLETED ####
    response = client.embeddings.create(
        input=sentences,
        model=embeddings_model
    )

    sentence_vectors = [data.embedding for data in response.data]
    ### TO BE COMPLETED ###

    # Convert from float64 to float32 to prevent bug:
    # https://github.com/facebookresearch/faiss/issues/461

    return np.float32(np.array(sentence_vectors) / np.expand_dims(np.linalg.norm(sentence_vectors, axis=1), axis=1))

openaiIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=2048,
                  sentence_vector_dim=1536, # This is the length of the OpenAI embeddings model "text-embedding-3-small"
                  vectorizer=OpenAIVectorizer())

openaiIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?",
                               eval_sample_size = 100)

---- Indexing ----
Start indexing!


100%|██████████| 5/5 [00:28<00:00,  5.78s/it]



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: What is the step by step guide to invest in share market in india? with score 0.7989835739135742
1 Question: What is the step by step guide to invest in share market? with score 0.6763179898262024
2 Question: How do I buy stocks? with score 0.6537752151489258
3 Question: I am 17 and I want to invest money in stock market where should I start? with score 0.6437757611274719
4 Question: How can I make money online in India? with score 0.6310142278671265
5 Question: How can we earn money online in india? with score 0.6182802319526672
6 Question: What should I do to make money online in India? with score 0.6099871397018433
7 Question: What are the ways to learn about stock market? with score 0.6030036211013794
8 Question: What is the best way to learn about stock market? with score 0.5974246263504028
9 Question: What are mutual funds and which is the best one in India in which to i

Evaluating random samples: 100%|██████████| 100/100 [00:49<00:00,  2.02it/s]


Recall@10:		92.00%
Mean Reciprocal Rank:	0.32





🎉 CONGRATULATIONS on finishing the assignment!!! We built a real model with an actual datasets for a problem that is used every time a new Quora question gets created!!

As for why did SentenceBERT & Cohere perform so well, we'll cover that in Siamese networks in week4.

# Extensions

Now that you've worked through the project there is a lot more for us to try:

- Try out `SentenceBert` and `SpacyVectors` on the entire dataset rather the sample and see what you get?
- Try different transformer models from hugging face