# Lesson 1 - Semantic Search

Welcome to Lesson 1. 

To access the `requirement.txt` file, go to `File` and click on `Open`.
 
I hope you enjoy this course!

### Import the required packages

In [1]:
import warnings

warnings.filterwarnings("ignore")

In [2]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone, ServerlessSpec

from DLAIUtils import Utils
import DLAIUtils

import os
import time
import torch

In [3]:
from tqdm.auto import tqdm

### Load the Dataset

In [4]:
dataset = load_dataset("quora", split="train[240000:290000]")

Downloading builder script: 100%|██████████| 2.38k/2.38k [00:00<00:00, 11.1MB/s]
Downloading readme: 100%|██████████| 5.69k/5.69k [00:00<00:00, 10.9MB/s]
Downloading data: 100%|██████████| 58.2M/58.2M [00:04<00:00, 13.6MB/s]
Downloading data files: 100%|██████████| 1/1 [00:04<00:00,  4.98s/it]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 1250.91it/s]
Generating train split: 100%|██████████| 404290/404290 [00:16<00:00, 24683.69 examples/s]


In [5]:
dataset[:5]

{'questions': [{'id': [207550, 351729],
   'text': ['What is the truth of life?', "What's the evil truth of life?"]},
  {'id': [33183, 351730],
   'text': ['Which is the best smartphone under 20K in India?',
    'Which is the best smartphone with in 20k in India?']},
  {'id': [351731, 351732],
   'text': ['Steps taken by Canadian government to improve literacy rate?',
    'Can I send homemade herbal hair oil from India to US via postal or private courier services?']},
  {'id': [37799, 94186],
   'text': ['What is a good way to lose 30 pounds in 2 months?',
    'What can I do to lose 30 pounds in 2 months?']},
  {'id': [351733, 351734],
   'text': ['Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?',
    'How do you graph x + 2y = -2?']}],
 'is_duplicate': [False, True, False, True, False]}

Let's collect the questions.

In [6]:
questions = []
for record in dataset["questions"]:
    questions.extend(record["text"])

# Remove the duplicates
questions = list(set(questions))

Take a peek into the questions set.

In [7]:
print("\n".join(questions[:10]))
print("-"*50)
print(f"Number of questions: {len(questions)}")

Do people really believe you can sell your soul to the devil?
What are the most dangerous trends or practices in parenting that most parents do without noticing or realizing?
What are the home remedies for pericoronitis?
What is the Combined Gas Law? How is it used?
Can alcohol stimulate height growth?
What is this bug?
Why aren't there so many scandals in soccer/football as there are in American Football?
Why is it so difficult for girls to propose first?
I feel that Lee scratch Perry has better reggae than bob marley, how about you?
How do I convince someone to share their feelings?
--------------------------------------------------
Number of questions: 88919


### Check cuda and Setup the model

**Note**: "Checking cuda" refers to checking if you have access to GPUs (faster compute). In this course, we are using CPUs. So, you might notice some code cells taking a little longer to run.

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"

if device != "cuda":
    print("Sorry no cuda")

We are using *all-MiniLM-L6-v2* sentence-transformers model that maps sentences to a 384 dimensional dense vector space.

In [9]:
model = SentenceTransformer(model_name_or_path="all-MiniLM-L6-v2", device=device)

In [10]:
query = "which city is the most populated in the world?"
xq = model.encode(query)
xq.shape

(384,)

### Setup Pinecone

In [11]:
Utils = Utils()
PINECONE_API_KEY = Utils.get_pinecone_api_key()

In [None]:
pinecone = Pinecone(api_key=PINECONE_API_KEY)
INDEX_NAME = Utils.create_dlai_index_name(index_name="dl-ai")

if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
    pinecone.delete_index(name=INDEX_NAME)

print(INDEX_NAME)

pinecone.create_index(
    name=INDEX_NAME,
    dimension=model.get_sentence_embedding_dimension(),
    spec=ServerlessSpec(cloud="aws", region="us-west-2"),
    metric="cosine"
)

index = pinecone.Index(name=INDEX_NAME)
print(index)

### Create Embeddings and Upsert to Pinecone

In [14]:
batch_size = 200
vector_limit = 10000

questions = questions[:vector_limit]

import json

for i in tqdm(range(0, len(questions), batch_size)):
    # Find end of batch
    i_end = min(i+batch_size, len(questions))
    # Create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # Create metadata batch
    metadatas = [{"text": text} for text in questions[i: i_end]]
    # Create embeddings
    xc = model.encode(questions[i: i_end])
    # Create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

100%|██████████| 50/50 [01:05<00:00,  1.31s/it]


In [15]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 10000}},
 'total_vector_count': 10000}

### Run your Query

In [17]:
# Small helper function so we can repeat queries later
def run_query(query):
    # query embedding
    embedding = model.encode(query).tolist()
    
    results = index.query(vector=embedding, top_k=10, include_metadata=True, include_values=False)

    for result in results["matches"]:
        print(f'{round(result["score"], 2)}: {result["metadata"]["text"]}')


In [18]:
run_query("which city has the highest population in the world?")

0.66: Which is the best city in the world to travel?
0.64: Which city has the most museums per capita?
0.61: How's the world's population determined?
0.6: What percentage of the world's population lives in developed countries?
0.55: Where will the biggest increases in population come from the next 20 years?
0.55: About 50% of the world population is concentrated between the latitudes of?
0.54: What are the largest slums in the world?
0.53: What are the world`s deadliest tourist destinations?
0.53: Which is the top worst country in the world?
0.52: Which are the worst cities of India?


In [19]:
query = "how do i make chocolate cake?"
run_query(query)

0.52: How do you make shepherd's pie?
0.52: How do you bake air-dry clay?
0.51: What are some ways to make Shepherd's pie?
0.5: What does red velvet cake taste like?
0.49: How do you make whipped cream without heavy cream?
0.47: How do I make rice?
0.47: Where can I find delicious cupcakes at Gold Coast?
0.46: Where can I get custom decorated cakes for any occasion at Gold Coast?
0.46: Why is banana bread considered a bread and not a cake?
0.45: Where can I get very nice and original flavor cupcakes in Gold Coast?
