# Finding similar texts with embeddings

* Area: Data-driven prototypes.
* Learning objectives: 
  * Practice sentence embeddings
  * Understand retrieval-augmented generation
  * Understand vector databases




## Data preparation
Load hotel review data. 

In [1]:
import pandas as pd

df = pd.read_csv('~/data/Hotel_Reviews.csv')
df.head(10)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968
3,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...,210,1403,Great location in nice surroundings the bar a...,26,1,3.8,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",3 days,52.360576,4.915968
4,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/24/2017,7.7,Hotel Arena,New Zealand,You When I booked with your company on line y...,140,1403,Amazing location and building Romantic setting,8,3,6.7,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",10 days,52.360576,4.915968
5,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/24/2017,7.7,Hotel Arena,Poland,Backyard of the hotel is total mess shouldn t...,17,1403,Good restaurant with modern design great chil...,20,1,6.7,"[' Leisure trip ', ' Group ', ' Duplex Double ...",10 days,52.360576,4.915968
6,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/17/2017,7.7,Hotel Arena,United Kingdom,Cleaner did not change our sheet and duvet ev...,33,1403,The room is spacious and bright The hotel is ...,18,6,4.6,"[' Leisure trip ', ' Group ', ' Duplex Twin Ro...",17 days,52.360576,4.915968
7,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/17/2017,7.7,Hotel Arena,United Kingdom,Apart from the price for the brekfast Everyth...,11,1403,Good location Set in a lovely park friendly s...,19,1,10.0,"[' Leisure trip ', ' Couple ', ' Duplex Double...",17 days,52.360576,4.915968
8,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/9/2017,7.7,Hotel Arena,Belgium,Even though the pictures show very clean room...,34,1403,No Positive,0,3,6.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",25 days,52.360576,4.915968
9,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/8/2017,7.7,Hotel Arena,Norway,The aircondition makes so much noise and its ...,15,1403,The room was big enough and the bed is good T...,50,1,7.9,"[' Leisure trip ', ' Couple ', ' Large King Ro...",26 days,52.360576,4.915968


## Part 1: Working with embeddings

### First: Bag of words representation

We want to extract a representation from the positive reviews of the hotels. Let's try with the common Bag-of-Words first:

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=10, max_df = 0.8, stop_words='english')
vectors = vectorizer.fit_transform(df.Positive_Review)
vectors.shape

(515738, 8783)

We get a representation based on word counts. Is is fixed-length but has two problems: 

1) It needs one entry per word, thus leading to large sizes (8783). 
2) Counting words is not the most "semantic" representation (two sentences could be similar in meaning and still have different words). 


### Introducing Sentence Transformers

Sentence Transformers are a type of models (and the name of the  corresponding python package) that allows extracting a sentence embedding. 

Its python implementation is straightforward. We import the library and initialize the `SentenceTransformer` object. 

The argument to `SentenceTransformer` is the name of the model. The chosen model `'all-MiniLM-L6-v2'` is one of the simplest, but you can use other models from the HuggingFace model hub. 

See [sbert](https://www.sbert.net/) for documentation and other names. 

In [3]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

  from tqdm.autonotebook import tqdm, trange


Then, `model.encode` is computing the embeddings. 

For those who are familiar with neural networks, this is doing a forward pass. Behind the scenes, it is using the `pytorch` package to load and evaluate the models. 

In [4]:
# This will take a few minutes 
embeddings = model.encode(df.Positive_Review, show_progress_bar=True)

Batches:   0%|          | 0/16117 [00:00<?, ?it/s]

See how the shape of the embeddings is much smaller (384):

In [5]:
embeddings.shape

(515738, 384)

### Test if similar embeddings lead to similar texts

To that end, we write a small python function that computes a similarity function between the embedding of a given sentence, and all the embeddings in the data. 

In [6]:
from scipy.spatial.distance import cdist
import numpy as np

def print_top_N(vectors, id, N=10):
    """
    Print the top N closest reviews to the review at index id
    
    vectors: np.array
        The vectors to search
    id: int
        The index of the review to search for
    N: int
        The number of reviews to print
    """
    distances = cdist(vectors[id,:].reshape(1,-1), vectors, 'cosine')
    top10 = np.argsort(distances)[0][1:(N+1)]

    print(f"Orig: {df.Positive_Review.values[id]}")
    print("-"*(len(df.Positive_Review.values[id])+6))
    for i, idx in enumerate(top10):
        print(f"Top {i}: {df.Positive_Review.values[idx]}")


In [7]:
print_top_N(embeddings, 2)

Orig:  Location was good and staff were ok It is cute hotel the breakfast range is nice Will go back 
-----------------------------------------------------------------------------------------------------
Top 0:  The staff especially at the breakfast were fantastic The hotel in general is good with very good location
Top 1:  Perfect location staff was really friendly and helpful Breakfast was delicious with loads of different choices Will definitely be back Really great hotel 
Top 2:  Location is excellent The staff were friendly and the breakfast was vey good 
Top 3:  The location is really good and staffs are really kind Breakfast was okay 
Top 4:  Location is very nice and staff is friendly the breakfast was good as well 
Top 5:  Location is very good and extremely helpful staff breakfast was okay the hotel was everything it described it would be 
Top 6:  Great location Staff was very friendly and helpful Hotel breakfast was very good 
Top 7:  Location was great breakfast very good S

## Part 2: Simulating RAG

In this part, we will simulate RAG by  designing a prompting strategy and manually computing the most similar entries to the query. 

In [8]:
import cohere
COHERE_API_KEY = "your-key"
cohere_client = cohere.Client(COHERE_API_KEY)

First, let's consider a query such as the following. 

Note that Cohere also provides embedding methods! Probably the embedding coming from a LLM is better than the "small" SentenceTransformer model we used. 

The `embed` function from the Cohere SDK and specify as type `search_query`. We're chosing an embedding suitable to do similarity search (other embeddings exist for classification, for instance). 

In [9]:
query_message = "I love this product"
response = cohere_client.embed(texts=[query_message], model="embed-english-v3.0", input_type="search_query")

This one has a larger size. 

In [10]:
len(response.embeddings[0])

1024

For the example, we'll go for the sentence transformer, otherwise this will imply sending 500K elements to the LLM. 

In this example we'll simulate the RAG. Note the strategy is very similar to that of Tools in notebook `5_llms/chat_with_query.ipynb`.

In the first call, we understand whether we need to answer, or gather more information:

In [11]:
prompt = """
The user is going to ask a question. You should respond normally.
However, if the user asks for hotels with good reviews in a certain
category, respond REQUEST_HOTEL and then a description of the information. For example,
if the user asks for "hotels with good breakfast and service",
you should respond "REQUEST_HOTEL: good breakfast and service".
The message by the user is: 
"""

user_request = input("What is your request? ")

response = cohere_client.chat(
    message=prompt + user_request,
)

In [12]:
response.text

'REQUEST_HOTEL: Good breakfast and orange juice.'

In the second call, we invoke the LLM with additional information from the texts most similar to the query: 

In [13]:
parts = response.text.split(":")
if parts[0] == "REQUEST_HOTEL":
    # Perform search
    query = parts[1]
    # Compute the embedding of the query
    query_vector = model.encode([query])
    # Compute all distances and get top 10
    distances = cdist(query_vector, embeddings, 'cosine')
    top10 = np.argsort(distances)[0][0:10]

    # Show the top 10 (just for control)
    second_prompt = "Here are some reviews of specific hotels:\n"
    for idx in top10:
        # Add positive review and hotel name to response
        second_prompt += f"Hotel: {df.Hotel_Name.values[idx]}\n"
        second_prompt += f"Review: {df.Positive_Review.values[idx]}\n\n"
    print("Second prompt (for debugging):")
    print(second_prompt)

    # Eventually do the second call with all the information:
    response2 = cohere_client.chat(
        message="Now respond to the original request of the user, but take into account the following information:\n" + second_prompt,
        chat_history=response.chat_history,
    )

Second prompt (for debugging):
Here are some reviews of specific hotels:
Hotel: Best Western Premier Hotel Couture
Review:  Breakfast is good Fresh orange juice 

Hotel: The Queens Gate Hotel
Review:  Fantastic fresh orange juice at breakfast and good variety of food 

Hotel: Best Western Premier Hotel Couture
Review:  breakfast very good with huge choice really enjoyed freshly squeezed orange juice 

Hotel: Brunelleschi Hotel
Review:  real fresh orange juice at breakfast

Hotel: Bcn Urban Hotels Gran Rosellon
Review:  Real orange juice in the breakfast 

Hotel: The Queens Gate Hotel
Review:  Great breakfasts with fresh squeezed orange juice 

Hotel: Worldhotel Cristoforo Colombo
Review:  Nice and professional stuff Freshly squeezed orange juice for breakfast daily 

Hotel: DoubleTree by Hilton London Docklands Riverside
Review:  All you can eat breakfast with readily available fresh orange juice

Hotel: H tel des Champs Elys es
Review:  Quiet room Reasonable breakfast with great fresh

Inspect the response:

In [14]:
response2.text

"I can recommend several hotels that fit your criteria of offering a good breakfast with fresh orange juice, based on the reviews I've seen:\n\n- The Queens Gate Hotel: Multiple reviews highlight the fantastic fresh orange juice and a good variety of food at breakfast.\n- Best Western Premier Hotel Couture: Guests have praised the very good breakfast with a huge choice, including freshly squeezed orange juice.\n- Worldhotel Cristoforo Colombo: This hotel offers freshly squeezed orange juice at breakfast, and guests have also commended the nice and professional staff.\n- DoubleTree by Hilton London Docklands Riverside: Features an all-you-can-eat breakfast with readily available fresh orange juice.\n- H tel des Champs Elys es: Reviewers have noted a reasonable breakfast with great fresh orange juice, along with quiet rooms.\n- Atahotel Contessa Jolanda: Known for its good breakfast, including items like sausages, cheese, and magnificent orange juice.\n- Brunelleschi Hotel and Bcn Urban 

## Part 3: Introducing Vector Databases

This part will briefly illustrate the concept of vector databases. While we kept the texts in memory and used `cdist` to compute the similarities, this is not scalable when you have longer embeddings and millions of texts. 

Vector databases allow indexing embeddings so that they can be retrieved with similarity queries efficiently.

They can also encode other information in the database other from the embedding itself. 

We will illustrate Qdrant, one of the most popular vector databases right now. To follow this example, you'll need to create a Docker image according to the instructions [in the Qdrant documentation](https://qdrant.tech/documentation/quick-start/). 

If you don't have docker, just think it's a way to simulate a server that contains the vector database (inside your own machine) and that the server has an API to insert and retrieve data. 

Once the docker container is running, you can use the following code. This will connect to the database: 

In [15]:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient("http://localhost:6333")

client.create_collection(
    "hotel_reviews",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)

True

This prepares the information in the format needed to be inserted in the database. Note: 
- We only insert a few (for running this example in a reasonable time). Inserting can take some time.
- We can use the `payload` parameter to insert additional data in the form of a dictionary.

In [16]:
from qdrant_client.models import PointStruct

points = [
    
    PointStruct(id=i, vector=embeddings[i].tolist(), payload={"text": df.Positive_Review[i], "hotel_name": df.Hotel_Name[i]})
    for i in range(1000)
    
]

And this eventually performs the insert operation: 

In [17]:
opinfo = client.upsert(
    collection_name="hotel_reviews",
    wait=True,
    points=points
)

Once the vectors are inserted, we can perform a query with a new vector: 

In [18]:
query_embedding = model.encode(["I love hotels with great orange juice for breakfast"])
search_result = client.search(
    collection_name="hotel_reviews",
    query_vector=query_embedding.tolist()[0],
    limit=5
)