# Introduction

This notebook uses the embeddings to create a search engine. This notebook shows how to prepare a search to understand natural language and return relevant results. In the next notebook, we will use this tto enhance the response from the large language model. 

In [13]:
import pandas as pd 

## vector database search
from qdrant_client import models, QdrantClient

## vector computing framework
from sentence_transformers import SentenceTransformer

# tensor computation library
from torch import mps

## Data Processing

Load the data and remove null values.

In [14]:
df = pd.read_csv('./data/top_rated_wines.csv')

## remove the empty values
df = df[df['variety'].notna()]
data = df.to_dict('records')
df

Unnamed: 0,name,region,variety,rating,notes
0,3 Rings Reserve Shiraz 2004,"Barossa Valley, Barossa, South Australia, Aust...",Red Wine,96.0,Vintage Comments : Classic Barossa vintage con...
1,Abreu Vineyards Cappella 2007,"Napa Valley, California",Red Wine,96.0,Cappella is a proprietary blend of two clones ...
2,Abreu Vineyards Cappella 2010,"Napa Valley, California",Red Wine,98.0,Cappella is one of the oldest vineyard sites i...
3,Abreu Vineyards Howell Mountain 2008,"Howell Mountain, Napa Valley, California",Red Wine,96.0,When David purchased this Howell Mountain prop...
4,Abreu Vineyards Howell Mountain 2009,"Howell Mountain, Napa Valley, California",Red Wine,98.0,"As a set of wines, it is hard to surpass the f..."
...,...,...,...,...,...
1360,Lewis Cellars Alec's Blend Red 2002,"Napa Valley, California",Red Wine,96.0,Number 12 on
1361,Lewis Cellars Cabernet Sauvignon 2002,"Napa Valley, California",Red Wine,96.0,Showcasing the unique personalities of small h...
1362,Lewis Cellars Cuvee L Cabernet Sauvignon 2015,"Napa Valley, California",Red Wine,96.0,"Straight from James Fenimore Cooper’s novel, L..."
1363,Lewis Cellars Reserve Cabernet Sauvignon 2010,"Napa Valley, California",Red Wine,96.0,


## Process Embeddings 
Embeddings are representation of the text data (in our case the wine csv file) as vectors in a high-dimentional space. We use embeddings to be able to complare the simarify between sentences. Vectors allow us to represent the text in matematical terms. In this notebook, I use cosine similarify that allows to compute and measure the cosine of the angle between two vectors, effectively quantifying how similar two sentences regardless of their lenght. 

In [16]:
## encode using the 'all-MiniLM-L6-v2' model. 
encoder = SentenceTransformer('all-MiniLM-L6-v2') # model: download ML model locally

## database to store the vectors. Since the data is in a small size, we can store the data in memory. 
qdrant = QdrantClient(":memory:")

In [18]:
# create a collection that will be stored in the database. The collection stored the params 
# size: takes the size from the input data
# distance function: cosine

qdrant.recreate_collection(
    collection_name = "top_wines",
    vectors_config = models.VectorParams(
        size = encoder.get_sentence_embedding_dimension(),
        distance = models.Distance.COSINE
    )
)

  qdrant.recreate_collection(


True

In [19]:
# creates an index and uploads all the data into the in-memory database
# payload is the metadata 
qdrant.upload_points(
    collection_name = "top_wines",
    points = [
        models.PointStruct(
            id = idx,
            vector = encoder.encode(doc['notes']).tolist(),
            payload = doc
        ) 
        for idx, doc in enumerate(data)
    ]
)

## Search with given input text

Let's search! 

In [22]:
search_text = "50 points Merlot from Napa Valley"
hits = qdrant.search(
    collection_name = "top_wines",
    query_vector = encoder.encode(search_text).tolist(),
    limit = 5
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'name': 'Chateau Petrus 2012', 'region': 'Pomerol, Bordeaux, France', 'variety': 'Red Wine', 'rating': 96.0, 'notes': 'Blend: 100% Merlot'} score: 0.45927218395903796
{'name': 'Bartolo Mascarello Barolo 2010', 'region': 'Barolo, Piedmont, Italy', 'variety': 'Red Wine', 'rating': 96.0, 'notes': '#50 '} score: 0.4232178161722193
{'name': "Kapcsandy Family Winery State Lane Vineyard Roberta's Reserve 2014", 'region': 'Yountville, Napa Valley, California', 'variety': 'Red Wine', 'rating': 96.0, 'notes': 'Full body and superb tannin tension with polish and balance. Harmony. The layers and texture to this are phenomenal. The greatest American merlot ever made. Ranks the best in the world.'} score: 0.39703137962421087
{'name': 'Domaine Saint Prefert Chateauneuf-du-Pape Reserve Auguste Favier 2015', 'region': 'Rhone, France', 'variety': 'Red Wine', 'rating': 97.0, 'notes': '#44 '} score: 0.37906333593074604
{'name': 'Chateau Calon-Segur (Futures Pre-Sale) 2019', 'region': 'St. Estephe, Bordea