# Introduction

This notebook uses the embeddings to create a search engine. This notebook shows how to prepare a search to understand natural language and return relevant results. In the next notebook, we will use this tto enhance the response from the large language model. 

In [3]:
import pandas as pd 

## vector database search
from qdrant_client import models, QdrantClient

## vector computing framework
from sentence_transformers import SentenceTransformer

# tensor computation library
from torch import mps

  from tqdm.autonotebook import tqdm, trange
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



## Data Processing

Load the data and remove null values.

In [4]:
df = pd.read_csv('./data/top_rated_wines.csv')

## remove the empty values
df = df[df['variety'].notna()]
##data = df.to_dict('records')
data = df.sample(500).to_dict('records') # work on a sample of 500 rows to run themodel faster
df

Unnamed: 0,name,region,variety,rating,notes
0,3 Rings Reserve Shiraz 2004,"Barossa Valley, Barossa, South Australia, Aust...",Red Wine,96.0,Vintage Comments : Classic Barossa vintage con...
1,Abreu Vineyards Cappella 2007,"Napa Valley, California",Red Wine,96.0,Cappella is a proprietary blend of two clones ...
2,Abreu Vineyards Cappella 2010,"Napa Valley, California",Red Wine,98.0,Cappella is one of the oldest vineyard sites i...
3,Abreu Vineyards Howell Mountain 2008,"Howell Mountain, Napa Valley, California",Red Wine,96.0,When David purchased this Howell Mountain prop...
4,Abreu Vineyards Howell Mountain 2009,"Howell Mountain, Napa Valley, California",Red Wine,98.0,"As a set of wines, it is hard to surpass the f..."
...,...,...,...,...,...
1360,Lewis Cellars Alec's Blend Red 2002,"Napa Valley, California",Red Wine,96.0,Number 12 on
1361,Lewis Cellars Cabernet Sauvignon 2002,"Napa Valley, California",Red Wine,96.0,Showcasing the unique personalities of small h...
1362,Lewis Cellars Cuvee L Cabernet Sauvignon 2015,"Napa Valley, California",Red Wine,96.0,"Straight from James Fenimore Cooper’s novel, L..."
1363,Lewis Cellars Reserve Cabernet Sauvignon 2010,"Napa Valley, California",Red Wine,96.0,


## Process Embeddings 
Embeddings are representation of the text data (in our case the wine csv file) as vectors in a high-dimentional space. We use embeddings to be able to complare the simarify between sentences. Vectors allow us to represent the text in matematical terms. In this notebook, I use cosine similarify that allows to compute and measure the cosine of the angle between two vectors, effectively quantifying how similar two sentences regardless of their lenght. 

In [5]:
## encode using the 'all-MiniLM-L6-v2' model. 
encoder = SentenceTransformer('all-MiniLM-L6-v2') # model: download ML model locally

## database to store the vectors. Since the data is in a small size, we can store the data in memory. 
qdrant = QdrantClient(":memory:")

In [6]:
# create a collection that will be stored in the database. The collection stored the params 
# size: takes the size from the input data
# distance function: cosine

qdrant.recreate_collection(
    collection_name = "top_wines",
    vectors_config = models.VectorParams(
        size = encoder.get_sentence_embedding_dimension(),
        distance = models.Distance.COSINE
    )
)

  qdrant.recreate_collection(


True

In [7]:
# creates an index and uploads all the data into the in-memory database
# payload is the metadata 
qdrant.upload_points(
    collection_name = "top_wines",
    points = [
        models.PointStruct(
            id = idx,
            vector = encoder.encode(doc['notes']).tolist(),
            payload = doc
        ) 
        for idx, doc in enumerate(data)
    ]
)

## Search with given input text

Let's search! 

In [8]:
user_prompt = "I like Malbec wine from Spain. Which wine should I pair with my steak?"
hits = qdrant.search(
    collection_name = "top_wines",
    query_vector = encoder.encode(user_prompt).tolist(),
    limit = 5
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'name': 'Bodega Colome Altura Maxima Malbec 2012', 'region': 'Salta, Argentina', 'variety': 'Red Wine', 'rating': 96.0, 'notes': 'Winemaker Thibaut Delmotte has crafted wines of distinction and international acclaim for Colome. He believes the Malbec from Altura Maxima Vineyard is the embodiment of two extremes - a traditional grape variety from his French origins made from the vineyard that challenges all convention in the modern viticultural world.'} score: 0.5610543624980235
{'name': 'Cavallotto Barolo Riserva Bricco Boschis (chipped wax - 3L) 2001', 'region': 'Barolo, Piedmont, Italy', 'variety': 'Red Wine', 'rating': 96.0, 'notes': 'A wine of great structure but with elegance and complexity. Initially fruity with floral and spice aromas that open up. An excellent wine for aging.'} score: 0.5499794613487093
{'name': 'Colgin Tychson Hill Vineyard Cabernet Sauvignon 2005', 'region': 'St. Helena, Napa Valley, California', 'variety': 'Red Wine', 'rating': 96.0, 'notes': 'My first impr

In [9]:
search_result = [hit.payload for hit in hits]

In [None]:
## Connect to LLM from OpenAI 
from openai import OpenAI

client = OpenAI(
    base_url = "http://127.0.0.1:8080/v1",
    api_key = "sk_no_key_required"
)
completion = client.chat.completions.create(
    model = "LLaMA_CPP",
    messages = [
        {"role": "system", "content": "You are a chatbot wine specialist."},
        {"role": "user", "content": "I like Malbec wine from Spain. Which wine should I pair with my steak?"},
        {"role": "assistant", "content": str(search_result)}
    ]
)