<a href="https://colab.research.google.com/github/jsandino/wine-rag/blob/main/wine_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Install Dependencies

In [None]:
!pip install qdrant-client==1.12.1
!pip install sentence-transformers==3.3.1
!pip install openai==1.11.1

## Import Libraries

In [6]:
import pandas as pd
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

## Download the LLM

In [7]:
from pathlib import Path
import urllib.request

def download_llm():
  llm_path = Path("llm/mxbai-embed-large-v1-f16.llamafile")
  if not llm_path.is_file():
    Path("llm").mkdir(parents=True, exist_ok=True)
    url = "https://huggingface.co/Mozilla/mxbai-embed-large-v1-llamafile/resolve/main/mxbai-embed-large-v1-f16.llamafile"
    urllib.request.urlretrieve(url, llm_path)

download_llm()

## Load and Inspect the data

In [13]:
df = pd.read_csv("top_wines.csv")
df.head()

Unnamed: 0,name,region,variety,rating,notes
0,3 Rings Reserve Shiraz 2004,"Barossa Valley, Barossa, South Australia, Aust...",Red Wine,96.0,Vintage Comments : Classic Barossa vintage con...
1,Abreu Vineyards Cappella 2007,"Napa Valley, California",Red Wine,96.0,Cappella is a proprietary blend of two clones ...
2,Abreu Vineyards Cappella 2010,"Napa Valley, California",Red Wine,98.0,Cappella is one of the oldest vineyard sites i...
3,Abreu Vineyards Howell Mountain 2008,"Howell Mountain, Napa Valley, California",Red Wine,96.0,When David purchased this Howell Mountain prop...
4,Abreu Vineyards Howell Mountain 2009,"Howell Mountain, Napa Valley, California",Red Wine,98.0,"As a set of wines, it is hard to surpass the f..."


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1365 entries, 0 to 1364
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     1365 non-null   object 
 1   region   1364 non-null   object 
 2   variety  1347 non-null   object 
 3   rating   1365 non-null   float64
 4   notes    1365 non-null   object 
dtypes: float64(1), object(4)
memory usage: 53.4+ KB


## Clean up Data

In [23]:
# Remove NA entries...
df = df[df['region'].notna()]
df = df[df['variety'].notna()]

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1347 entries, 0 to 1364
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     1347 non-null   object 
 1   region   1347 non-null   object 
 2   variety  1347 non-null   object 
 3   rating   1347 non-null   float64
 4   notes    1347 non-null   object 
dtypes: float64(1), object(4)
memory usage: 63.1+ KB


## Create Records

  Transform each wine row into a data record (ie dictionary):

In [29]:
data = df.to_dict("records")

# Show what the first record looks like...
for k, v in data[0].items():
  print(f'{k}: {v}')

name: 3 Rings Reserve Shiraz 2004
region: Barossa Valley, Barossa, South Australia, Australia
variety: Red Wine
rating: 96.0
notes: Vintage Comments : Classic Barossa vintage conditions. An average wet Spring followed by extreme heat in early February. Occasional rainfall events kept the vines in good balance up to harvest in late March 2004. Very good quality coupled with good average yields. More than 30 months in wood followed by six months tank maturation of the blend prior to bottling, July 2007. 


## Vectorize Data

First, create an encoder to create the embeddings from the wine notes:

In [22]:
encoder = SentenceTransformer('all-MiniLM-L6-v2') # Model

Next, create an in-memory Qdrant database instance to store vectorized data:

In [20]:
qdrant = QdrantClient(":memory:")

In [None]:
# Create collection to store wines
qdrant.recreate_collection(
    collection_name="top_wines",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)

Vectorize the wine notes, and associate each vector with its corresponding wine record (ie the payload):

In [30]:
qdrant.upload_points(
    collection_name="top_wines",
    points=[
        models.PointStruct(
            id=idx,
            vector=encoder.encode(record["notes"]).tolist(),
            payload=record,
        ) for idx, record in enumerate(data)
    ]
)