In [1]:
import pandas as pd
df = pd.read_csv('../../top_rated_wines.csv')
df = df[df['variety'].notna()] # remove any NaN values as it blows up serialization
data = df.to_dict('records')
df

Unnamed: 0,name,region,variety,rating,notes
0,3 Rings Reserve Shiraz 2004,"Barossa Valley, Barossa, South Australia, Aust...",Red Wine,96.0,Vintage Comments : Classic Barossa vintage con...
1,Abreu Vineyards Cappella 2007,"Napa Valley, California",Red Wine,96.0,Cappella is a proprietary blend of two clones ...
2,Abreu Vineyards Cappella 2010,"Napa Valley, California",Red Wine,98.0,Cappella is one of the oldest vineyard sites i...
3,Abreu Vineyards Howell Mountain 2008,"Howell Mountain, Napa Valley, California",Red Wine,96.0,When David purchased this Howell Mountain prop...
4,Abreu Vineyards Howell Mountain 2009,"Howell Mountain, Napa Valley, California",Red Wine,98.0,"As a set of wines, it is hard to surpass the f..."
...,...,...,...,...,...
1360,Lewis Cellars Alec's Blend Red 2002,"Napa Valley, California",Red Wine,96.0,Number 12 on
1361,Lewis Cellars Cabernet Sauvignon 2002,"Napa Valley, California",Red Wine,96.0,Showcasing the unique personalities of small h...
1362,Lewis Cellars Cuvee L Cabernet Sauvignon 2015,"Napa Valley, California",Red Wine,96.0,"Straight from James Fenimore Cooper’s novel, L..."
1363,Lewis Cellars Reserve Cabernet Sauvignon 2010,"Napa Valley, California",Red Wine,96.0,


In [2]:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

`qdrant_client:` This is a Python client for interacting with the `Qdrant` vector database, which is designed to handle large-scale vector search and retrieval. Qdrant can efficiently store and query vectors (like embeddings) generated from text, images, or other data types.

`models:` This module within the `qdrant_client` library contains data models that are used to define the structure of your `data and queries`. For example, you might use models to define how your vector embeddings are structured and stored in the database.

`QdrantClient:` This is the main class in the `qdrant_client` library that allows you to connect to a `Qdrant instance` (which could be hosted locally or in the cloud) and perform operations like inserting, updating, and querying vector data.

In [3]:
encoder = SentenceTransformer('all-MiniLM-L6-v2') # Model to create embeddings

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [4]:
# create the vector database client
qdrant = QdrantClient(":memory:") # Create in-memory Qdrant instance

`recreate_collection:` This is a method provided by the QdrantClient. It first deletes the collection if it already exists and then creates a new one with the specified configuration. This is useful when you want to ensure that your collection starts from a clean state.

`collection_name =` "top_wines": This specifies the name of the collection you're creating. In this case, it's named "top_wines". Collections in Qdrant are like tables in a relational database, where you store related vector data.  

`models.VectorParams:` This is a data model that defines how vectors (embeddings) will be stored in the collection. It specifies parameters like the size of the vectors and the distance metric to be used for similarity search.

`size` = encoder.get_sentence_embedding_dimension():

`encoder.get_sentence_embedding_dimension():` This function returns the dimensionality (size) of the sentence embeddings generated by the model (e.g., SentenceTransformer). The size refers to the number of dimensions in the vector. For example, if your model generates 384-dimensional vectors, this will return 384.

`models.Distance.COSINE:` This specifies the distance metric that will be used to compare vectors during search operations. COSINE distance is commonly used in vector similarity searches because it measures the cosine of the angle between two vectors, which is effective for comparing the similarity of text embeddings.

In [5]:
encoder.get_sentence_embedding_dimension()

384

In [6]:
# Create collection to store vines information
qdrant.recreate_collection(
    collection_name = "top_wines",
    vectors_config = models.VectorParams(
        size = encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance = models.Distance.COSINE
    )
)

  qdrant.recreate_collection(


True

In [8]:
for id, doc in enumerate(data):
    print(id)
    print(doc)
    break

0
{'name': '3 Rings Reserve Shiraz 2004', 'region': 'Barossa Valley, Barossa, South Australia, Australia', 'variety': 'Red Wine', 'rating': 96.0, 'notes': 'Vintage Comments : Classic Barossa vintage conditions. An average wet Spring followed by extreme heat in early February. Occasional rainfall events kept the vines in good balance up to harvest in late March 2004. Very good quality coupled with good average yields. More than 30 months in wood followed by six months tank maturation of the blend prior to bottling, July 2007. '}


In [7]:
# vectorize!
qdrant.upload_points(
    collection_name = "top_wines",
    points = [
        models.PointStruct(
            id = idx, # index
            vector = encoder.encode(doc["notes"]).tolist(), # Convert vector to list
            payload = doc # Attach the entire document as payload
        ) for idx, doc in enumerate(data) # 'data' is the list holding all the wine documents
    ]
)

In [9]:
# Search time for awesome wines!

hits = qdrant.search(
    collection_name = "top_wines",
    query_vector = encoder.encode("99 points Cabernet Sauvignon from Napa Valley").tolist(),
    limit = 3
)
for hit in hits:
  print(hit.payload, "score:", hit.score)

{'name': 'Kapcsandy Family Winery State Lane Cabernet Sauvignon Grand Vin 2017', 'region': 'Yountville, Napa Valley, California', 'variety': 'Red Wine', 'rating': 96.0, 'notes': '100% Cabernet Sauvignon'} score: 0.7492030054523204
{'name': 'Lewis Cellars Cabernet Sauvignon 2002', 'region': 'Napa Valley, California', 'variety': 'Red Wine', 'rating': 96.0, 'notes': 'Showcasing the unique personalities of small hillside vineyards from Pritchard Hill, Oakville and Rutherford, the 2002 Napa Valley Cabernet delivers compelling aromas of mocha, ripe berries, tobacco and sweet oak spice. The wine is 100% Cabernet Sauvignon, complex, rich and focused. With a deep core of black fruit and traces of briar and vanilla, it turns chocolaty and long on the palate with serious, integrated tannins.'} score: 0.7331375680355914
{'name': 'Anakota Helena Montana Vineyard Cabernet Sauvignon 2013', 'region': 'Knights Valley, Sonoma County, California', 'variety': 'Red Wine', 'rating': 96.0, 'notes': 'Blend: 1

In [12]:
hits = qdrant.search(
    collection_name = "top_wines",
    query_vector = encoder.encode("95 rated red vines from france").tolist(),
    limit = 3
)
for hit in hits:
  print(hit.payload, "score:", hit.score)

{'name': 'Felsina Maestro Raro Cabernet Sauvignon 2016', 'region': 'Tuscany, Italy', 'variety': 'Red Wine', 'rating': 97.0, 'notes': 'The grapes come from the vineyards Rancia Piccola and Poggiolo, the first also called Mastro Raro, adjacent and similar to that of Rancia. The best-known and most wydely-planted red grape in the world, Cabernet Sauvignon, was planted here, right in the locus of the Felsina terroir at its most classic. First vintage 1987. '} score: 0.552681820230273
{'name': 'Jaboulet Hermitage La Chapelle (1.5L) 1998', 'region': 'Hermitage, Rhone, France', 'variety': 'Red Wine', 'rating': 96.0, 'notes': 'One of the finest red wines of France. When young the color is deep purple, like black cherries, with aromas of blackcurrant and blackberry. It is a full wine with delicate tannins, 100% destemmed, complex. With age this rich nectar takes on scents of leather, truffles, undergrowth and leaf-mold. The Syrah vines, with an average age of 35 years, have an exceptional posit