# Workshop: Using Cloud tools for Information Retrieval

## Objective:
Learn how to use two powerful vector databases, ChromaDB and Pinecone, for performing similarity searches with text embeddings. Vector databases are essential tools in the field of Information Retrieval (IR) and are widely used in various applications such as search engines, recommendation systems, and natural language processing (NLP).

In [1]:
import chromadb
import torch
from transformers import AutoTokenizer, AutoModel

In [1]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="")

In [21]:
pc.create_index(
    name="jueves300",
    dimension=300, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

In [22]:
index = pc.Index("jueves300")

In [4]:
import pandas as pd

wine_df = pd.read_csv("../week10/data/winemag-data-130k-v2.csv")
wine_df

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


In [6]:
import gensim.downloader as api

word2vec_model = api.load('word2vec-google-news-300')

In [7]:
corpus = wine_df[['Unnamed: 0','description']][:30]
corpus

Unnamed: 0.1,Unnamed: 0,description
0,0,"Aromas include tropical fruit, broom, brimston..."
1,1,"This is ripe and fruity, a wine that is smooth..."
2,2,"Tart and snappy, the flavors of lime flesh and..."
3,3,"Pineapple rind, lemon pith and orange blossom ..."
4,4,"Much like the regular bottling from 2012, this..."
5,5,Blackberry and raspberry aromas show a typical...
6,6,"Here's a bright, informal red that opens with ..."
7,7,This dry and restrained wine offers spice in p...
8,8,Savory dried thyme notes accent sunnier flavor...
9,9,This has great depth of flavor with its fresh ...


In [9]:
import numpy as np

def generate_word2vec_embeddings(texts):
    embeddings = []
    for text in texts:
        tokens = text.lower().split()
        word_vectors = [word2vec_model[word] for word in tokens if word in word2vec_model]
        if word_vectors:
            embeddings.append(np.mean(word_vectors, axis=0))
        else:
            embeddings.append(np.zeros(word2vec_model.vector_size))
    return np.array(embeddings)

word2vec_embeddings = generate_word2vec_embeddings(corpus['description'])
print("Word2Vec Embeddings:", word2vec_embeddings)
print("Word2Vec Shape:", word2vec_embeddings.shape)

Word2Vec Embeddings: [[ 0.08961201  0.02994537  0.01651001 ... -0.06150293  0.08003616
   0.0424881 ]
 [ 0.0043335   0.03103406  0.00594482 ... -0.04978027  0.06455892
  -0.00902507]
 [-0.01267483  0.02049818  0.03032443 ... -0.04003103  0.081967
   0.08878367]
 ...
 [ 0.01143392 -0.00242276  0.01770698 ... -0.02921549 -0.00348239
   0.03066678]
 [ 0.00535366  0.04958234 -0.01933507 ... -0.05190604  0.00845337
   0.05245536]
 [ 0.04411708  0.01201714 -0.00256348 ... -0.04543632  0.05410679
  -0.01553432]]
Word2Vec Shape: (30, 300)


In [11]:
word2vec_embeddings

array([[ 0.08961201,  0.02994537,  0.01651001, ..., -0.06150293,
         0.08003616,  0.0424881 ],
       [ 0.0043335 ,  0.03103406,  0.00594482, ..., -0.04978027,
         0.06455892, -0.00902507],
       [-0.01267483,  0.02049818,  0.03032443, ..., -0.04003103,
         0.081967  ,  0.08878367],
       ...,
       [ 0.01143392, -0.00242276,  0.01770698, ..., -0.02921549,
        -0.00348239,  0.03066678],
       [ 0.00535366,  0.04958234, -0.01933507, ..., -0.05190604,
         0.00845337,  0.05245536],
       [ 0.04411708,  0.01201714, -0.00256348, ..., -0.04543632,
         0.05410679, -0.01553432]], dtype=float32)

In [15]:
x = {'id': '0', 'values': word2vec_embeddings[0]}
x

{'id': '0',
 'values': array([ 0.08961201,  0.02994537,  0.01651001,  0.15172577, -0.04099274,
         0.00646244, -0.03237152, -0.02348328,  0.01422119,  0.1411171 ,
         0.00030422, -0.17591858, -0.06302261,  0.04878616, -0.12338638,
         0.13635635, -0.05486679,  0.14001846,  0.07175446, -0.07574081,
         0.01264381,  0.03368759,  0.03656769,  0.0002799 ,  0.07305908,
        -0.10983276, -0.05321145,  0.12943459, -0.02946377,  0.04084396,
        -0.05028152,  0.01818275, -0.03564739,  0.07135391, -0.08524704,
         0.03613758,  0.02350235, -0.14900208,  0.04727173,  0.01794624,
         0.13770485, -0.14965057,  0.0192771 ,  0.09023285, -0.0658226 ,
        -0.24008465, -0.0807209 ,  0.02245712,  0.02532959,  0.11656952,
        -0.0317688 ,  0.06780624, -0.05103111, -0.03971863, -0.00222874,
         0.13659668, -0.02814484, -0.03833008,  0.02139664, -0.09573364,
        -0.10679626,  0.08338165, -0.06730461, -0.07703114,  0.03153419,
        -0.07352448, -0.05755

In [16]:
vectors = []
for i in range(30):
    x = {'id': str(i), 'values': word2vec_embeddings[i]}
    vectors.append(x)

In [None]:
vectors = [{'id': str(i), 'values': word2vec_embeddings[i]} for i in range(30)]

In [18]:
y = [1, 'a']
y

[1, 'a']

In [17]:
vectors

[{'id': '0',
  'values': array([ 0.08961201,  0.02994537,  0.01651001,  0.15172577, -0.04099274,
          0.00646244, -0.03237152, -0.02348328,  0.01422119,  0.1411171 ,
          0.00030422, -0.17591858, -0.06302261,  0.04878616, -0.12338638,
          0.13635635, -0.05486679,  0.14001846,  0.07175446, -0.07574081,
          0.01264381,  0.03368759,  0.03656769,  0.0002799 ,  0.07305908,
         -0.10983276, -0.05321145,  0.12943459, -0.02946377,  0.04084396,
         -0.05028152,  0.01818275, -0.03564739,  0.07135391, -0.08524704,
          0.03613758,  0.02350235, -0.14900208,  0.04727173,  0.01794624,
          0.13770485, -0.14965057,  0.0192771 ,  0.09023285, -0.0658226 ,
         -0.24008465, -0.0807209 ,  0.02245712,  0.02532959,  0.11656952,
         -0.0317688 ,  0.06780624, -0.05103111, -0.03971863, -0.00222874,
          0.13659668, -0.02814484, -0.03833008,  0.02139664, -0.09573364,
         -0.10679626,  0.08338165, -0.06730461, -0.07703114,  0.03153419,
         -0.073

In [25]:
index.upsert(vectors=vectors, namespace='vectors')

{'upserted_count': 30}

In [26]:
print(index.describe_index_stats())

{'dimension': 300,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 30}},
 'total_vector_count': 30}


In [3]:
index.upsert(
    vectors=[
        {"id": "vec1", "values": [1.0, 1.5]},
        {"id": "vec2", "values": [2.0, 1.0]},
        {"id": "vec3", "values": [0.1, 3.0]},
    ],
    namespace="ns1"
)

index.upsert(
    vectors=[
        {"id": "vec1", "values": [1.0, -2.5]},
        {"id": "vec2", "values": [3.0, -2.0]},
        {"id": "vec3", "values": [0.5, -1.5]},
    ],
    namespace="ns2"
)

{'upserted_count': 3}

In [27]:
query_str = 'coffee smell'

query_vector = generate_word2vec_embeddings([query_str])

query_vector

array([[ 0.01367188, -0.10717773, -0.20788574,  0.30711365, -0.10205078,
        -0.00811768, -0.1159668 , -0.02148438, -0.00925446,  0.3671875 ,
        -0.10205078, -0.2368164 ,  0.05566406,  0.07617188, -0.01531982,
         0.19726562, -0.08215332,  0.14453125,  0.22705078, -0.18115234,
        -0.20507812,  0.07391357,  0.17333984, -0.24658203, -0.14978027,
         0.00488281, -0.02490234,  0.18237305, -0.05865479,  0.0579834 ,
        -0.03411865, -0.17160034, -0.10400391, -0.14428711, -0.17773438,
        -0.02542114,  0.01763916, -0.17547607, -0.08508301,  0.06848145,
        -0.16625977, -0.27077103,  0.0925293 ,  0.06634521, -0.16479492,
        -0.07995605, -0.32470703,  0.09127808, -0.27775574,  0.2524414 ,
        -0.13500977, -0.01123047,  0.00634766, -0.07592773, -0.00756836,
         0.04724121,  0.06738281,  0.2019043 ,  0.08813477, -0.24609375,
        -0.07141113,  0.17640495, -0.00439453,  0.17382812,  0.07470703,
        -0.18554688, -0.07110596,  0.03588867,  0.0

In [32]:
index.query(
    namespace="vectors",
    vector=query_vector.tolist(),
    top_k=3,
    include_values=False
)

{'matches': [{'id': '27', 'score': 0.611195326, 'values': []},
             {'id': '24', 'score': 0.569107175, 'values': []},
             {'id': '18', 'score': 0.551815, 'values': []}],
 'namespace': 'vectors',
 'usage': {'read_units': 5}}

In [40]:
wine_df[wine_df['Unnamed: 0'] == 27]['description']

27    Aromas recall ripe dark berry, toast and a whiff of cake spice. The soft, informal palate offers sour cherry, vanilla and a hint of espresso alongside round tannins. Drink soon.
Name: description, dtype: object

# ChromaDB

In [2]:
# Inicializar el cliente de ChromaDB
chroma_client = chromadb.Client()

In [3]:
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

In [4]:
# Generar embeddings
def get_embeddings(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state[:, 0, :]
    return embeddings

In [5]:
import pandas as pd
csv_file_path = "../data/winemag-data_first150k.csv" 
df = pd.read_csv(csv_file_path)
df

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
...,...,...,...,...,...,...,...,...,...,...,...
150925,150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset


In [6]:
texts = df['description'].dropna().sample(n=100, random_state=42).tolist()

In [7]:
embeddings = get_embeddings(texts).numpy().tolist()

In [8]:
# Crear una colección en ChromaDB
collection_name = "wine_descriptions"
collection = chroma_client.create_collection(name=collection_name)

In [9]:
collection.add(
    embeddings=embeddings,
    documents=texts,
    ids=[f"id_{i}" for i in range(len(texts))]
)

In [10]:
query_text = "coffee smell"
query_embedding = get_embeddings([query_text])[0]

In [11]:
query_embedding_list = query_embedding.tolist()

In [12]:
results = collection.query(
    query_embeddings=[query_embedding_list],
    n_results=3
)

In [13]:
print(f"Query: '{query_text}'\n")
for i, (document, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
    print(f"Result {i+1}: Distance: {distance:.4f}")
    print(f"Description: {document}\n")

Query: 'coffee smell'

Result 1: Distance: 32.9100
Description: A hint of sweat in the nose blows away to reveal a solid core of cherry fruit, along with broad flavors of stem, coffee and earth. This wine is substantial, though that slightly funky character lingers into the finish.

Result 2: Distance: 33.7467
Description: This hearty young red offers aromas of stewed black fruit, oak and espresso. The brooding palate delivers sugary black plum, ripe blackberry, tobacco, coffee bean and chocolate alongside massive tannins. Drink after 2019.

Result 3: Distance: 34.8161
Description: This opens with attractive aromas of pressed rose, woodland berry, dried herb and a whiff of baking spice. The robust palate shows sour, almost unripe cherry, clove, mocha and anisette. The warmth of evident alcohol closes the finish.

