# Bases de datos vectoriales

### Parte 1: Recuperación con TF-IDF
1. Cargar los datos 

In [42]:
import pandas as pd
archivo_csv = 'wiki_movie_plots_deduped.csv'
df = pd.read_csv(archivo_csv)
plots = df['Plot']
titles = df['Title']
df.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


2. Configurar TF-IDF

Usando scikit-learn, vamos a calcular los puntajes TF-IDF para los plots

In [43]:
# Importar scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Calcular TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(plots)

Creamos la funcion para realizar búsquedas

In [44]:
# Función para buscar películas
def search_movies(query, top_n=5):
    query_tfidf = vectorizer.transform([query])
    similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()
    top_indices = similarities.argsort()[-top_n:][::-1]
    results = [(titles[i], similarities[i]) for i in top_indices]
    return results

3. Probamos el sistema con una consulta

In [45]:
# Probar el sistema
query = "dinosaurs"
results = search_movies(query)

Ahora imprimiremos los resultados

In [46]:
print("\nResultados:")
for title, score in results:
    print(f"{title} (similaridad: {score:.2f})\n{plots[titles[titles == title].index[0]]}\n")


Resultados:
We're Back! A Dinosaur's Story (similaridad: 0.47)
In present-day New York City, an Eastern bluebird named Buster runs away from his siblings and he meets an intelligent orange Tyrannosaurus named Rex, who is playing golf. He explains to Buster that he was once a ravaging dinosaur, and proceeds to tell his personal story.
In a prehistoric jungle, Rex is terrorizing other dinosaurs such as this Thescelosaurus he is pursuing when a spaceship lands on Earth with a little alien named Vorb. Vorb captures Rex and gives him "Brain Grain", a special breakfast cereal that vastly increases Rex's intelligence. Rex is given his name and introduced to other dinosaurs that are also anthropomorphized by the magic of Brain Grain: a blue Triceratops named Woog, a purple Pteranodon named Elsa and a green Parasaurolophus named Dweeb. They soon meet Vorb's employer Captain Neweyes, the inventor of Brain Grain, who reveals his goal of allowing the children of the present time to see real dinos

4. Evaluamos resultados

Los resultados tienen lógica acorde a la query

### Parte 2: Recuperación con BM25
1. Configurar ElasticSearch

In [47]:
pip install elasticsearch

Note: you may need to restart the kernel to use updated packages.


En este punto vamos a configurar BM25 como nuestro modelo de recuperación de información. Para ello, vamos a utilizar la biblioteca Elasticsearch de Python

In [48]:
from elasticsearch import Elasticsearch

# Conexión al servidor Elasticsearch
es = Elasticsearch("http://localhost:9200")

# Verificar si está conectado
if es.ping():
    print("Conexión exitosa a Elasticsearch")
else:
    print("Error al conectar con Elasticsearch")


Conexión exitosa a Elasticsearch


In [49]:
# Definimos el esquema del índice
index_name = "movies"
if not es.indices.exists(index=index_name):
    es.indices.create(
        index=index_name,
        body={
            "mappings": {
                "properties": {
                    "title": {"type": "text"},
                    "plot": {"type": "text"}
                }
            }
        }
    )
    print(f"Índice '{index_name}' creado.")
else:
    print(f"Índice '{index_name}' ya existe.")

Índice 'movies' ya existe.


Una vez se configura el índice, insertamos los datos del CSV en Elasticsearch

In [50]:
# Indexar los datos
for _, row in df.iterrows():
    doc = {
        "title": row["Title"],
        "plot": row["Plot"]
    }
    es.index(index=index_name, id=row["Title"], document=doc)

print("Películas indexadas con éxito.")

Películas indexadas con éxito.


2. Declaramos la función para buscar

In [51]:
# Función para realizar búsqueda con BM25
def busqueda_bm25(query, es, index_name):
    body = {
        "size": 5,
        "query": {
            "match": {
                "plot": query
            }
        }
    }
    resultados = es.search(index=index_name, body=body)
    return [
        {"Title": hit["_source"]["title"], "Plot": hit["_source"]["plot"]}
        for hit in resultados["hits"]["hits"]
    ]

Ahora hacemos una búsqueda para probar el funcionamiento

In [52]:
query = "dinosaurs"

resultados_bm25 = busqueda_bm25(query, es, index_name)
for resultado in resultados_bm25:
    print(f"Título: {resultado['Title']}\nTrama: {resultado['Plot']}\n")

Título: We're Back! A Dinosaur's Story
Trama: In present-day New York City, an Eastern bluebird named Buster runs away from his siblings and he meets an intelligent orange Tyrannosaurus named Rex, who is playing golf. He explains to Buster that he was once a ravaging dinosaur, and proceeds to tell his personal story.
In a prehistoric jungle, Rex is terrorizing other dinosaurs such as this Thescelosaurus he is pursuing when a spaceship lands on Earth with a little alien named Vorb. Vorb captures Rex and gives him "Brain Grain", a special breakfast cereal that vastly increases Rex's intelligence. Rex is given his name and introduced to other dinosaurs that are also anthropomorphized by the magic of Brain Grain: a blue Triceratops named Woog, a purple Pteranodon named Elsa and a green Parasaurolophus named Dweeb. They soon meet Vorb's employer Captain Neweyes, the inventor of Brain Grain, who reveals his goal of allowing the children of the present time to see real dinosaurs, fulfilling t

3. Evaluamos los resultados


### Recuperación con FAISS
1. Configuración de FAISS

Instalamos las bibliotecas necesarias para esta parte en caso de no tenerlo usando el siguiente comando

pip install faiss-cpu numpy pandas sentence-transformers

Posteriormente, generamos los embeddings 

In [53]:
from sentence_transformers import SentenceTransformer
import pandas as pd

# Generar embeddings
model = SentenceTransformer('paraphrase-MiniLM-L3-v2')  # Modelo más pequeño
df['Embeddings'] = df['Plot'].apply(lambda x: model.encode(x))


In [54]:
import faiss
import numpy as np

# Convertir embeddings a una matriz de NumPy
embeddings = np.array(df['Embeddings'].tolist()).astype('float32')

# Crear un índice FAISS (L2: distancia euclidiana)
dimension = embeddings.shape[1]  # Dimensiones del embedding
index = faiss.IndexFlatL2(dimension)

# Agregar embeddings al índice
index.add(embeddings)
print(f"Total de vectores indexados: {index.ntotal}")


Total de vectores indexados: 34886


2. Vamos a realizar consultas

In [55]:
def search_movies(query, top_n=5):
    query_embedding = model.encode([query]).astype('float32')
    distances, indices = index.search(query_embedding, top_n)
    results = [(df.iloc[i]['Title'], distances[0][j], df.iloc[i]['Plot']) for j, i in enumerate(indices[0])]
    return results

# Probar con una consulta
query = "dinosaurs"
results = search_movies(query)

print("\nResultados:")
for title, distance, plot in results:
    print(f"{title} (distancia: {distance:.4f})\n{plots[titles[titles == title].index[0]]}\n")



Resultados:
Theodore Rex (distancia: 47.9583)
In an alternate futuristic society where humans and anthropomorphic dinosaurs co-exist, a tough police detective named Katie Coltraine (Whoopi Goldberg) is paired with a Tyrannosaurus named Theodore Rex (George Newbern) to find the killer of dinosaurs and other prehistoric animals leading them to a ruthless billionaire bent on killing off mankind by creating a new ice age.

The Land Before Time (distancia: 48.6658)
During the age of the dinosaurs, a massive famine forces several herds of dinosaurs to seek an oasis known as the Great Valley. Among these, a mother in a diminished "Longneck" herd gives birth to a single baby, named Littlefoot. Years later, Littlefoot plays with Cera, a "Three-horn", until her father intervenes, whereupon Littlefoot's mother describes the different kinds of dinosaurs: "Three-horns", "Spiketails", "Swimmers", and "Flyers". That night, as Littlefoot follows a "Hopper", he encounters Cera again, and they play tog

3. Evaluamos los resultados

### Parte 4: Recuperación con ChromaDB
1. Configurar ChromaDB

En este caso entendí que usar la GPU procesa más rapido este punto del taller, por lo que se hará uso de CUDA

In [56]:
import chromadb
from chromadb.utils import embedding_functions
import torch

# Verificar si la GPU está disponible
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Usando el dispositivo: {device}")

Usando el dispositivo: cuda


Vamos a usar estos campos en el modelo de ChromaDB

In [57]:
plots = df['Plot'].tolist()  # Lista de las tramas
titles = df['Title'].tolist()   # Lista de los títulos

In [58]:
# Configuración de la base de datos
client = chromadb.Client()

In [59]:
# Crear la función de embeddings con GPU
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name = "all-MiniLM-L6-v2",  # Modelo más grande
    device = device  # Anteriormente definido con GPU
)

In [60]:
# Crear colección con la función de embeddings configurada
collection = client.get_or_create_collection(
    name="movies_collection",
    embedding_function=embedding_function
)

In [61]:
# Generar IDs únicos para cada documento
ids = [f"movie_{i}" for i in range(len(df))]

In [62]:
# Dividir los datos en lotes
batch_size = 5000  # Tamaño del lote (ajustar según sea necesario)
for i in range(0, len(plots), batch_size):
    batch_plots = plots[i:i+batch_size]
    batch_titles = titles[i:i+batch_size]
    batch_ids = ids[i:i+batch_size]
    
    # Agregar el lote a la colección
    collection.add(
        plots = batch_plots,  # Tramas del lote
        metadatas = [{"title": title} for title in batch_titles],  # Metadata del lote
        ids=batch_ids  # IDs únicos del lote
    )
    print(f"Lote {i // batch_size + 1} procesado exitosamente.")

print(f"Se han agregado {len(plots)} documentos y generado sus embeddings.")

TypeError: Collection.add() got an unexpected keyword argument 'plots'

In [None]:
# Definir una consulta (puede ser una trama de película o cualquier texto)
query = "A young boy discovers "

# Realizar la consulta en la colección
results = collection.query(
    query_texts=[query],  # Pasar la consulta como texto
    n_results=5  # Número de resultados a devolver
)

# Mostrar los resultados
for i, document in enumerate(results['documents'][0]):  # Accedemos a la lista de documentos en el primer índice
    print(f"Resultado {i + 1}:")
    print(f"Trama: {document}")  # Solo mostramos la trama si no hay metadata disponible

In [None]:
print("\nResultados:")
for title, distance, plot in results:
    print(f"{title} (distancia: {distance:.4f})\n{plots[titles[titles == title].index[0]]}\n")

### Parte 5: Comparación de resultados

