# Preparacion del Dataset


Almacenar el csv en un dataframe

In [9]:
import pandas as pd
archivo_csv = 'wiki_movie_plots_deduped.csv'
df = pd.read_csv(archivo_csv)

Mantener solo las columnas relevantes

In [10]:
df_final = df[['Release Year','Title', 'Plot']]
df_final.head()

Unnamed: 0,Release Year,Title,Plot
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...


Verificar si existen valores nulos

In [11]:
df_final.isna().sum()

Release Year    0
Title           0
Plot            0
dtype: int64

Convertir a minusculas y eliminar signos de puntuación

In [12]:
df_final.loc[:, 'textoLimpio'] = df_final['Plot'].str.lower().str.replace('.', '', regex=False).str.replace(',', '', regex=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final.loc[:, 'textoLimpio'] = df_final['Plot'].str.lower().str.replace('.', '', regex=False).str.replace(',', '', regex=False)


In [13]:
df_final.head()

Unnamed: 0,Release Year,Title,Plot,textoLimpio
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",a bartender is working at a saloon serving dri...
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",the moon painted with a smiling face hangs ove...
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed...",the film just over a minute long is composed o...
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...,the earliest known adaptation of the classic f...


Tokenizacion

In [14]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

df_final.loc[:, 'tokens'] = df_final['textoLimpio'].apply(word_tokenize)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\dicam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final.loc[:, 'tokens'] = df_final['textoLimpio'].apply(word_tokenize)


In [15]:
import nltk
nltk.download

<bound method Downloader.download of <nltk.downloader.Downloader object at 0x0000023F79FBFEC0>>

In [16]:
df_final.head()

Unnamed: 0,Release Year,Title,Plot,textoLimpio,tokens
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",a bartender is working at a saloon serving dri...,"[a, bartender, is, working, at, a, saloon, ser..."
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",the moon painted with a smiling face hangs ove...,"[the, moon, painted, with, a, smiling, face, h..."
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed...",the film just over a minute long is composed o...,"[the, film, just, over, a, minute, long, is, c..."
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,lasting just 61 seconds and consisting of two ...,"[lasting, just, 61, seconds, and, consisting, ..."
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...,the earliest known adaptation of the classic f...,"[the, earliest, known, adaptation, of, the, cl..."


# TF-IDF


In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Configuración de TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df_final['Plot'])  # Matriz TF-IDF

# Función para búsqueda con TF-IDF
def busqueda_tfidf(consulta, tfidf_vectorizer, tfidf_matrix, df):
    # Transformar la consulta en un vector TF-IDF
    consulta_vector = tfidf_vectorizer.transform([consulta])
    # Calcular similitud de coseno
    similitudes = cosine_similarity(consulta_vector, tfidf_matrix).flatten()
    # Obtener índices de los documentos más similares
    top_indices = similitudes.argsort()[::-1][:5]
    # Recuperar los documentos correspondientes
    resultados = df.iloc[top_indices][['Title', 'Plot']].to_dict('records')
    return resultados


In [46]:
# Ejemplo de búsqueda
resultados_tfidf = busqueda_tfidf("dinosaurs", tfidf_vectorizer, tfidf_matrix, df_final)
for resultado in resultados_tfidf:
    print(f"Título: {resultado['Title']}\nTrama: {resultado['Plot']}\n")

Título: We're Back! A Dinosaur's Story
Trama: In present-day New York City, an Eastern bluebird named Buster runs away from his siblings and he meets an intelligent orange Tyrannosaurus named Rex, who is playing golf. He explains to Buster that he was once a ravaging dinosaur, and proceeds to tell his personal story.
In a prehistoric jungle, Rex is terrorizing other dinosaurs such as this Thescelosaurus he is pursuing when a spaceship lands on Earth with a little alien named Vorb. Vorb captures Rex and gives him "Brain Grain", a special breakfast cereal that vastly increases Rex's intelligence. Rex is given his name and introduced to other dinosaurs that are also anthropomorphized by the magic of Brain Grain: a blue Triceratops named Woog, a purple Pteranodon named Elsa and a green Parasaurolophus named Dweeb. They soon meet Vorb's employer Captain Neweyes, the inventor of Brain Grain, who reveals his goal of allowing the children of the present time to see real dinosaurs, fulfilling t

# BM25

Conexión Elasticsearch con Docker

In [47]:
!pip install elasticsearch



Ejecutar en terminal los siguientes comandos para la conexión

- docker network create elastic

- docker pull docker.elastic.co/elasticsearch/elasticsearch:8.17.0

- docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.17.0

- curl http://localhost:9200

Establecer la conexión con elasticsearch

In [50]:
from elasticsearch import Elasticsearch

# Conexión al cliente Elasticsearch
es = Elasticsearch("http://localhost:9200")

# Verificar si está conectado
if es.ping():
    print("Conexión exitosa a Elasticsearch")
else:
    print("Error al conectar con Elasticsearch")


Conexión exitosa a Elasticsearch


Creación del indice

In [51]:
index_name = "elasticsearch_index"
if not es.indices.exists(index=index_name):
    es.indices.create(
        index=index_name,
        body={
            "mappings": {
                "properties": {
                    "title": {"type": "text"},
                    "plot": {"type": "text"}
                }
            }
        }
    )
    print(f"Índice '{index_name}' creado.")
else:
    print(f"Índice '{index_name}' ya existe.")


Índice 'elasticsearch_index' creado.


Indexación de peliculas

In [52]:
for _, row in df_final.iterrows():
    doc = {
        "title": row["Title"],
        "plot": row["Plot"]
    }
    es.index(index=index_name, id=row["Title"], document=doc)

print("Películas indexadas con éxito.")


Películas indexadas con éxito.


Función para busqueda del indice Elasticsearch

In [53]:
# Función para realizar búsqueda con BM25
def busqueda_bm25(query, es, index_name):
    body = {
        "size": 5,
        "query": {
            "match": {
                "plot": query
            }
        }
    }
    resultados = es.search(index=index_name, body=body)
    return [
        {"Title": hit["_source"]["title"], "Plot": hit["_source"]["plot"]}
        for hit in resultados["hits"]["hits"]
    ]



Realizar la búsqueda

In [55]:
# Ejemplo de búsqueda
resultados_bm25 = busqueda_bm25("time travel", es, index_name)
for resultado in resultados_bm25:
    print(f"Título: {resultado['Title']}\nTrama: {resultado['Plot']}\n")


Título: Love Story 2050
Trama: Karan Malhotra (Harman Baweja) is a spirited and happy-go-lucky boy who does not follow the rules. Sana (Priyanka Chopra) is the opposite of Karan: a sweet and shy girl who lives life by the rules. Even though they are completely opposite, they fall in love, leading to a magical love story.[9]
A scientist, Dr. Yatinder Khanna (Boman Irani), has dedicated 15 years of his life to building a time machine. Sana expresses a wish to time-travel to Mumbai in the year 2050, but she is killed in an accident before her marriage to Karan. Karan wishes to travel back in time and find Sana. Dr. Yatinder, Karan, and Sana's siblings, Rahul and Thea, travel forward in time and reach Mumbai in 2050. They are fascinated by the futuristic Mumbai, with its flying cars, holograms, robots, 200-story buildings and more.
Twists and turns lead to the introduction of Ziesha (Priyanka Chopra), the reincarnation of Sana. Ziesha is a popular singer in 2050 who does not remember her p

# FAISS

In [2]:
!pip install faiss-cpu
!pip install sentence-transformers

Note: you may need to restart the kernel to use updated packages.


In [19]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Crear embeddings con Sentence Transformers
modelo = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = modelo.encode(df_final['Plot'].tolist())


In [23]:
# Crear índice FAISS
dimension = embeddings.shape[1]
indice_faiss = faiss.IndexFlatL2(dimension)
indice_faiss.add(embeddings)

In [24]:
# Función para búsqueda con FAISS
def busqueda_faiss(consulta, modelo, indice_faiss, df):
    consulta_vector = modelo.encode([consulta])
    _, indices = indice_faiss.search(np.array(consulta_vector).astype('float32'), 5)
    resultados = df.iloc[indices[0]][['Title', 'Plot']].to_dict('records')
    return resultados

In [25]:
# Ejemplo de búsqueda
resultados_faiss = busqueda_faiss("time travel", modelo, indice_faiss, df_final)
for resultado in resultados_faiss:
    print(f"Título: {resultado['Title']}\nTrama: {resultado['Plot']}\n")


Título: Dimensions
Trama: The film follows Stephen, a brilliant young scientist who lives in Cambridge, England, in what appears to be the 1920s. His world is turned upside down upon meeting a charismatic and inspirational professor at a garden party, who demonstrates to Stephen and his friends what life would be like if they were one-, or two-dimensional beings. He then proceeds to explain that by manipulating other dimensions, time travel may actually be possible.
Soon after the professor's visit, Stephan,his cousin, Conrad, and his neighbor, Victoria, were fooling around by a well. Conrad throws Victoria's skipping rope down the well. The nanny catches the boys rolling around fighting and drags them in the house by the ears leaving Victoria alone to play outside by herself. After some time, she decides to climb down the well to get her skipping rope. She never climbs out of the well and her body is never found.
As Stephen’s life unfolds, events lead him to dedicate himself to turnin

# Chroma

In [None]:
import torch
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU found")


In [None]:
import pandas as pd
import chromadb
from chromadb.utils import embedding_functions
import torch

# Verificar si la GPU está disponible
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Usando el dispositivo: {device}")

# Filtrar campos necesarios
movies_df = df_final[df_final['Plot'].notna()]
documents = movies_df['Plot'].tolist()  
titles = movies_df['Title'].tolist()  

# Configurar cliente Chroma
client = chromadb.Client()

# Crear la función de embeddings con GPU
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2",  
    device=device  
)

# Crear colección con la función de embeddings configurada
collection = client.get_or_create_collection(
    name="movies_collection",
    embedding_function=embedding_function
)

# Generar IDs únicos para cada documento
ids = [f"movie_{i}" for i in range(len(movies_df))]

# Dividir los datos en lotes
batch_size = 5000  # Tamaño del lote
for i in range(0, len(documents), batch_size):
    batch_documents = documents[i:i+batch_size]
    batch_titles = titles[i:i+batch_size]
    batch_ids = ids[i:i+batch_size]
    
    # Agregar el lote a la colección
    collection.add(
        documents=batch_documents, 
        metadatas=[{"title": title} for title in batch_titles],  
        ids=batch_ids  
    )
    print(f"Lote {i // batch_size + 1} procesado exitosamente.")

print(f"Se han agregado {len(documents)} documentos y generado sus embeddings.")


Usando el dispositivo: cuda
Lote 1 procesado exitosamente.
Lote 2 procesado exitosamente.
Lote 3 procesado exitosamente.
Lote 4 procesado exitosamente.
Lote 5 procesado exitosamente.
Lote 6 procesado exitosamente.
Lote 7 procesado exitosamente.
Se han agregado 34886 documentos y generado sus embeddings.


In [None]:
# Definir una consulta
query = "young boy"

# Realizar la consulta en la colección
results = collection.query(
    query_texts=[query],  
    n_results=5  
)

# Mostrar los resultados
for i, document in enumerate(results['documents'][0]):
    print(f"Resultado {i + 1}:")
    print(f"Trama: {document}")
    print("-" * 50)


Resultado 1:
Trama: An investigative thriller based on the search for a missing youngster.
--------------------------------------------------
Resultado 2:
Trama: A young boy from the lower caste resorts to petty thefts to make both ends meet after he lost his father at the early age. Once the boy realizes the importance of an education, he begins to improve his life and never looks back. Through diligence and dedication, he climbs the social and political ladder to success.
--------------------------------------------------
Resultado 3:
Trama: Thirteen-year-old Jesse is assigned a school project. A photographic self-portrait intended to portray one’s self without resorting to literal representation. Jesse lives with his parents, Sabi and Tim, in the lefty, middle class Toronto neighbourhood of Riverdale. A quiet and distant only-child with budding artistic aspirations, Jesse is inspired by the assignment to look for excitement and meaning in the world around him. Wielding a newly acqui