# Preparacion del Dataset


Almacenar el csv en un dataframe

In [1]:
import pandas as pd
archivo_csv = 'wiki_movie_plots_deduped.csv'
df = pd.read_csv(archivo_csv)

Mantener solo las columnas relevantes

In [2]:
df_final = df[['Release Year','Title', 'Plot']]
df_final.head()

Unnamed: 0,Release Year,Title,Plot
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...


Verificar si existen valores nulos

In [3]:
df_final.isna().sum()

Release Year    0
Title           0
Plot            0
dtype: int64

Convertir a minusculas y eliminar signos de puntuación

In [4]:
df_final.loc[:, 'textoLimpio'] = df_final['Plot'].str.lower().str.replace('.', '', regex=False).str.replace(',', '', regex=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final.loc[:, 'textoLimpio'] = df_final['Plot'].str.lower().str.replace('.', '', regex=False).str.replace(',', '', regex=False)


In [5]:
df_final.head()

Unnamed: 0,Release Year,Title,Plot,textoLimpio
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",a bartender is working at a saloon serving dri...
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",the moon painted with a smiling face hangs ove...
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed...",the film just over a minute long is composed o...
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...,the earliest known adaptation of the classic f...


## Tokenizacion

In [6]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

df_final.loc[:, 'tokens'] = df_final['textoLimpio'].apply(word_tokenize)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\glenn\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final.loc[:, 'tokens'] = df_final['textoLimpio'].apply(word_tokenize)


In [7]:
df_final.head()

Unnamed: 0,Release Year,Title,Plot,textoLimpio,tokens
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",a bartender is working at a saloon serving dri...,"[a, bartender, is, working, at, a, saloon, ser..."
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",the moon painted with a smiling face hangs ove...,"[the, moon, painted, with, a, smiling, face, h..."
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed...",the film just over a minute long is composed o...,"[the, film, just, over, a, minute, long, is, c..."
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,lasting just 61 seconds and consisting of two ...,"[lasting, just, 61, seconds, and, consisting, ..."
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...,the earliest known adaptation of the classic f...,"[the, earliest, known, adaptation, of, the, cl..."


# Eliminar Stopwords

In [8]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df_final.loc[:, 'tokens'] = df_final['tokens'].apply(lambda tokens: [token for token in tokens if token not in stop_words])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\glenn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
df_final.head()

Unnamed: 0,Release Year,Title,Plot,textoLimpio,tokens
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",a bartender is working at a saloon serving dri...,"[bartender, working, saloon, serving, drinks, ..."
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",the moon painted with a smiling face hangs ove...,"[moon, painted, smiling, face, hangs, park, ni..."
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed...",the film just over a minute long is composed o...,"[film, minute, long, composed, two, shots, fir..."
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,lasting just 61 seconds and consisting of two ...,"[lasting, 61, seconds, consisting, two, shots,..."
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...,the earliest known adaptation of the classic f...,"[earliest, known, adaptation, classic, fairyta..."


# Indice Invertido

In [10]:
def indiceInvertido(df, columna_tokens):
    indice = {}
    for index, row in df.iterrows():
        tokens = row[columna_tokens]
        for token in tokens:
            if token not in indice:
                indice[token] = []
            indice[token].append(index)
    return indice


In [11]:
indice = indiceInvertido(df_final, 'tokens')

In [12]:
def buscarTitulos(indice_invertido, df, columna_title, columna_plot, palabra):
    documentos = indice_invertido.get(palabra, [])
    titulos_encontrados = []
    for documento in documentos:
        titulo = df.loc[documento, columna_title]
        plot = df.loc[documento, columna_plot]
        titulos_encontrados.append(titulo)
    return titulos_encontrados


In [13]:
def busquedaIndiceInvertido(indice, df, texto_busqueda):
    # Tokenizamos el texto de búsqueda
    tokens_busqueda = texto_busqueda.split()
    resultados = set()  # Usamos un set para evitar duplicados

    # Para cada token en el texto de búsqueda, encontramos los documentos que lo contienen
    for token in tokens_busqueda:
        if token in indice:
            # Añadimos los títulos y tramas correspondientes a los índices encontrados
            for index in indice[token]:
                titulo = df.iloc[index]["Title"]
                plot = df.iloc[index]["Plot"]
                resultados.add((titulo, plot))  # Usamos una tupla para asegurarnos de que no se repitan

    # Convertimos el set de resultados en una lista de diccionarios y devolvemos los primeros 5
    return [{"Title": titulo, "Plot": plot} for titulo, plot in list(resultados)[:5]]



In [14]:
resultados = busquedaIndiceInvertido(indice,df_final,"time travel")
for resultado in resultados:
    print(f"Título: {resultado['Title']}")
    print(f"Trama: {resultado['Plot']}\n")

Título: Brewster's Millions
Trama: As summarized in a film publication,[4] Monte Brewster's (Arbuckle) two grandfathers, one rich and the other a self-made man, squabble as to the way the infant should be raised. The mother steps in and decides to raise the child her way, which results in Monte being a clerk in a steamship office at the age of 21. At this point the grandfathers get together again, with one grandfather giving him $1 million, and the other offering $4 million provided that at the end of one year Monte spends the $1 million given by the other grandfather. Other conditions include that he be absolutely "broke" at the end of one year, that he not marry for five years, and not to tell any one of the arrangement. Young Brewster tries everything he can to get rid of the money, but everything he does and the wildest chances he takes result in more money for him. He hires three men to help him spend the money, but they take too much interest in investing it wisely. They hire Peg

# Woosh

In [15]:
pip install whoosh

Collecting whoosh
  Downloading Whoosh-2.7.4-py2.py3-none-any.whl.metadata (3.1 kB)
Downloading Whoosh-2.7.4-py2.py3-none-any.whl (468 kB)
Installing collected packages: whoosh
Successfully installed whoosh-2.7.4
Note: you may need to restart the kernel to use updated packages.


In [16]:
from whoosh import index
from whoosh.fields import Schema, TEXT, ID
from whoosh.qparser import QueryParser
import os

In [17]:
schema = Schema(
    Title=TEXT(stored=True),  # Almacenar el título
    Plot=TEXT(stored=True)    # Almacenar la trama
)

In [18]:
# Paso 2: Crear el índice
index_dir = "whoosh_index"
if not os.path.exists(index_dir):
    os.mkdir(index_dir)
index_whoosh = index.create_in(index_dir, schema)


In [19]:
writer = index_whoosh.writer()
for _, row in df_final.iterrows():
    writer.add_document(
        Title=row["Title"],
        Plot=row["Plot"]
    )
writer.commit()


In [20]:

# Paso 4: Realizar búsquedas
def buscar_peliculas_whoosh(texto_busqueda):
    with index_whoosh.searcher() as searcher:
        query = QueryParser("Plot", index_whoosh.schema).parse(texto_busqueda)
        resultados = searcher.search(query, limit=5)  # Limitar a los 10 resultados más relevantes
        return [dict(result) for result in resultados]


In [21]:
# Ejemplo de búsqueda
resultados = buscar_peliculas_whoosh("time travel")
for resultado in resultados:
    print(f"Título: {resultado['Title']}")
    print(f"Trama: {resultado['Plot']}\n")

Título: Time Chasers
Trama: Physics teacher and amateur pilot Nick Miller (Matthew Bruch) has finally completed his quest of enabling time travel, via a Commodore 64 and his small airplane. After being inspired by a television commercial for GenCorp, he uses a ruse to bring out both a GenCorp executive and a reporter from a local paper. To Nick's surprise, the reporter is Lisa Hansen (Bonnie Pritchard), an old high school flame. One trip to 2041 later and Gencorp's executive, Matthew Paul (Peter Harrington), quickly arranges Nick a meeting with CEO J.K. Robertson (George Woodard). Impressed by the potential of time travel, Robertson offers Nick a licensing agreement on the technology.
The following week, Nick and Lisa meet at the supermarket and go on a date to the 1950s. However, another trip to 2041 reveals that GenCorp abused Nick's time travel technology, creating a dystopian future. In an attempt to tell J.K. about how GenCorp inadvertently ruined the future. J.K. dismisses the ev

# Elasticsearch

In [22]:
!pip install elasticsearch

Collecting elasticsearch
  Downloading elasticsearch-8.17.0-py3-none-any.whl.metadata (8.8 kB)
Collecting elastic-transport<9,>=8.15.1 (from elasticsearch)
  Downloading elastic_transport-8.17.0-py3-none-any.whl.metadata (3.6 kB)
Downloading elasticsearch-8.17.0-py3-none-any.whl (571 kB)
   ---------------------------------------- 0.0/571.2 kB ? eta -:--:--
   ---------------------------------------- 0.0/571.2 kB ? eta -:--:--
   ---------------------------------------- 571.2/571.2 kB 2.5 MB/s eta 0:00:00
Downloading elastic_transport-8.17.0-py3-none-any.whl (64 kB)
Installing collected packages: elastic-transport, elasticsearch
Successfully installed elastic-transport-8.17.0 elasticsearch-8.17.0


### Conexion a Docker

docker network create elastic

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.17.0

docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.17.0

curl http://localhost:9200

In [23]:
from elasticsearch import Elasticsearch

# Conexión al cliente Elasticsearch
es = Elasticsearch("http://localhost:9200")

# Verificar si está conectado
if es.ping():
    print("Conexión exitosa a Elasticsearch")
else:
    print("Error al conectar con Elasticsearch")


Error al conectar con Elasticsearch


In [24]:
# Definimos el esquema del índice
index_name = "movies"
if not es.indices.exists(index=index_name):
    es.indices.create(
        index=index_name,
        body={
            "mappings": {
                "properties": {
                    "title": {"type": "text"},
                    "plot": {"type": "text"}
                }
            }
        }
    )
    print(f"Índice '{index_name}' creado.")
else:
    print(f"Índice '{index_name}' ya existe.")


ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x000001B2BC5BFD70>: Failed to establish a new connection: [WinError 10061] No se puede establecer una conexión ya que el equipo de destino denegó expresamente dicha conexión))

In [None]:
# Indexar los datos
for _, row in df_final.iterrows():
    doc = {
        "title": row["Title"],
        "plot": row["Plot"]
    }
    es.index(index=index_name, id=row["Title"], document=doc)

print("Películas indexadas con éxito.")


Películas indexadas con éxito.


In [65]:
# Función para realizar búsquedas
def buscar_peliculas_elasticsearch(query):
    body = {
        "size": 5,
        "query": {
            "match": {
                "plot": query
            }
        }
    }
    resultados = es.search(index=index_name, body=body)
    return [
        {
            "title": hit["_source"]["title"],
            
            "plot": hit["_source"]["plot"]
        }
        for hit in resultados["hits"]["hits"]
    ]

In [66]:
# Ejemplo de búsqueda
resultados = buscar_peliculas_elasticsearch("time travel")
for resultado in resultados:
    print(f"Título: {resultado['title']}\nTrama: {resultado['plot']}\n")

Título: Time Chasers
Trama: Physics teacher and amateur pilot Nick Miller (Matthew Bruch) has finally completed his quest of enabling time travel, via a Commodore 64 and his small airplane. After being inspired by a television commercial for GenCorp, he uses a ruse to bring out both a GenCorp executive and a reporter from a local paper. To Nick's surprise, the reporter is Lisa Hansen (Bonnie Pritchard), an old high school flame. One trip to 2041 later and Gencorp's executive, Matthew Paul (Peter Harrington), quickly arranges Nick a meeting with CEO J.K. Robertson (George Woodard). Impressed by the potential of time travel, Robertson offers Nick a licensing agreement on the technology.
The following week, Nick and Lisa meet at the supermarket and go on a date to the 1950s. However, another trip to 2041 reveals that GenCorp abused Nick's time travel technology, creating a dystopian future. In an attempt to tell J.K. about how GenCorp inadvertently ruined the future. J.K. dismisses the ev