Almacenar el csv en un dataframe

In [39]:
import pandas as pd
archivo_csv = 'wiki_movie_plots_deduped.csv'
df = pd.read_csv(archivo_csv)

Mantener solo las columnas relevantes

In [40]:
df_final = df[['Release Year','Title', 'Plot']]
df_final.head()

Unnamed: 0,Release Year,Title,Plot
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...


Verificar si existen valores nulos

In [41]:
df_final.isna().sum()

Release Year    0
Title           0
Plot            0
dtype: int64

Convertir a minusculas y eliminar signos de puntuación

In [42]:
df_final.loc[:, 'textoLimpio'] = df_final['Plot'].str.lower().str.replace('.', '', regex=False).str.replace(',', '', regex=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final.loc[:, 'textoLimpio'] = df_final['Plot'].str.lower().str.replace('.', '', regex=False).str.replace(',', '', regex=False)


In [43]:
df_final.head()

Unnamed: 0,Release Year,Title,Plot,textoLimpio
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",a bartender is working at a saloon serving dri...
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",the moon painted with a smiling face hangs ove...
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed...",the film just over a minute long is composed o...
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...,the earliest known adaptation of the classic f...


## Tokenizacion

In [44]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

df_final.loc[:, 'tokens'] = df_final['textoLimpio'].apply(word_tokenize)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\dicam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


KeyboardInterrupt: 

In [8]:
import nltk
nltk.download

<bound method Downloader.download of <nltk.downloader.Downloader object at 0x000001B37F5E8950>>

In [9]:
df_final.head()

Unnamed: 0,Release Year,Title,Plot,textoLimpio,tokens
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",a bartender is working at a saloon serving dri...,"[a, bartender, is, working, at, a, saloon, ser..."
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",the moon painted with a smiling face hangs ove...,"[the, moon, painted, with, a, smiling, face, h..."
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed...",the film just over a minute long is composed o...,"[the, film, just, over, a, minute, long, is, c..."
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,lasting just 61 seconds and consisting of two ...,"[lasting, just, 61, seconds, and, consisting, ..."
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...,the earliest known adaptation of the classic f...,"[the, earliest, known, adaptation, of, the, cl..."


# Eliminar Stopwords

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df_final.loc[:, 'tokens'] = df_final['tokens'].apply(lambda tokens: [token for token in tokens if token not in stop_words])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dicam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df_final.head()

Unnamed: 0,Release Year,Title,Plot,textoLimpio,tokens
0,1901,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",a bartender is working at a saloon serving dri...,"[bartender, working, saloon, serving, drinks, ..."
1,1901,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",the moon painted with a smiling face hangs ove...,"[moon, painted, smiling, face, hangs, park, ni..."
2,1901,The Martyred Presidents,"The film, just over a minute long, is composed...",the film just over a minute long is composed o...,"[film, minute, long, composed, two, shots, fir..."
3,1901,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,lasting just 61 seconds and consisting of two ...,"[lasting, 61, seconds, consisting, two, shots,..."
4,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...,the earliest known adaptation of the classic f...,"[earliest, known, adaptation, classic, fairyta..."


# Indice Invertido

In [12]:
def indiceInvertido(df, columna_tokens):
    indice = {}
    for index, row in df.iterrows():
        tokens = row[columna_tokens]
        for token in tokens:
            if token not in indice:
                indice[token] = []
            indice[token].append(index)
    return indice


In [13]:
indice = indiceInvertido(df_final, 'tokens')

In [14]:
def buscarTitulos(indice_invertido, df, columna_title, columna_plot, palabra):
    documentos = indice_invertido.get(palabra, [])
    titulos_encontrados = []
    for documento in documentos:
        titulo = df.loc[documento, columna_title]
        plot = df.loc[documento, columna_plot]
        titulos_encontrados.append(titulo)
    return titulos_encontrados


In [69]:
def busquedaIndiceInvertido(indice, df, texto_busqueda):
    # Tokenizamos el texto de búsqueda
    tokens_busqueda = texto_busqueda.split()
    resultados = set()  # Usamos un set para evitar duplicados

    # Para cada token en el texto de búsqueda, encontramos los documentos que lo contienen
    for token in tokens_busqueda:
        if token in indice:
            # Añadimos los títulos y tramas correspondientes a los índices encontrados
            for index in indice[token]:
                titulo = df.iloc[index]["Title"]
                plot = df.iloc[index]["Plot"]
                resultados.add((titulo, plot))  # Usamos una tupla para asegurarnos de que no se repitan

    # Convertimos el set de resultados en una lista de diccionarios y devolvemos los primeros 5
    return [{"Title": titulo, "Plot": plot} for titulo, plot in list(resultados)[:5]]



In [70]:
resultados = busquedaIndiceInvertido(indice,df_final,"time travel")
for resultado in resultados:
    print(f"Título: {resultado['Title']}")
    print(f"Trama: {resultado['Plot']}\n")

Título: Johns
Trama: It's Christmas Eve and John (David Arquette) is asleep in a Los Angeles park. He awakens as someone is stealing his shoes, in which he keeps his money. He chases the thief but can't catch him. John is angered not only because those are his "lucky" sneakers but because he's trying to accumulate enough money for an overnight stay in a fancy hotel to celebrate his birthday, which is also Christmas. Each time John puts any money together, either by turning a trick, robbing the house of one of his regular "dates" or stealing from potential clients, it's taken from him either by robbery or in payback for a drug deal where he burned the dealer.
Meanwhile, Donner (Lukas Haas), a fellow hustler who's new to the streets and has fallen for John, tries to convince John to go with him to Branson, Missouri. Donner has a relative who runs a theme park there who can get them jobs. John is initially resistant to the idea but, after some particularly bad experiences, agrees to go.
J

# Woosh

In [None]:
pip install whoosh

Note: you may need to restart the kernel to use updated packages.


In [18]:
from whoosh import index
from whoosh.fields import Schema, TEXT, ID
from whoosh.qparser import QueryParser
import os

In [19]:
schema = Schema(
    Title=TEXT(stored=True),  # Almacenar el título
    Plot=TEXT(stored=True)    # Almacenar la trama
)

In [20]:
# Paso 2: Crear el índice
index_dir = "whoosh_index"
if not os.path.exists(index_dir):
    os.mkdir(index_dir)
index_whoosh = index.create_in(index_dir, schema)


In [21]:
writer = index_whoosh.writer()
for _, row in df_final.iterrows():
    writer.add_document(
        Title=row["Title"],
        Plot=row["Plot"]
    )
writer.commit()


In [63]:

# Paso 4: Realizar búsquedas
def buscar_peliculas_whoosh(texto_busqueda):
    with index_whoosh.searcher() as searcher:
        query = QueryParser("Plot", index_whoosh.schema).parse(texto_busqueda)
        resultados = searcher.search(query, limit=5)  # Limitar a los 10 resultados más relevantes
        return [dict(result) for result in resultados]


In [67]:
# Ejemplo de búsqueda
resultados = buscar_peliculas_whoosh("time travel")
for resultado in resultados:
    print(f"Título: {resultado['Title']}")
    print(f"Trama: {resultado['Plot']}\n")

Título: Time Chasers
Trama: Physics teacher and amateur pilot Nick Miller (Matthew Bruch) has finally completed his quest of enabling time travel, via a Commodore 64 and his small airplane. After being inspired by a television commercial for GenCorp, he uses a ruse to bring out both a GenCorp executive and a reporter from a local paper. To Nick's surprise, the reporter is Lisa Hansen (Bonnie Pritchard), an old high school flame. One trip to 2041 later and Gencorp's executive, Matthew Paul (Peter Harrington), quickly arranges Nick a meeting with CEO J.K. Robertson (George Woodard). Impressed by the potential of time travel, Robertson offers Nick a licensing agreement on the technology.
The following week, Nick and Lisa meet at the supermarket and go on a date to the 1950s. However, another trip to 2041 reveals that GenCorp abused Nick's time travel technology, creating a dystopian future. In an attempt to tell J.K. about how GenCorp inadvertently ruined the future. J.K. dismisses the ev

# Elasticsearch

In [None]:
!pip install elasticsearch



### Conexion a Docker

docker network create elastic

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.17.0

docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.17.0

curl http://localhost:9200

In [None]:
from elasticsearch import Elasticsearch

# Conexión al cliente Elasticsearch
es = Elasticsearch("http://localhost:9200")

# Verificar si está conectado
if es.ping():
    print("Conexión exitosa a Elasticsearch")
else:
    print("Error al conectar con Elasticsearch")


Conexión exitosa a Elasticsearch


In [None]:
# Definimos el esquema del índice
index_name = "movies"
if not es.indices.exists(index=index_name):
    es.indices.create(
        index=index_name,
        body={
            "mappings": {
                "properties": {
                    "title": {"type": "text"},
                    "plot": {"type": "text"}
                }
            }
        }
    )
    print(f"Índice '{index_name}' creado.")
else:
    print(f"Índice '{index_name}' ya existe.")


Índice 'movies' ya existe.


In [None]:
# Indexar los datos
for _, row in df_final.iterrows():
    doc = {
        "title": row["Title"],
        "plot": row["Plot"]
    }
    es.index(index=index_name, id=row["Title"], document=doc)

print("Películas indexadas con éxito.")


Películas indexadas con éxito.


In [65]:
# Función para realizar búsquedas
def buscar_peliculas_elasticsearch(query):
    body = {
        "size": 5,
        "query": {
            "match": {
                "plot": query
            }
        }
    }
    resultados = es.search(index=index_name, body=body)
    return [
        {
            "title": hit["_source"]["title"],
            
            "plot": hit["_source"]["plot"]
        }
        for hit in resultados["hits"]["hits"]
    ]

In [66]:
# Ejemplo de búsqueda
resultados = buscar_peliculas_elasticsearch("time travel")
for resultado in resultados:
    print(f"Título: {resultado['title']}\nTrama: {resultado['plot']}\n")

Título: Time Chasers
Trama: Physics teacher and amateur pilot Nick Miller (Matthew Bruch) has finally completed his quest of enabling time travel, via a Commodore 64 and his small airplane. After being inspired by a television commercial for GenCorp, he uses a ruse to bring out both a GenCorp executive and a reporter from a local paper. To Nick's surprise, the reporter is Lisa Hansen (Bonnie Pritchard), an old high school flame. One trip to 2041 later and Gencorp's executive, Matthew Paul (Peter Harrington), quickly arranges Nick a meeting with CEO J.K. Robertson (George Woodard). Impressed by the potential of time travel, Robertson offers Nick a licensing agreement on the technology.
The following week, Nick and Lisa meet at the supermarket and go on a date to the 1950s. However, another trip to 2041 reveals that GenCorp abused Nick's time travel technology, creating a dystopian future. In an attempt to tell J.K. about how GenCorp inadvertently ruined the future. J.K. dismisses the ev