# Práctica de Laboratorio de Procesamiento del Lenguaje Natural (NLP)
Tema: Modelos de lenguaje

## Contexto del cliente

Imagina que trabajas en una empresa de análisis de películas llamada "CineAnalyzer". La empresa se dedica a categorizar automáticamente las películas en diferentes géneros, como acción, comedia, ciencia ficción, drama, etc. Para automatizar este proceso, han decidido utilizar técnicas de Procesamiento del Lenguaje Natural (NLP) y te han asignado la tarea de preparar los textos para después poder realizar distintas analíticas y clasificaciones sobre ellos.

## Dataset

In [1]:
argumentos_peliculas = [
    "Un detective privado lucha por desentrañar los secretos oscuros de una mansión embrujada en 'Misterios en la Mansión Húngara'.",
    "Un equipo de astronautas debe luchar por sobrevivir en un planeta desconocido después de un aterrizaje forzoso en 'Planeta Olvidado'.",
    "En un mundo post-apocalíptico, un grupo de sobrevivientes busca refugio en 'La Última Esperanza'.",
    "Un científico brillante inventa una máquina del tiempo que desencadena consecuencias inesperadas en 'El Viaje Temporal'.",
    "Dos desconocidos quedan atrapados en un ascensor durante horas y descubren que tienen más en común de lo que imaginaban en 'Atrapados en el Ascensor'.",
    "Una joven talentosa lucha por alcanzar sus sueños musicales en 'Notas de Pasión'.",
    "Un detective de homicidios persigue a un asesino en serie que deja acertijos macabros en 'El Enigma del Asesino'.",
    "En un mundo de magia y criaturas míticas, un joven campesino emprende una búsqueda épica en 'La Búsqueda del Dragón'.",
    "Un grupo de amigos se enfrenta a sus miedos más oscuros cuando deciden pasar la noche en una casa encantada en 'Pesadillas Nocturnas'.",
    "Un hombre común descubre que tiene habilidades sobrenaturales y debe aprender a controlarlas en 'El Elegido'.",
    "Una periodista intrépida investiga una conspiración gubernamental en 'La Verdad Oculta'.",
    "En un futuro distópico, la lucha por los recursos desencadena una guerra mortal en 'Desierto de Hierro'.",
    "Una joven se embarca en un viaje mágico para salvar a su familia en 'El Libro de los Encantamientos'.",
    "Un grupo de adolescentes se enfrenta a un asesino en serie que imita a famosos psicópatas en 'El Juego del Asesino'.",
    "Un científico loco crea un monstruo gigante que amenaza con destruir la ciudad en 'La Ira del Coloso'.",
    "Una joven artista se debate entre el amor y la ambición en 'Pinceladas del Corazón'.",
    "Un detective retirado es llamado de vuelta al servicio para resolver un último caso en 'El Último Caso'.",
    "Un equipo de exploradores descubre una civilización perdida en las profundidades de la selva en 'El Enigma de los Mayas'.",
    "Una inteligencia artificial cobra conciencia y desafía a la humanidad en 'El Despertar de la Máquina'.",
    "Dos amigos de la infancia se reencuentran en un viaje por carretera que cambiará sus vidas en 'Camino a la Amistad'.",
    "Un grupo de reclusos se une para escapar de una prisión de máxima seguridad en 'Fuga Imposible'.",
    "Una mujer debe enfrentar a su pasado traumático cuando regresa a su ciudad natal en 'Secretos Enterrados'.",
    "Un aventurero intrépido busca un tesoro perdido en 'La Búsqueda del Oro'.",
    "Un circo ambulante esconde oscuros secretos detrás de su apariencia encantadora en 'El Circo de las Sombras'.",
    "Una madre soltera lucha por proteger a su hijo de un peligroso criminal en 'Refugio en la Oscuridad'.",
    "Un grupo de científicos debe detener un virus mortal que amenaza con destruir la humanidad en 'Pandemia Mortal'.",
    "Un músico talentoso se enfrenta a sus demonios internos mientras busca la fama en 'Notas de Desesperación'.",
    "Un equipo de arqueólogos descubre una antigua profecía que podría cambiar el mundo en 'La Profecía Olvidada'.",
    "Un detective ciego resuelve crímenes utilizando sus otros sentidos en 'El Ojo de la Justicia'.",
    "Una joven rebelde se convierte en la líder de una revolución en 'Rebeldía en la Ciudad'.",
    "Un grupo de amigos de la infancia regresa a su pueblo natal para enfrentar un trauma del pasado en 'Secretos Oscuros'.",
    "Un explorador solitario se adentra en la selva amazónica en busca de una criatura legendaria en 'El Rostro del Amazonas'.",
    "Un científico brillante inventa una forma de viajar a dimensiones paralelas en 'El Portal Interdimensional'.",
    "Un detective atormentado investiga una serie de suicidios que podrían estar relacionados en 'Misterios Mortales'.",
    "Un grupo de astronautas queda atrapado en una estación espacial averiada en 'El Riesgo del Espacio'.",
    "Una joven hereda una mansión encantada y descubre secretos oscuros en 'La Herencia Maldita'.",
    "Un grupo de surfistas se enfrenta a un tiburón asesino en 'Olas de Terror'.",
    "Un periodista investiga una serie de desapariciones en un pequeño pueblo en 'El Misterio de la Desolación'.",
    "Un científico obsesionado busca pruebas de vida extraterrestre en 'Encuentro en el Espacio'.",
    "Un equipo de detectives de lo paranormal investiga fenómenos inexplicables en 'Cazadores de Fantasmas'.",
    "Un grupo de adolescentes descubre un portal a un mundo mágico en 'El Portal de la Fantasía'.",
    "Un viaje en crucero se convierte en una pesadilla cuando un asesino comienza a atacar a los pasajeros en 'Crucero de Terror'."
]



## Ejercicio 1: : Preprocesamiento de Texto
El siguiente paso es preprocesar el argumento de las películas. Sigue los siguientes pasos: 

1. Convertir a minúsculas
2. Tokenizar las frases en palabras
3. Lematizar cada token
4. Unir los tokens de nuevo en una frase utilizando join (esto se debe a que Bag of Words, TFIDF, etc, necesitan como entrada un conjunto de frases, no un conjunto de tokens).


In [3]:
import spacy

# Cargar el modelo de lenguaje en español de spaCy
nlp = spacy.load('es_core_news_lg')

# Lista para almacenar los argumentos de las películas preprocesados
argumentos_preprocesados = []

# Iterar a través de cada argumento de la película
for argumento in argumentos_peliculas:
    # Paso 1: Convertir a minúsculas
    argumento = argumento.lower()

    # Paso 2: Tokenizar las frases en palabras
    tokens = nlp(argumento)

    # Paso 3: Lematizar cada token
    lematizados = [token.lemma_ for token in tokens]

    # Paso 4: Unir los tokens de nuevo en una frase
    frase_lematizada = ' '.join(lematizados)

    # Agregar la frase preprocesada a la lista
    argumentos_preprocesados.append(frase_lematizada)

# Ahora, argumentos_preprocesados contiene los argumentos de las películas preprocesados


In [4]:
argumentos_preprocesados

["uno detective privado luchar por desentrañar el secreto oscuro de uno mansión embrujado en ' misterio en el mansión húngaro ' .",
 "uno equipo de astronauta deber luchar por sobrevivir en uno planeta desconocido después de uno aterrizaje forzoso en ' planeta olvidado ' .",
 "en uno mundo post-apocalíptico , uno grupo de sobreviviente buscar refugio en ' el último esperanza ' .",
 "uno científico brillante inventar uno máquina del tiempo que desencadenar consecuencia inesperado en ' el viaje temporal ' .",
 "dos desconocido quedar atrapado en uno ascensor durante hora y descubrir que tener más en común de él que imaginar en ' atrapado en el ascensor ' .",
 "uno joven talentós lucha por alcanzar su sueño musical en ' nota de pasión ' .",
 "uno detective de homicidio perseguir a uno asesino en serie que dejar acertijo macabro en ' el enigma del asesino ' .",
 "en uno mundo de magia y criatura mítico , uno joven campesino emprender uno búsqueda épico en ' el búsqueda del dragón ' .",
 "u

## Ejercicio 2: Modelo Bag of Words (BoW)
Ahora, vamos a crear un modelo Bag of Words para representar los argumentos de las películas. Utiliza la librería CountVectorizer de Scikit-Learn.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# Crear una instancia de CountVectorizer
vectorizer = CountVectorizer()

# Ajustar el vectorizador a los argumentos preprocesados y transformar los datos
X_bow = vectorizer.fit_transform(argumentos_preprocesados)

# X_bow es ahora una matriz de términos de BoW, donde cada fila representa un argumento de película
# y cada columna representa una palabra única en todos los argumentos de películas preprocesados.

# Para obtener las palabras únicas que corresponden a las columnas, puedes hacer lo siguiente:
palabras_unicas = vectorizer.get_feature_names_out()

# La matriz X_bow y palabras_unicas se pueden utilizar para análisis posteriores.
import pandas as pd

# Convertir la matriz X_bow a un DataFrame de pandas
df_bow = pd.DataFrame(X_bow.toarray(), columns=palabras_unicas)

# df_bow ahora contiene la tabla de frecuencia de términos (BoW)


In [11]:
df_bow

Unnamed: 0,acertijo,adentrar,adolescente,al,alcanzar,amazona,amazónico,ambición,ambulante,amenazar,...,utilizar,verdad,viajar,viaje,vida,virus,vuelta,él,épico,último
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
39,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
40,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Ejercicio 3: Modelo Bag of N-grams
Ahora, vamos a crear un modelo Bag of N-grams para representar las noticias (prueba con n=2 y con n=3)

In [15]:
# Crear una instancia de CountVectorizer con ngram_range=(2, 2) para bigramas
vectorizer_bigram = CountVectorizer(ngram_range=(2, 2))

# Ajustar el vectorizador a los argumentos preprocesados y transformar los datos
X_bigram = vectorizer_bigram.fit_transform(argumentos_preprocesados)

# Convertir la matriz X_bigram a un DataFrame de pandas
df_bigram = pd.DataFrame(X_bigram.toarray(), columns=vectorizer_bigram.get_feature_names_out())

# df_bigram ahora contiene la representación Bag of Bigrams en formato DataFrame
df_bigram

Unnamed: 0,acertijo macabro,adentrar en,adolescente descubrir,adolescente él,al servicio,alcanzar su,amazónico en,ambición en,ambulante esconder,amenazar con,...,él embarcar,él en,él enfrentar,él paranormal,él que,él reencontrar,él unir,épico en,último caso,último esperanza
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
40,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Crear una instancia de CountVectorizer con ngram_range=(2, 2) para bigramas
vectorizer_trigram = CountVectorizer(ngram_range=(3, 3))

# Ajustar el vectorizador a los argumentos preprocesados y transformar los datos
X_trigram = vectorizer_trigram.fit_transform(argumentos_preprocesados)

# Convertir la matriz X_bigram a un DataFrame de pandas
df_trigram = pd.DataFrame(X_trigram.toarray(), columns=vectorizer_trigram.get_feature_names_out())

# df_bigram ahora contiene la representación Bag of Bigrams en formato DataFrame
df_trigram

Unnamed: 0,acertijo macabro en,adentrar en el,adolescente descubrir uno,adolescente él enfrentar,al servicio para,alcanzar su sueño,amazónico en busca,ambición en pincelada,ambulante esconder oscuro,amenazar con destruir,...,él embarcar en,él en el,él enfrentar su,él enfrentar uno,él paranormal investigar,él que imaginar,él reencontrar en,él unir para,épico en el,último caso en
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
40,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Ejercicio 4: Modelo TF-IDF
Ahora, vamos a crear un modelo TF-IDF para representar los argumentos de las películas.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Crear una instancia de TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Ajustar el vectorizador a los argumentos preprocesados y transformar los datos
X_tfidf = tfidf_vectorizer.fit_transform(argumentos_preprocesados)

# Convertir la matriz X_tfidf a un DataFrame de pandas
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# df_tfidf ahora contiene la representación TF-IDF en formato DataFrame
df_tfidf

Unnamed: 0,acertijo,adentrar,adolescente,al,alcanzar,amazona,amazónico,ambición,ambulante,amenazar,...,utilizar,verdad,viajar,viaje,vida,virus,vuelta,él,épico,último
0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000
1,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000
2,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.321504
3,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.252211,0.000000,0.0,0.0,0.000000,0.0,0.000000
4,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.117696,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000
38,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.334881,0.0,0.0,0.000000,0.0,0.000000
39,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.180184,0.0,0.000000
40,0.0,0.0,0.307028,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000


## Ejercicio 5: Cálculo de Similitud con Coseno
Finalmente, vamos a calcular la similitud entre documentos utilizando la medida del coseno.

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(X_tfidf)

df_similarity_tfidf = pd.DataFrame(similarity_matrix, columns=range(len(argumentos_preprocesados)), index=range(len(argumentos_preprocesados)))

df_similarity_tfidf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,1.000000,0.161561,0.077234,0.050519,0.058686,0.083656,0.109382,0.071389,0.105572,0.038878,...,0.062277,0.161517,0.073751,0.355136,0.063467,0.186850,0.067289,0.102854,0.099258,0.080120
1,0.161561,1.000000,0.074285,0.041970,0.102684,0.090203,0.063832,0.072726,0.058998,0.078094,...,0.060517,0.059966,0.144653,0.048623,0.089688,0.107331,0.059393,0.136117,0.092419,0.080559
2,0.077234,0.074285,1.000000,0.051889,0.064616,0.041602,0.071780,0.139406,0.103336,0.037719,...,0.065673,0.054867,0.124245,0.060115,0.129966,0.119772,0.150686,0.064419,0.218256,0.089389
3,0.050519,0.041970,0.051889,1.000000,0.085127,0.020860,0.112011,0.083546,0.041210,0.063652,...,0.284637,0.071027,0.087012,0.047383,0.039617,0.075013,0.103930,0.021068,0.065564,0.111575
4,0.058686,0.102684,0.064616,0.085127,1.000000,0.032149,0.103823,0.055327,0.121775,0.258474,...,0.042887,0.089524,0.255075,0.081379,0.074490,0.083204,0.062063,0.068306,0.099029,0.086756
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37,0.186850,0.107331,0.119772,0.075013,0.083204,0.059788,0.168109,0.113347,0.095123,0.053952,...,0.100736,0.307626,0.114370,0.086904,0.119028,1.000000,0.100889,0.180645,0.163115,0.123472
38,0.067289,0.059393,0.150686,0.103930,0.062063,0.035387,0.060973,0.062744,0.056356,0.031995,...,0.121154,0.041816,0.170569,0.045663,0.062542,0.100889,1.000000,0.059075,0.081874,0.074287
39,0.102854,0.136117,0.064419,0.021068,0.068306,0.048517,0.107820,0.057294,0.076953,0.043509,...,0.054675,0.185603,0.061514,0.024408,0.138511,0.180645,0.059075,1.000000,0.093611,0.087782
40,0.099258,0.092419,0.218256,0.065564,0.099029,0.050320,0.086512,0.155582,0.115287,0.098291,...,0.280281,0.069646,0.138614,0.137554,0.156553,0.163115,0.081874,0.093611,1.000000,0.101706


## El desafío final: 

Ahora, el desafío es crear un sistema que pueda recibir un nuevo argumento de películas, preprocesarlo, vectorizarlo con el modelo de lenguaje que elijas, que calcule la similitud coseno con los otros argumentos, y devuelva el argumento de película más similar. ¡Suerte! Utiliza para compararlo el argumentos_peliculas_preprocesados que has generado antes.