# Introducción a NLP: Preprocesamiento de Texto con PyTorch

En esta notebook, nos enfocaremos en el preprocesamiento de texto, un paso fundamental en cualquier proyecto de Procesamiento de Lenguaje Natural (NLP). Antes de sumergirnos en arquitecturas complejas como RNNs o Transformers, es crucial dominar cómo preparar los datos de texto para que sean utilizados eficazmente por los modelos de aprendizaje profundo.

## Intro

### Objetivos

1. **Explorar técnicas clave de preprocesamiento de texto**, incluyendo tokenización, limpieza, y manejo de vocabulario.
2. **Comprender la importancia del padding y el truncamiento** en el manejo de secuencias de texto de longitud variable.
3. **Convertir texto en representaciones numéricas** adecuadas para ser ingresadas en modelos de NLP.
4. **Implementar un pipeline de preprocesamiento** en PyTorch que prepare el texto para futuras etapas de modelado.

### Contenido

1. Introducción al concepto de preprocesamiento en NLP y su relevancia.
2. Limpieza y tokenización del texto utilizando bibliotecas estándar.
3. Construcción de un vocabulario a partir de datos textuales.
4. Conversión de texto en índices numéricos para su uso en modelos.
5. Implementación de técnicas de padding y truncamiento.
6. Preparación del dataset para ser utilizado en modelos de aprendizaje profundo en PyTorch.

### Sobre el Dataset IMDB

En esta notebook utilizaremos el dataset de reseñas de películas de IMDB, un conjunto de datos ampliamente utilizado en la investigación de NLP. El dataset contiene 50,000 reseñas de películas en inglés, etiquetadas como **positivas** o **negativas**. Se divide equitativamente en un conjunto de entrenamiento y un conjunto de prueba, con 25,000 reseñas en cada uno. Las reseñas positivas y negativas están equilibradas, lo que lo convierte en un excelente recurso para entrenar y evaluar modelos de análisis de sentimiento.

El dataset fue creado por Andrew Maas y sus colegas en la Universidad de Stanford, y está disponible públicamente [aquí](https://ai.stanford.edu/~amaas/data/sentiment/). El propósito principal de este conjunto de datos es facilitar la investigación en tareas de clasificación de texto, como el análisis de sentimientos, donde el objetivo es predecir si una reseña es positiva o negativa basándose en su contenido textual.

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split

from torchinfo import summary

import pandas as pd
import numpy as np

import os
import sys
import tarfile
import urllib.request
import re
from pathlib import Path
from collections import Counter
from itertools import chain

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

from utils import (
    train,
)

In [6]:
# Fijamos la semilla para que los resultados sean reproducibles
SEED = 23

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [7]:
# definimos el dispositivo que vamos a usar
DEVICE = "cpu"  # por defecto, usamos la CPU
if torch.cuda.is_available():
    DEVICE = "cuda"  # si hay GPU, usamos la GPU
elif torch.backends.mps.is_available():
    DEVICE = "mps"  # si no hay GPU, pero hay MPS, usamos MPS
elif torch.xpu.is_available():
    DEVICE = "xpu"  # si no hay GPU, pero hay XPU, usamos XPU

print(f"Usando {DEVICE}")

NUM_WORKERS = 0 # Win y MacOS pueden tener problemas con múltiples workers
if sys.platform == 'linux':
    NUM_WORKERS = 4  # numero de workers para cargar los datos (depende de cada caso)

print(f"Usando {NUM_WORKERS}")

Usando cpu
Usando 4


In [8]:
BATCH_SIZE = 512  # tamaño del batch

## Carga de Datos

Primero, descargaremos el [dataset IMDB](https://ai.stanford.edu/~amaas/data/sentiment/) y lo cargaremos en un DataFrame de Pandas para su fácil manipulación.

Se nos presentan dos carpetas: `train` y `test`, cada una con subcarpetas `pos` y `neg` que contienen reseñas positivas y negativas, respectivamente. Cada reseña se almacena en un archivo de texto separado. Utilizaremos la biblioteca `os` para navegar por las carpetas y cargar las reseñas en un DataFrame de Pandas.

In [9]:
DATA_PATH = "data"

url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
tar_path = os.path.join(DATA_PATH, "aclImdb_v1.tar.gz")

# Create data directory if it doesn't exist
os.makedirs(DATA_PATH, exist_ok=True)

# Download the file if not already downloaded
if not os.path.exists(tar_path):
    print("Downloading dataset...")
    urllib.request.urlretrieve(url, tar_path)
    print("Download complete.")
else:
    print("File already downloaded.")

# Extract the tar.gz file
print("Extracting files...")
with tarfile.open(tar_path, "r:gz") as tar:
    tar.extractall(path=DATA_PATH)
print("Extraction complete.")

# Optional: print extracted folder contents
extracted_path = os.path.join(DATA_PATH, "aclImdb")
print(f"Dataset extracted to: {extracted_path}")
print("Contents:", os.listdir(extracted_path))

Downloading dataset...
Download complete.
Extracting files...
Extraction complete.
Dataset extracted to: data/aclImdb
Contents: ['train', 'imdbEr.txt', 'README', 'imdb.vocab', 'test']


In [10]:
TRAIN_PATH = str(Path(extracted_path) / "train")
TEST_PATH = str(Path(extracted_path) / "test")

In [11]:
# Cargar el dataset de IMDB desde archivos locales
def load_imdb_data(base_directory):
    data = []
    for label in ["pos", "neg"]:
        folder = os.path.join(base_directory, label)
        for file in os.listdir(folder):
            with open(os.path.join(folder, file), "r", encoding="utf-8") as f:
                data.append((f.read(), 1 if label == "pos" else 0)) # 1 para positivo, 0 para negativo
    return pd.DataFrame(data, columns=["review", "sentiment"])

train_df = load_imdb_data(TRAIN_PATH)
test_df = load_imdb_data(TEST_PATH)

In [12]:
# Mostrar más caracteres por columna
pd.set_option('display.max_colwidth', None)
# Mostrar más columnas y filas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 0)  # 0 = autoajuste según la terminal

In [13]:
train_df.head()

Unnamed: 0,review,sentiment
0,"YES, the plot is hardly plausible and very thin. YES, the acting does range from average to laughable. YES, it has been done so many times before. However what we are dealing with is a film that does not shy away from these facts and pretends to be nothing more than it is. There are indeed some original death scenes and the tension does increase throughout the movie. In addition you are never more than a few minutes away from a gory killing. I urge everyone to watch this film with an unprejudiced eye and see it for what it set out to be; a scary, funny slasher flick with a theme tune second to none.",1
1,"This movie was great! It was an excellent rendition of an ancient myth. The animation was somewhat odd, but nothing new from Disney. It was definitely better than expected for a Disney movie with no singing.<br /><br />The background animation was magical. It was a different level of work for the Disney people. Some of the characters were a little boxy, but it was more than made up for with the beauty and lushness of the scenery. The music was largely instrumental but that was perfect for the movie. This was definitely not a film that needed the characters to bust into song.<br /><br />Perfect. 10 out of 10.",1
2,"One of my favourite films first saw it when I was about 10, which probably tells you a lot about the type of humour. Although dated the humour definitely has a charm about it. Expect to see the usual Askey & Murdoch banter so popular in its day, with lots of interesting, quirky co-characters. The lady with the parrot, the couple due to get married and are in trouble from 'her', and my favourite, the stationmaster, ""Nobody knows where it comes from ... nobody knows where it goes.."" Interestingly the ghost train was written by Arnold Ridley of Dads Army fame (Private Godfrey the medic) Watch it on a rainy Sunday afternoon after your lunch and smile.",1
3,"I am surprised than many viewers hold more respect for the sequel to this brilliant movie... I have seen all the guinea pigs and this one is easily the best.<br /><br />Even though ive seen the ""making of"", i still have doubts when watching those 35mins of pure torture : its that powerful.<br /><br />A 10 out of 10 because this movie achieved perfectly what it set out to do : be the best fake snuff film ever made.",1
4,"Part Two picks up... not where the last film left off. As part of the quasi-conventionality of Steven Soderbergh's epic 4+ hour event, Che's two stories are told as classic ""Rise"" and ""Fall"" scenarios. In Part Two, Che Guevara, leaving his post as a bureaucrat in Cuba and after a failed attempt in the Congo (only in passing mentioned in the film), goes down to Bolivia to try and start up another through-the-jungle style revolution. Things don't go quite as well planned, at all, probably because of Che's then notorious stature as a Communist and revolutionary, and in part because of America's involvement on the side of the Bolivian Government, and, of course, that Castro wasn't really around as a back-up for Che.<br /><br />As it goes, the second part of Che is sadder, but in some ways wiser than the first part. Which makes sense, as Guevara has to endure low morale from his men, betrayals from those around him, constant mistakes by grunts and nearby peasants, and by ultimately the enclosing, larger military force. But what's sadder still is that Guevara, no matter what, won't give in. One may see this as an incredible strength or a fatal flaw- maybe both- but it's also clear how one starts to see Che, if not totally more fully rounded, then as something of a more sympathetic character. True, he did kill, and executed, and felt justified all the way. And yet it starts to work on the viewer in the sense of a primal level of pity; the sequence where Guevara's health worsens without medicine, leading up to the shocking stabbing of a horse, marks as one of the most memorable and satisfying of any film this year.<br /><br />Again, Soderbergh's command of narrative is strong, if, on occasion, slightly sluggish (understandable due to the big running time), and one or two scenes just feel totally odd (Matt Damon?), but these are minor liabilities. Going this time for the straight color camera approach, this is almost like a pure militia-style war picture, told with a great deal of care for the men in the group, as well as Guevara as the Lord-over this group, and how things dwindle down the final scene. And as always, Del-Toro is at the top of his game, in every scene, every beat knowing this guy so well- for better and for worse- that he comes about as close to embodiment as possible. Overall, the two parts of Che make up an impressive package: history as drama in compelling style, good for an audience even if they don't know Che or, better, if they don't think highly of him. It's that special. 8.5/10",1


In [14]:
train_df.tail()

Unnamed: 0,review,sentiment
24995,"Spoilers ahead -- proceed at your own caution.<br /><br />My main problem with this movie is that once Harry learns the identities of the three blackmailers -- with relative ease -- he continues to cave into their demands. And then the whole scene with his wife being kidnapped, he decides to wire his classic car up to explode (with the money in it), which makes us take a pretty tall leap of logic.<br /><br />Okay, so he wanted to keep his affair with Cini out of the public eye due to his wife's involvement with the DA campaign. This I can see, but why not hire someone to slap these turds around a bit, or even kill them once he'd determined there was no actual blackmail evidence (e.g, Cini's body?) This was a pretty interesting movie for the first 2/3 of it. After that, it sort of falls apart.",0
24996,"I read the book Celestine Prophecy and was looking forward to seeing the movie. Be advised that the movie is loosely based on the book. Many of the book's most interesting points do not even come out in the movie. It is a ""B"" movie at best. Many events, characters, how the character interact and meet in the book are simply changed or do not occur. The flow of events that in the book are very smooth, are choppy and fed to the view as though you a child. The character development is very poor. Personnallities of the characters differ from those in the book. The direction is similar to a ""B"" horror flick. I understand that it would take six hours in film to present all that is in the book, but they screen play base missed many points. The casting was very good.",0
24997,"This is a pretty bad movie. But not so bad as it's reputation suggests. The production values aren't too bad and there is the odd effective scene. And it does have an 80's cheezoid veneer that means that it is always kind of fun. Watch out, too, for Jimmy Nail's brief appearance - his attempt at an American accent is so astoundingly rubbish it's fantastic. Fantastic too are Sybil Danning's breasts - they make a brief appearance in the movie but the scene is repeated umpteen times in the end credits in what can only be described as the 12"" remix of Sybil Danning's boobs. Has to be seen to be believed. As a horror movie it isn't scary, the effects are silly and Christopher Lee turns up to sleepwalk through his performance. I guess he was buying a new house and needed some cash for the deposit. The two central characters - the man and the woman - were so negligible that I have forgotten almost everything about them and I just watched this movie earlier tonight. The werewolves are noticeably less impressive than in the original movie, in fact, bizarrely, they sometimes look more like badly burned apes. The eastern European setting is quite good and the music provided by the new wave band Babel, while being pretty terrible, does at least give the film some added cheese.<br /><br />Overall? Good for a laugh. Not good quality but did you seriously expect it to be? And, at the very least, you've always got Sybil's knockers.",0
24998,"Would someone tell shaq to stick to what he is good at basketball. This movie was not even entertaining on a stupid level. In this movie shaq plays a genie who lives in a boom box is that not orginal a genie in a boom box instead of a lamp. He is supposed to help a little boy played by the equally annoying francais cappra. This movie had the most flimsy storyline since water world, the acting was awful and I think that anyone who likes this flim would be afraid to admit it.",0
24999,"I was really hoping that this would be a funny show, given all the hype and the clever preview clips. And talk about hype, I even heard an interview with the show's creator on the BBC World Today - a show that is broadcast all over the world.<br /><br />Unfortunately, this show doesn't even come close to delivering. All of the jokes are obvious - the kind that sound kind of funny the first time you hear them but after that seem lame - and they are not given any new treatment or twist. All of the characters are one-dimensional. The acting is - well - mediocre (I'm being nice). It's the classic CBC recipe - one that always fails.<br /><br />If you're Muslim I think you would have to be stupid to believe any of the white characters, and if you're white you'd probably be offended a little by the fact that almost all of the white characters are portrayed as either bigoted, ignorant, or both. Not that making fun of white people is a problem - most of the better comedies are rooted in that. It's only a problem when it isn't funny - as in this show.<br /><br />Canada is bursting with funny people - so many that we export them to Hollywood on a regular basis. So how come the producers of this show couldn't find any?",0


In [15]:
train_class_distribution = train_df["sentiment"].value_counts()

print(
    "La distribución de clases en el conjunto de entrenamiento es:",
    train_class_distribution,
)

La distribución de clases en el conjunto de entrenamiento es: sentiment
1    12500
0    12500
Name: count, dtype: int64


## Preprocesamiento de Texto

El preprocesamiento de texto es esencial para preparar los datos de texto antes de ingresarlos en un modelo de aprendizaje profundo. A continuación se describen los pasos clave que realizaremos:

1. **Limpieza del Texto de Bajo Nivel**:
   - **Eliminación de HTML**: Remover etiquetas HTML u otros elementos de markup.
   - **Eliminación de Texto entre Corchetes**: Eliminar texto entre corchetes, como [imagen], [audio], etc.
   - **Eliminación de Caracteres No AlfaNuméricos**: Remover caracteres especiales, como puntuación y otros símbolos.
   - **Eliminación de Espacios en Blanco Adicionales**: Remover espacios en blanco adicionales y espacios al principio y al final del texto.

2. **Limpieza del Texto de Alto Nivel**:
   - **Transformación a Minúsculas**: Convertir todo el texto a minúsculas para evitar duplicados.
   - **Eliminación de Stop Words**: Remover palabras comunes que no aportan información.
   - **Lematización**: Convertir las palabras a su forma base utilizando técnicas de lematización.

3. **Construcción del Vocabulario**:
   - **Creación de un Diccionario**: Asignar un índice único a cada palabra en el corpus.
   - **Filtro por Frecuencia**: Eliminar palabras demasiado raras o comunes.


### Limpieza del Texto de Bajo Nivel

En esta etapa, eliminaremos las etiquetas HTML, la puntuación, los caracteres especiales y los espacios en blanco innecesarios de las reseñas. Nos ayudaremos con expresiones regulares de la [biblioteca `re`](https://docs.python.org/3/library/re.html) para realizar estas tareas. 

In [16]:
def strip_html_tags(text):
    """Elimina etiquetas HTML"""
    pattern = r"<.*?>"
    return re.sub(pattern, "", text)

def remove_between_square_brackets(text):
    """Elimina texto entre corchetes cuadrados (ej: [Spoiler], [Citation needed])"""
    pattern = r"\[[^\]]*\]"
    return re.sub(pattern, "", text)

def remove_special_characters(text, keep_punctuation=True):
    """
    Elimina caracteres especiales
    
    Args:
        text: texto a limpiar
        keep_punctuation: si True, conserva puntuación básica (.,!?;:)
    """
    if keep_punctuation:
        # Conservamos letras, números, espacios y puntuación común
        pattern = r"[^a-zA-Z0-9\s.,!?;:\-']"
    else:
        # Solo letras, números y espacios
        pattern = r"[^a-zA-Z0-9\s]"
    
    return re.sub(pattern, "", text)

def remove_additional_whitespace(text):
    """Elimina espacios en blanco múltiples y espacios alrededor de puntuación"""
    # Primero eliminamos espacios múltiples
    text = re.sub(r"\s{2,}", " ", text)
    # Eliminamos espacios antes de puntuación
    text = re.sub(r"\s+([.,!?;:])", r"\1", text)
    # Aseguramos un espacio después de puntuación (si no hay ya)
    text = re.sub(r"([.,!?;:])([^\s])", r"\1 \2", text)
    # Eliminamos espacios al inicio y final
    return text.strip()

def low_level_text_cleaning(text, keep_punctuation=True):
    """
    Limpieza completa de texto
    
    Args:
        text: texto a limpiar
        keep_punctuation: si True, conserva puntuación (recomendado para NLP)
    """
    text = strip_html_tags(text)
    text = remove_between_square_brackets(text)
    text = remove_special_characters(text, keep_punctuation)
    text = remove_additional_whitespace(text)
    return text


# Texto de prueba
testing_text = "<p>Hello    World! [Spoiler] The villain is the butler! [End of spoiler] <br> See you .</p>"
print(f"Original: {testing_text}")
print(f"Cleaned: {low_level_text_cleaning(testing_text)}")

Original: <p>Hello    World! [Spoiler] The villain is the butler! [End of spoiler] <br> See you .</p>
Cleaned: Hello World! The villain is the butler! See you.


In [17]:
# Aplicamos la limpieza a los conjuntos de entrenamiento y prueba
train_df["review_low_level_cleaned"] = train_df["review"].apply(low_level_text_cleaning)
test_df["review_low_level_cleaned"] = test_df["review"].apply(low_level_text_cleaning)

### Limpieza del Texto de Alto Nivel

En esta etapa, convertiremos el texto a minúsculas y eliminaremos las palabras vacías (stop words) del texto. Las palabras vacías son palabras comunes que no aportan información significativa al texto, como "a", "the", "is", etc. Utilizaremos la biblioteca `nltk` para descargar la lista de palabras vacías y eliminarlas del texto.

> **Nota**: Dependiendo de la tarea y el dominio, es posible que desee personalizar la lista de palabras vacías para adaptarla a sus necesidades.

> **Nota 2**: A veces descargar el paquete de stopwords puede fallar debido a `429: Too Many Requests`. Si esto ocurre, hay que descargar el paquete manualmente `python -m nltk.downloader stopwords`

In [18]:
# Descargamos stopwords de nltk
nltk.download("stopwords")
all_stopwords = set(stopwords.words("english"))

print(f"Stopwords: {all_stopwords}")

Stopwords: {'d', "she'd", 'our', 'here', 'below', 'this', "i've", 'those', "wouldn't", 'from', 'some', "we're", "we've", 'do', "won't", 'only', "you'd", 'an', 'the', "haven't", 'during', 'themselves', 'haven', 'for', 'to', 'all', 'does', "i'd", "shouldn't", 'than', 'ours', 'having', "that'll", 'yourself', 'ain', "shan't", 'now', "mightn't", 'was', 'aren', 'be', 'out', 'are', 'has', 'his', 'theirs', 'were', 'how', 'into', 't', "he'll", 's', 'ma', 'their', 'off', "he's", 'against', 'mightn', 'mustn', 'on', 'myself', 'most', 'when', "weren't", 'between', 'again', "we'd", 'needn', 'no', 'too', "it'd", 've', 'll', 'there', 'further', 'up', 'himself', 'if', 'same', 'yours', 'both', 'itself', 'about', 'over', 'her', "you'll", 'should', 'a', 'its', 'by', "they'll", 'ourselves', 'own', 'hers', 'been', 'being', 'she', 'whom', 'shan', 'where', 'which', 'and', 'weren', "you're", 'at', 'you', "hasn't", 'why', 'while', "i'm", 'them', 'had', 'down', 'me', 're', "you've", "should've", "couldn't", 'did

[nltk_data] Downloading package stopwords to /home/rami/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [19]:
def remove_stop_words(full_text_line):
    # Eliminamos stopwords
    tokens = full_text_line.split()
    tokens = [token for token in tokens if token not in all_stopwords]
    return " ".join(tokens)

def to_lower_case(full_text_line):
    # Convertimos a minúsculas
    return full_text_line.lower()

# Texto de prueba
testing_text = "The quick brown fox don't jumps over the lazy dog"
print(f"Original:                          {testing_text}")
print(f"Lower Case and Stop Words Removed: {remove_stop_words(to_lower_case(testing_text))}")

Original:                          The quick brown fox don't jumps over the lazy dog
Lower Case and Stop Words Removed: quick brown fox jumps lazy dog


#### Lematización / Stemming

- La lematización es el proceso de convertir las palabras a su forma base o lema. Por ejemplo, las palabras "corriendo", "corre" y "corrió" se convertirían a "correr". La lematización ayuda a reducir la variabilidad de las palabras y a agrupar palabras similares juntas.

- El stemming es un proceso similar a la lematización, pero más simple. Consiste en eliminar los sufijos de las palabras para obtener su raíz. Por ejemplo, las palabras "corriendo", "corre" y "corrió" se convertirían a "corr". Aunque el stemming es más rápido que la lematización, a menudo produce resultados menos precisos.

In [29]:
# necesitamos esto para que funcione el tokenizador (nltk.word_tokenize()), de esta forma maneja mejor la puntuación
nltk.download('punkt') 
nltk.download('punkt_tab')

# necesitamos esto para que funcione el lematizador (WordNetLemmatizer)
nltk.download('wordnet')
# para que la salida de pos_tag sea compatible con WordNetLemmatizer
nltk.download('universal_tagset') 

# necesitamos esto para que funcione el etiquetador POS (nltk.pos_tag)
nltk.download('averaged_perceptron_tagger') 
nltk.download('averaged_perceptron_tagger_eng')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to /home/rami/nltk_data...
[nltk_data] Error downloading 'punkt' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data]     pages/packages/tokenizers/punkt.zip>:   HTTP Error
[nltk_data]     429: Too Many Requests
[nltk_data] Downloading package punkt_tab to /home/rami/nltk_data...
[nltk_data] Error downloading 'punkt_tab' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data]     pages/packages/tokenizers/punkt_tab.zip>:   HTTP Error
[nltk_data]     429: Too Many Requests
[nltk_data] Downloading package wordnet to /home/rami/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/rami/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/rami/nltk_data...
[nltk_data]   Package averaged_perceptron_ta

Cuando vamos a separar el texto en tokens, no es tan sencillo cómo hacer un `split(" ")`, ya que hay que tener en cuenta la puntuación, los signos de interrogación, etc. Para esto, utilizamos el tokenizador de NLTK `nltk.word_tokenize()`, que maneja estos casos de manera adecuada.

In [30]:
test_texts = [
    "Hello, world! This is a test.",
    "It's a beautiful day, isn't it?",
    "Dr. Smith went to Washington D.C. on Jan. 5th.",
    "The quick brown fox jumps over the lazy dog."
]

for testing_text in test_texts:
    tokens = nltk.word_tokenize(testing_text) # separamos en tokens
    print(f"Original:   {testing_text}")
    print(f"tokens: {tokens}\n")

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/home/rami/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/share/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [31]:
# Función para convertir POS (Part of Speech) tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'): # adjetivo
        return wordnet.ADJ
    elif treebank_tag.startswith('V'): # verbo
        return wordnet.VERB
    elif treebank_tag.startswith('N'): # sustantivo
        return wordnet.NOUN
    elif treebank_tag.startswith('R'): # adverbio
        return wordnet.ADV
    else:
        return wordnet.NOUN  # por defecto, sustantivo

text = "The children were running faster than their friends."

print("="*80)
print("TEXTO ORIGINAL:")
print("="*80)
print(text)

# Tokenizar
tokens = nltk.word_tokenize(text.lower())  # lowercase para mejor procesamiento

# POS tagging
pos_tags = nltk.pos_tag(tokens, tagset='universal')

print("\n" + "="*80)
print("ANÁLISIS DETALLADO:")
print("="*80)
print(f"{'Original':<15} {'POS':<8} {'Lemma':<15} {'Stem':<15}")
print("-"*80)
    
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

lemmatized_tokens = []
stemmed_tokens = []
for token, pos in pos_tags:
    if token.isalpha():  # Solo palabras, ignorar puntuación
        wordnet_pos = get_wordnet_pos(pos)
        lemma = lemmatizer.lemmatize(token, wordnet_pos)
        stem = stemmer.stem(token)
        
        # Mostrar solo si hay cambio
        if token != lemma or token != stem:
            print(f"{token:<15} {pos:<8} {lemma:<15} {stem:<15}")
        
        lemmatized_tokens.append(lemma)
        stemmed_tokens.append(stem)
    else:
        lemmatized_tokens.append(token)
        stemmed_tokens.append(token)

print("\n" + "="*80)
print("TEXTO LEMATIZADO:")
print("="*80)
lemmatized_text = " ".join(lemmatized_tokens).replace(".", ".\n")
print(lemmatized_text)

print("\n" + "="*80)
print("TEXTO STEMMED:")
print("="*80)
stemmed_text = " ".join(stemmed_tokens).replace(" .", ".\n")
print(stemmed_text)


TEXTO ORIGINAL:
The children were running faster than their friends.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/home/rami/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/share/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [41]:
def lemmatize(text):
    # Tokenizar el texto
    tokens = nltk.word_tokenize(text)
    # Obtener etiquetas POS
    pos_tags = nltk.pos_tag(tokens, tagset='universal')
    # Lematizar cada token con su etiqueta POS correspondiente
    lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(pos_tag)) for token, pos_tag in pos_tags]
    return " ".join(lemmatized_tokens)

def stem_words(text):
    # Aplicamos stemming a cada palabra
    tokens = nltk.word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return " ".join(stemmed_tokens)

def high_level_text_cleaning(text, remove_stop_words=False):
    text = to_lower_case(text)
    if remove_stop_words:
        text = remove_stop_words(text)
    text = lemmatize(text)
    return text

def high_level_text_cleaning_v2(text, remove_stop_words=False):
    text = to_lower_case(text)
    if remove_stop_words:
        text = remove_stop_words(text)
    text = stem_words(text)
    return text

print(f"Original: {text}")
print(f"Lemma: {high_level_text_cleaning(text)}")
print(f"Steam: {high_level_text_cleaning_v2(text)}")

Original: The children were running faster than their friends.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/home/rami/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/share/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [None]:
testing_text = (
    "Why waste time saying a lot of words when a few words do the trick?"
)
print(f"Lemma: {high_level_text_cleaning(testing_text)}")
print(f"Steam: {high_level_text_cleaning_v2(testing_text)}")

<img src="https://media1.tenor.com/m/IsYdPRq7bjcAAAAC/why-waste-time-when-few-word-do-trick.gif"/>

[Kevin's Small Talk - The Office US](https://www.youtube.com/watch?v=_K-L9uhsBLM)

In [32]:
# Aplicamos la limpieza de alto nivel a los conjuntos de entrenamiento y prueba
train_df["review_clean"] = train_df["review_low_level_cleaned"].apply(high_level_text_cleaning_v2) # podemos usar lemmatization o stemming
test_df["review_clean"] = test_df["review_low_level_cleaned"].apply(high_level_text_cleaning_v2)

NameError: name 'high_level_text_cleaning_v2' is not defined

### Construcción del Vocabulario

Finalmente, construiremos un vocabulario a partir de las reseñas limpias. Un vocabulario es un conjunto de todas las palabras únicas en el corpus. Cada palabra en el vocabulario se asigna a un índice único, que se utilizará para convertir el texto en una secuencia de índices numéricos.

In [33]:
OOV_TOKEN = "<OOV>"
PAD_TOKEN = "<PAD>"
MAX_VOCAB_SIZE = 20_000
SEQUENCE_LENGTH = 200
EMBEDDING_DIM = 100

In [38]:
def make_vocab(all_texts, max_vocab_size, min_freq=5):
    # Contamos la frecuencia de cada palabra
    counts = Counter(chain(*(all_texts.str.split())))
    counts = {word: freq for word, freq in counts.items() if freq >= min_freq}

    # Ordenamos las palabras por frecuencia y nos quedamos con las max_vocab_size palabras más frecuentes
    vocab = sorted(counts, key=counts.get, reverse=True)[:max_vocab_size]
    vocab.append(OOV_TOKEN)  # Añadimos el token OOV al final

    # Mapa de palabras a índices
    vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}
    vocab_to_int[PAD_TOKEN] = 0  # Añadimos el token PAD al principio

    return vocab_to_int

# Texto de prueba
testing_text1 = "The quick brown fox jumps over the lazy dog"
testing_text2 = "Then the quick brown fox jumps over the lazy dog"
print(f"Vocab: {make_vocab(pd.Series([testing_text1, testing_text2]), 100, 1)}")

Vocab: {'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumps': 5, 'over': 6, 'lazy': 7, 'dog': 8, 'The': 9, 'Then': 10, '<OOV>': 11, '<PAD>': 0}


In [42]:
word_to_index = make_vocab(train_df["review_clean"], MAX_VOCAB_SIZE)
VOCAB_SIZE = len(word_to_index)
print(f"Tamaño del vocabulario: {VOCAB_SIZE}")

KeyError: 'review_clean'

Ahora vamos a implementar una funcion que transforma un string con la review en una lista de enteros con la posición de las palabras en el vocabulario.

In [43]:
def get_review_features(review_text, word_to_idx):
    """
    Convierte un texto en una lista de índices basados en el vocabulario.
    """
    # Tokenizar el texto y convertir cada palabra a su índice correspondiente
    return [
        word_to_idx.get(word, word_to_idx[OOV_TOKEN]) for word in review_text.split()
    ]
    
def truncate_sequence(sequence, max_length, keep='last'):
    """
    Trunca la secuencia manteniendo las primeras o últimas palabras
    
    Args:
        sequence: lista de tokens/índices
        max_length: longitud máxima
        keep: 'first' o 'last'
    """
    if len(sequence) <= max_length:
        return sequence
    
    if keep == 'last':
        return sequence[-max_length:]  # Últimas palabras
    else:
        return sequence[:max_length]   # Primeras palabras

def left_pad_features(review_ints, seq_length, pad_value=0, truncate_keep='last'):
    """
    Aplica padding a la izquierda a una secuencia de índices para que todas las secuencias tengan la misma longitud.
    """
    # Truncar si es más largo que seq_length
    review_ints = truncate_sequence(review_ints, seq_length, keep=truncate_keep)
    
    # Padding a la izquierda
    if len(review_ints) < seq_length:
        padding = [pad_value] * (seq_length - len(review_ints))
        return padding + review_ints
    else:
        return review_ints

def get_review_representation(review_text, word_to_idx, max_sequence_length):
    """
    Convierte el texto de entrada en una representación de secuencia con padding a la izquierda.
    """
    review_ints = get_review_features(review_text, word_to_idx)
    return left_pad_features(review_ints, max_sequence_length)

# Texto de prueba
testing_text = "The quick brown fox jumps <br> over the lazy dog ! ThisWordIsOOV."
print(f"Original: {testing_text}")
testing_text = low_level_text_cleaning(testing_text)
print(f"Limpieza bajo nivel: {testing_text}")
testing_text = high_level_text_cleaning(testing_text)
print(f"Limpieza alto nivel: {testing_text}")
testing_text = get_review_representation(testing_text, word_to_index, 15)
print(f"Features: {testing_text}")

Original: The quick brown fox jumps <br> over the lazy dog ! ThisWordIsOOV.
Limpieza bajo nivel: The quick brown fox jumps over the lazy dog! ThisWordIsOOV.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/home/rami/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/share/nltk_data'
    - '/home/rami/.conda/envs/Taller_DL/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Dataset y DataLoader

Nuestro dataset `IMDBDataset` tomará el DataFrame de Pandas con las **reseñas preprocesadas** y el vocabulario, y devolverá una secuencia de índices numéricos para cada reseña. Luego, utilizaremos un DataLoader para cargar los datos en lotes y alimentarlos a nuestro modelo.

In [35]:
class IMDBDataset(Dataset):
    def __init__(self, reviews, labels, vocab, max_sequence_length):
        self.reviews = reviews
        self.labels = labels
        self.vocab = vocab
        self.max_sequence_length = max_sequence_length

    def __len__(self):
        return len(self.reviews)

    def __getitem__(self, idx):
        # Obtener texto y etiqueta
        text = self.reviews[idx]
        label = self.labels[idx]

        # Convertir texto a representación con padding
        indices = get_review_representation(text, self.vocab, self.max_sequence_length)

        return torch.tensor(indices, dtype=torch.int32), torch.tensor(
            [label], dtype=torch.float32
        )

In [44]:
# Crear el dataset
train_dataset = IMDBDataset(
    train_df["review_clean"],
    train_df["sentiment"],
    word_to_index,
    max_sequence_length=SEQUENCE_LENGTH,
)
test_dataset = IMDBDataset(
    test_df["review_clean"],
    test_df["sentiment"],
    word_to_index,
    max_sequence_length=SEQUENCE_LENGTH,
)

# Separar el conjunto de entrenamiento en subconjuntos de entrenamiento y validación
train_len = len(train_dataset)
val_len = int(0.2 * train_len)
train_dataset, val_dataset = random_split(train_dataset, [train_len - val_len, val_len])

KeyError: 'review_clean'

In [None]:
train_dataset[0]

In [None]:
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Parte 1: Modelos sin RNNs

## Modelo base

Para capturar la semántica de las reseñas necesitamos tomar nuestro texto (ya convertido a índices) y convertirlo en un vector de características. Para esto utilizamos una capa de embedding que mapea cada índice a un vector de características. 


### nn.Embedding

La [capa de embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) en PyTorch es una capa lineal que mapea un índice a un vector de características. Por ejemplo, si nuestro vocabulario tiene 10,000 palabras y estamos utilizando un embedding de tamaño 100, la capa de embedding tendrá una matriz de pesos de tamaño 10,000 x 100. Dado un índice de palabra, la capa de embedding devuelve la fila correspondiente de la matriz de pesos, que es el vector de características de la palabra.

In [None]:
embedding_layer = nn.Embedding(
    VOCAB_SIZE, EMBEDDING_DIM, padding_idx=word_to_index[PAD_TOKEN]
)
word_indices = torch.tensor([0, 1, 2, 3, 4, 5])
embedding_layer(word_indices)

Al igual que con otras capas en PyTorch, la capa de embedding se inicializa con pesos aleatorios y se ajusta durante el entrenamiento.


$$
\text{Parámetros} = \text{Tamaño del Vocabulario} \times \text{Tamaño del Embedding} + \text{Tamaño del Embedding}
$$

### Arquitectura

Para clasificar las reseñas de IMDB, utilizaremos una arquitectura de modelo simple con las siguientes capas:

1. **Capa de Embedding**: Mapea cada índice de palabra a un vector de características.
2. **Capa de Promedio**: Calcula el promedio de los vectores de características de todas las palabras en una reseña.
3. **Capa Lineal Oculta**: Transforma el vector de características promedio en un vector de características de tamaño oculto.
4. **Capa de Salida**: Produce la salida final, que es la probabilidad de que la reseña sea positiva o negativa.

In [47]:
class SentimentModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(SentimentModel, self).__init__()
        self.embed = nn.Embedding(
            vocab_size, embedding_dim, padding_idx=word_to_index[PAD_TOKEN]
        )
        self.fc = nn.Linear(embedding_dim, hidden_dim)
        self.relu = nn.ReLU(inplace=True)
        self.out = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        # x: [BATCH_SIZE, SEQUENCE_LENGTH]
        x = self.embed(x)
        # x: [BATCH_SIZE, SEQUENCE_LENGTH, EMBEDDING_DIM]
        x = x.mean(dim=1)
        # x: [BATCH_SIZE, EMBEDDING_DIM]
        x = self.relu(x)
        # x: [BATCH_SIZE, HIDDEN_DIM]
        x = self.fc(x)
        #
        x = self.relu(x)
        x = F.sigmoid(self.out(x))
        return x

summary(
    SentimentModel(VOCAB_SIZE, EMBEDDING_DIM, 512),
    input_size=(BATCH_SIZE, SEQUENCE_LENGTH),
    dtypes=[torch.int32],
)


NameError: name 'VOCAB_SIZE' is not defined

### Entrenamiento y Evaluación

In [None]:
CRITERION = nn.BCELoss().to(DEVICE)

In [None]:
base_model = SentimentModel(
    vocab_size=VOCAB_SIZE, embedding_dim=EMBEDDING_DIM, hidden_dim=512
).to(DEVICE)
base_optimizer = optim.Adam(base_model.parameters(), lr=0.001)

In [None]:
_, _ = train(
    base_model,
    optimizer=base_optimizer,
    criterion=CRITERION,
    train_loader=train_loader,
    val_loader=val_loader,
    device=DEVICE,
    do_early_stopping=True,
    patience=3,
    epochs=20,
)

In [None]:
def model_accuracy(model, data_loader):
    model.eval()
    with torch.no_grad():
        y_true = []
        y_pred = []
        for x, y in data_loader:
            x, y = x.to(DEVICE), y.to(DEVICE)
            out = torch.where(model(x) > 0.5, 1, 0)
            y_true.extend(y.cpu().numpy())
            y_pred.extend(out.cpu().numpy())
        print(f"Accuracy: {np.mean(np.array(y_true) == np.array(y_pred)) * 100:.2f}%")

In [None]:
model_accuracy(base_model, test_loader)

## Ejercicios

- Explorar otras técnicas de preprocesamiento de texto, como stemming, eliminación de números, etc.
- Explorar con los hiperparámetros del modelo, como el tamaño del embedding, el tamaño de la capa oculta, etc.

# Parte 2: Modelos con RNNs

Las [redes neuronales recurrentes (RNNs)](https://d2l.ai/chapter_recurrent-neural-networks/rnn.html) son una clase de redes neuronales diseñadas para manejar datos secuenciales. A diferencia de las redes neuronales convolucionales (CNNs), que son eficaces para procesar datos espaciales, como imágenes, las RNNs son ideales para modelar datos secuenciales, como texto, audio y series temporales.

En este caso tendremos una arquitectura many-to-one, donde la entrada es una secuencia de palabras y la salida es una sola etiqueta de clase (positiva o negativa).

## RNNs

### nn.RNN

La capa [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) en PyTorch es una capa recurrente que procesa una secuencia de entrada paso a paso, manteniendo un estado oculto que captura la información de pasos anteriores. Dado un tensor de entrada de tamaño `(batch, secuencia, características)`, la capa RNN procesa la secuencia paso a paso y devuelve el estado oculto final para cada secuencia en el lote.

<!-- ![rnn](https://d2l.ai/_images/rnn.svg) -->
<img src="https://d2l.ai/_images/rnn.svg" width="500" style="background:white; display: block; margin-left: auto; margin-right: auto;"/>

In [None]:
rnn = nn.RNN(input_size=EMBEDDING_DIM, hidden_size=32, num_layers=1, batch_first=True)
tensor = torch.randn(BATCH_SIZE, SEQUENCE_LENGTH, EMBEDDING_DIM)

output, hidden = rnn(tensor)
print(f"Output shape: {output.shape}")
print(f"Hidden shape: {hidden.shape}")

### Arquitectura

In [None]:
class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, n_layers=2, dropout=0.5):
        super(SentimentRNN, self).__init__()
        self.embed = nn.Embedding(
            vocab_size, embedding_dim, padding_idx=word_to_index[PAD_TOKEN]
        )
        self.drop = nn.Dropout(dropout)
        self.rnn = nn.RNN(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True,
        )
        self.fc = nn.Linear(hidden_dim, hidden_dim*2)
        self.out = nn.Linear(hidden_dim*2, 1)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        # x: [BATCH_SIZE, SEQUENCE_LENGTH]
        x = self.embed(x)
        # x: [BATCH_SIZE, SEQUENCE_LENGTH, EMBEDDING_DIM]
        x = self.drop(x)
        x, _ = self.rnn(x)
        # x: [BATCH_SIZE, SEQUENCE_LENGTH, HIDDEN_DIM]
        x = x[:, -1, :]
        # x: [BATCH_SIZE, HIDDEN_DIM]
        x = self.fc(x)

        x = self.relu(x)
        # x: [BATCH_SIZE, HIDDEN_DIM*2]
        x = F.sigmoid(self.out(x))
        return x

summary(
    SentimentRNN(VOCAB_SIZE, EMBEDDING_DIM, 128),
    input_size=(BATCH_SIZE, SEQUENCE_LENGTH),
    dtypes=[torch.int32],
)

### Entrenamiento y Evaluación

In [None]:
rnn_model = SentimentRNN(
    vocab_size=VOCAB_SIZE,
    embedding_dim=EMBEDDING_DIM,
    hidden_dim=128,
    n_layers=2,
    dropout=0.5,
).to(DEVICE)
rnn_optimizer = optim.Adam(rnn_model.parameters(), lr=0.001)

In [None]:
_, _ = train(
    rnn_model,
    optimizer=rnn_optimizer,
    criterion=CRITERION,
    train_loader=train_loader,
    val_loader=val_loader,
    device=DEVICE,
    do_early_stopping=False,
    epochs=20,
)

In [None]:
model_accuracy(rnn_model, test_loader)

## LSTMs

Las [redes LSTM (Long Short-Term Memory)](https://d2l.ai/chapter_recurrent-modern/lstm.html) son una variante de las RNNs que están diseñadas para manejar el problema del desvanecimiento del gradiente. Las LSTMs utilizan una estructura de celda más compleja que permite que el gradiente fluya sin desvanecerse o explotar, lo que las hace más efectivas para modelar secuencias a largo plazo.

<img src="https://d2l.ai/_images/lstm-3.svg" width="500" style="background:white; display: block; margin-left: auto; margin-right: auto;"/>

In [None]:
lstm = nn.LSTM(input_size=EMBEDDING_DIM, hidden_size=32, num_layers=1, batch_first=True)
tensor = torch.randn(BATCH_SIZE, SEQUENCE_LENGTH, EMBEDDING_DIM)

output, (hidden, cell) = lstm(tensor)
print(f"Output shape: {output.shape} (batch_size, seq_length, hidden_size)")
print(f"Hidden shape: {hidden.shape} (num_layers, batch_size, hidden_size)")
print(f"Cell shape: {cell.shape} (num_layers, batch_size, hidden_size)")

In [None]:
class SentimentLSTM(nn.Module):
    def __init__(
        self, vocab_size, embedding_dim, hidden_dim, n_layers=2, dropout=0.5
    ):
        super(SentimentLSTM, self).__init__()
        self.embed = nn.Embedding(
            vocab_size, embedding_dim, padding_idx=word_to_index[PAD_TOKEN]
        )
        self.drop = nn.Dropout(dropout)
        
        pass

    def forward(self, text):
        # text: [BATCH_SIZE, SEQUENCE_LENGTH]
        pass # use sigmoid activation for the output

summary(
    SentimentLSTM(VOCAB_SIZE, EMBEDDING_DIM, 128),
    input_size=(BATCH_SIZE, SEQUENCE_LENGTH),
    dtypes=[torch.int32],
)

In [None]:
lstm_model = SentimentLSTM(
    vocab_size=VOCAB_SIZE,
    embedding_dim=EMBEDDING_DIM,
    hidden_dim=128,
    n_layers=4,
    dropout=0.5,
).to(DEVICE)

optimizer = optim.Adam(lstm_model.parameters(), lr=0.001)

_, _ = train(
    lstm_model,
    optimizer=optimizer,
    criterion=CRITERION,
    train_loader=train_loader,
    val_loader=val_loader,
    device=DEVICE,
    do_early_stopping=False,
    epochs=20,
)

In [None]:
model_accuracy(lstm_model, test_loader)

## Ejericios

- Explorar otras arquitecturas de RNNs, como [GRUs](https://d2l.ai/chapter_recurrent-modern/gru.html).
- Experimentar con diferentes hiperparámetros, como el tamaño de la capa oculta, la tasa de aprendizaje, etc.