# Tarea 2 - Named Entity Recognition

----------------------

- **Nombres:** Mario Vicuña, Miguel Videla

- **Usuario o nombre de equipo en Codalab:** TeamChalla



## Introducción a la tarea

### Objetivo


El objetivo de esta tarea es resolver una de las tasks mas importantes de Sequence Labelling: [Named Entity Recognition (NER)](http://www.cs.columbia.edu/~mcollins/cs4705-spring2019/slides/tagging.pdf). 

En particular, deberán participar, al igual que en la tarea anterior, en una competencia en donde deberán crear distintos modelos que apunten a resolver NER en español. Para esto, les proveeremos un dataset de NER de noticias etiquetadas en español mas este baseline en donde podrán comenzar a trabajar. 

Esperamos que (por lo menos) utilizen Redes Neuronales Recurrentes (RNN) para resolverla. Nuevamente, hay total libertad para utilizar software y los modelos que deseen, siempre y cuando estos no traigan los modelos ya implementados (como el caso de spacy).


**¿Qué es Sequence Labelling?** 

En breves palabras, dada una secuencia de tokens (frase u oración) sequence labelling tiene por objetivo asignar una etiqueta a cada token de dicha secuencia.

**Named Entity Recognition (NER)**

Esta tarea consiste en localizar y clasificar los tokens de una oración que representen entidades nombradas. Es decir, tokens que simbolicen (1) **personas**, (2) **organizaciones**, (3) **lugares** y (4) **adjetivos, eventos y otras entidades que no entren en las categorías anteriores** deberán ser taggeados como (1) **PER**, (2) **ORG**, (3) **LOC** y (4) **MISC** respectivamente. Adicionalmente, dado que existen entidades representadas en más de un token (como La Serena), se utiliza la notación BIO como prefijo al tag: Beginning, Inside, Outside. Es decir, si encuentro una entidad, el primer token etiquetado será precedido por B, el segundo por I y los n restantes por I. Por otra parte, si el token no representa ninguna entidad nombrada, se representa por O. Un ejemplo de esto es:

Por ejemplo:

```
Felipe B-PER
Bravo I-PER
es O
el O
profesor O
de O
PLN B-MISC
de O
la O
Universidad B-ORG
de I-ORG
Chile I-ORG
. O
```

Estos links son los más indicados para comenzar:

-  [Tagging, and Hidden Markov Models ](http://www.cs.columbia.edu/~mcollins/cs4705-spring2019/slides/tagging.pdf) (slides by Michael Collins), [notes](http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf), [video 1](https://youtu.be/-ngfOZz8yK0), [video 2](https://youtu.be/PLoLKQwkONw), [video 3](https://youtu.be/aaa5Qoi8Vco), [video 4](https://youtu.be/4pKWIDkF_6Y)       
-  [Recurrent Neural Networks](slides/NLP-RNN.pdf) | [video 1](https://youtu.be/BmhjUkzz3nk), [video 2](https://youtu.be/z43YFR1iIvk), [video 3](https://youtu.be/7L5JxQdwNJk)


Recuerden que todo el material se encuentra disponible en el [github del curso](https://github.com/dccuchile/CC6205).

### Reglas de la tarea

Algunos detalles de la competencia:

- Para que su tarea sea evaluada, deben participar en la competencia como también, enviar este notebook con su informe.
- Para participar, deben registrarse en la competencia en Codalab en grupos de máximo 2 alumnos. Cada grupo debe tener un nombre de equipo. (¡Y deben reportarlo en su informe!)
- Las métricas usadas serán Precisión, Recall y F1.
- En esta tarea se recomienda usar GPU. Pueden ejecutar su tarea en colab (lo cual trae todo instalado) o pueden intentar correrlo en su computador. en este caso, deberá ser compatible con cuda y deberán instalar todo por su cuenta.
- En total pueden hacer un **máximo de 4 envíos**.
- Por favor, todas sus dudas haganlas en el hilo de U-cursos de la tarea. Los emails que lleguen al equipo docente serán remitidos a ese medio. Recuerden el ánimo colaborativo del curso!!
- Estar top 5 en alguna métrica equivale a 1 punto extra en la nota final.


**Link a la competencia:  https://competitions.codalab.org/competitions/25302?secret_key=690406c7-b3b0-4092-8694-d08d7991ca94**

### Modelos

La RNN del baseline adjunto a este notebook está programado en [`pytorch`](https://pytorch.org/) y contiene:

- La carga los datasets, creación de batches de texto y padding. En resumen, carga los datos y los deja listo para entrenar la red.
- La implementación básica de una red `LSTM` simple de solo un nivel y sin bi-direccionalidad. 
- La construcción un output para que lo puedan probar en la tarea en codelab.



roponer algunos experimentos a hacer:
(cambiar el batch size, dimensiones de las capas, cambiar el tipo de
RNN, cambiar el optimizer, usar una CRF loss, usar embeddings
pre-entrenados, usar BERT??). Quizás podemos sugerir usar algo como
https://github.com/flairNLP/flair

Se espera que ustedes experimenten con el baseline utilizando (pero no limitándose) estas sugerencias:

*   Probar Early stopping
*   Variar la cantidad de parámetros de la capa de embeddings.
*   Variar la cantidad de capas RNN.
*   Variar la cantidad de parámetros de las capas de RNN.
*   Inicializar la capa de embeddings con modelos pre-entrenados. (word2vec, glove, conceptnet, etc...).[Guía breve aquí](https://github.com/dccuchile/spanish-word-embeddings), [Embeddings en español aquí](https://github.com/dccuchile/spanish-word-embeddings).
*   Variar la cantidad de épocas de entrenamiento.
*   Variar el optimizador, learning rate, batch size, usar CRF loss, etc...
*   Probar bi-direccionalidad.
*   Probar teacher forcing.
*   Incluir dropout.
*   Probar modelos de tipo GRU
*   Probar Embedding Contextuales (les puede ser de utilidad [flair](https://github.com/flairNLP/flair))
*   Probar modelos de transformers en español usando [Huggingface](https://github.com/huggingface/transformers)

### Reporte

Este debe cumplir la siguiente estructura:

1.	**Introducción**: Presentar brevemente el problema a resolver, los modelos utilizados en el desarrollo de la tarea y conclusiones obtenidas. (0.5 Puntos)

2.	**Modelos**: Describir brevemente los modelos, métodos y hiperparámetros utilizados. (1.0 puntos)

4.	**Métricas de evaluación**: Describir las métricas utilizadas en la evaluación indicando que miden y cuál es su interpretación en este problema en particular. (0.5 puntos)

5.	**Experimentos**: Reportar todos sus experimentos y código en esta sección. Comparar los resultados obtenidos utilizando diferentes modelos. ¡Es vital haber realizado varios experimentos para sacar una buena nota! (3.0 puntos)

6.	**Conclusiones**: Discutir resultados, proponer trabajo futuro. (1.0 punto)

(Pueden eliminar cualquier celda con instrucciones...)

**Importante**: Recuerden poner su nombre y el de su usuario o de equipo (en caso de que aplique) tanto en el reporte. NO serán evaluados Notebooks sin nombre.


-----------------------------------------

## Introducción


...

## Modelos 


...

## Métricas de evaluación

- **Precision:** ...
- **Recall:** ...
- **F1 score:** ...

## Experimentos


El código que les entregaremos servirá de baseline para luego implementar mejores modelos. 
En general, el código asociado a la carga de los datos, las funciones de entrenamiento, de evaluación y la predicción de los datos de la competencia no deberían cambiar. 
Solo deben preocuparse de cambiar la arquitectura del modelo, sus hiperparámetros y reportar, lo cual lo pueden hacer en las subsecciones *modelos*.



In [None]:
!nvidia-smi

Thu Jul 30 21:47:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

###  Carga de datos y Preprocesamiento

Para cargar los datos y preprocesarlos usaremos la librería [`torchtext`](https://github.com/pytorch/text).
En particular usaremos su módulo `data`, el cual según su documentación original provee: 

    - Ability to describe declaratively how to load a custom NLP dataset that's in a "normal" format
    - Ability to define a preprocessing pipeline
    - Batching, padding, and numericalizing (including building a vocabulary object)
    - Wrapper for dataset splits (train, validation, test)


El proceso será el siguiente: 

1. Descargar los datos desde github y examinarlos.
2. Definir los campos (`fields`) que cargaremos desde los archivos.
3. Cargar los datasets.
4. Crear el vocabulario.



In [None]:
# Instalar torchtext (en codalab) - Descomentar.
#!pip3 install --upgrade torchtext
!pip3 install torchtext==0.6
#!pip3 install --upgrade torch
!pip3 install torch==1.5.1

Collecting torchtext==0.6
[?25l  Downloading https://files.pythonhosted.org/packages/f2/17/e7c588245aece7aa93f360894179374830daf60d7ed0bbb59332de3b3b61/torchtext-0.6.0-py3-none-any.whl (64kB)
[K     |█████                           | 10kB 22.2MB/s eta 0:00:01[K     |██████████▏                     | 20kB 5.0MB/s eta 0:00:01[K     |███████████████▎                | 30kB 6.4MB/s eta 0:00:01[K     |████████████████████▍           | 40kB 7.0MB/s eta 0:00:01[K     |█████████████████████████▌      | 51kB 6.6MB/s eta 0:00:01[K     |██████████████████████████████▋ | 61kB 7.1MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 5.1MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |▎                               | 10kB 30.7MB/s eta 0:00:01[K     |▋                               | 20kB 37.6MB/s

In [None]:
import torch
from torchtext import data, datasets


# Garantizar reproducibilidad 
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# torch.cuda.device(1)
# print(torch.cuda.current_device())
print('Using', device)

Using cuda


#### Obtener datos

Descargamos los datos de entrenamiento, validación y prueba en nuestro directorio de trabajo

In [None]:
%%capture
!wget https://github.com/dccuchile/CC6205/releases/download/Data/train_NER_esp.txt -nc # Dataset de Entrenamiento
!wget https://github.com/dccuchile/CC6205/releases/download/Data/val_NER_esp.txt -nc    # Dataset de Validación (Para probar y ajustar el modelo)
!wget https://github.com/dccuchile/CC6205/releases/download/Data/test_NER_esp.txt -nc  # Dataset de la Competencia. Estos datos solo contienen los tokens. ¡¡SON LOS QUE DEBEN SER PREDICHOS!!

####  Fields

Un `field`:

* Define un tipo de datos junto con instrucciones para convertir el texto a Tensor.
* Contiene un objeto `Vocab` que contiene el vocabulario (palabras posibles que puede tomar ese campo).
* Contiene otros parámetros relacionados con la forma en que se debe numericalizar un tipo de datos, como un método de tokenización y el tipo de Tensor que se debe producir.


Analizemos el siguiente cuadro el cual contiene un ejemplo cualquiera de entrenamiento:


```
El O
Abogado B-PER
General I-PER
del I-PER
Estado I-PER
, O
Daryl B-PER
Williams I-PER
```

Cada linea contiene una palabra y su clase. Para que `torchtext` pueda cargar estos datos, debemos definir como va a leer y separar los componentes de cada una de las lineas.
Para esto, definiremos un field para cada uno de esos componentes: Las palabras (`TEXT`) y los NER_TAGS (`clase`).


In [None]:
# Primer Field: TEXT. Representan los tokens de la secuencia
TEXT = data.Field(lower=False) 

# Segundo Field: NER_TAGS. Representan los Tags asociados a cada palabra.
NER_TAGS = data.Field(unk_token=None)

fields = (("text", TEXT), ("nertags", NER_TAGS))

####  SequenceTaggingDataset

`SequenceTaggingDataset` es una clase de torchtext diseñada para contener datasets de sequence labelling. 
Los ejemplos que se guarden en una instancia de estos serán arreglos de palabras pareados con sus respectivos tags.
Por ejemplo, para Part-of-speech tagging:

[I, love, PyTorch, .] estará pareado con [PRON, VERB, PROPN, PUNCT]


La idea es que usando los fields que definimos antes, le indiquemos a la clase cómo cargar los datasets de prueba, validación y test.

In [None]:
train_data, valid_data, test_data = datasets.SequenceTaggingDataset.splits(
    path="./",
    train="train_NER_esp.txt",
    validation="val_NER_esp.txt",
    test="test_NER_esp.txt",
    fields=fields,
    encoding="iso-8859-1",
    separator=" "
)

In [None]:
print(f"Numero de ejemplos de entrenamiento: {len(train_data)}")
print(f"Número de ejemplos de validación: {len(valid_data)}")
print(f"Número de ejemplos de test (competencia): {len(test_data)}")

Numero de ejemplos de entrenamiento: 8323
Número de ejemplos de validación: 1915
Número de ejemplos de test (competencia): 1517


Visualizemos un ejemplo

In [None]:
import random
random_item_idx = random.randint(0, len(train_data))
random_example = train_data.examples[random_item_idx]
list(zip(random_example.text, random_example.nertags))

[('Dentro', 'O'),
 ('de', 'O'),
 ('la', 'O'),
 ('apatía', 'O'),
 ('y', 'O'),
 ('el', 'O'),
 ('desengaño', 'O'),
 ('con', 'O'),
 ('la', 'O'),
 ('oposición', 'O'),
 ('al', 'O'),
 ('presidente', 'O'),
 ('yugoslavo', 'O'),
 (',', 'O'),
 ('Slobodan', 'B-PER'),
 ('Milosevic', 'I-PER'),
 (',', 'O'),
 ('la', 'O'),
 ('población', 'O'),
 ('ha', 'O'),
 ('empezado', 'O'),
 ('a', 'O'),
 ('jugar', 'O'),
 ('a', 'O'),
 ('las', 'O'),
 ('cartas', 'O'),
 ('con', 'O'),
 ('una', 'O'),
 ('baraja', 'O'),
 ('en', 'O'),
 ('la', 'O'),
 ('que', 'O'),
 ('sus', 'O'),
 ('caricaturizados', 'O'),
 ('líderes', 'O'),
 ('pintan', 'O'),
 ('corazones', 'O'),
 (',', 'O'),
 ('bastos', 'O'),
 ('y', 'O'),
 ('otros', 'O'),
 ('palos', 'O'),
 ('.', 'O')]

#### Construir los vocabularios para el texto y las etiquetas

Los vocabularios son los obbjetos que contienen todos los tokens (de entrenamiento) posibles para ambos fields.
El siguiente paso consiste en construirlos. Para esto, hacemos uso del método `Field.build_vocab` sobre cada uno de nuestros `fields`. 

In [None]:
TEXT.build_vocab(train_data)
NER_TAGS.build_vocab(train_data)

In [None]:
print(f"Tokens únicos en TEXT: {len(TEXT.vocab)}")
print(f"Tokens únicos en NER_TAGS: {len(NER_TAGS.vocab)}")

Tokens únicos en TEXT: 26101
Tokens únicos en NER_TAGS: 10


In [None]:
#Veamos las posibles etiquetas que hemos cargado:
NER_TAGS.vocab.itos

['<pad>',
 'O',
 'B-ORG',
 'I-ORG',
 'B-LOC',
 'B-PER',
 'I-PER',
 'I-MISC',
 'B-MISC',
 'I-LOC']

In [None]:
%%script false
## Verifiquemos los NER_TAGS del conjunto de validación:
TEXT.build_vocab(valid_data)
NER_TAGS.build_vocab(valid_data)

print(f"Tokens únicos en TEXT: {len(TEXT.vocab)}")
print(f"Tokens únicos en NER_TAGS: {len(NER_TAGS.vocab)}")

NER_TAGS.vocab.itos

Observen que ademas de los tags NER, tenemos \<pad\>, el cual es generado por el dataloader para cumplir con el padding de cada oración.

Veamos ahora los tokens mas frecuentes y especiales:

In [None]:
# Tokens mas frecuentes
TEXT.vocab.freqs.most_common(10)

[('de', 17657),
 (',', 14716),
 ('la', 9571),
 ('que', 7516),
 ('.', 7263),
 ('el', 6905),
 ('en', 6484),
 ('"', 5691),
 ('y', 5336),
 ('a', 4304)]

In [None]:
# Seteamos algunas variables que nos serán de utilidad mas adelante...
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

PAD_TAG_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
O_TAG_IDX = NER_TAGS.vocab.stoi['O']

print(UNK_IDX)
print(PAD_IDX)
print(PAD_TAG_IDX)
print(O_TAG_IDX)

0
1
0
1


#### Frecuencia de los Tags

Visualizemos rápidamente las cantidades y frecuencias de cada tag:

In [None]:
def tag_percentage(tag_counts):
    
    total_count = sum([count for tag, count in tag_counts])
    tag_counts_percentages = [(tag, count, count/total_count) for tag, count in tag_counts]
  
    return tag_counts_percentages

print("Tag Ocurrencia Porcentaje\n")

for tag, count, percent in tag_percentage(NER_TAGS.vocab.freqs.most_common()):
    print(f"{tag}\t{count}\t{percent*100:4.1f}%")

Tag Ocurrencia Porcentaje

O	231920	87.6%
B-ORG	7390	 2.8%
I-ORG	4992	 1.9%
B-LOC	4913	 1.9%
B-PER	4321	 1.6%
I-PER	3903	 1.5%
I-MISC	3212	 1.2%
B-MISC	2173	 0.8%
I-LOC	1891	 0.7%


#### Configuramos pytorch y dividimos los datos.

Importante: si tienes problemas con la ram de la gpu, disminuye el tamaño de los batches

In [None]:
BATCH_SIZE = 16  # disminuir si hay problemas de ram.

# Usar cuda si es que está disponible.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using', device)

# Dividir datos entre entrenamiento y test
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=BATCH_SIZE,
    device=device,
    sort=False,
)

Using cuda


#### Métricas de evaluación

Además, definiremos las métricas que serán usadas tanto para la competencia como para evaluar el modelo: `precision`, `recall` y `f1`.
**Importante**: Noten que la evaluación solo se hace para las Named Entities (sin contar 'O').

In [None]:
# Definimos las métricas

from sklearn.metrics import f1_score, precision_score, recall_score
import warnings
import sklearn.exceptions
warnings.filterwarnings("ignore",
                        category=sklearn.exceptions.UndefinedMetricWarning)


def calculate_metrics(preds, y_true, pad_idx=PAD_TAG_IDX, o_idx=O_TAG_IDX):
    """
    Calcula precision, recall y f1 de cada batch.
    """

    # Obtener el indice de la clase con probabilidad mayor. (clases)
    y_pred = preds.argmax(dim=1, keepdim=True)
    # Obtenemos los indices distintos de 0.

    # filtramos <pad> y O para calcular los scores.
    mask = [(y_true != o_idx) & (y_true != pad_idx)]
    y_pred = y_pred[mask]
    y_true = y_true[mask]

    # traemos a la cpu
    y_pred = y_pred.view(-1).to('cpu')
    y_true = y_true.to('cpu')
    
    # calcular scores
    f1 = f1_score(y_true, y_pred, average='macro')
    precision = precision_score(y_true, y_pred, average='macro')
    recall = recall_score(y_true, y_pred, average='macro')

    return precision, recall, f1

-------------------

### Modelo Baseline

Teniendo ya cargado los datos, toca definir nuestro modelo. Este baseline tendrá una capa de embedding, unas cuantas LSTM y una capa de salida y usará dropout en el entrenamiento.

Este constará de los siguientes pasos: 

1. Definir la clase que contendrá la red.
2. Definir los hiperparámetros e inicializar la red. 
3. Definir la época de entrenamiento
3. Definir la función de loss.



Recomendamos que para experimentar, encapsules los modelos en una sola variable y luego la fijes en model para entrenarla

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


# Definir la red
class NER_RNN(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 dropout, 
                 pad_idx):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                      embedding_dim,
                                      padding_idx=pad_idx)

        # Capa LSTM
        self.lstm = nn.LSTM(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional, 
                           dropout = dropout if n_layers > 1 else 0)

        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                            output_dim)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):

        #text = [sent len, batch size]

        # Convertir lo enviado a embedding
        embedded = self.dropout(self.embedding(text))
        
        outputs, (hidden, cell) = self.lstm(embedded)
        #embedded = [sent len, batch size, emb dim]

        # Pasar los embeddings por la rnn (LSTM)

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        # Predecir usando la capa de salida.
        predictions = self.fc(self.dropout(outputs))
        #predictions = [sent len, batch size, output dim]

        return predictions

#### Hiperparámetros de la red

Definimos los hiperparámetros. 

In [None]:
# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 128  # dimensión de la capas LSTM
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = False

# Creamos nuestro modelo.
baseline_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

baseline_model_name = 'baseline'  # nombre que tendrá el modelo guardado...

In [None]:
baseline_n_epochs = 10

#### Definimos la función de loss

In [None]:
# Loss: Cross Entropy
TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

--------------------
### Modelo 1

En estas secciones pueden implementar nuevas redes al modificar los hiperparámetros, la cantidad de épocas de entrenamiento, el tamaño de los batches, loss, optimizador, etc... como también definir nuevas arquitecturas de red (mediante la creación de clases nuevas)


Al final de estas, hay 4 variables, las cuales deben setear con los modelos, épocas de entrenamiento, loss y optimizador que deseen probar.


In [None]:
# model_1 = ...
# model_name_1 = ...
# n_epochs_1 = ...
# loss_1 = ...

---------------

### Modelo 2

In [None]:
# model_2 = ...
# model_name_2 = ...
# n_epochs_2 = ...
# loss_2 = ...

---------------


### Modelo 3

In [None]:
# modelo_3 = ...
# model_name_3 = ...
# n_epochs_3 = ...
# loss_3 = ...

------
### Entrenamos y evaluamos


**Importante** : Fijen el modelo, el número de épocas de entrenamiento, la loss y el optimizador que usarán para entrenar y evaluar en las siguientes variables!!!

In [None]:
model = baseline_model
model_name = baseline_model_name
criterion = baseline_criterion
n_epochs = baseline_n_epochs



#### Inicializamos la red

iniciamos los pesos de la red de forma aleatoria (Usando una distribución normal).


In [None]:
def init_weights(m):
    # Inicializamos los pesos como aleatorios
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean=0, std=0.1) 
        
    # Seteamos como 0 los embeddings de UNK y PAD.
    model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
    model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
        
model.apply(init_weights)

NER_RNN(
  (embedding): Embedding(26101, 100, padding_idx=1)
  (lstm): LSTM(100, 128, num_layers=2, dropout=0.25)
  (fc): Linear(in_features=128, out_features=10, bias=True)
  (dropout): Dropout(p=0.25, inplace=False)
)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

El modelo actual tiene 2,861,246 parámetros entrenables.


Por último, definimos los embeddings que representan a \<unk\> y \<pad\>  como [0, 0, ..., 0]

#### Definimos el optimizador

In [None]:
# Optimizador
optimizer = optim.Adam(model.parameters())

#### Enviamos el modelo a cuda


In [None]:
# Enviamos el modelo y la loss a cuda (en el caso en que esté disponible)
model = model.to(device)
criterion = criterion.to(device)

#### Definimos el entrenamiento de la red

Algunos conceptos previos: 

- `epoch` : una pasada de entrenamiento completa de una dataset.
- `batch`: una fracción de la época. Se utilizan para entrenar mas rápidamente la red. (mas eficiente pasar n datos que uno en cada ejecución del backpropagation)

Esta función está encargada de entrenar la red en una época. Para esto, por cada batch de la época actual, predice los tags del texto, calcula su loss y luego hace backpropagation para actualizar los pesos de la red."

Observación: En algunos comentarios aparecerá el tamaño de los tensores entre corchetes

In [None]:
def train(model, iterator, optimizer, criterion):

    epoch_loss = 0
    epoch_precision = 0
    epoch_recall = 0
    epoch_f1 = 0

    model.train()

    # Por cada batch del iterador de la época:
    for batch in iterator:

        # Extraemos el texto y los tags del batch que estamos procesado
        text = batch.text
        tags = batch.nertags

        # Reiniciamos los gradientes calculados en la iteración anterior
        optimizer.zero_grad()

        #text = [sent len, batch size]

        # Predecimos los tags del texto del batch.
        predictions = model(text)

        #predictions = [sent len, batch size, output dim]
        #tags = [sent len, batch size]

        # Reordenamos los datos para calcular la loss
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)

        #predictions = [sent len * batch size, output dim]
        #tags = [sent len * batch size]

        # Calculamos el Cross Entropy de las predicciones con respecto a las etiquetas reales
        loss = criterion(predictions, tags)
        
        # Calculamos el accuracy
        precision, recall, f1 = calculate_metrics(predictions, tags)

        # Calculamos los gradientes
        loss.backward()

        # Actualizamos los parámetros de la red
        optimizer.step()

        # Actualizamos el loss y las métricas
        epoch_loss += loss.item()
        epoch_precision += precision
        epoch_recall += recall
        epoch_f1 += f1

    return epoch_loss / len(iterator), epoch_precision / len(
        iterator), epoch_recall / len(iterator), epoch_f1 / len(iterator)

#### `Definimos la función de evaluación`

Evalua el rendimiento actual de la red usando los datos de validación. 

Por cada batch de estos datos, calcula y reporta el loss y las métricas asociadas al conjunto de validación. 
Ya que las métricas son calculadas por cada batch, estas son retornadas promediadas por el número de batches entregados. (ver linea del return)

In [None]:
def evaluate(model, iterator, criterion):

    epoch_loss = 0
    epoch_precision = 0
    epoch_recall = 0
    epoch_f1 = 0

    model.eval()

    # Indicamos que ahora no guardaremos los gradientes
    with torch.no_grad():
        # Por cada batch
        for batch in iterator:

            text = batch.text
            tags = batch.nertags

            # Predecimos
            predictions = model(text)

            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)

            # Calculamos el Cross Entropy de las predicciones con respecto a las etiquetas reales
            loss = criterion(predictions, tags)

            # Calculamos las métricas
            precision, recall, f1 = calculate_metrics(predictions, tags)

            # Actualizamos el loss y las métricas
            epoch_loss += loss.item()
            epoch_precision += precision
            epoch_recall += recall
            epoch_f1 += f1

    return epoch_loss / len(iterator), epoch_precision / len(
        iterator), epoch_recall / len(iterator), epoch_f1 / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs


#### Entrenamiento de la red

En este cuadro de código ejecutaremos el entrenamiento de la red.
Para esto, primero definiremos el número de épocas y luego por cada época, ejecutaremos `train` y `evaluate`.

**Importante: Reiniciar los pesos del modelo**

Si ejecutas nuevamente esta celda, se seguira entrenando el mismo modelo una y otra vez. 
Para reiniciar el modelo se debe ejecutar nuevamente la celda que contiene la función `init_weights`



In [None]:
%%script false
best_valid_loss = float('inf')

for epoch in range(n_epochs):

    start_time = time.time()

    # Recuerdo: train_iterator y valid_iterator contienen el dataset dividido en batches.

    # Entrenar
    train_loss, train_precision, train_recall, train_f1 = train(
        model, train_iterator, optimizer, criterion)

    # Evaluar (valid = validación)
    valid_loss, valid_precision, valid_recall, valid_f1 = evaluate(
        model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    # Si obtuvimos mejores resultados, guardamos este modelo en el almacenamiento (para poder cargarlo luego)
    # Si detienen el entrenamiento prematuramente, pueden cargar el modelo en el siguiente recuadro de código.
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), '{}.pt'.format(model_name))
    # Si ya no mejoramos el loss de validación, terminamos de entrenar.

    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(
        f'\tTrain Loss: {train_loss:.3f} | Train f1: {train_f1:.2f} | Train precision: {train_precision:.2f} | Train recall: {train_recall:.2f}'
    )
    print(
        f'\t Val. Loss: {valid_loss:.3f} |  Val. f1: {valid_f1:.2f} |  Val. precision: {valid_precision:.2f} | Val. recall: {valid_recall:.2f}'
    )

**Importante**: Recuerden que el último modelo entrenado no es el mejor (probablemente esté *overfitteado*), si no el que guardamos con la menor loss del conjunto de validación.
Para cargar el mejor modelo entrenado, ejecuten la siguiente celda.

Este problema lo pueden solucionar con *early stopping*.

In [None]:
%%script false
# cargar el mejor modelo entrenado.
model.load_state_dict(torch.load('{}.pt'.format(model_name)))

In [None]:
# Limpiar ram de cuda
torch.cuda.empty_cache()

#### Evaluamos el set de validación con el modelo final

Estos son los resultados de predecir el dataset de evaluación con el *mejor* modelo entrenado.

In [None]:
%%script false
valid_loss, valid_precision, valid_recall, valid_f1 = evaluate(
    model, valid_iterator, criterion)

print(
    f'Val. Loss: {valid_loss:.3f} |  Val. f1: {valid_f1:.2f} | Val. precision: {valid_precision:.2f} | Val. recall: {valid_recall:.2f}'
)

### Modificaciones al Baseline

En esta sección se desarrollan los distintos experimentos sobre arquitecturas recurrentes, considerando la base ya provista.

En primer lugar se propone variar la cantidad de épocas e implementar *Early Stopping*, para aprovechar al máximo la información en los datos y el hecho de que ya se cuenta con las herramientas para recuperar el ```state``` del modelo con mejor desempeño en validación durante el entrenamiento.

La idea es poder incrementar el número de épocas para que los modelos no "se queden cortos", pero evitar por medio de *Early Stopping* entrenar la red una vez que ya está en *overfitting* y así no entrenar durante las épocas restantes, en las cuales es esperable no encontrar un modelo mejor que alguno de los ya encontrados.

#### Número de Épocas y *Early Stopping*

In [None]:
def optimize_model(model, train_iterator, valid_iterator, optimizer, 
                   criterion, nEpochs = 100, stopTolerance = 10, useBest = True):
  '''stopTolerance denota la cantidad de épocas sin mejoría antes de terminar \
  el entrenamiento mediante Early Stopping'''
  '''useBest denota el criterio de Early Stopping a considerar: si useBest es \
  True, tras <stopTolerance> épocas sin mejorar el mejor resultado obtenido se \
  termina el entrenamiento. Si es False, tras <stopTolerance> épocas \
  consecutivas sin una sola mejora de la Loss se termina el entrenamiento.'''

  best_valid_loss = float('inf')
  prev_valid_loss = float('inf')
  counter = 0

  for epoch in range(nEpochs):
    
      start_time = time.time()

      # Recuerdo: train_iterator y valid_iterator contienen el dataset dividido en batches.

      # Entrenar
      train_loss, train_precision, train_recall, train_f1 = train(
          model, train_iterator, optimizer, criterion)

      # Evaluar (valid = validación)
      valid_loss, valid_precision, valid_recall, valid_f1 = evaluate(
          model, valid_iterator, criterion)

      end_time = time.time()

      epoch_mins, epoch_secs = epoch_time(start_time, end_time)

      print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
      print(
          f'\tTrain Loss: {train_loss:.3f} | Train f1: {train_f1:.2f} | Train precision: {train_precision:.2f} | Train recall: {train_recall:.2f}'
      )
      print(
          f'\t Val. Loss: {valid_loss:.3f} |  Val. f1: {valid_f1:.2f} |  Val. precision: {valid_precision:.2f} | Val. recall: {valid_recall:.2f}'
      )

      # Si obtuvimos mejores resultados, guardamos este modelo en el almacenamiento (para poder cargarlo luego)
      # Si detienen el entrenamiento prematuramente, pueden cargar el modelo en el siguiente recuadro de código.
      if valid_loss < best_valid_loss:
          best_valid_loss = valid_loss
          torch.save(model.state_dict(), '{}.pt'.format(model_name))
          counter = 0

      # Si ya no mejoramos el loss de validación, terminamos de entrenar.
      else:
          if useBest:
              counter += 1
          
          else:
              if valid_loss >= prev_valid_loss:
                  counter += 1
          
          if counter == stopTolerance:
              break

      prev_valid_loss = valid_loss

  model.load_state_dict(torch.load('{}.pt'.format(model_name)))
  torch.cuda.empty_cache()

  valid_loss, valid_precision, valid_recall, valid_f1 = evaluate(
    model, valid_iterator, criterion)
  
  print('\nPerformance of best found Model:')
  print(
      f'Val. Loss: {valid_loss:.3f} |  Val. f1: {valid_f1:.2f} | Val. precision: {valid_precision:.2f} | Val. recall: {valid_recall:.2f}'
  )

In [None]:
%%script false
model.apply(init_weights)
optimizer = optim.Adam(model.parameters())
optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

#### Tipos de RNN

Para poder introducir arquitecturas alternativas a LSTM en los experimentos, se pretende definir una clase equivalente a la anterior, pero incluyendo arquitecturas Elman RNN y GRU, aprovechando que las implementaciones nativas de ```Pytorch``` reciben (casi) los mismos parámetros en su construcción.

In [None]:
class NER_ELMAN(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 dropout, 
                 pad_idx,
                 nonlinearity = 'tanh'):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                      embedding_dim,
                                      padding_idx=pad_idx)

        # Capa Elman RNN
        self.rnn = nn.RNN(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional, 
                           dropout = dropout if n_layers > 1 else 0,
                           nonlinearity = nonlinearity)

        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                            output_dim)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):

        #text = [sent len, batch size]

        # Convertir lo enviado a embedding
        embedded = self.dropout(self.embedding(text))
        
        outputs, hidden = self.rnn(embedded)
        #embedded = [sent len, batch size, emb dim]

        # Pasar los embeddings por la rnn (LSTM)

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        # Predecir usando la capa de salida.
        predictions = self.fc(self.dropout(outputs))
        #predictions = [sent len, batch size, output dim]

        return predictions

class NER_GRU(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 dropout, 
                 pad_idx):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                      embedding_dim,
                                      padding_idx=pad_idx)

        # Capa Elman RNN
        self.gru = nn.GRU(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional, 
                           dropout = dropout if n_layers > 1 else 0)

        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                            output_dim)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):

        #text = [sent len, batch size]

        # Convertir lo enviado a embedding
        embedded = self.dropout(self.embedding(text))
        
        outputs, hidden = self.gru(embedded)
        #embedded = [sent len, batch size, emb dim]

        # Pasar los embeddings por la rnn (LSTM)

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        # Predecir usando la capa de salida.
        predictions = self.fc(self.dropout(outputs))
        #predictions = [sent len, batch size, output dim]

        return predictions

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 128  # dimensión de la capas LSTM
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = False

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

elman_model_name = 'Elman1'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

gru_model_name = 'GRU1'  # nombre que tendrá el modelo guardado...

In [None]:
model = elman_model
model_name = baseline_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,673,854 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 13s
	Train Loss: 0.420 | Train f1: 0.20 | Train precision: 0.28 | Train recall: 0.19
	 Val. Loss: 0.286 |  Val. f1: 0.36 |  Val. precision: 0.46 | Val. recall: 0.35
Epoch: 02 | Epoch Time: 0m 13s
	Train Loss: 0.174 | Train f1: 0.51 | Train precision: 0.58 | Train recall: 0.50
	 Val. Loss: 0.240 |  Val. f1: 0.51 |  Val. precision: 0.62 | Val. recall: 0.49
Epoch: 03 | Epoch Time: 0m 16s
	Train Loss: 0.104 | Train f1: 0.66 | Train precision: 0.71 | Train recall: 0.66
	 Val. Loss: 0.265 |  Val. f1: 0.55 |  Val. precision: 0.65 | Val. recall: 0.52
Epoch: 04 | Epoch Time: 0m 17s
	Train Loss: 0.071 | Train f1: 0.74 | Train precision: 0.78 | Train recall: 0.74
	 Val. Loss: 0.239 |  Val. f1: 0.58 |  Val. precision: 0.65 | Val. recall: 0.56
Epoch: 05 | Epoch Time: 0m 16s
	Train Loss: 0.053 | Train f1: 0.79 | Train precision: 0.82 | Train recall: 0.79
	 Val. Loss: 0.280 |  Val. f1: 0.56 |  Val. precision: 0.64 | V

In [None]:
model = gru_model
model_name = baseline_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,798,782 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 18s
	Train Loss: 0.374 | Train f1: 0.25 | Train precision: 0.33 | Train recall: 0.24
	 Val. Loss: 0.259 |  Val. f1: 0.46 |  Val. precision: 0.56 | Val. recall: 0.43
Epoch: 02 | Epoch Time: 0m 16s
	Train Loss: 0.141 | Train f1: 0.60 | Train precision: 0.66 | Train recall: 0.59
	 Val. Loss: 0.216 |  Val. f1: 0.56 |  Val. precision: 0.65 | Val. recall: 0.54
Epoch: 03 | Epoch Time: 0m 15s
	Train Loss: 0.081 | Train f1: 0.72 | Train precision: 0.76 | Train recall: 0.72
	 Val. Loss: 0.199 |  Val. f1: 0.60 |  Val. precision: 0.66 | Val. recall: 0.59
Epoch: 04 | Epoch Time: 0m 15s
	Train Loss: 0.057 | Train f1: 0.78 | Train precision: 0.81 | Train recall: 0.78
	 Val. Loss: 0.214 |  Val. f1: 0.59 |  Val. precision: 0.66 | Val. recall: 0.59
Epoch: 05 | Epoch Time: 0m 15s
	Train Loss: 0.041 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.244 |  Val. f1: 0.59 |  Val. precision: 0.67 | V

La Elman RNN de Pytorch acepta cambiar la función de activación por ReLU, por lo que probaremos el mismo modelo anteriormente visto, pero con esta nueva no-linealidad

In [None]:
elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman2'  # nombre que tendrá el modelo guardado...

model = elman_model
model_name = baseline_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,673,854 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 16s
	Train Loss: 0.426 | Train f1: 0.15 | Train precision: 0.20 | Train recall: 0.14
	 Val. Loss: 0.290 |  Val. f1: 0.33 |  Val. precision: 0.38 | Val. recall: 0.34
Epoch: 02 | Epoch Time: 0m 16s
	Train Loss: 0.187 | Train f1: 0.42 | Train precision: 0.47 | Train recall: 0.43
	 Val. Loss: 0.246 |  Val. f1: 0.46 |  Val. precision: 0.56 | Val. recall: 0.46
Epoch: 03 | Epoch Time: 0m 16s
	Train Loss: 0.119 | Train f1: 0.60 | Train precision: 0.65 | Train recall: 0.60
	 Val. Loss: 0.215 |  Val. f1: 0.53 |  Val. precision: 0.61 | Val. recall: 0.52
Epoch: 04 | Epoch Time: 0m 16s
	Train Loss: 0.085 | Train f1: 0.70 | Train precision: 0.73 | Train recall: 0.70
	 Val. Loss: 0.209 |  Val. f1: 0.57 |  Val. precision: 0.64 | Val. recall: 0.56
Epoch: 05 | Epoch Time: 0m 16s
	Train Loss: 0.065 | Train f1: 0.76 | Train precision: 0.79 | Train recall: 0.76
	 Val. Loss: 0.215 |  Val. f1: 0.58 |  Val. precision: 0.65 | V

#### Bidireccionalidad

A continuación, se incluye en los modelos anteriores bidireccionalidad para estudiar cómo afecta esto en particular a cada uno. Para le red Elman RNN se considerará solo la arquitectura con ```ReLU``` como función de activación dado que ya se verificó su mejor *performance* con esta.

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 128  # dimensión de la capas LSTM
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

gru_model_name = 'GRU1'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 3,243,454 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 29s
	Train Loss: 0.357 | Train f1: 0.26 | Train precision: 0.33 | Train recall: 0.25
	 Val. Loss: 0.243 |  Val. f1: 0.47 |  Val. precision: 0.58 | Val. recall: 0.46
Epoch: 02 | Epoch Time: 0m 25s
	Train Loss: 0.123 | Train f1: 0.61 | Train precision: 0.66 | Train recall: 0.61
	 Val. Loss: 0.190 |  Val. f1: 0.58 |  Val. precision: 0.64 | Val. recall: 0.58
Epoch: 03 | Epoch Time: 0m 24s
	Train Loss: 0.066 | Train f1: 0.75 | Train precision: 0.78 | Train recall: 0.75
	 Val. Loss: 0.194 |  Val. f1: 0.58 |  Val. precision: 0.65 | Val. recall: 0.58
Epoch: 04 | Epoch Time: 0m 24s
	Train Loss: 0.042 | Train f1: 0.82 | Train precision: 0.84 | Train recall: 0.82
	 Val. Loss: 0.186 |  Val. f1: 0.62 |  Val. precision: 0.66 | Val. recall: 0.63
Epoch: 05 | Epoch Time: 0m 24s
	Train Loss: 0.029 | Train f1: 0.87 | Train precision: 0.89 | Train recall: 0.87
	 Val. Loss: 0.216 |  Val. f1: 0.61 |  Val. precision: 0.67 | V

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,770,366 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.375 | Train f1: 0.23 | Train precision: 0.30 | Train recall: 0.21
	 Val. Loss: 0.255 |  Val. f1: 0.43 |  Val. precision: 0.50 | Val. recall: 0.43
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.150 | Train f1: 0.54 | Train precision: 0.60 | Train recall: 0.53
	 Val. Loss: 0.203 |  Val. f1: 0.54 |  Val. precision: 0.61 | Val. recall: 0.55
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.088 | Train f1: 0.69 | Train precision: 0.73 | Train recall: 0.69
	 Val. Loss: 0.191 |  Val. f1: 0.59 |  Val. precision: 0.65 | Val. recall: 0.60
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.059 | Train f1: 0.77 | Train precision: 0.80 | Train recall: 0.77
	 Val. Loss: 0.180 |  Val. f1: 0.61 |  Val. precision: 0.65 | Val. recall: 0.62
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.042 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.199 |  Val. f1: 0.62 |  Val. precision: 0.67 | Val. r

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 09 | Epoch Time: 0m 8s
	Train Loss: 0.017 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.234 |  Val. f1: 0.63 |  Val. precision: 0.67 | Val. recall: 0.64
Epoch: 10 | Epoch Time: 0m 8s
	Train Loss: 0.015 | Train f1: 0.93 | Train precision: 0.93 | Train recall: 0.93
	 Val. Loss: 0.257 |  Val. f1: 0.61 |  Val. precision: 0.67 | Val. recall: 0.62
Epoch: 11 | Epoch Time: 0m 8s
	Train Loss: 0.013 | Train f1: 0.94 | Train precision: 0.94 | Train recall: 0.94
	 Val. Loss: 0.252 |  Val. f1: 0.61 |  Val. precision: 0.66 | Val. recall: 0.62
Epoch: 12 | Epoch Time: 0m 8s
	Train Loss: 0.012 | Train f1: 0.94 | Train precision: 0.94 | Train recall: 0.94
	 Val. Loss: 0.273 |  Val. f1: 0.61 |  Val. precision: 0.66 | Val. recall: 0.61
Epoch: 13 | Epoch Time: 0m 8s
	Train Loss: 0.012 | Train f1: 0.95 | Train precision: 0.95 | Train recall: 0.95
	 Val. Loss: 0.259 |  Val. f1: 0.61 |  Val. precision: 0.65 | Val. recall: 0.63
Epoch: 14 | Epoch Time: 0m 8s
	Train Loss: 0.010

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 3,085,758 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 25s
	Train Loss: 0.325 | Train f1: 0.33 | Train precision: 0.42 | Train recall: 0.30
	 Val. Loss: 0.225 |  Val. f1: 0.51 |  Val. precision: 0.60 | Val. recall: 0.49
Epoch: 02 | Epoch Time: 0m 26s
	Train Loss: 0.111 | Train f1: 0.67 | Train precision: 0.72 | Train recall: 0.66
	 Val. Loss: 0.186 |  Val. f1: 0.58 |  Val. precision: 0.65 | Val. recall: 0.56
Epoch: 03 | Epoch Time: 0m 25s
	Train Loss: 0.058 | Train f1: 0.78 | Train precision: 0.81 | Train recall: 0.78
	 Val. Loss: 0.180 |  Val. f1: 0.62 |  Val. precision: 0.68 | Val. recall: 0.61
Epoch: 04 | Epoch Time: 0m 24s
	Train Loss: 0.036 | Train f1: 0.85 | Train precision: 0.87 | Train recall: 0.85
	 Val. Loss: 0.202 |  Val. f1: 0.62 |  Val. precision: 0.69 | Val. recall: 0.61
Epoch: 05 | Epoch Time: 0m 24s
	Train Loss: 0.025 | Train f1: 0.88 | Train precision: 0.89 | Train recall: 0.88
	 Val. Loss: 0.218 |  Val. f1: 0.62 |  Val. precision: 0.68 | V

#### Tamaño de las Redes Recurrentes

Habiendo constatado que los modelos estudiados mejoran consistentemente con la inclusión de bidireccionalidad, variaremos el tamaño (```HIDDEN_DIM```) de las capas ocultas que conforman las redes recurrentes de cada modelo.

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 64  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

gru_model_name = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,795,710 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.420 | Train f1: 0.17 | Train precision: 0.20 | Train recall: 0.17
	 Val. Loss: 0.273 |  Val. f1: 0.35 |  Val. precision: 0.41 | Val. recall: 0.35
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.155 | Train f1: 0.52 | Train precision: 0.57 | Train recall: 0.52
	 Val. Loss: 0.231 |  Val. f1: 0.52 |  Val. precision: 0.60 | Val. recall: 0.51
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.085 | Train f1: 0.71 | Train precision: 0.75 | Train recall: 0.71
	 Val. Loss: 0.208 |  Val. f1: 0.57 |  Val. precision: 0.64 | Val. recall: 0.57
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.055 | Train f1: 0.79 | Train precision: 0.81 | Train recall: 0.79
	 Val. Loss: 0.203 |  Val. f1: 0.60 |  Val. precision: 0.66 | Val. recall: 0.59
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.038 | Train f1: 0.85 | Train precision: 0.86 | Train recall: 0.85
	 Val. Loss: 0.226 |  Val. f1: 0.60 |  Val. precision: 0.66 | Val. r

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,657,470 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.375 | Train f1: 0.21 | Train precision: 0.26 | Train recall: 0.20
	 Val. Loss: 0.248 |  Val. f1: 0.37 |  Val. precision: 0.42 | Val. recall: 0.39
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.150 | Train f1: 0.51 | Train precision: 0.57 | Train recall: 0.52
	 Val. Loss: 0.198 |  Val. f1: 0.53 |  Val. precision: 0.60 | Val. recall: 0.53
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.090 | Train f1: 0.68 | Train precision: 0.71 | Train recall: 0.68
	 Val. Loss: 0.208 |  Val. f1: 0.58 |  Val. precision: 0.63 | Val. recall: 0.59
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.063 | Train f1: 0.76 | Train precision: 0.78 | Train recall: 0.76
	 Val. Loss: 0.218 |  Val. f1: 0.58 |  Val. precision: 0.64 | Val. recall: 0.58
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.046 | Train f1: 0.81 | Train precision: 0.83 | Train recall: 0.81
	 Val. Loss: 0.203 |  Val. f1: 0.60 |  Val. precision: 0.65 | Val. r

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,749,630 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.359 | Train f1: 0.26 | Train precision: 0.32 | Train recall: 0.25
	 Val. Loss: 0.247 |  Val. f1: 0.44 |  Val. precision: 0.52 | Val. recall: 0.43
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.125 | Train f1: 0.61 | Train precision: 0.67 | Train recall: 0.61
	 Val. Loss: 0.188 |  Val. f1: 0.57 |  Val. precision: 0.63 | Val. recall: 0.57
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.068 | Train f1: 0.75 | Train precision: 0.78 | Train recall: 0.75
	 Val. Loss: 0.202 |  Val. f1: 0.59 |  Val. precision: 0.66 | Val. recall: 0.58
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.042 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.210 |  Val. f1: 0.60 |  Val. precision: 0.67 | Val. recall: 0.60
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.030 | Train f1: 0.87 | Train precision: 0.89 | Train recall: 0.87
	 Val. Loss: 0.212 |  Val. f1: 0.61 |  Val. precision: 0.67 | Val. r

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 256  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

gru_model_name = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,925,374 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 11s
	Train Loss: 0.341 | Train f1: 0.29 | Train precision: 0.37 | Train recall: 0.27
	 Val. Loss: 0.246 |  Val. f1: 0.46 |  Val. precision: 0.56 | Val. recall: 0.45
Epoch: 02 | Epoch Time: 0m 11s
	Train Loss: 0.118 | Train f1: 0.63 | Train precision: 0.68 | Train recall: 0.62
	 Val. Loss: 0.205 |  Val. f1: 0.54 |  Val. precision: 0.62 | Val. recall: 0.54
Epoch: 03 | Epoch Time: 0m 11s
	Train Loss: 0.063 | Train f1: 0.76 | Train precision: 0.79 | Train recall: 0.76
	 Val. Loss: 0.194 |  Val. f1: 0.59 |  Val. precision: 0.66 | Val. recall: 0.58
Epoch: 04 | Epoch Time: 0m 11s
	Train Loss: 0.041 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.182 |  Val. f1: 0.62 |  Val. precision: 0.67 | Val. recall: 0.63
Epoch: 05 | Epoch Time: 0m 11s
	Train Loss: 0.028 | Train f1: 0.87 | Train precision: 0.89 | Train recall: 0.87
	 Val. Loss: 0.210 |  Val. f1: 0.61 |  Val. precision: 0.67 | V

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 3,192,766 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 294306.830 | Train f1: 0.02 | Train precision: 0.06 | Train recall: 0.01
	 Val. Loss: 0.928 |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. recall: 0.00
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.877 | Train f1: 0.01 | Train precision: 0.02 | Train recall: 0.00
	 Val. Loss: 0.774 |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. recall: 0.00
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.758 | Train f1: 0.00 | Train precision: 0.01 | Train recall: 0.00
	 Val. Loss: 0.766 |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. recall: 0.00
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.705 | Train f1: 0.00 | Train precision: 0.00 | Train recall: 0.00
	 Val. Loss: 0.775 |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. recall: 0.00
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.700 | Train f1: 0.00 | Train precision: 0.00 | Train recall: 0.00
	 Val. Loss: 0.754 |  Val. f1: 0.00 |  Val. precision: 0.00 | V

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 25 | Epoch Time: 0m 9s
	Train Loss: 0.199 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.432 |  Val. f1: 0.36 |  Val. precision: 0.47 | Val. recall: 0.34
Epoch: 26 | Epoch Time: 0m 9s
	Train Loss: 0.180 | Train f1: 0.45 | Train precision: 0.51 | Train recall: 0.45
	 Val. Loss: 0.424 |  Val. f1: 0.38 |  Val. precision: 0.48 | Val. recall: 0.36
Epoch: 27 | Epoch Time: 0m 9s
	Train Loss: 0.162 | Train f1: 0.50 | Train precision: 0.55 | Train recall: 0.49
	 Val. Loss: 0.472 |  Val. f1: 0.39 |  Val. precision: 0.51 | Val. recall: 0.37
Epoch: 28 | Epoch Time: 0m 8s
	Train Loss: 0.144 | Train f1: 0.54 | Train precision: 0.59 | Train recall: 0.54
	 Val. Loss: 0.439 |  Val. f1: 0.44 |  Val. precision: 0.56 | Val. recall: 0.42
Epoch: 29 | Epoch Time: 0m 8s
	Train Loss: 0.131 | Train f1: 0.58 | Train precision: 0.63 | Train recall: 0.58
	 Val. Loss: 0.435 |  Val. f1: 0.46 |  Val. precision: 0.58 | Val. recall: 0.42
Epoch: 30 | Epoch Time: 0m 8s
	Train Loss: 0.116

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)  

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,347,838 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 10s
	Train Loss: 0.343 | Train f1: 0.32 | Train precision: 0.42 | Train recall: 0.29
	 Val. Loss: 0.229 |  Val. f1: 0.49 |  Val. precision: 0.57 | Val. recall: 0.49
Epoch: 02 | Epoch Time: 0m 10s
	Train Loss: 0.116 | Train f1: 0.64 | Train precision: 0.70 | Train recall: 0.63
	 Val. Loss: 0.218 |  Val. f1: 0.56 |  Val. precision: 0.65 | Val. recall: 0.54
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.058 | Train f1: 0.78 | Train precision: 0.81 | Train recall: 0.78
	 Val. Loss: 0.195 |  Val. f1: 0.61 |  Val. precision: 0.68 | Val. recall: 0.59
Epoch: 04 | Epoch Time: 0m 10s
	Train Loss: 0.034 | Train f1: 0.85 | Train precision: 0.87 | Train recall: 0.85
	 Val. Loss: 0.210 |  Val. f1: 0.63 |  Val. precision: 0.69 | Val. recall: 0.62
Epoch: 05 | Epoch Time: 0m 10s
	Train Loss: 0.025 | Train f1: 0.89 | Train precision: 0.90 | Train recall: 0.89
	 Val. Loss: 0.218 |  Val. f1: 0.63 |  Val. precision: 0.69 | V

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 512  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

gru_model_name = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 11,434,942 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 20s
	Train Loss: 0.389 | Train f1: 0.23 | Train precision: 0.30 | Train recall: 0.22
	 Val. Loss: 0.270 |  Val. f1: 0.45 |  Val. precision: 0.55 | Val. recall: 0.45
Epoch: 02 | Epoch Time: 0m 20s
	Train Loss: 0.131 | Train f1: 0.59 | Train precision: 0.65 | Train recall: 0.59
	 Val. Loss: 0.208 |  Val. f1: 0.56 |  Val. precision: 0.64 | Val. recall: 0.56
Epoch: 03 | Epoch Time: 0m 20s
	Train Loss: 0.070 | Train f1: 0.74 | Train precision: 0.77 | Train recall: 0.74
	 Val. Loss: 0.184 |  Val. f1: 0.61 |  Val. precision: 0.68 | Val. recall: 0.61
Epoch: 04 | Epoch Time: 0m 20s
	Train Loss: 0.042 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.209 |  Val. f1: 0.60 |  Val. precision: 0.67 | Val. recall: 0.59
Epoch: 05 | Epoch Time: 0m 20s
	Train Loss: 0.028 | Train f1: 0.87 | Train precision: 0.89 | Train recall: 0.87
	 Val. Loss: 0.236 |  Val. f1: 0.59 |  Val. precision: 0.65 | 

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,823,998 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 11s
	Train Loss: nan | Train f1: 0.00 | Train precision: 0.00 | Train recall: 0.00
	 Val. Loss: nan |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. recall: 0.00
Epoch: 02 | Epoch Time: 0m 11s
	Train Loss: nan | Train f1: 0.00 | Train precision: 0.00 | Train recall: 0.00
	 Val. Loss: nan |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. recall: 0.00


KeyboardInterrupt: ignored

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)  

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 9,231,294 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 18s
	Train Loss: 0.565 | Train f1: 0.19 | Train precision: 0.26 | Train recall: 0.18
	 Val. Loss: 0.284 |  Val. f1: 0.42 |  Val. precision: 0.49 | Val. recall: 0.43
Epoch: 02 | Epoch Time: 0m 18s
	Train Loss: 0.165 | Train f1: 0.55 | Train precision: 0.61 | Train recall: 0.55
	 Val. Loss: 0.197 |  Val. f1: 0.57 |  Val. precision: 0.66 | Val. recall: 0.56
Epoch: 03 | Epoch Time: 0m 18s
	Train Loss: 0.082 | Train f1: 0.72 | Train precision: 0.75 | Train recall: 0.72
	 Val. Loss: 0.217 |  Val. f1: 0.61 |  Val. precision: 0.67 | Val. recall: 0.61
Epoch: 04 | Epoch Time: 0m 18s
	Train Loss: 0.048 | Train f1: 0.80 | Train precision: 0.83 | Train recall: 0.80
	 Val. Loss: 0.228 |  Val. f1: 0.62 |  Val. precision: 0.69 | Val. recall: 0.61
Epoch: 05 | Epoch Time: 0m 18s
	Train Loss: 0.032 | Train f1: 0.86 | Train precision: 0.88 | Train recall: 0.86
	 Val. Loss: 0.227 |  Val. f1: 0.62 |  Val. precision: 0.68 | V

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 10 | Epoch Time: 0m 18s
	Train Loss: 0.013 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.323 |  Val. f1: 0.60 |  Val. precision: 0.67 | Val. recall: 0.60
Epoch: 11 | Epoch Time: 0m 18s
	Train Loss: 0.011 | Train f1: 0.95 | Train precision: 0.95 | Train recall: 0.95
	 Val. Loss: 0.339 |  Val. f1: 0.62 |  Val. precision: 0.70 | Val. recall: 0.61
Epoch: 12 | Epoch Time: 0m 18s
	Train Loss: 0.011 | Train f1: 0.95 | Train precision: 0.95 | Train recall: 0.95
	 Val. Loss: 0.308 |  Val. f1: 0.62 |  Val. precision: 0.68 | Val. recall: 0.62

Performance of best found Model:
Val. Loss: 0.197 |  Val. f1: 0.57 | Val. precision: 0.66 | Val. recall: 0.56


Pareciera que aumentar la cantidad de parámetros en las capas ocultas de los modelos recurrentes no mejora la capacidad de la red, incluso al contrario. No obstante, cabe destacar que hasta ahora el modelo con mejor pérdida en validación se encuentra generalmente tras 2-4 épocas. Esto podría deberse a un optimizador con tasa de aprendizaje muy alta.

Por esta razón, se experimentará con los modelos más grandes para verificar si su desempeño puede mejorar al modificar los parámetros del optimizador.

#### Efecto del Optimizador

Primeramente, se probará disminuir el *learning rate* del optimizador usado (```Adam```) e incluyendo un ```Scheduler``` para variar el mismo durante el entrenamiento.

Pero antes, es necesario modificar las funciones de optimización y entrenamiento definidas anteriormente:

In [None]:
def train(model, iterator, optimizer, criterion, scheduler = None):

    epoch_loss = 0
    epoch_precision = 0
    epoch_recall = 0
    epoch_f1 = 0

    model.train()

    # Por cada batch del iterador de la época:
    for batch in iterator:

        # Extraemos el texto y los tags del batch que estamos procesado
        text = batch.text
        tags = batch.nertags

        # Reiniciamos los gradientes calculados en la iteración anterior
        optimizer.zero_grad()

        #text = [sent len, batch size]

        # Predecimos los tags del texto del batch.
        predictions = model(text)

        #predictions = [sent len, batch size, output dim]
        #tags = [sent len, batch size]

        # Reordenamos los datos para calcular la loss
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)

        #predictions = [sent len * batch size, output dim]
        #tags = [sent len * batch size]

        # Calculamos el Cross Entropy de las predicciones con respecto a las etiquetas reales
        loss = criterion(predictions, tags)
        
        # Calculamos el accuracy
        precision, recall, f1 = calculate_metrics(predictions, tags)

        # Calculamos los gradientes
        loss.backward()

        # Actualizamos los parámetros de la red
        optimizer.step()

        # Actualizamos el loss y las métricas
        epoch_loss += loss.item()
        epoch_precision += precision
        epoch_recall += recall
        epoch_f1 += f1

    # Actualizamos el optimizador
    if scheduler is not None:
        scheduler.step()

    return epoch_loss / len(iterator), epoch_precision / len(
        iterator), epoch_recall / len(iterator), epoch_f1 / len(iterator)

def optimize_model(model, train_iterator, valid_iterator, optimizer, 
                   criterion, scheduler = None, 
                   nEpochs = 100, stopTolerance = 10, useBest = True):
  '''stopTolerance denota la cantidad de épocas sin mejoría antes de terminar \
  el entrenamiento mediante Early Stopping'''
  '''useBest denota el criterio de Early Stopping a considerar: si useBest es \
  True, tras <stopTolerance> épocas sin mejorar el mejor resultado obtenido se \
  termina el entrenamiento. Si es False, tras <stopTolerance> épocas \
  consecutivas sin una sola mejora de la Loss se termina el entrenamiento.'''

  best_valid_loss = float('inf')
  prev_valid_loss = float('inf')
  counter = 0

  for epoch in range(nEpochs):
    
      start_time = time.time()

      # Recuerdo: train_iterator y valid_iterator contienen el dataset dividido en batches.

      # Entrenar
      train_loss, train_precision, train_recall, train_f1 = train(
          model, train_iterator, optimizer, criterion, scheduler)

      # Evaluar (valid = validación)
      valid_loss, valid_precision, valid_recall, valid_f1 = evaluate(
          model, valid_iterator, criterion)

      end_time = time.time()

      epoch_mins, epoch_secs = epoch_time(start_time, end_time)

      print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
      print(
          f'\tTrain Loss: {train_loss:.3f} | Train f1: {train_f1:.2f} | Train precision: {train_precision:.2f} | Train recall: {train_recall:.2f}'
      )
      print(
          f'\t Val. Loss: {valid_loss:.3f} |  Val. f1: {valid_f1:.2f} |  Val. precision: {valid_precision:.2f} | Val. recall: {valid_recall:.2f}'
      )

      # Si obtuvimos mejores resultados, guardamos este modelo en el almacenamiento (para poder cargarlo luego)
      # Si detienen el entrenamiento prematuramente, pueden cargar el modelo en el siguiente recuadro de código.
      if valid_loss < best_valid_loss:
          best_valid_loss = valid_loss
          torch.save(model.state_dict(), '{}.pt'.format(model_name))
          counter = 0

      # Si ya no mejoramos el loss de validación, terminamos de entrenar.
      else:
          if useBest:
              counter += 1
          
          else:
              if valid_loss >= prev_valid_loss:
                  counter += 1
          
          if counter == stopTolerance:
              break

      prev_valid_loss = valid_loss

  model.load_state_dict(torch.load('{}.pt'.format(model_name)))
  torch.cuda.empty_cache()

  valid_loss, valid_precision, valid_recall, valid_f1 = evaluate(
    model, valid_iterator, criterion)
  
  print('\nPerformance of best found Model:')
  print(
      f'Val. Loss: {valid_loss:.3f} |  Val. f1: {valid_f1:.2f} | Val. precision: {valid_precision:.2f} | Val. recall: {valid_recall:.2f}'
  )

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 256  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

gru_model_name = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

lRate = 0.0005
#decayRate = 3
decayRate = 2
decayFactor = 0.9
#decayFactor = 0.75

optimizer = optim.Adam(model.parameters(), lr = lRate)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size = decayRate, 
                                      gamma = decayFactor)
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, 
               criterion, scheduler = scheduler)

El modelo actual tiene 4,925,374 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 11s
	Train Loss: 0.418 | Train f1: 0.19 | Train precision: 0.27 | Train recall: 0.17
	 Val. Loss: 0.296 |  Val. f1: 0.38 |  Val. precision: 0.49 | Val. recall: 0.36
Epoch: 02 | Epoch Time: 0m 11s
	Train Loss: 0.179 | Train f1: 0.50 | Train precision: 0.57 | Train recall: 0.49
	 Val. Loss: 0.234 |  Val. f1: 0.50 |  Val. precision: 0.60 | Val. recall: 0.48
Epoch: 03 | Epoch Time: 0m 11s
	Train Loss: 0.106 | Train f1: 0.65 | Train precision: 0.70 | Train recall: 0.64
	 Val. Loss: 0.195 |  Val. f1: 0.56 |  Val. precision: 0.63 | Val. recall: 0.56
Epoch: 04 | Epoch Time: 0m 11s
	Train Loss: 0.071 | Train f1: 0.74 | Train precision: 0.77 | Train recall: 0.73
	 Val. Loss: 0.193 |  Val. f1: 0.59 |  Val. precision: 0.65 | Val. recall: 0.58
Epoch: 05 | Epoch Time: 0m 11s
	Train Loss: 0.049 | Train f1: 0.80 | Train precision: 0.82 | Train recall: 0.79
	 Val. Loss: 0.219 |  Val. f1: 0.58 |  Val. precision: 0.66 | V

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

lRate = 0.0005
#decayRate = 3
decayRate = 2
decayFactor = 0.9
#decayFactor = 0.75

optimizer = optim.Adam(model.parameters(), lr = lRate)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size = decayRate, 
                                      gamma = decayFactor)
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, 
               criterion, scheduler = scheduler)

El modelo actual tiene 3,192,766 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 62189147986614.891 | Train f1: 0.04 | Train precision: 0.07 | Train recall: 0.04
	 Val. Loss: 6.421 |  Val. f1: 0.01 |  Val. precision: 0.02 | Val. recall: 0.00
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: nan | Train f1: 0.02 | Train precision: 0.06 | Train recall: 0.02
	 Val. Loss: nan |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. recall: 0.00
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: nan | Train f1: 0.00 | Train precision: 0.00 | Train recall: 0.00
	 Val. Loss: nan |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. recall: 0.00
Epoch: 04 | Epoch Time: 0m 9s
	Train Loss: nan | Train f1: 0.00 | Train precision: 0.00 | Train recall: 0.00
	 Val. Loss: nan |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. recall: 0.00
Epoch: 05 | Epoch Time: 0m 9s
	Train Loss: nan | Train f1: 0.00 | Train precision: 0.00 | Train recall: 0.00
	 Val. Loss: nan |  Val. f1: 0.00 |  Val. precision: 0.00 | Val. reca

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

lRate = 0.0005
#decayRate = 3
decayRate = 2
decayFactor = 0.9
#decayFactor = 0.75

optimizer = optim.Adam(model.parameters(), lr = lRate)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size = decayRate, 
                                      gamma = decayFactor)
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, 
               criterion, scheduler = scheduler)

El modelo actual tiene 4,347,838 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 10s
	Train Loss: 0.411 | Train f1: 0.22 | Train precision: 0.34 | Train recall: 0.19
	 Val. Loss: 0.284 |  Val. f1: 0.41 |  Val. precision: 0.54 | Val. recall: 0.39
Epoch: 02 | Epoch Time: 0m 10s
	Train Loss: 0.176 | Train f1: 0.54 | Train precision: 0.62 | Train recall: 0.52
	 Val. Loss: 0.218 |  Val. f1: 0.53 |  Val. precision: 0.62 | Val. recall: 0.52
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.102 | Train f1: 0.67 | Train precision: 0.73 | Train recall: 0.67
	 Val. Loss: 0.214 |  Val. f1: 0.58 |  Val. precision: 0.66 | Val. recall: 0.56
Epoch: 04 | Epoch Time: 0m 10s
	Train Loss: 0.067 | Train f1: 0.75 | Train precision: 0.78 | Train recall: 0.75
	 Val. Loss: 0.215 |  Val. f1: 0.59 |  Val. precision: 0.67 | Val. recall: 0.58
Epoch: 05 | Epoch Time: 0m 10s
	Train Loss: 0.045 | Train f1: 0.81 | Train precision: 0.83 | Train recall: 0.81
	 Val. Loss: 0.202 |  Val. f1: 0.62 |  Val. precision: 0.69 | V

En términos generales, no se aprecian mayores diferencias al incorporar un ```Scheduler``` para el *learning rate* y, de momento, el valor por defecto de este para el optimizador ```Adam``` parece adecuado.

#### Profundidad de la red

Hasta el momento, al haber variado la cantidad de parámetros en las capas de las redes se ha encontrado que las arquitecturas con compuertas se comportan mejor con 256 unidades por capa, mientras que la `Elmann RNN` ofrece un desempeño comparable para 128 unidades en la capa oculta, deteriorándose al aumentar esta cantidad.

Con esto en consideración, se propone variar la cantidad de capas de la red, más precisamente considerar modelos `Stacked RNN`. Dado que los modelos probados hasta ahora constan de dos capas de profundidad se hace una prueba con tan solo una a modo de *sanity check* para luego aumentar su número y observar el efecto percibido en el desempeño.

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 256  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 1  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, int(HIDDEN_DIM/2), OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

gru_model_name = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 3,348,414 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 5s
	Train Loss: 0.335 | Train f1: 0.29 | Train precision: 0.37 | Train recall: 0.27
	 Val. Loss: 0.237 |  Val. f1: 0.45 |  Val. precision: 0.55 | Val. recall: 0.44
Epoch: 02 | Epoch Time: 0m 5s
	Train Loss: 0.117 | Train f1: 0.63 | Train precision: 0.68 | Train recall: 0.62
	 Val. Loss: 0.184 |  Val. f1: 0.59 |  Val. precision: 0.65 | Val. recall: 0.58
Epoch: 03 | Epoch Time: 0m 5s
	Train Loss: 0.065 | Train f1: 0.76 | Train precision: 0.79 | Train recall: 0.76
	 Val. Loss: 0.188 |  Val. f1: 0.60 |  Val. precision: 0.67 | Val. recall: 0.60
Epoch: 04 | Epoch Time: 0m 5s
	Train Loss: 0.041 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.204 |  Val. f1: 0.61 |  Val. precision: 0.69 | Val. recall: 0.59
Epoch: 05 | Epoch Time: 0m 5s
	Train Loss: 0.028 | Train f1: 0.88 | Train precision: 0.89 | Train recall: 0.88
	 Val. Loss: 0.197 |  Val. f1: 0.62 |  Val. precision: 0.69 | Val. r

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,671,550 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 5s
	Train Loss: 0.392 | Train f1: 0.21 | Train precision: 0.28 | Train recall: 0.19
	 Val. Loss: 0.266 |  Val. f1: 0.41 |  Val. precision: 0.49 | Val. recall: 0.41
Epoch: 02 | Epoch Time: 0m 5s
	Train Loss: 0.148 | Train f1: 0.54 | Train precision: 0.60 | Train recall: 0.53
	 Val. Loss: 0.201 |  Val. f1: 0.55 |  Val. precision: 0.63 | Val. recall: 0.53
Epoch: 03 | Epoch Time: 0m 5s
	Train Loss: 0.085 | Train f1: 0.70 | Train precision: 0.74 | Train recall: 0.70
	 Val. Loss: 0.174 |  Val. f1: 0.60 |  Val. precision: 0.66 | Val. recall: 0.60
Epoch: 04 | Epoch Time: 0m 5s
	Train Loss: 0.055 | Train f1: 0.79 | Train precision: 0.81 | Train recall: 0.79
	 Val. Loss: 0.180 |  Val. f1: 0.63 |  Val. precision: 0.69 | Val. recall: 0.63
Epoch: 05 | Epoch Time: 0m 5s
	Train Loss: 0.037 | Train f1: 0.84 | Train precision: 0.85 | Train recall: 0.84
	 Val. Loss: 0.201 |  Val. f1: 0.62 |  Val. precision: 0.68 | Val. r

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 3,165,118 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 5s
	Train Loss: 0.316 | Train f1: 0.33 | Train precision: 0.43 | Train recall: 0.31
	 Val. Loss: 0.219 |  Val. f1: 0.52 |  Val. precision: 0.60 | Val. recall: 0.51
Epoch: 02 | Epoch Time: 0m 5s
	Train Loss: 0.110 | Train f1: 0.66 | Train precision: 0.71 | Train recall: 0.65
	 Val. Loss: 0.182 |  Val. f1: 0.58 |  Val. precision: 0.65 | Val. recall: 0.57
Epoch: 03 | Epoch Time: 0m 5s
	Train Loss: 0.059 | Train f1: 0.78 | Train precision: 0.81 | Train recall: 0.78
	 Val. Loss: 0.185 |  Val. f1: 0.61 |  Val. precision: 0.68 | Val. recall: 0.60
Epoch: 04 | Epoch Time: 0m 5s
	Train Loss: 0.035 | Train f1: 0.85 | Train precision: 0.87 | Train recall: 0.85
	 Val. Loss: 0.191 |  Val. f1: 0.61 |  Val. precision: 0.67 | Val. recall: 0.61
Epoch: 05 | Epoch Time: 0m 5s
	Train Loss: 0.023 | Train f1: 0.89 | Train precision: 0.90 | Train recall: 0.89
	 Val. Loss: 0.201 |  Val. f1: 0.62 |  Val. precision: 0.68 | Val. r

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 07 | Epoch Time: 0m 5s
	Train Loss: 0.015 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.226 |  Val. f1: 0.62 |  Val. precision: 0.69 | Val. recall: 0.61
Epoch: 08 | Epoch Time: 0m 5s
	Train Loss: 0.011 | Train f1: 0.95 | Train precision: 0.95 | Train recall: 0.95
	 Val. Loss: 0.220 |  Val. f1: 0.63 |  Val. precision: 0.68 | Val. recall: 0.62
Epoch: 09 | Epoch Time: 0m 5s
	Train Loss: 0.010 | Train f1: 0.95 | Train precision: 0.95 | Train recall: 0.95
	 Val. Loss: 0.234 |  Val. f1: 0.63 |  Val. precision: 0.69 | Val. recall: 0.62
Epoch: 10 | Epoch Time: 0m 5s
	Train Loss: 0.009 | Train f1: 0.96 | Train precision: 0.96 | Train recall: 0.96
	 Val. Loss: 0.241 |  Val. f1: 0.62 |  Val. precision: 0.68 | Val. recall: 0.61
Epoch: 11 | Epoch Time: 0m 5s
	Train Loss: 0.009 | Train f1: 0.96 | Train precision: 0.96 | Train recall: 0.96
	 Val. Loss: 0.267 |  Val. f1: 0.61 |  Val. precision: 0.68 | Val. recall: 0.60
Epoch: 12 | Epoch Time: 0m 5s
	Train Loss: 0.007

Como cabría de esperar, al disminuir la profundidad de la red se pierde capacidad predictiva.

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 256  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 3  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, int(HIDDEN_DIM/2), OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

gru_model_name = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 6,502,334 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 18s
	Train Loss: 0.364 | Train f1: 0.25 | Train precision: 0.32 | Train recall: 0.24
	 Val. Loss: 0.243 |  Val. f1: 0.46 |  Val. precision: 0.56 | Val. recall: 0.44
Epoch: 02 | Epoch Time: 0m 18s
	Train Loss: 0.128 | Train f1: 0.60 | Train precision: 0.66 | Train recall: 0.60
	 Val. Loss: 0.188 |  Val. f1: 0.57 |  Val. precision: 0.64 | Val. recall: 0.57
Epoch: 03 | Epoch Time: 0m 18s
	Train Loss: 0.071 | Train f1: 0.73 | Train precision: 0.77 | Train recall: 0.73
	 Val. Loss: 0.206 |  Val. f1: 0.59 |  Val. precision: 0.66 | Val. recall: 0.58
Epoch: 04 | Epoch Time: 0m 18s
	Train Loss: 0.045 | Train f1: 0.82 | Train precision: 0.84 | Train recall: 0.82
	 Val. Loss: 0.197 |  Val. f1: 0.61 |  Val. precision: 0.66 | Val. recall: 0.62
Epoch: 05 | Epoch Time: 0m 18s
	Train Loss: 0.032 | Train f1: 0.86 | Train precision: 0.88 | Train recall: 0.86
	 Val. Loss: 0.210 |  Val. f1: 0.61 |  Val. precision: 0.66 | V

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,869,182 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 13s
	Train Loss: 0.441 | Train f1: 0.17 | Train precision: 0.23 | Train recall: 0.15
	 Val. Loss: 0.280 |  Val. f1: 0.33 |  Val. precision: 0.39 | Val. recall: 0.33
Epoch: 02 | Epoch Time: 0m 13s
	Train Loss: 0.176 | Train f1: 0.46 | Train precision: 0.52 | Train recall: 0.46
	 Val. Loss: 0.216 |  Val. f1: 0.50 |  Val. precision: 0.58 | Val. recall: 0.50
Epoch: 03 | Epoch Time: 0m 13s
	Train Loss: 0.106 | Train f1: 0.63 | Train precision: 0.68 | Train recall: 0.63
	 Val. Loss: 0.199 |  Val. f1: 0.57 |  Val. precision: 0.63 | Val. recall: 0.58
Epoch: 04 | Epoch Time: 0m 13s
	Train Loss: 0.073 | Train f1: 0.72 | Train precision: 0.75 | Train recall: 0.72
	 Val. Loss: 0.208 |  Val. f1: 0.56 |  Val. precision: 0.64 | Val. recall: 0.56
Epoch: 05 | Epoch Time: 0m 13s
	Train Loss: 0.055 | Train f1: 0.79 | Train precision: 0.81 | Train recall: 0.79
	 Val. Loss: 0.195 |  Val. f1: 0.60 |  Val. precision: 0.65 | V

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 5,530,558 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 17s
	Train Loss: 0.430 | Train f1: 0.25 | Train precision: 0.34 | Train recall: 0.23
	 Val. Loss: 0.255 |  Val. f1: 0.49 |  Val. precision: 0.59 | Val. recall: 0.48
Epoch: 02 | Epoch Time: 0m 16s
	Train Loss: 0.139 | Train f1: 0.60 | Train precision: 0.66 | Train recall: 0.60
	 Val. Loss: 0.200 |  Val. f1: 0.58 |  Val. precision: 0.65 | Val. recall: 0.57
Epoch: 03 | Epoch Time: 0m 16s
	Train Loss: 0.073 | Train f1: 0.74 | Train precision: 0.77 | Train recall: 0.74
	 Val. Loss: 0.184 |  Val. f1: 0.63 |  Val. precision: 0.69 | Val. recall: 0.63
Epoch: 04 | Epoch Time: 0m 17s
	Train Loss: 0.044 | Train f1: 0.82 | Train precision: 0.84 | Train recall: 0.82
	 Val. Loss: 0.223 |  Val. f1: 0.61 |  Val. precision: 0.66 | Val. recall: 0.62
Epoch: 05 | Epoch Time: 0m 17s
	Train Loss: 0.031 | Train f1: 0.87 | Train precision: 0.88 | Train recall: 0.87
	 Val. Loss: 0.225 |  Val. f1: 0.62 |  Val. precision: 0.67 | V

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 256  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 4  # número de capas.
DROPOUT = 0.25
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, int(HIDDEN_DIM/2), OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

gru_model_name = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_w  eights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 8,079,294 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 25s
	Train Loss: 0.426 | Train f1: 0.18 | Train precision: 0.23 | Train recall: 0.17
	 Val. Loss: 0.319 |  Val. f1: 0.35 |  Val. precision: 0.47 | Val. recall: 0.35
Epoch: 02 | Epoch Time: 0m 25s
	Train Loss: 0.150 | Train f1: 0.54 | Train precision: 0.60 | Train recall: 0.54
	 Val. Loss: 0.208 |  Val. f1: 0.55 |  Val. precision: 0.61 | Val. recall: 0.55
Epoch: 03 | Epoch Time: 0m 25s
	Train Loss: 0.085 | Train f1: 0.70 | Train precision: 0.74 | Train recall: 0.71
	 Val. Loss: 0.203 |  Val. f1: 0.56 |  Val. precision: 0.63 | Val. recall: 0.56
Epoch: 04 | Epoch Time: 0m 25s
	Train Loss: 0.057 | Train f1: 0.78 | Train precision: 0.80 | Train recall: 0.78
	 Val. Loss: 0.212 |  Val. f1: 0.59 |  Val. precision: 0.64 | Val. recall: 0.60
Epoch: 05 | Epoch Time: 0m 25s
	Train Loss: 0.041 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.229 |  Val. f1: 0.59 |  Val. precision: 0.65 | V

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,967,998 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 17s
	Train Loss: 0.472 | Train f1: 0.13 | Train precision: 0.20 | Train recall: 0.12
	 Val. Loss: 0.310 |  Val. f1: 0.33 |  Val. precision: 0.36 | Val. recall: 0.36
Epoch: 02 | Epoch Time: 0m 17s
	Train Loss: 0.200 | Train f1: 0.43 | Train precision: 0.49 | Train recall: 0.43
	 Val. Loss: 0.227 |  Val. f1: 0.50 |  Val. precision: 0.56 | Val. recall: 0.52
Epoch: 03 | Epoch Time: 0m 17s
	Train Loss: 0.125 | Train f1: 0.59 | Train precision: 0.64 | Train recall: 0.59
	 Val. Loss: 0.227 |  Val. f1: 0.55 |  Val. precision: 0.62 | Val. recall: 0.55
Epoch: 04 | Epoch Time: 0m 17s
	Train Loss: 0.089 | Train f1: 0.69 | Train precision: 0.72 | Train recall: 0.69
	 Val. Loss: 0.198 |  Val. f1: 0.61 |  Val. precision: 0.65 | Val. recall: 0.62
Epoch: 05 | Epoch Time: 0m 17s
	Train Loss: 0.067 | Train f1: 0.75 | Train precision: 0.78 | Train recall: 0.75
	 Val. Loss: 0.221 |  Val. f1: 0.60 |  Val. precision: 0.64 | V

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion
  
model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 6,713,278 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 23s
	Train Loss: 0.510 | Train f1: 0.15 | Train precision: 0.20 | Train recall: 0.14
	 Val. Loss: 0.278 |  Val. f1: 0.39 |  Val. precision: 0.48 | Val. recall: 0.38
Epoch: 02 | Epoch Time: 0m 23s
	Train Loss: 0.167 | Train f1: 0.52 | Train precision: 0.59 | Train recall: 0.52
	 Val. Loss: 0.214 |  Val. f1: 0.55 |  Val. precision: 0.60 | Val. recall: 0.58
Epoch: 03 | Epoch Time: 0m 23s
	Train Loss: 0.091 | Train f1: 0.69 | Train precision: 0.73 | Train recall: 0.69
	 Val. Loss: 0.202 |  Val. f1: 0.59 |  Val. precision: 0.65 | Val. recall: 0.58
Epoch: 04 | Epoch Time: 0m 23s
	Train Loss: 0.061 | Train f1: 0.78 | Train precision: 0.80 | Train recall: 0.78
	 Val. Loss: 0.210 |  Val. f1: 0.61 |  Val. precision: 0.66 | Val. recall: 0.61
Epoch: 05 | Epoch Time: 0m 23s
	Train Loss: 0.043 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.82
	 Val. Loss: 0.227 |  Val. f1: 0.60 |  Val. precision: 0.65 | V

En promedio, al aumentar o disminuir la profundidad de los modelos recurrentes se incurre en una pérdida general de *performance* y aún en los casos en que alguna métrica mejora las pérdidas en las otras son mayores en magnitud.

#### Dropout

Para analizar el efecto del *dropout* cabe hacer antes la siguiente observación: en los modelos provistos se usa un mismo *dropout* en cada capa de la red, entendiendo por esto a que entre la capa de *Embedding* y el modelo recurrente se incorpora una capa de *dropout*, la cual posee el mismo parámetro que se utiliza dentro de la red recurrente y, a su vez, entre la salida de esta y la red de clasificación se incorpora otra capa de *dropout* con igual parámetro. Esto no es usual en la práctica por lo que se propone separar estas tres instancias de *dropout* y experimentar con ellas individualmente.

En virtud de lo anterior, se prueban primero los modelos sin *dropout* y se varía cada componente en forma individual en busca de la configuración óptima.

Pero antes, es necesario modificar las clases de los modelos estudiados hasta ahora:

In [None]:
class NER_RNN(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 in_dropout,
                 rnn_dropout,
                 out_dropout, 
                 pad_idx):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                      embedding_dim,
                                      padding_idx=pad_idx)

        # Capa LSTM
        self.lstm = nn.LSTM(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional, 
                           dropout = rnn_dropout if n_layers > 1 else 0)

        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                            output_dim)

        # Dropout
        self.in_dropout = nn.Dropout(in_dropout)
        self.out_dropout = nn.Dropout(out_dropout)

    def forward(self, text):

        #text = [sent len, batch size]

        # Convertir lo enviado a embedding
        embedded = self.in_dropout(self.embedding(text))
        
        outputs, (hidden, cell) = self.lstm(embedded)
        #embedded = [sent len, batch size, emb dim]

        # Pasar los embeddings por la rnn (LSTM)

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        # Predecir usando la capa de salida.
        predictions = self.fc(self.out_dropout(outputs))
        #predictions = [sent len, batch size, output dim]

        return predictions

class NER_ELMAN(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 in_dropout,
                 rnn_dropout,
                 out_dropout, 
                 pad_idx,
                 nonlinearity = 'tanh'):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                      embedding_dim,
                                      padding_idx=pad_idx)

        # Capa Elman RNN
        self.rnn = nn.RNN(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional, 
                           dropout = rnn_dropout if n_layers > 1 else 0,
                           nonlinearity = nonlinearity)

        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                            output_dim)

        # Dropout
        self.in_dropout = nn.Dropout(in_dropout)
        self.out_dropout = nn.Dropout(out_dropout)

    def forward(self, text):

        #text = [sent len, batch size]

        # Convertir lo enviado a embedding
        embedded = self.in_dropout(self.embedding(text))
        
        outputs, hidden = self.rnn(embedded)
        #embedded = [sent len, batch size, emb dim]

        # Pasar los embeddings por la rnn (LSTM)

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        # Predecir usando la capa de salida.
        predictions = self.fc(self.out_dropout(outputs))
        #predictions = [sent len, batch size, output dim]

        return predictions

class NER_GRU(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 in_dropout,
                 rnn_dropout,
                 out_dropout, 
                 pad_idx):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                      embedding_dim,
                                      padding_idx=pad_idx)

        # Capa Elman RNN
        self.gru = nn.GRU(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional, 
                           dropout = rnn_dropout if n_layers > 1 else 0)

        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                            output_dim)

        # Dropout
        self.in_dropout = nn.Dropout(in_dropout)
        self.out_dropout = nn.Dropout(out_dropout)

    def forward(self, text):

        #text = [sent len, batch size]

        # Convertir lo enviado a embedding
        embedded = self.in_dropout(self.embedding(text))
        
        outputs, hidden = self.gru(embedded)
        #embedded = [sent len, batch size, emb dim]

        # Pasar los embeddings por la rnn (LSTM)

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        # Predecir usando la capa de salida.
        predictions = self.fc(self.out_dropout(outputs))
        #predictions = [sent len, batch size, output dim]

        return predictions

##### Input *Dropout*

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 256  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
#IN_DROPOUT = 0.0
#IN_DROPOUT = 0.1
#IN_DROPOUT = 0.25
#IN_DROPOUT = 0.5
IN_DROPOUT = 0.75
RNN_DROPOUT = 0.0
OUT_DROPOUT = 0.0
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                     N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                     OUT_DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, int(HIDDEN_DIM/2), OUTPUT_DIM,
                        N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                        OUT_DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                    N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                    OUT_DROPOUT, PAD_IDX)

gru_model_name = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,925,374 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 9s
	Train Loss: 0.429 | Train f1: 0.18 | Train precision: 0.29 | Train recall: 0.15
	 Val. Loss: 0.305 |  Val. f1: 0.36 |  Val. precision: 0.48 | Val. recall: 0.32
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.231 | Train f1: 0.43 | Train precision: 0.52 | Train recall: 0.40
	 Val. Loss: 0.227 |  Val. f1: 0.48 |  Val. precision: 0.57 | Val. recall: 0.46
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.166 | Train f1: 0.54 | Train precision: 0.62 | Train recall: 0.52
	 Val. Loss: 0.207 |  Val. f1: 0.54 |  Val. precision: 0.63 | Val. recall: 0.52
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.127 | Train f1: 0.61 | Train precision: 0.67 | Train recall: 0.60
	 Val. Loss: 0.183 |  Val. f1: 0.60 |  Val. precision: 0.66 | Val. recall: 0.60
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.102 | Train f1: 0.66 | Train precision: 0.71 | Train recall: 0.65
	 Val. Loss: 0.177 |  Val. f1: 0.60 |  Val. precision: 0.66 | Val. r

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,770,366 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 7s
	Train Loss: 0.473 | Train f1: 0.12 | Train precision: 0.23 | Train recall: 0.09
	 Val. Loss: 0.340 |  Val. f1: 0.34 |  Val. precision: 0.45 | Val. recall: 0.31
Epoch: 02 | Epoch Time: 0m 7s
	Train Loss: 0.267 | Train f1: 0.37 | Train precision: 0.46 | Train recall: 0.34
	 Val. Loss: 0.271 |  Val. f1: 0.46 |  Val. precision: 0.52 | Val. recall: 0.47
Epoch: 03 | Epoch Time: 0m 7s
	Train Loss: 0.195 | Train f1: 0.48 | Train precision: 0.57 | Train recall: 0.46
	 Val. Loss: 0.214 |  Val. f1: 0.52 |  Val. precision: 0.58 | Val. recall: 0.53
Epoch: 04 | Epoch Time: 0m 7s
	Train Loss: 0.151 | Train f1: 0.56 | Train precision: 0.63 | Train recall: 0.54
	 Val. Loss: 0.198 |  Val. f1: 0.57 |  Val. precision: 0.62 | Val. recall: 0.59
Epoch: 05 | Epoch Time: 0m 7s
	Train Loss: 0.124 | Train f1: 0.62 | Train precision: 0.68 | Train recall: 0.61
	 Val. Loss: 0.186 |  Val. f1: 0.59 |  Val. precision: 0.64 | Val. r

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,347,838 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.451 | Train f1: 0.17 | Train precision: 0.31 | Train recall: 0.14
	 Val. Loss: 0.309 |  Val. f1: 0.38 |  Val. precision: 0.49 | Val. recall: 0.36
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.247 | Train f1: 0.43 | Train precision: 0.55 | Train recall: 0.39
	 Val. Loss: 0.226 |  Val. f1: 0.50 |  Val. precision: 0.58 | Val. recall: 0.49
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.171 | Train f1: 0.55 | Train precision: 0.65 | Train recall: 0.53
	 Val. Loss: 0.217 |  Val. f1: 0.53 |  Val. precision: 0.61 | Val. recall: 0.53
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.127 | Train f1: 0.63 | Train precision: 0.70 | Train recall: 0.61
	 Val. Loss: 0.179 |  Val. f1: 0.58 |  Val. precision: 0.66 | Val. recall: 0.57
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.104 | Train f1: 0.67 | Train precision: 0.73 | Train recall: 0.66
	 Val. Loss: 0.172 |  Val. f1: 0.61 |  Val. precision: 0.68 | Val. r

Aparentemente, los mejores resultados se obtienen al incorporar un *dropout* de `0.75` en la entrada del modelo recurrente. Estudiemos acontinuación el efecto del *dropout* dentro del mismo modelo:

##### RNN Dropout

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 256  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
IN_DROPOUT = 0.75
#RNN_DROPOUT = 0.0
#RNN_DROPOUT = 0.1
#RNN_DROPOUT = 0.25
#RNN_DROPOUT = 0.5
RNN_DROPOUT = 0.75
OUT_DROPOUT = 0.0
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                     N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                     OUT_DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, int(HIDDEN_DIM/2), OUTPUT_DIM,
                        N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                        OUT_DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                    N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                    OUT_DROPOUT, PAD_IDX)

gru_model_name = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,925,374 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 11s
	Train Loss: 0.473 | Train f1: 0.12 | Train precision: 0.22 | Train recall: 0.10
	 Val. Loss: 0.338 |  Val. f1: 0.34 |  Val. precision: 0.48 | Val. recall: 0.30
Epoch: 02 | Epoch Time: 0m 11s
	Train Loss: 0.258 | Train f1: 0.38 | Train precision: 0.48 | Train recall: 0.35
	 Val. Loss: 0.240 |  Val. f1: 0.46 |  Val. precision: 0.56 | Val. recall: 0.45
Epoch: 03 | Epoch Time: 0m 11s
	Train Loss: 0.185 | Train f1: 0.50 | Train precision: 0.58 | Train recall: 0.48
	 Val. Loss: 0.209 |  Val. f1: 0.53 |  Val. precision: 0.60 | Val. recall: 0.54
Epoch: 04 | Epoch Time: 0m 11s
	Train Loss: 0.143 | Train f1: 0.58 | Train precision: 0.64 | Train recall: 0.56
	 Val. Loss: 0.189 |  Val. f1: 0.57 |  Val. precision: 0.65 | Val. recall: 0.57
Epoch: 05 | Epoch Time: 0m 11s
	Train Loss: 0.116 | Train f1: 0.64 | Train precision: 0.69 | Train recall: 0.63
	 Val. Loss: 0.206 |  Val. f1: 0.58 |  Val. precision: 0.64 | V

In [None]:
%%script false
import numpy as np
test_sentence1 = "La pandemia de enfermedad por coronavirus de 2019-2020 es una pandemia derivada de la enfermedad por coronavirus iniciada en 2019 (COVID-19), ocasionada por el virus coronavirus 2 del síndrome respiratorio agudo grave (SARS-CoV-2). Se identificó por primera vez en diciembre de 2019 en la ciudad de Wuhan,​ capital de la provincia de Hubei, en la República Popular China"
test_sentence2 = 'Vardoc es el seudónimo de Nicolás Ignacio Liñán de Ariza Baquerizo, su canal es Vardoc1. Nicolás es uno de los Youtubers más prestigiosos de Chile, oriundo de la ciudad de Temuco, realiza gameplays y vlogs diarios en su canal. '
test_sentences = [test_sentence1, test_sentence2]
tokenized_sentence = TEXT.process(test_sentences)
print(tokenized_sentence[0])
input_ids = torch.tensor([tokenized_sentence]).cuda()
print(input_ids)

# tokenized_sentence = TEXT.tokenize(test_sentence)
# print(tokenized_sentence.text)
# input_ids = TEXT.numericalize(tokenized_sentence[0]).cuda()
# print(input_ids)
# with torch.no_grad():
#     output = elman_model(tokenized_sentence)
#     print(output.shape)
# label_indices = np.argmax(output[0].to('cpu').numpy(), axis = -1)
# #print(label_indices)
# for i in tokenized_sentence[0]:
#     print(TEXT.vocab.itos[i])


# test_sentence = "Vardoc es el seudónimo de Nicolás Ignacio Liñán de Ariza Baquerizo, su canal es Vardoc1. Nicolás es uno de los Youtubers más prestigiosos de Chile, oriundo de la ciudad de Temuco, realiza gameplays y vlogs diarios en su canal."
# tokenized_sentence = bert_tokenizer.encode(test_sentence)
# input_ids = torch.tensor([tokenized_sentence]).cuda()
# with torch.no_grad():
#     output = model(input_ids)
# label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
# tokens = bert_tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])
# new_tokens, new_labels = [], []
# for token, label_idx in zip(tokens, label_indices[0]):
#     if token.startswith("##"):
#         new_tokens[-1] = new_tokens[-1] + token[2:]
#     else:
#         new_labels.append(tag_values[label_idx])
#         new_tokens.append(token)
# for token, label in zip(new_tokens, new_labels):
#     print("{}\t{}".format(label, token))

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,770,366 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.563 | Train f1: 0.02 | Train precision: 0.07 | Train recall: 0.01
	 Val. Loss: 0.588 |  Val. f1: 0.22 |  Val. precision: 0.31 | Val. recall: 0.21
Epoch: 02 | Epoch Time: 0m 9s
	Train Loss: 0.365 | Train f1: 0.21 | Train precision: 0.33 | Train recall: 0.18
	 Val. Loss: 0.380 |  Val. f1: 0.34 |  Val. precision: 0.37 | Val. recall: 0.37
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.269 | Train f1: 0.33 | Train precision: 0.40 | Train recall: 0.32
	 Val. Loss: 0.316 |  Val. f1: 0.40 |  Val. precision: 0.47 | Val. recall: 0.43
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.218 | Train f1: 0.41 | Train precision: 0.47 | Train recall: 0.39
	 Val. Loss: 0.265 |  Val. f1: 0.47 |  Val. precision: 0.52 | Val. recall: 0.50
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.183 | Train f1: 0.47 | Train precision: 0.55 | Train recall: 0.46
	 Val. Loss: 0.272 |  Val. f1: 0.50 |  Val. precision: 0.56 | Val. r

In [None]:
model = gru_model
model_name = gru_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,347,838 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 10s
	Train Loss: 0.530 | Train f1: 0.09 | Train precision: 0.19 | Train recall: 0.06
	 Val. Loss: 0.369 |  Val. f1: 0.30 |  Val. precision: 0.45 | Val. recall: 0.26
Epoch: 02 | Epoch Time: 0m 10s
	Train Loss: 0.303 | Train f1: 0.35 | Train precision: 0.47 | Train recall: 0.31
	 Val. Loss: 0.257 |  Val. f1: 0.47 |  Val. precision: 0.58 | Val. recall: 0.44
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.210 | Train f1: 0.48 | Train precision: 0.58 | Train recall: 0.45
	 Val. Loss: 0.232 |  Val. f1: 0.50 |  Val. precision: 0.60 | Val. recall: 0.50
Epoch: 04 | Epoch Time: 0m 10s
	Train Loss: 0.164 | Train f1: 0.56 | Train precision: 0.64 | Train recall: 0.54
	 Val. Loss: 0.196 |  Val. f1: 0.58 |  Val. precision: 0.63 | Val. recall: 0.59
Epoch: 05 | Epoch Time: 0m 10s
	Train Loss: 0.133 | Train f1: 0.61 | Train precision: 0.68 | Train recall: 0.60
	 Val. Loss: 0.197 |  Val. f1: 0.58 |  Val. precision: 0.66 | V

##### Output *Dropout*

Para estos experimentos se considerarán dos modelos basados en GRU, puesto que se obtuvieron resultados bastante competitivos entre sí al considerar una red recurrente tipo GRU sin *dropout* y con un *dropout* de `0.5`. Para los otros dos tipos de arquitectura se tiene que, en promedio y en términos globales, se obtienen mejores resultados sin *dropout* en la red recurrente.

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100  # dimensión de los embeddings.
HIDDEN_DIM = 256  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
IN_DROPOUT = 0.75
RNN_DROPOUT = 0.0
#OUT_DROPOUT = 0.0
#OUT_DROPOUT = 0.1
#OUT_DROPOUT = 0.25
#OUT_DROPOUT = 0.5
OUT_DROPOUT = 0.75
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                     N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                     OUT_DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, int(HIDDEN_DIM/2), OUTPUT_DIM,
                        N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                        OUT_DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model1 = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                    N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                    OUT_DROPOUT, PAD_IDX)

gru_model_name1 = 'GRU'  # nombre que tendrá el modelo guardado...

gru_model2 = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                    N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, 0.5, 
                    OUT_DROPOUT, PAD_IDX)

gru_model_name2 = 'GRU'  # nombre que tendrá el modelo guardado...

In [None]:
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,925,374 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 26s
	Train Loss: 0.487 | Train f1: 0.12 | Train precision: 0.21 | Train recall: 0.10
	 Val. Loss: 0.328 |  Val. f1: 0.32 |  Val. precision: 0.43 | Val. recall: 0.30
Epoch: 02 | Epoch Time: 0m 26s
	Train Loss: 0.265 | Train f1: 0.38 | Train precision: 0.48 | Train recall: 0.36
	 Val. Loss: 0.243 |  Val. f1: 0.46 |  Val. precision: 0.56 | Val. recall: 0.43
Epoch: 03 | Epoch Time: 0m 26s
	Train Loss: 0.190 | Train f1: 0.49 | Train precision: 0.57 | Train recall: 0.47
	 Val. Loss: 0.210 |  Val. f1: 0.49 |  Val. precision: 0.59 | Val. recall: 0.48
Epoch: 04 | Epoch Time: 0m 26s
	Train Loss: 0.148 | Train f1: 0.57 | Train precision: 0.64 | Train recall: 0.56
	 Val. Loss: 0.199 |  Val. f1: 0.54 |  Val. precision: 0.62 | Val. recall: 0.55
Epoch: 05 | Epoch Time: 0m 26s
	Train Loss: 0.120 | Train f1: 0.63 | Train precision: 0.68 | Train recall: 0.62
	 Val. Loss: 0.168 |  Val. f1: 0.61 |  Val. precision: 0.66 | V

In [None]:
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,770,366 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.587 | Train f1: 0.03 | Train precision: 0.08 | Train recall: 0.02
	 Val. Loss: 0.487 |  Val. f1: 0.24 |  Val. precision: 0.35 | Val. recall: 0.22
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.356 | Train f1: 0.23 | Train precision: 0.34 | Train recall: 0.20
	 Val. Loss: 0.305 |  Val. f1: 0.34 |  Val. precision: 0.41 | Val. recall: 0.35
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.263 | Train f1: 0.34 | Train precision: 0.40 | Train recall: 0.32
	 Val. Loss: 0.250 |  Val. f1: 0.42 |  Val. precision: 0.48 | Val. recall: 0.42
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.212 | Train f1: 0.41 | Train precision: 0.48 | Train recall: 0.40
	 Val. Loss: 0.222 |  Val. f1: 0.47 |  Val. precision: 0.54 | Val. recall: 0.48
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.178 | Train f1: 0.47 | Train precision: 0.54 | Train recall: 0.47
	 Val. Loss: 0.208 |  Val. f1: 0.51 |  Val. precision: 0.57 | Val. r

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 17 | Epoch Time: 0m 8s
	Train Loss: 0.060 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.187 |  Val. f1: 0.63 |  Val. precision: 0.68 | Val. recall: 0.64
Epoch: 18 | Epoch Time: 0m 8s
	Train Loss: 0.060 | Train f1: 0.78 | Train precision: 0.81 | Train recall: 0.78
	 Val. Loss: 0.195 |  Val. f1: 0.63 |  Val. precision: 0.67 | Val. recall: 0.64
Epoch: 19 | Epoch Time: 0m 8s
	Train Loss: 0.055 | Train f1: 0.79 | Train precision: 0.81 | Train recall: 0.79
	 Val. Loss: 0.191 |  Val. f1: 0.64 |  Val. precision: 0.68 | Val. recall: 0.65
Epoch: 20 | Epoch Time: 0m 8s
	Train Loss: 0.053 | Train f1: 0.80 | Train precision: 0.82 | Train recall: 0.80
	 Val. Loss: 0.194 |  Val. f1: 0.64 |  Val. precision: 0.67 | Val. recall: 0.65
Epoch: 21 | Epoch Time: 0m 8s
	Train Loss: 0.049 | Train f1: 0.80 | Train precision: 0.82 | Train recall: 0.80
	 Val. Loss: 0.207 |  Val. f1: 0.65 |  Val. precision: 0.69 | Val. recall: 0.66
Epoch: 22 | Epoch Time: 0m 8s
	Train Loss: 0.049

In [None]:
model = gru_model1
model_name = gru_model_name1
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,347,838 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.563 | Train f1: 0.10 | Train precision: 0.22 | Train recall: 0.08
	 Val. Loss: 0.378 |  Val. f1: 0.27 |  Val. precision: 0.42 | Val. recall: 0.22
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.308 | Train f1: 0.34 | Train precision: 0.45 | Train recall: 0.30
	 Val. Loss: 0.283 |  Val. f1: 0.41 |  Val. precision: 0.51 | Val. recall: 0.40
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.215 | Train f1: 0.48 | Train precision: 0.57 | Train recall: 0.45
	 Val. Loss: 0.223 |  Val. f1: 0.50 |  Val. precision: 0.60 | Val. recall: 0.48
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.164 | Train f1: 0.56 | Train precision: 0.63 | Train recall: 0.54
	 Val. Loss: 0.192 |  Val. f1: 0.56 |  Val. precision: 0.65 | Val. recall: 0.54
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.133 | Train f1: 0.62 | Train precision: 0.68 | Train recall: 0.60
	 Val. Loss: 0.191 |  Val. f1: 0.58 |  Val. precision: 0.67 | Val. r

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 14 | Epoch Time: 0m 8s
	Train Loss: 0.046 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.191 |  Val. f1: 0.64 |  Val. precision: 0.70 | Val. recall: 0.63
Epoch: 15 | Epoch Time: 0m 8s
	Train Loss: 0.042 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.201 |  Val. f1: 0.64 |  Val. precision: 0.70 | Val. recall: 0.63
Epoch: 16 | Epoch Time: 0m 8s
	Train Loss: 0.038 | Train f1: 0.84 | Train precision: 0.86 | Train recall: 0.84
	 Val. Loss: 0.196 |  Val. f1: 0.64 |  Val. precision: 0.70 | Val. recall: 0.63
Epoch: 17 | Epoch Time: 0m 8s
	Train Loss: 0.037 | Train f1: 0.85 | Train precision: 0.86 | Train recall: 0.84
	 Val. Loss: 0.222 |  Val. f1: 0.63 |  Val. precision: 0.70 | Val. recall: 0.63
Epoch: 18 | Epoch Time: 0m 8s
	Train Loss: 0.036 | Train f1: 0.85 | Train precision: 0.86 | Train recall: 0.84
	 Val. Loss: 0.203 |  Val. f1: 0.66 |  Val. precision: 0.71 | Val. recall: 0.65
Epoch: 19 | Epoch Time: 0m 8s
	Train Loss: 0.033

In [None]:
model = gru_model2
model_name = gru_model_name2
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 4,347,838 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 10s
	Train Loss: 0.615 | Train f1: 0.05 | Train precision: 0.14 | Train recall: 0.04
	 Val. Loss: 0.439 |  Val. f1: 0.19 |  Val. precision: 0.36 | Val. recall: 0.14
Epoch: 02 | Epoch Time: 0m 10s
	Train Loss: 0.342 | Train f1: 0.29 | Train precision: 0.40 | Train recall: 0.25
	 Val. Loss: 0.282 |  Val. f1: 0.42 |  Val. precision: 0.54 | Val. recall: 0.39
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.233 | Train f1: 0.44 | Train precision: 0.53 | Train recall: 0.41
	 Val. Loss: 0.223 |  Val. f1: 0.51 |  Val. precision: 0.60 | Val. recall: 0.49
Epoch: 04 | Epoch Time: 0m 10s
	Train Loss: 0.179 | Train f1: 0.54 | Train precision: 0.62 | Train recall: 0.52
	 Val. Loss: 0.216 |  Val. f1: 0.55 |  Val. precision: 0.64 | Val. recall: 0.52
Epoch: 05 | Epoch Time: 0m 10s
	Train Loss: 0.144 | Train f1: 0.59 | Train precision: 0.66 | Train recall: 0.58
	 Val. Loss: 0.200 |  Val. f1: 0.57 |  Val. precision: 0.65 | V

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 15 | Epoch Time: 0m 10s
	Train Loss: 0.050 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.206 |  Val. f1: 0.64 |  Val. precision: 0.70 | Val. recall: 0.64
Epoch: 16 | Epoch Time: 0m 10s
	Train Loss: 0.047 | Train f1: 0.82 | Train precision: 0.84 | Train recall: 0.82
	 Val. Loss: 0.205 |  Val. f1: 0.64 |  Val. precision: 0.70 | Val. recall: 0.63
Epoch: 17 | Epoch Time: 0m 10s
	Train Loss: 0.044 | Train f1: 0.82 | Train precision: 0.84 | Train recall: 0.82
	 Val. Loss: 0.227 |  Val. f1: 0.64 |  Val. precision: 0.70 | Val. recall: 0.62
Epoch: 18 | Epoch Time: 0m 10s
	Train Loss: 0.042 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.228 |  Val. f1: 0.63 |  Val. precision: 0.70 | Val. recall: 0.62
Epoch: 19 | Epoch Time: 0m 10s
	Train Loss: 0.039 | Train f1: 0.84 | Train precision: 0.86 | Train recall: 0.83
	 Val. Loss: 0.218 |  Val. f1: 0.64 |  Val. precision: 0.70 | Val. recall: 0.63
Epoch: 20 | Epoch Time: 0m 10s
	Train Loss:

Si bien en algunos casos existen algunas mejoras en desempeño al incluir esta tercera componente de *dropout*, estas suelen ir acompañadas de un deterioro en otros aspectos, generalmente *loss*. Por lo tanto, se concluye que es preferible no considerar esta componente.

#### Capa de Embedding

En esta sección se explora cómo mejorar el desempeño de los modelos por medio de los Embeddings. Para esto se conciben dos enfoques: entrenar la capa de Embedding de los modelos ya empleados, variando la dimensionalidad de las representaciones, o bien emplear Embeddings pre-entrenados (obtenidos desde https://github.com/BotCenter/spanishWordEmbeddings). Para comparar en mejor manera el efecto de los Embeddings pre-entrenados se consideran arquitecturas cuya capa de Embedding tiene la misma dimensionalidad que los obtenidos desde el enlace anterior.

Además, dado que el modelo pre-entrenado se puede cargar directamente en la capa de Embedding de nuestros modelos, se pueden seguir entrenando, por lo que también se tendrá esto en cuenta.

Para poder cargar los Embeddings pre-entrenados en nuestros modelos se realiza una leve modificación:

In [None]:
torch.cuda.empty_cache()

In [None]:
class NER_RNN(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 in_dropout,
                 rnn_dropout,
                 out_dropout, 
                 pad_idx):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                      embedding_dim,
                                      padding_idx=pad_idx)

        # Capa LSTM
        self.lstm = nn.LSTM(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional, 
                           dropout = rnn_dropout if n_layers > 1 else 0)

        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                            output_dim)

        # Dropout
        self.in_dropout = nn.Dropout(in_dropout)
        self.out_dropout = nn.Dropout(out_dropout)

    def forward(self, text):

        #text = [sent len, batch size]

        # Convertir lo enviado a embedding
        embedded = self.in_dropout(self.embedding(text))
        
        outputs, (hidden, cell) = self.lstm(embedded)
        #embedded = [sent len, batch size, emb dim]

        # Pasar los embeddings por la rnn (LSTM)

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        # Predecir usando la capa de salida.
        predictions = self.fc(self.out_dropout(outputs))
        #predictions = [sent len, batch size, output dim]

        return predictions
    
    def load_pretrained_embeddings(self, pre_trained_emb, requieres_grad):
        self.embedding = nn.Embedding.from_pretrained(pre_trained_emb).to(device)
        self.embedding.weight.requires_grad = requieres_grad

class NER_ELMAN(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 in_dropout,
                 rnn_dropout,
                 out_dropout, 
                 pad_idx,
                 nonlinearity = 'tanh'):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                      embedding_dim,
                                      padding_idx=pad_idx)

        # Capa Elman RNN
        self.rnn = nn.RNN(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional, 
                           dropout = rnn_dropout if n_layers > 1 else 0,
                           nonlinearity = nonlinearity)

        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                            output_dim)

        # Dropout
        self.in_dropout = nn.Dropout(in_dropout)
        self.out_dropout = nn.Dropout(out_dropout)

    def forward(self, text):

        #text = [sent len, batch size]

        # Convertir lo enviado a embedding
        embedded = self.in_dropout(self.embedding(text))
        
        outputs, hidden = self.rnn(embedded)
        #embedded = [sent len, batch size, emb dim]

        # Pasar los embeddings por la rnn (LSTM)

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        # Predecir usando la capa de salida.
        predictions = self.fc(self.out_dropout(outputs))
        #predictions = [sent len, batch size, output dim]

        return predictions
    
    def load_pretrained_embeddings(self, pre_trained_emb, requieres_grad):
        self.embedding = nn.Embedding.from_pretrained(pre_trained_emb).to(device)
        self.embedding.weight.requires_grad = requieres_grad

class NER_GRU(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 in_dropout,
                 rnn_dropout,
                 out_dropout, 
                 pad_idx):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                      embedding_dim,
                                      padding_idx=pad_idx)

        # Capa Elman RNN
        self.gru = nn.GRU(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional, 
                           dropout = rnn_dropout if n_layers > 1 else 0)

        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                            output_dim)

        # Dropout
        self.in_dropout = nn.Dropout(in_dropout)
        self.out_dropout = nn.Dropout(out_dropout)

    def forward(self, text):

        #text = [sent len, batch size]

        # Convertir lo enviado a embedding
        embedded = self.in_dropout(self.embedding(text))
        
        outputs, hidden = self.gru(embedded)
        #embedded = [sent len, batch size, emb dim]

        # Pasar los embeddings por la rnn (LSTM)

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        # Predecir usando la capa de salida.
        predictions = self.fc(self.out_dropout(outputs))
        #predictions = [sent len, batch size, output dim]

        return predictions

    def load_pretrained_embeddings(self, pre_trained_emb, requieres_grad):
        self.embedding = nn.Embedding.from_pretrained(pre_trained_emb).to(device)
        self.embedding.weight.requires_grad = requieres_grad

In [None]:
## Recuperamos estas líneas por simplicidad

# Primer Field: TEXT. Representan los tokens de la secuencia
TEXT = data.Field(lower=False) 

# Segundo Field: NER_TAGS. Representan los Tags asociados a cada palabra.
NER_TAGS = data.Field(unk_token=None)

fields = (("text", TEXT), ("nertags", NER_TAGS))

train_data, valid_data, test_data = datasets.SequenceTaggingDataset.splits(
    path="./",
    train="train_NER_esp.txt",
    validation="val_NER_esp.txt",
    test="test_NER_esp.txt",
    fields=fields,
    encoding="iso-8859-1",
    separator=" "
)

print(f"Numero de ejemplos de entrenamiento: {len(train_data)}")
print(f"Número de ejemplos de validación: {len(valid_data)}")
print(f"Número de ejemplos de test (competencia): {len(test_data)}")

TEXT.build_vocab(train_data)
NER_TAGS.build_vocab(train_data)

print(f"Tokens únicos en TEXT: {len(TEXT.vocab)}")
print(f"Tokens únicos en NER_TAGS: {len(NER_TAGS.vocab)}")

BATCH_SIZE = 16  # disminuir si hay problemas de ram.

# Usar cuda si es que está disponible.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using', device)

# Dividir datos entre entrenamiento y test
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=BATCH_SIZE,
    device=device,
    sort=False,
)

Numero de ejemplos de entrenamiento: 8323
Número de ejemplos de validación: 1915
Número de ejemplos de test (competencia): 1517
Tokens únicos en TEXT: 26101
Tokens únicos en NER_TAGS: 10
Using cuda


In [None]:
%%script false
for batch in train_iterator:
    print(batch.text)
    print(batch.text.shape)
    print(model.embedding(batch.text).shape)
    break

In [None]:
import torchtext.vocab as vocab
from tqdm import tqdm_notebook
from gensim.models.wrappers import FastText
from gensim.models.keyedvectors import KeyedVectors

filename = 'pretrained_model.bin'
!rm $filename
filename = 'pretrained_model.vec'
!rm $filename

pretrained_dict = {0: 'XS',
                   1: 'S',
                   2: 'M',
                   3: 'L',
                   4: 'new L'}

loadKey = 4

if pretrained_dict[loadKey] == 'XS':
    filename = 'pretrained_model.bin'
    !wget -O $filename https://zenodo.org/record/3234051/files/embeddings-xs-model.bin -nc
    W2V_SIZE = 10

elif pretrained_dict[loadKey] == 'S':
    filename = 'pretrained_model.bin'
    !wget -O $filename https://zenodo.org/record/3234051/files/embeddings-s-model.bin  -nc
    W2V_SIZE = 30

elif pretrained_dict[loadKey] == 'M':
    filename = 'pretrained_model.vec'
    #!wget -O $filename https://zenodo.org/record/3234051/files/embeddings-m-model.bin  -nc
    !wget -O $filename https://zenodo.org/record/3234051/files/embeddings-m-model.vec  -nc
    W2V_SIZE = 100

elif pretrained_dict[loadKey] == 'L':
    filename = 'pretrained_model.vec'
    #!wget -O $filename https://zenodo.org/record/3234051/files/embeddings-l-model.bin  -nc
    !wget -O $filename https://zenodo.org/record/3234051/files/embeddings-l-model.vec  -nc
    W2V_SIZE = 300

elif pretrained_dict[loadKey] == 'new L':
    filename = 'pretrained_model.vec'
    #!wget -O $filename https://zenodo.org/record/3255001/files/embeddings-new_large-general_3B_fasttext.bin  -nc
    !wget -O $filename https://zenodo.org/record/3255001/files/embeddings-new_large-general_3B_fasttext.vec  -nc
    W2V_SIZE = 300

if W2V_SIZE <= 30:
    wordvectors_file = 'pretrained_model'
    wordvectors = FastText.load_fasttext_format(wordvectors_file)

else:
    wordvectors_file = 'pretrained_model.vec'
    wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file)

rm: cannot remove 'pretrained_model.bin': No such file or directory
--2020-07-31 03:12:02--  https://zenodo.org/record/3255001/files/embeddings-new_large-general_3B_fasttext.vec
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3801508151 (3.5G) [application/octet-stream]
Saving to: ‘pretrained_model.vec’


2020-07-31 03:13:07 (56.1 MB/s) - ‘pretrained_model.vec’ saved [3801508151/3801508151]



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


##### Dimensión del Embedding

In [None]:
## Parámetros del Experimento ##

# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = W2V_SIZE  # dimensión de los embeddings.
print(EMBEDDING_DIM)
HIDDEN_DIM = 256  # dimensión de la capas RNN
OUTPUT_DIM = len(NER_TAGS.vocab)  # número de clases

N_LAYERS = 2  # número de capas.
IN_DROPOUT = 0.75
RNN_DROPOUT = 0.0
OUT_DROPOUT = 0.0
BIDIRECTIONAL = True

TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
baseline_criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Creamos nuestros modelos.
lstm_model = NER_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                     N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                     OUT_DROPOUT, PAD_IDX)

lstm_model_name = 'LSTM'  # nombre que tendrá el modelo guardado...

elman_model = NER_ELMAN(INPUT_DIM, EMBEDDING_DIM, int(HIDDEN_DIM/2), OUTPUT_DIM,
                        N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                        OUT_DROPOUT, PAD_IDX, 'relu')

elman_model_name = 'Elman'  # nombre que tendrá el modelo guardado...

gru_model1 = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                    N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, RNN_DROPOUT, 
                    OUT_DROPOUT, PAD_IDX)

gru_model_name1 = 'GRU'  # nombre que tendrá el modelo guardado...

gru_model2 = NER_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                    N_LAYERS, BIDIRECTIONAL, IN_DROPOUT, 0.5, 
                    OUT_DROPOUT, PAD_IDX)

gru_model_name2 = 'GRU'  # nombre que tendrá el modelo guardado...

300


In [None]:
torch.cuda.empty_cache()
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 10,555,174 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 10s
	Train Loss: 0.377 | Train f1: 0.24 | Train precision: 0.35 | Train recall: 0.21
	 Val. Loss: 0.287 |  Val. f1: 0.44 |  Val. precision: 0.52 | Val. recall: 0.44
Epoch: 02 | Epoch Time: 0m 10s
	Train Loss: 0.176 | Train f1: 0.54 | Train precision: 0.63 | Train recall: 0.51
	 Val. Loss: 0.198 |  Val. f1: 0.55 |  Val. precision: 0.63 | Val. recall: 0.55
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.113 | Train f1: 0.65 | Train precision: 0.71 | Train recall: 0.64
	 Val. Loss: 0.171 |  Val. f1: 0.61 |  Val. precision: 0.67 | Val. recall: 0.60
Epoch: 04 | Epoch Time: 0m 10s
	Train Loss: 0.081 | Train f1: 0.72 | Train precision: 0.76 | Train recall: 0.71
	 Val. Loss: 0.161 |  Val. f1: 0.65 |  Val. precision: 0.69 | Val. recall: 0.66
Epoch: 05 | Epoch Time: 0m 10s
	Train Loss: 0.061 | Train f1: 0.76 | Train precision: 0.79 | Train recall: 0.76
	 Val. Loss: 0.176 |  Val. f1: 0.63 |  Val. precision: 0.69 | 

In [None]:
torch.cuda.empty_cache()
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion
  
model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 8,041,766 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 9s
	Train Loss: 0.452 | Train f1: 0.15 | Train precision: 0.24 | Train recall: 0.13
	 Val. Loss: 0.318 |  Val. f1: 0.38 |  Val. precision: 0.45 | Val. recall: 0.38
Epoch: 02 | Epoch Time: 0m 9s
	Train Loss: 0.220 | Train f1: 0.43 | Train precision: 0.53 | Train recall: 0.41
	 Val. Loss: 0.266 |  Val. f1: 0.48 |  Val. precision: 0.55 | Val. recall: 0.51
Epoch: 03 | Epoch Time: 0m 9s
	Train Loss: 0.148 | Train f1: 0.57 | Train precision: 0.64 | Train recall: 0.56
	 Val. Loss: 0.195 |  Val. f1: 0.57 |  Val. precision: 0.62 | Val. recall: 0.59
Epoch: 04 | Epoch Time: 0m 9s
	Train Loss: 0.109 | Train f1: 0.65 | Train precision: 0.70 | Train recall: 0.65
	 Val. Loss: 0.182 |  Val. f1: 0.60 |  Val. precision: 0.65 | Val. recall: 0.62
Epoch: 05 | Epoch Time: 0m 9s
	Train Loss: 0.085 | Train f1: 0.71 | Train precision: 0.75 | Train recall: 0.70
	 Val. Loss: 0.192 |  Val. f1: 0.61 |  Val. precision: 0.65 | Val. r

In [None]:
torch.cuda.empty_cache()
model = gru_model1
model_name = gru_model_name1
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 9,875,238 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 9s
	Train Loss: 0.404 | Train f1: 0.23 | Train precision: 0.37 | Train recall: 0.19
	 Val. Loss: 0.283 |  Val. f1: 0.44 |  Val. precision: 0.58 | Val. recall: 0.42
Epoch: 02 | Epoch Time: 0m 9s
	Train Loss: 0.196 | Train f1: 0.52 | Train precision: 0.62 | Train recall: 0.50
	 Val. Loss: 0.205 |  Val. f1: 0.57 |  Val. precision: 0.65 | Val. recall: 0.56
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.122 | Train f1: 0.64 | Train precision: 0.70 | Train recall: 0.62
	 Val. Loss: 0.213 |  Val. f1: 0.57 |  Val. precision: 0.64 | Val. recall: 0.58
Epoch: 04 | Epoch Time: 0m 9s
	Train Loss: 0.088 | Train f1: 0.72 | Train precision: 0.76 | Train recall: 0.71
	 Val. Loss: 0.192 |  Val. f1: 0.60 |  Val. precision: 0.68 | Val. recall: 0.59
Epoch: 05 | Epoch Time: 0m 9s
	Train Loss: 0.066 | Train f1: 0.75 | Train precision: 0.79 | Train recall: 0.74
	 Val. Loss: 0.179 |  Val. f1: 0.63 |  Val. precision: 0.68 | Val. 

In [None]:
torch.cuda.empty_cache()
model = gru_model2
model_name = gru_model_name2
criterion = baseline_criterion

model.apply(init_weights)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 9,875,238 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 11s
	Train Loss: 0.438 | Train f1: 0.19 | Train precision: 0.32 | Train recall: 0.16
	 Val. Loss: 0.298 |  Val. f1: 0.41 |  Val. precision: 0.54 | Val. recall: 0.37
Epoch: 02 | Epoch Time: 0m 11s
	Train Loss: 0.219 | Train f1: 0.47 | Train precision: 0.58 | Train recall: 0.45
	 Val. Loss: 0.214 |  Val. f1: 0.54 |  Val. precision: 0.62 | Val. recall: 0.53
Epoch: 03 | Epoch Time: 0m 11s
	Train Loss: 0.142 | Train f1: 0.60 | Train precision: 0.67 | Train recall: 0.58
	 Val. Loss: 0.198 |  Val. f1: 0.57 |  Val. precision: 0.65 | Val. recall: 0.57
Epoch: 04 | Epoch Time: 0m 11s
	Train Loss: 0.101 | Train f1: 0.68 | Train precision: 0.73 | Train recall: 0.67
	 Val. Loss: 0.184 |  Val. f1: 0.62 |  Val. precision: 0.68 | Val. recall: 0.62


  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 05 | Epoch Time: 0m 11s
	Train Loss: 0.076 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.180 |  Val. f1: 0.63 |  Val. precision: 0.69 | Val. recall: 0.63
Epoch: 06 | Epoch Time: 0m 11s
	Train Loss: 0.062 | Train f1: 0.77 | Train precision: 0.80 | Train recall: 0.76
	 Val. Loss: 0.185 |  Val. f1: 0.64 |  Val. precision: 0.69 | Val. recall: 0.65
Epoch: 07 | Epoch Time: 0m 11s
	Train Loss: 0.051 | Train f1: 0.79 | Train precision: 0.82 | Train recall: 0.78
	 Val. Loss: 0.195 |  Val. f1: 0.64 |  Val. precision: 0.69 | Val. recall: 0.64
Epoch: 08 | Epoch Time: 0m 11s
	Train Loss: 0.043 | Train f1: 0.82 | Train precision: 0.84 | Train recall: 0.81
	 Val. Loss: 0.199 |  Val. f1: 0.65 |  Val. precision: 0.69 | Val. recall: 0.66
Epoch: 09 | Epoch Time: 0m 11s
	Train Loss: 0.036 | Train f1: 0.84 | Train precision: 0.86 | Train recall: 0.84
	 Val. Loss: 0.206 |  Val. f1: 0.66 |  Val. precision: 0.70 | Val. recall: 0.66
Epoch: 10 | Epoch Time: 0m 11s
	Train Loss:

##### Embeddings Pre-entrenados

In [None]:
'El uso del modelo pre-entrenado se concibe con ayuda de:\
https://medium.com/@rohit_agrawal/using-fine-tuned-gensim-word2vec-embeddings-with-torchtext-and-pytorch-17eea2883cd'

word2vec_vectors = []
for token, idx in tqdm_notebook(TEXT.vocab.stoi.items()):
    if token in wordvectors.wv.vocab.keys():
        word2vec_vectors.append(torch.FloatTensor(wordvectors[token]))
    else:
        word2vec_vectors.append(torch.zeros(W2V_SIZE))
TEXT.vocab.set_vectors(TEXT.vocab.stoi, word2vec_vectors, W2V_SIZE)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=29171.0), HTML(value='')))

  """





In [None]:
%%script false
print(TEXT.vocab.vectors.shape)
pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
model.embedding = nn.Embedding.from_pretrained(pre_trained_emb).cuda()

## Al cargar Embeddings pre-entrenados quedan con requires_grad = False
print(model.embedding.weight.requires_grad)
model.embedding.weight.requires_grad = True
print(model.embedding.weight.requires_grad)

for batch in train_iterator:
    print(batch.text)
    print(batch.text.shape)
    print(model.embedding(batch.text.cuda()).shape)
    break

In [None]:
torch.cuda.empty_cache()
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)

pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
model.load_pretrained_embeddings(pre_trained_emb, False)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,724,874 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 9s
	Train Loss: 0.252 | Train f1: 0.32 | Train precision: 0.37 | Train recall: 0.33
	 Val. Loss: 0.382 |  Val. f1: 0.44 |  Val. precision: 0.50 | Val. recall: 0.48
Epoch: 02 | Epoch Time: 0m 9s
	Train Loss: 0.169 | Train f1: 0.47 | Train precision: 0.52 | Train recall: 0.48
	 Val. Loss: 0.324 |  Val. f1: 0.46 |  Val. precision: 0.52 | Val. recall: 0.48
Epoch: 03 | Epoch Time: 0m 9s
	Train Loss: 0.152 | Train f1: 0.51 | Train precision: 0.57 | Train recall: 0.52
	 Val. Loss: 0.278 |  Val. f1: 0.49 |  Val. precision: 0.54 | Val. recall: 0.50
Epoch: 04 | Epoch Time: 0m 9s
	Train Loss: 0.138 | Train f1: 0.54 | Train precision: 0.60 | Train recall: 0.55
	 Val. Loss: 0.305 |  Val. f1: 0.52 |  Val. precision: 0.57 | Val. recall: 0.53
Epoch: 05 | Epoch Time: 0m 9s
	Train Loss: 0.132 | Train f1: 0.56 | Train precision: 0.61 | Train recall: 0.56
	 Val. Loss: 0.279 |  Val. f1: 0.52 |  Val. precision: 0.57 | Val. r

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 09 | Epoch Time: 0m 9s
	Train Loss: 0.104 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.271 |  Val. f1: 0.56 |  Val. precision: 0.61 | Val. recall: 0.57
Epoch: 10 | Epoch Time: 0m 9s
	Train Loss: 0.099 | Train f1: 0.63 | Train precision: 0.68 | Train recall: 0.64
	 Val. Loss: 0.286 |  Val. f1: 0.57 |  Val. precision: 0.61 | Val. recall: 0.58
Epoch: 11 | Epoch Time: 0m 9s
	Train Loss: 0.095 | Train f1: 0.65 | Train precision: 0.69 | Train recall: 0.65
	 Val. Loss: 0.285 |  Val. f1: 0.56 |  Val. precision: 0.61 | Val. recall: 0.58
Epoch: 12 | Epoch Time: 0m 9s
	Train Loss: 0.087 | Train f1: 0.66 | Train precision: 0.70 | Train recall: 0.66
	 Val. Loss: 0.304 |  Val. f1: 0.56 |  Val. precision: 0.61 | Val. recall: 0.58
Epoch: 13 | Epoch Time: 0m 9s
	Train Loss: 0.084 | Train f1: 0.68 | Train precision: 0.72 | Train recall: 0.68
	 Val. Loss: 0.301 |  Val. f1: 0.58 |  Val. precision: 0.62 | Val. recall: 0.59
Epoch: 14 | Epoch Time: 0m 9s
	Train Loss: 0.076

In [None]:
torch.cuda.empty_cache()
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)

pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
model.load_pretrained_embeddings(pre_trained_emb, False)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 211,466 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.309 | Train f1: 0.24 | Train precision: 0.28 | Train recall: 0.26
	 Val. Loss: 0.380 |  Val. f1: 0.36 |  Val. precision: 0.38 | Val. recall: 0.40
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.199 | Train f1: 0.39 | Train precision: 0.43 | Train recall: 0.40
	 Val. Loss: 0.378 |  Val. f1: 0.42 |  Val. precision: 0.46 | Val. recall: 0.46
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.178 | Train f1: 0.44 | Train precision: 0.50 | Train recall: 0.45
	 Val. Loss: 0.342 |  Val. f1: 0.44 |  Val. precision: 0.50 | Val. recall: 0.46
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.164 | Train f1: 0.48 | Train precision: 0.54 | Train recall: 0.48
	 Val. Loss: 0.340 |  Val. f1: 0.44 |  Val. precision: 0.49 | Val. recall: 0.47
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.156 | Train f1: 0.50 | Train precision: 0.55 | Train recall: 0.50
	 Val. Loss: 0.313 |  Val. f1: 0.43 |  Val. precision: 0.50 | Val. rec

In [None]:
torch.cuda.empty_cache()
model = gru_model1
model_name = gru_model_name1
criterion = baseline_criterion

model.apply(init_weights)

pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
model.load_pretrained_embeddings(pre_trained_emb, False)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,044,938 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.251 | Train f1: 0.34 | Train precision: 0.40 | Train recall: 0.35
	 Val. Loss: 0.376 |  Val. f1: 0.47 |  Val. precision: 0.52 | Val. recall: 0.49
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.173 | Train f1: 0.47 | Train precision: 0.52 | Train recall: 0.47
	 Val. Loss: 0.363 |  Val. f1: 0.47 |  Val. precision: 0.55 | Val. recall: 0.48
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.154 | Train f1: 0.50 | Train precision: 0.56 | Train recall: 0.51
	 Val. Loss: 0.356 |  Val. f1: 0.46 |  Val. precision: 0.53 | Val. recall: 0.48
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.143 | Train f1: 0.53 | Train precision: 0.58 | Train recall: 0.54
	 Val. Loss: 0.293 |  Val. f1: 0.53 |  Val. precision: 0.58 | Val. recall: 0.55
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.133 | Train f1: 0.56 | Train precision: 0.61 | Train recall: 0.56
	 Val. Loss: 0.293 |  Val. f1: 0.54 |  Val. precision: 0.58 | Val. r

In [None]:
torch.cuda.empty_cache()
model = gru_model2
model_name = gru_model_name2
criterion = baseline_criterion

model.apply(init_weights)

pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
model.load_pretrained_embeddings(pre_trained_emb, False)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 2,044,938 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 9s
	Train Loss: 0.279 | Train f1: 0.31 | Train precision: 0.36 | Train recall: 0.31
	 Val. Loss: 0.359 |  Val. f1: 0.44 |  Val. precision: 0.49 | Val. recall: 0.46
Epoch: 02 | Epoch Time: 0m 9s
	Train Loss: 0.188 | Train f1: 0.43 | Train precision: 0.49 | Train recall: 0.44
	 Val. Loss: 0.425 |  Val. f1: 0.42 |  Val. precision: 0.47 | Val. recall: 0.46
Epoch: 03 | Epoch Time: 0m 9s
	Train Loss: 0.170 | Train f1: 0.47 | Train precision: 0.53 | Train recall: 0.47
	 Val. Loss: 0.372 |  Val. f1: 0.48 |  Val. precision: 0.53 | Val. recall: 0.51
Epoch: 04 | Epoch Time: 0m 9s
	Train Loss: 0.157 | Train f1: 0.50 | Train precision: 0.55 | Train recall: 0.50
	 Val. Loss: 0.372 |  Val. f1: 0.46 |  Val. precision: 0.50 | Val. recall: 0.49
Epoch: 05 | Epoch Time: 0m 9s
	Train Loss: 0.149 | Train f1: 0.52 | Train precision: 0.58 | Train recall: 0.53
	 Val. Loss: 0.324 |  Val. f1: 0.52 |  Val. precision: 0.56 | Val. r

##### *Fine-tunning* de los Embeddings Pre-entrenados

In [None]:
torch.cuda.empty_cache()
model = lstm_model
model_name = lstm_model_name
criterion = baseline_criterion

model.apply(init_weights)

pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
model.load_pretrained_embeddings(pre_trained_emb, True)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 10,555,174 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 11s
	Train Loss: 0.224 | Train f1: 0.39 | Train precision: 0.44 | Train recall: 0.40
	 Val. Loss: 0.307 |  Val. f1: 0.50 |  Val. precision: 0.54 | Val. recall: 0.53
Epoch: 02 | Epoch Time: 0m 11s
	Train Loss: 0.102 | Train f1: 0.66 | Train precision: 0.71 | Train recall: 0.67
	 Val. Loss: 0.272 |  Val. f1: 0.61 |  Val. precision: 0.65 | Val. recall: 0.64
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.063 | Train f1: 0.77 | Train precision: 0.80 | Train recall: 0.77
	 Val. Loss: 0.276 |  Val. f1: 0.66 |  Val. precision: 0.69 | Val. recall: 0.68
Epoch: 04 | Epoch Time: 0m 10s
	Train Loss: 0.046 | Train f1: 0.81 | Train precision: 0.83 | Train recall: 0.81
	 Val. Loss: 0.263 |  Val. f1: 0.63 |  Val. precision: 0.67 | Val. recall: 0.65
Epoch: 05 | Epoch Time: 0m 10s
	Train Loss: 0.037 | Train f1: 0.84 | Train precision: 0.86 | Train recall: 0.84
	 Val. Loss: 0.278 |  Val. f1: 0.63 |  Val. precision: 0.66 | 

In [None]:
torch.cuda.empty_cache()
model = elman_model
model_name = elman_model_name
criterion = baseline_criterion

model.apply(init_weights)

pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
model.load_pretrained_embeddings(pre_trained_emb, True)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 8,041,766 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 9s
	Train Loss: 0.284 | Train f1: 0.29 | Train precision: 0.33 | Train recall: 0.30
	 Val. Loss: 0.303 |  Val. f1: 0.46 |  Val. precision: 0.51 | Val. recall: 0.49
Epoch: 02 | Epoch Time: 0m 9s
	Train Loss: 0.136 | Train f1: 0.57 | Train precision: 0.62 | Train recall: 0.58
	 Val. Loss: 0.275 |  Val. f1: 0.59 |  Val. precision: 0.63 | Val. recall: 0.61
Epoch: 03 | Epoch Time: 0m 9s
	Train Loss: 0.088 | Train f1: 0.71 | Train precision: 0.75 | Train recall: 0.71
	 Val. Loss: 0.281 |  Val. f1: 0.61 |  Val. precision: 0.66 | Val. recall: 0.64
Epoch: 04 | Epoch Time: 0m 9s
	Train Loss: 0.064 | Train f1: 0.77 | Train precision: 0.80 | Train recall: 0.77
	 Val. Loss: 0.249 |  Val. f1: 0.63 |  Val. precision: 0.66 | Val. recall: 0.65
Epoch: 05 | Epoch Time: 0m 9s
	Train Loss: 0.051 | Train f1: 0.81 | Train precision: 0.83 | Train recall: 0.81
	 Val. Loss: 0.287 |  Val. f1: 0.64 |  Val. precision: 0.67 | Val. r

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Epoch: 09 | Epoch Time: 0m 9s
	Train Loss: 0.031 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.350 |  Val. f1: 0.65 |  Val. precision: 0.68 | Val. recall: 0.67
Epoch: 10 | Epoch Time: 0m 9s
	Train Loss: 0.026 | Train f1: nan | Train precision: nan | Train recall: nan
	 Val. Loss: 0.336 |  Val. f1: 0.65 |  Val. precision: 0.68 | Val. recall: 0.68
Epoch: 11 | Epoch Time: 0m 9s
	Train Loss: 0.025 | Train f1: 0.89 | Train precision: 0.90 | Train recall: 0.89
	 Val. Loss: 0.343 |  Val. f1: 0.66 |  Val. precision: 0.68 | Val. recall: 0.68
Epoch: 12 | Epoch Time: 0m 9s
	Train Loss: 0.022 | Train f1: 0.89 | Train precision: 0.90 | Train recall: 0.89
	 Val. Loss: 0.353 |  Val. f1: 0.64 |  Val. precision: 0.67 | Val. recall: 0.67
Epoch: 13 | Epoch Time: 0m 9s
	Train Loss: 0.019 | Train f1: 0.91 | Train precision: 0.91 | Train recall: 0.91
	 Val. Loss: 0.331 |  Val. f1: 0.66 |  Val. precision: 0.69 | Val. recall: 0.69
Epoch: 14 | Epoch Time: 0m 9s
	Train Loss: 0.020 | 

In [None]:
torch.cuda.empty_cache()
model = gru_model1
model_name = gru_model_name1
criterion = baseline_criterion

model.apply(init_weights)

pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
model.load_pretrained_embeddings(pre_trained_emb, True)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 9,875,238 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 9s
	Train Loss: 0.223 | Train f1: 0.42 | Train precision: 0.47 | Train recall: 0.43
	 Val. Loss: 0.309 |  Val. f1: 0.54 |  Val. precision: 0.58 | Val. recall: 0.58
Epoch: 02 | Epoch Time: 0m 10s
	Train Loss: 0.095 | Train f1: 0.69 | Train precision: 0.74 | Train recall: 0.69
	 Val. Loss: 0.253 |  Val. f1: 0.62 |  Val. precision: 0.66 | Val. recall: 0.64
Epoch: 03 | Epoch Time: 0m 9s
	Train Loss: 0.059 | Train f1: 0.79 | Train precision: 0.81 | Train recall: 0.79
	 Val. Loss: 0.274 |  Val. f1: 0.62 |  Val. precision: 0.65 | Val. recall: 0.65
Epoch: 04 | Epoch Time: 0m 9s
	Train Loss: 0.042 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.327 |  Val. f1: 0.69 |  Val. precision: 0.71 | Val. recall: 0.71
Epoch: 05 | Epoch Time: 0m 10s
	Train Loss: 0.035 | Train f1: 0.86 | Train precision: 0.87 | Train recall: 0.85
	 Val. Loss: 0.299 |  Val. f1: 0.67 |  Val. precision: 0.70 | Val.

In [None]:
torch.cuda.empty_cache()
model = gru_model2
model_name = gru_model_name2
criterion = baseline_criterion

model.apply(init_weights)

pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
model.load_pretrained_embeddings(pre_trained_emb, True)
print(f'El modelo actual tiene {count_parameters(model):,} parámetros entrenables.')

optimizer = optim.Adam(model.parameters())
model = model.to(device)
criterion = criterion.to(device)

optimize_model(model, train_iterator, valid_iterator, optimizer, criterion)

El modelo actual tiene 9,875,238 parámetros entrenables.
Epoch: 01 | Epoch Time: 0m 11s
	Train Loss: 0.253 | Train f1: 0.37 | Train precision: 0.43 | Train recall: 0.37
	 Val. Loss: 0.393 |  Val. f1: 0.50 |  Val. precision: 0.53 | Val. recall: 0.54
Epoch: 02 | Epoch Time: 0m 11s
	Train Loss: 0.113 | Train f1: 0.65 | Train precision: 0.69 | Train recall: 0.65
	 Val. Loss: 0.281 |  Val. f1: 0.62 |  Val. precision: 0.66 | Val. recall: 0.64
Epoch: 03 | Epoch Time: 0m 11s
	Train Loss: 0.072 | Train f1: 0.75 | Train precision: 0.78 | Train recall: 0.75
	 Val. Loss: 0.267 |  Val. f1: 0.63 |  Val. precision: 0.67 | Val. recall: 0.65
Epoch: 04 | Epoch Time: 0m 11s
	Train Loss: 0.053 | Train f1: 0.80 | Train precision: 0.83 | Train recall: 0.80
	 Val. Loss: 0.258 |  Val. f1: 0.65 |  Val. precision: 0.68 | Val. recall: 0.66
Epoch: 05 | Epoch Time: 0m 11s
	Train Loss: 0.043 | Train f1: 0.83 | Train precision: 0.85 | Train recall: 0.83
	 Val. Loss: 0.324 |  Val. f1: 0.65 |  Val. precision: 0.68 | V


### Predecir datos para la competencia

Ahora, a partir de los datos de **test** y nuestro modelo entrenado, predeciremos las etiquetas que serán evaluadas en la competencia.

In [None]:
def predict_labels(model, iterator, criterion, fields=fields):

    # Extraemos los vocabularios.
    text_field = fields[0][1]
    nertags_field = fields[1][1]
    tags_vocab = nertags_field.vocab.itos
    words_vocab = text_field.vocab.itos

    model.eval()

    predictions = []

    with torch.no_grad():

        for batch in iterator:

            text_batch = batch.text
            text_batch = torch.transpose(text_batch, 0, 1).tolist()

            # Predecir los tags de las sentences del batch
            predictions_batch = model(batch.text)
            predictions_batch = torch.transpose(predictions_batch, 0, 1)

            # por cada oración predicha:
            for sentence, sentence_prediction in zip(text_batch,
                                                     predictions_batch):
                for word_idx, word_predictions in zip(sentence,
                                                      sentence_prediction):
                    # Obtener el indice del tag con la probabilidad mas alta.
                    argmax_index = word_predictions.topk(1)[1]

                    current_tag = tags_vocab[argmax_index]
                    # Obtenemos la palabra
                    current_word = words_vocab[word_idx]

                    if current_word != '<pad>':
                        predictions.append([current_word, current_tag])


    return predictions


predictions = predict_labels(model, test_iterator, criterion)

In [None]:
len(predictions)

51533

### Generar el archivo para la submission

No hay problema si aparecen unk en la salida. Estos no son relevantes para evaluarlos, usamos solo los tags.

In [None]:
import os, shutil

if (os.path.isfile('./predictions.zip')):
    os.remove('./predictions.zip')

if (not os.path.isdir('./predictions')):
    os.mkdir('./predictions')

else:
    # Eliminar predicciones anteriores:
    shutil.rmtree('./predictions')
    os.mkdir('./predictions')

f = open('predictions/predictions.txt', 'w')
for word, tag in predictions:
    f.write(word + ' ' + tag + '\n')
f.write('\n')
f.close()

a = shutil.make_archive('predictions', 'zip', './predictions')

In [None]:
%%script false
# A veces no funciona a la primera. Ejecutar mas de una vez para obtener el archivo...
from google.colab import files
files.download('predictions.zip')  

## Conclusiones



...