<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## LSTM Traductor
Ejemplo basado en [LINK](https://stackabuse.com/python-for-nlp-neural-machine-translation-with-seq2seq-in-keras/)

### Consignas
Replicar y extender el traductor:
- Replicar el modelo en PyTorch.
- Extender el entrenamiento a más datos y tamaños de
secuencias mayores.
- Explorar el impacto de la cantidad de neuronas en
las capas recurrentes.
- Mostrar 5 ejemplos de traducciones generadas.
- Extras que se pueden probar: Embeddings
pre-entrenados para los dos idiomas; cambiar la
estrategia de generación (por ejemplo muestreo
aleatorio);

### Datos
El objetivo es utilizar datos disponibles del Tatoeba Project de traducciones de texto en diferentes idiomas.  
Se construirá un modelo traductor de inglés a español seq2seq utilizando encoder-decoder.  
[LINK](https://www.manythings.org/anki/)

In [1]:
import requests
import os
from zipfile import ZipFile
import numpy as np

In [2]:
# Funcion auxiliar para descargar el dataset
def download_dataset(dataset_url: str, target_dir: str, check_dir: str | None = None, force: bool = False, tmp_file: str = "tmp.zip", unzip: bool = True):

    if check_dir and os.path.isdir(check_dir) and not force:
        print("Check folder already exists, nothing downloaded.")
        return

    try:
        with requests.get(dataset_url, stream=True, allow_redirects=True) as response:
            response.raise_for_status()  # Raise an exception for bad status codes

            with open(tmp_file, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
        print(f"File '{tmp_file}' downloaded successfully.")
    except requests.exceptions.RequestException as e:
        if (os.path.isfile(tmp_file)):
            os.remove(tmp_file)
        raise(Exception(f"Error downloading file: {e}"))

    if not unzip:
        return
    try:
        with ZipFile(tmp_file, 'r') as zip_object:
            zip_object.extractall(target_dir)
        print(f"Successfully extracted '{tmp_file}' to '{target_dir}'.")

    except FileNotFoundError:
        raise(Exception(f"Error: The file '{tmp_file}' was not found."))
    except Exception as e:
        raise(Exception(f"An error occurred: {e}"))
    finally:
        if (os.path.isfile(tmp_file)):
            os.remove(tmp_file)


In [3]:
dataset_url = "http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
base_dir = "./"
folder_dir = "spa-eng"
download_dataset(dataset_url, base_dir, check_dir=folder_dir, tmp_file="spa-eng.zip")

Check folder already exists, nothing downloaded.


In [4]:
# dataset_file

text_file = os.path.join(base_dir, folder_dir, "spa.txt")
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]

# Por limitaciones de RAM no se leen todas las filas
MAX_NUM_SENTENCES = 6000

# Mezclar el dataset, forzar semilla siempre igual
np.random.seed([40])
np.random.shuffle(lines)

input_sentences = []
output_sentences = []
output_sentences_inputs = []
count = 0

for line in lines:
    count += 1
    if count > MAX_NUM_SENTENCES:
        break

    # el tabulador señaliza la separación entre las oraciones 
    # en ambos idiomas
    if '\t' not in line: 
        continue

    # Input sentence --> eng
    # output --> spa
    input_sentence, output = line.rstrip().split('\t')

    # output sentence (decoder_output) tiene <eos>
    output_sentence = output + ' <eos>'
    # output sentence input (decoder_input) tiene <sos>
    output_sentence_input = '<sos> ' + output

    input_sentences.append(input_sentence)
    output_sentences.append(output_sentence)
    output_sentences_inputs.append(output_sentence_input)

print("Cantidad de rows disponibles:", len(lines))
print("Cantidad de rows utilizadas:", len(input_sentences))

Cantidad de rows disponibles: 118964
Cantidad de rows utilizadas: 6000


In [5]:
input_sentences[0], output_sentences[0], output_sentences_inputs[0]

('A deal is a deal.',
 'Un trato es un trato. <eos>',
 '<sos> Un trato es un trato.')

### 2 - Preprocesamiento

In [6]:
# Definir el tamaño máximo del vocabulario
MAX_VOCAB_SIZE = 8000
# Vamos a necesitar un tokenizador para cada idioma

In [32]:
from collections import Counter
# Tokenizar las palabras, similar a Tokenizer de Keras
class Tokenizer():
    def __init__(self, num_words: int | None = None, filters: str = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'):
        self.num_words = num_words
        self.filters = filters
        self.word_index = {}

    def __preprocess_text(self, text: str) -> list[str]:
        text = ''.join(' ' if char in self.filters else char for char in text.lower())
        return [word for word in text.split(' ') if word]

    def fit_on_texts(self, input_sentences: list[str]) -> None:
        if (len(self.word_index) != 0):
            raise(Exception("Tokenizer has already been fit."))
        counter = Counter()
        for text in input_sentences:
            sentence = self.__preprocess_text(text)
            if not sentence: continue
            counter.update(sentence)

        most_common_n_words = counter.most_common(None if self.num_words is None else self.num_words - 1)

        self.word_index = { word: i+1 for i, (word, _) in enumerate(most_common_n_words)}

    def texts_to_sequences(self, input_sentences: list[str]):
        if (len(self.word_index) == 0):
            raise(Exception("Tokenizer has not been fit yet."))
        input_integer_seq = []
        for tokens in input_sentences:
            tokens = self.__preprocess_text(tokens)
            if not tokens: continue
            sequence = [self.word_index[word] for word in tokens]
            input_integer_seq.append(sequence)
        return input_integer_seq

In [33]:
# Defino una máxima cantidad de palabras a utilizar:
# - num_words --> the maximum number of words to keep, based on word frequency.
# - Only the most common num_words-1 words will be kept.

# tokenizador de inglés
input_tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
input_tokenizer.fit_on_texts(input_sentences)
input_integer_seq = input_tokenizer.texts_to_sequences(input_sentences)

word2idx_inputs = input_tokenizer.word_index
print("Palabras en el vocabulario:", len(word2idx_inputs))

max_input_len = max(len(sen) for sen in input_integer_seq)
print("Sentencia de entrada más larga:", max_input_len)

Palabras en el vocabulario: 3851
Sentencia de entrada más larga: 32


In [34]:
word2idx_inputs

{'the': 1,
 'to': 2,
 'i': 3,
 'you': 4,
 'tom': 5,
 'a': 6,
 'is': 7,
 'he': 8,
 'in': 9,
 'of': 10,
 'that': 11,
 'do': 12,
 'was': 13,
 'it': 14,
 'my': 15,
 'me': 16,
 'this': 17,
 'have': 18,
 'she': 19,
 'for': 20,
 'what': 21,
 'are': 22,
 "don't": 23,
 'his': 24,
 'mary': 25,
 'on': 26,
 'be': 27,
 'we': 28,
 'with': 29,
 'your': 30,
 'want': 31,
 'and': 32,
 'not': 33,
 "i'm": 34,
 'know': 35,
 'at': 36,
 'like': 37,
 'him': 38,
 'go': 39,
 'time': 40,
 'her': 41,
 'can': 42,
 'has': 43,
 'will': 44,
 'all': 45,
 'how': 46,
 'about': 47,
 'did': 48,
 'very': 49,
 'here': 50,
 'there': 51,
 "it's": 52,
 'as': 53,
 'up': 54,
 "didn't": 55,
 'think': 56,
 'they': 57,
 'had': 58,
 'when': 59,
 "can't": 60,
 'were': 61,
 'no': 62,
 'from': 63,
 'if': 64,
 'come': 65,
 'see': 66,
 'get': 67,
 'good': 68,
 'why': 69,
 "doesn't": 70,
 'been': 71,
 'an': 72,
 'out': 73,
 'by': 74,
 'tell': 75,
 'just': 76,
 'please': 77,
 'would': 78,
 'home': 79,
 'going': 80,
 'much': 81,
 'some': 82

In [35]:
# tokenizador de español
# A los filtros de símbolos del Tokenizer agregamos el "¿",
# sacamos los "<>" para que no afectar nuestros tokens
output_tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, filters='!"#$%&()*+,-./:;=¿?@[\\]^_`{|}~\t\n')
output_tokenizer.fit_on_texts(["<sos>", "<eos>"] + output_sentences)
output_integer_seq = output_tokenizer.texts_to_sequences(output_sentences)
output_input_integer_seq = output_tokenizer.texts_to_sequences(output_sentences_inputs)

word2idx_outputs = output_tokenizer.word_index
print("Palabras en el vocabulario:", len(word2idx_outputs))

num_words_output = min(len(word2idx_outputs) + 1, MAX_VOCAB_SIZE) 
# Se suma 1 para incluir el token de palabra desconocida

max_out_len = max(len(sen) for sen in output_integer_seq)
print("Sentencia de salida más larga:", max_out_len)

Palabras en el vocabulario: 5721
Sentencia de salida más larga: 36


Como era de esperarse, las sentencias en castellano son más largas que en inglés, y lo mismo sucede con su vocabulario.

In [36]:
# Por una cuestion de que no explote la RAM se limitará el tamaño de las sentencias de entrada
# a la mitad:
max_input_len = 16
max_out_len = 18

A la hora de realizar padding es importante tener en cuenta que en el encoder los ceros se agregan al comienzo y en el decoder al final.  
Esto es porque la salida del encoder está basado en las últimas palabras de la sentencia (son las más importantes), mientras que en el decoder está basado en el comienzo de la secuencia de salida ya que es la realimentación del sistema y termina con fin de sentencia.

In [37]:
def pad_sequences(sequences, maxlen=None, dtype='int32', padding_pre=True, truncating_pre=True, value=0.0):
    if maxlen is None:
        maxlen = max(len(s) for s in sequences)

    out_secuences = []
    for secuence in sequences:
        if len(secuence) < maxlen: # Hago padding hasta max_len
            if padding_pre:
                sec_out = [value] * (maxlen - len(secuence)) + secuence
            else:
                sec_out = secuence + [value] * (maxlen - len(secuence))
        else: # O me quedo con los max_len caracteres
            if truncating_pre:
                sec_out = secuence[-maxlen:]
            else:
                sec_out = secuence[:maxlen]
        out_secuences.append(sec_out)

    return np.array(out_secuences, dtype=dtype)

In [38]:
print("Cantidad de rows del dataset:", len(input_integer_seq))

encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=max_input_len)
print("encoder_input_sequences shape:", encoder_input_sequences.shape)

decoder_input_sequences = pad_sequences(output_input_integer_seq, maxlen=max_out_len, padding_pre=False)
print("decoder_input_sequences shape:", decoder_input_sequences.shape)

Cantidad de rows del dataset: 6000
encoder_input_sequences shape: (6000, 16)
decoder_input_sequences shape: (6000, 18)


La última capa del modelo (softmax) necesita que los valores de salida
del decoder (decoder_sequences) estén en formato oneHotEncoder.\
Se utiliza "decoder_output_sequences" con la misma estrategia con que se transformó la entrada del decoder.

In [39]:
def to_categorical(x, num_classes):
    """ 1-hot encodes a tensor """
    return np.eye(num_classes, dtype='uint8')[x]

In [40]:
decoder_output_sequences = pad_sequences(output_integer_seq, maxlen=max_out_len, padding_pre=False)
decoder_targets = to_categorical(decoder_output_sequences, num_classes=num_words_output)
decoder_targets.shape

(6000, 18, 5722)

### 3 - Preparar los embeddings

In [41]:
url = 'https://drive.usercontent.google.com/download?id=1KY6avD5I1eI2dxQzMkR3WExwKwRq2g94&export=download&confirm=t&uuid=07c897f9-d9a1-4bdd-8cce-70c9dca2368a&at=AKSUxGMQs76z20Q73h7ULNM9qfje%3A1759553933720'
output = 'gloveembedding.pkl'
if os.access(os.path.join(base_dir, output), os.F_OK) is False:
    download_dataset(url, base_dir, tmp_file=output, unzip=False)
else:
    print("Los embeddings gloveembedding.pkl ya están descargados")

Los embeddings gloveembedding.pkl ya están descargados


In [42]:
import logging
import os
from pathlib import Path
import pickle

class WordsEmbeddings(object):
    logger = logging.getLogger(__name__)

    def __init__(self):
        # load the embeddings
        words_embedding_pkl = Path(self.PKL_PATH)
        if not words_embedding_pkl.is_file():
            words_embedding_txt = Path(self.WORD_TO_VEC_MODEL_TXT_PATH)
            assert words_embedding_txt.is_file(), 'Words embedding not available'
            embeddings = self.convert_model_to_pickle()
        else:
            embeddings = self.load_model_from_pickle()
        self.embeddings = embeddings
        # build the vocabulary hashmap
        index = np.arange(self.embeddings.shape[0])
        # Dicctionarios para traducir de embedding a IDX de la palabra
        self.word2idx = dict(zip(self.embeddings['word'], index))
        self.idx2word = dict(zip(index, self.embeddings['word']))

    def get_words_embeddings(self, words):
        words_idxs = self.words2idxs(words)
        return self.embeddings[words_idxs]['embedding']

    def words2idxs(self, words):
        return np.array([self.word2idx.get(word, -1) for word in words])

    def idxs2words(self, idxs):
        return np.array([self.idx2word.get(idx, '-1') for idx in idxs])

    def load_model_from_pickle(self):
        self.logger.debug(
            'loading words embeddings from pickle {}'.format(
                self.PKL_PATH
            )
        )
        max_bytes = 2**28 - 1 # 256MB
        bytes_in = bytearray(0)
        input_size = os.path.getsize(self.PKL_PATH)
        with open(self.PKL_PATH, 'rb') as f_in:
            for _ in range(0, input_size, max_bytes):
                bytes_in += f_in.read(max_bytes)
        embeddings = pickle.loads(bytes_in)
        self.logger.debug('words embeddings loaded')
        return embeddings

    def convert_model_to_pickle(self):
        # create a numpy strctured array:
        # word     embedding
        # U50      np.float32[]
        # word_1   a, b, c
        # word_2   d, e, f
        # ...
        # word_n   g, h, i
        self.logger.debug(
            'converting and loading words embeddings from text file {}'.format(
                self.WORD_TO_VEC_MODEL_TXT_PATH
            )
        )
        structure = [('word', np.dtype('U' + str(self.WORD_MAX_SIZE))),
                     ('embedding', np.float32, (self.N_FEATURES,))]
        structure = np.dtype(structure)
        # load numpy array from disk using a generator
        with open(self.WORD_TO_VEC_MODEL_TXT_PATH, encoding="utf8") as words_embeddings_txt:
            embeddings_gen = (
                (line.split()[0], line.split()[1:]) for line in words_embeddings_txt
                if len(line.split()[1:]) == self.N_FEATURES
            )
            embeddings = np.fromiter(embeddings_gen, structure)
        # add a null embedding
        null_embedding = np.array(
            [('null_embedding', np.zeros((self.N_FEATURES,), dtype=np.float32))],
            dtype=structure
        )
        embeddings = np.concatenate([embeddings, null_embedding])
        # dump numpy array to disk using pickle
        max_bytes = 2**28 - 1 # # 256MB
        bytes_out = pickle.dumps(embeddings, protocol=pickle.HIGHEST_PROTOCOL)
        with open(self.PKL_PATH, 'wb') as f_out:
            for idx in range(0, len(bytes_out), max_bytes):
                f_out.write(bytes_out[idx:idx+max_bytes])
        self.logger.debug('words embeddings loaded')
        return embeddings


class GloveEmbeddings(WordsEmbeddings):
    WORD_TO_VEC_MODEL_TXT_PATH = 'glove.twitter.27B.50d.txt'
    PKL_PATH = 'gloveembedding.pkl'
    N_FEATURES = 50
    WORD_MAX_SIZE = 60

class FasttextEmbeddings(WordsEmbeddings):
    WORD_TO_VEC_MODEL_TXT_PATH = 'cc.en.300.vec'
    PKL_PATH = 'fasttext.pkl'
    N_FEATURES = 300
    WORD_MAX_SIZE = 60

In [43]:
# Por una cuestion de RAM se utilizarán los embeddings de Glove de dimension 50
model_embeddings = GloveEmbeddings()

In [119]:
# Crear la Embedding matrix de las secuencias
# en inglés

print('preparing embedding matrix...')
embed_dim = model_embeddings.N_FEATURES
words_not_found = []

# word_index proviene del tokenizer

nb_words = min(MAX_VOCAB_SIZE, len(word2idx_inputs))+1 # vocab_size
embedding_matrix = np.zeros((nb_words, embed_dim))
for word, i in word2idx_inputs.items():
    if i >= nb_words:
        continue
    embedding_vector = model_embeddings.get_words_embeddings(word)[0]
    if (embedding_vector is not None) and len(embedding_vector) > 0:
        
        embedding_matrix[i] = embedding_vector
    else:
        # words not found in embedding index will be all-zeros.
        words_not_found.append(word)

print('number of null word embeddings:', np.sum(np.sum(embedding_matrix**2, axis=1) == 0))

preparing embedding matrix...
number of null word embeddings: 30


In [120]:
# Dimensión de los embeddings de la secuencia en inglés
embedding_matrix.shape

(3852, 50)

### 4 - Entrenar el modelo

In [121]:
max_input_len

16

In [126]:
import torch
from torch.nn import Embedding, Module, LSTM, Linear

class Translator(Module):
    def __init__(self, num_words_input, embed_dim, embedding_matrix, num_words_output) -> None:
        super().__init__()
        n_units = 128

        # training encoder
        self.encoder_embedding_layer = Embedding(
                num_embeddings=num_words_input,  # definido en el Tokenizador
                embedding_dim=embed_dim,  # dimensión de los embeddings utilizados
                #   input_length=max_input_len, # tamaño máximo de la secuencia de entrada
                _weight=torch.nn.Parameter(torch.from_numpy(embedding_matrix.astype(np.float32))))  # matrix de embeddings
        self.encoder_embedding_layer.weight.requires_grad = False # marcar como layer no entrenable
        self.encoder = LSTM(input_size=embed_dim, hidden_size=n_units, batch_first=True)

        # training decoder
        self.decoder_embedding_layer = Embedding(num_embeddings=num_words_output, embedding_dim=n_units) #, input_length=max_out_len
        self.decoder_lstm = LSTM(input_size=n_units, hidden_size=n_units, batch_first=True)

        # Dense
        self.decoder_dense = Linear(in_features=n_units, out_features=num_words_output)
                
    def forward(self, encoder_inputs, decoder_inputs):
        encoder_inputs_x = self.encoder_embedding_layer(encoder_inputs)
        _, encoder_states = self.encoder(encoder_inputs_x)

        decoder_inputs_x = self.decoder_embedding_layer(decoder_inputs)
        decoder_outputs, _ = self.decoder_lstm(decoder_inputs_x, encoder_states)
        decoder_outputs = self.decoder_dense(decoder_outputs)
        return decoder_outputs

        
model=Translator(num_words_input=nb_words, embed_dim=embed_dim, embedding_matrix=embedding_matrix, num_words_output=num_words_output)
print(model)

Translator(
  (encoder_embedding_layer): Embedding(3852, 50)
  (encoder): LSTM(50, 128, batch_first=True)
  (decoder_embedding_layer): Embedding(5722, 128)
  (decoder_lstm): LSTM(128, 128, batch_first=True)
  (decoder_dense): Linear(in_features=128, out_features=5722, bias=True)
)


In [127]:
import torchinfo as torchinfo
torchinfo.summary(model, input_size=[(1, max_input_len), (1, max_out_len)], dtypes=[torch.int, torch.int])

Layer (type:depth-idx)                   Output Shape              Param #
Translator                               [1, 18, 5722]             --
├─Embedding: 1-1                         [1, 16, 50]               (192,600)
├─LSTM: 1-2                              [1, 16, 128]              92,160
├─Embedding: 1-3                         [1, 18, 128]              732,416
├─LSTM: 1-4                              [1, 18, 128]              132,096
├─Linear: 1-5                            [1, 18, 5722]             738,138
Total params: 1,887,410
Trainable params: 1,694,810
Non-trainable params: 192,600
Total mult-adds (Units.MEGABYTES): 5.52
Input size (MB): 0.00
Forward/backward pass size (MB): 0.88
Params size (MB): 7.55
Estimated Total Size (MB): 8.43

In [172]:
from torch.utils.data import Dataset

class TranslatorDataset(Dataset):
    def __init__(self, encoder_input_sequences, decoder_input_sequences, decoder_targets):
        self.encoder_input_sequences = torch.tensor(encoder_input_sequences, dtype=torch.int)
        self.decoder_input_sequences = torch.tensor(decoder_input_sequences, dtype=torch.int)
        self.decoder_targets = torch.tensor(decoder_targets, dtype=torch.float)

    def __len__(self):
        return len(self.encoder_input_sequences)

    def __getitem__(self, idx):
        return (self.encoder_input_sequences[idx],
                self.decoder_input_sequences[idx],
                self.decoder_targets[idx])

In [173]:
from sklearn.model_selection import train_test_split
train_dataset, val_dataset = train_test_split(
    TranslatorDataset(encoder_input_sequences, decoder_input_sequences, decoder_targets), test_size=0.2, random_state=42)
print(f"Train size: {len(train_dataset)}. Validation size: {len(val_dataset)}")

Train size: 4800. Validation size: 1200


In [176]:
model=Translator(num_words_input=nb_words, embed_dim=embed_dim, embedding_matrix=embedding_matrix, num_words_output=num_words_output)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = torch.nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # Tasa de aprendizaje


In [None]:
def train(model, train_dataset, val_dataset, optimizer, criterion, device, epocs=15):
    # input_batch: Secuencia de entrada (ej. frase en inglés)
    # target_batch: Secuencia de salida DESEADA (ej. frase en español, incluyendo <EOS>)
    
    for epoch in range(epocs):
        epoch_loss = 0
        for (input, output_input, output_target) in train_dataset:
            optimizer.zero_grad()
            input =input.to(device)
            output_input = output_input.to(device)
            output_target = output_target.to(device)

            output_pred = model(input, output_input)
            loss = criterion(output_pred, output_target)
            
            epoch_loss += loss.item()

            loss.backward()
            optimizer.step()

        print(f'Epoch {epoch+1}/{epocs}, Loss: {epoch_loss/len(encoder_input_sequences)}')

        epoch_loss = 0
        with torch.no_grad():
            for (input, output_input, output_target) in val_dataset:
                input =input.to(device)
                output_input = output_input.to(device)
                output_target = output_target.to(device)

                output_pred = model(input, output_input)
                loss = criterion(output_pred, output_target)
                epoch_loss += loss.item()
        print(f'  Val Loss: {epoch_loss/len(encoder_input_sequences)}')

train(model, train_dataset, val_dataset, optimizer, criterion, device, epocs=15)

Epoch 1/15, Loss: 1.8292619239886603
  Val Loss: 0.4145087894399961
Epoch 2/15, Loss: 1.4387086504151423
  Val Loss: 0.4113206118941307
Epoch 3/15, Loss: 1.2000642161294819
  Val Loss: 0.4295482021321853
Epoch 4/15, Loss: 1.0001885218322277
  Val Loss: 0.45125775333245594
Epoch 5/15, Loss: 0.8343781223644813
  Val Loss: 0.46851331464449564
Epoch 6/15, Loss: 0.7003751460065444
  Val Loss: 0.48192031465967494
Epoch 7/15, Loss: 0.5908925608322024
  Val Loss: 0.49682176353037355
Epoch 8/15, Loss: 0.5003578964285552
  Val Loss: 0.5093520157833894


KeyboardInterrupt: 

In [None]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense

n_units = 128

# define training encoder
encoder_inputs = Input(shape=(max_input_len))

#encoder_embedding_layer = Embedding(num_words, EMBEDDING_SIZE, weights=[embedding_matrix], input_length=max_input_len)

encoder_embedding_layer = Embedding(
          input_dim=nb_words,  # definido en el Tokenizador
          output_dim=embed_dim,  # dimensión de los embeddings utilizados
          input_length=max_input_len, # tamaño máximo de la secuencia de entrada
          weights=[embedding_matrix],  # matrix de embeddings
          trainable=False)      # marcar como layer no entrenable

encoder_inputs_x = encoder_embedding_layer(encoder_inputs)

encoder = LSTM(n_units, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs_x)
encoder_states = [state_h, state_c]

# define training decoder
decoder_inputs = Input(shape=(max_out_len))
decoder_embedding_layer = Embedding(input_dim=num_words_output, output_dim=n_units, input_length=max_out_len)
decoder_inputs_x = decoder_embedding_layer(decoder_inputs)

decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states)

# Dense
decoder_dense = Dense(num_words_output, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(loss='categorical_crossentropy', optimizer="Adam", metrics=['accuracy'])
model.summary()

In [None]:
# Modelo completo (encoder+decoder) para poder entrenar
plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

In [None]:
# Modelo solo encoder

# define inference encoder
encoder_model = Model(encoder_inputs, encoder_states)

plot_model(encoder_model, to_file='encoder_plot.png', show_shapes=True, show_layer_names=True)

In [None]:
# Modelo solo decoder (para realizar inferencia)

# define inference decoder
decoder_state_input_h = Input(shape=(n_units,))
decoder_state_input_c = Input(shape=(n_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

# En cada predicción habrá una sola palabra de entrada al decoder,
# que es la realimentación de la palabra anterior
# por lo que hay que modificar el input shape de la layer de Embedding
decoder_inputs_single = Input(shape=(1,))
decoder_inputs_single_x = decoder_embedding_layer(decoder_inputs_single)

decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs_single_x, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs_single] + decoder_states_inputs, [decoder_outputs] + decoder_states)

plot_model(decoder_model, to_file='decoder_plot.png', show_shapes=True, show_layer_names=True)

In [None]:
hist = model.fit(
    [encoder_input_sequences, decoder_input_sequences],
    decoder_targets,
    epochs=15, 
    validation_split=0.2)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Entrenamiento
epoch_count = range(1, len(hist.history['accuracy']) + 1)
sns.lineplot(x=epoch_count,  y=hist.history['accuracy'], label='train')
sns.lineplot(x=epoch_count,  y=hist.history['val_accuracy'], label='valid')
plt.show()

### 5 - Inferencia

In [None]:
'''
Step 1:
A deal is a deal -> Encoder -> enc(h1,c1)

enc(h1,c1) + <sos> -> Decoder -> Un + dec(h1,c1)

step 2:
dec(h1,c1) + Un -> Decoder -> trato + dec(h2,c2)

step 3:
dec(h2,c2) + trato -> Decoder -> es + dec(h3,c3)

step 4:
dec(h3,c3) + es -> Decoder -> un + dec(h4,c4)

step 5:
dec(h4,c4) + un -> Decoder -> trato + dec(h5,c5)

step 6:
dec(h5,c5) + trato. -> Decoder -> <eos> + dec(h6,c6)
'''

In [None]:
# Armar los conversores de índice a palabra:
idx2word_input = {v:k for k, v in word2idx_inputs.items()}
idx2word_target = {v:k for k, v in word2idx_outputs.items()}

In [None]:
def translate_sentence(input_seq):
    # Se transforma la sequencia de entrada a los estados "h" y "c" de la LSTM
    # para enviar la primera vez al decoder
    states_value = encoder_model.predict(input_seq)

    # Se inicializa la secuencia de entrada al decoder como "<sos>"
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = word2idx_outputs['<sos>']

    # Se obtiene el índice que finaliza la inferencia
    eos = word2idx_outputs['<eos>']
    
    output_sentence = []
    for _ in range(max_out_len):
        # Predicción del próximo elemento
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        idx = np.argmax(output_tokens[0, 0, :])

        # Si es "end of sentece <eos>" se acaba
        if eos == idx:
            break

        # Transformar idx a palabra
        word = ''        
        if idx > 0:
            word = idx2word_target[idx]
            output_sentence.append(word)

        # Actualizar los estados dada la última predicción
        states_value = [h, c]

        # Actualizar secuencia de entrada con la salida (re-alimentación)
        target_seq[0, 0] = idx

    return ' '.join(output_sentence)

In [None]:
i = np.random.choice(len(input_sentences))
input_seq = encoder_input_sequences[i:i+1]
translation = translate_sentence(input_seq)
print('-')
print('Input:', input_sentences[i])
print('Response:', translation)

In [None]:
input_test = "My mother say hi."
print('Input:', input_test)
integer_seq_test = input_tokenizer.texts_to_sequences([input_test])[0]
print("Representacion en vector de tokens de ids", integer_seq_test)
encoder_sequence_test = pad_sequences([integer_seq_test], maxlen=max_input_len)
print("Padding del vector:", encoder_sequence_test)

print('Input:', input_test)
translation = translate_sentence(encoder_sequence_test)
print('Response:', translation)

### 6 - Conclusión
A primera vista parece que el modelo tendría que funcionar muy bien por el accuracy alcanzado. La realidad es que las respuestas no tienen que ver demasiado con la pregunta/traducción pero la respuesta en si tiene bastante coherencia.\
Para poder mejorar el modelo haría falta poder consumir todo el dataset y todo el vocabulario, pero la cantidad de RAM no es suficiente.\
Este problema se resuelve con:
- Utilizando un DataGenerator para no levantar todo el dataset junto en el entrenamiento.
- Transfer learning evitando tener que entrenar todo el modelo  