<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## LSTM Bot QA

### Datos
El objecto es utilizar datos disponibles del challenge ConvAI2 (Conversational Intelligence Challenge 2) de conversaciones en inglés. Se construirá un BOT para responder a preguntas del usuario (QA).\
[LINK](http://convai.io/data/)

In [1]:
!pip install --upgrade --no-cache-dir gdown --quiet

In [3]:
import re
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Dense, Flatten, LSTM, SimpleRNN, Embedding, Input
from tensorflow.keras.models import Model
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer


In [4]:
# Descargar la carpeta de dataset
import os
import gdown
if os.access('data_volunteers.json', os.F_OK) is False:
    url = 'https://drive.google.com/uc?id=1awUxYwImF84MIT5-jCaYAPe2QwSgS1hN&export=download'
    output = 'data_volunteers.json'
    gdown.download(url, output, quiet=False)
else:
    print("El dataset ya se encuentra descargado")

Downloading...
From: https://drive.google.com/uc?id=1awUxYwImF84MIT5-jCaYAPe2QwSgS1hN&export=download
To: /content/data_volunteers.json
100%|██████████| 2.58M/2.58M [00:00<00:00, 23.1MB/s]


In [5]:
# dataset_file
import json

text_file = "data_volunteers.json"
with open(text_file) as f:
    data = json.load(f) # la variable data será un diccionario



In [6]:
# Observar los campos disponibles en cada linea del dataset
data[0].keys()

dict_keys(['dialog', 'start_time', 'end_time', 'bot_profile', 'user_profile', 'eval_score', 'profile_match', 'participant1_id', 'participant2_id'])

In [7]:
chat_in = []
chat_out = []

input_sentences = []
output_sentences = []
output_sentences_inputs = []
max_len = 30

def clean_text(txt):
    txt = txt.lower()
    txt.replace("\'d", " had")
    txt.replace("\'s", " is")
    txt.replace("\'m", " am")
    txt.replace("don't", "do not")
    txt = re.sub(r'\W+', ' ', txt)

    return txt

for line in data:
    for i in range(len(line['dialog'])-1):
        # vamos separando el texto en "preguntas" (chat_in)
        # y "respuestas" (chat_out)
        chat_in = clean_text(line['dialog'][i]['text'])
        chat_out = clean_text(line['dialog'][i+1]['text'])

        if len(chat_in) >= max_len or len(chat_out) >= max_len:
            continue

        input_sentence, output = chat_in, chat_out

        # output sentence (decoder_output) tiene <eos>
        output_sentence = output + ' <eos>'
        # output sentence input (decoder_input) tiene <sos>
        output_sentence_input = '<sos> ' + output

        input_sentences.append(input_sentence)
        output_sentences.append(output_sentence)
        output_sentences_inputs.append(output_sentence_input)

print("Cantidad de rows utilizadas:", len(input_sentences))

Cantidad de rows utilizadas: 6033


In [8]:
input_sentences[1], output_sentences[1], output_sentences_inputs[1]

('hi how are you ', 'not bad and you  <eos>', '<sos> not bad and you ')

### 2 - Preprocesamiento
Realizar el preprocesamiento necesario para obtener:
- word2idx_inputs, max_input_len
- word2idx_outputs, max_out_len, num_words_output
- encoder_input_sequences, decoder_output_sequences, decoder_targets

## Tokenización de las frases de entrada (encoder)

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Tokenización inputs (encoder)
tokenizer_inputs = Tokenizer()
tokenizer_inputs.fit_on_texts(input_sentences)
encoder_input_sequences = tokenizer_inputs.texts_to_sequences(input_sentences)

word2idx_inputs = tokenizer_inputs.word_index
max_input_len = max(len(seq) for seq in encoder_input_sequences)
encoder_input_sequences = pad_sequences(encoder_input_sequences, maxlen=max_input_len, padding='post')


## Tokenización de las frases de salida (decoder)

In [11]:
# Tokenización outputs (decoder)
tokenizer_outputs = Tokenizer(filters='')  # ¡IMPORTANTE! No filtramos símbolos (<sos>, <eos>)
tokenizer_outputs.fit_on_texts(output_sentences + output_sentences_inputs)

decoder_input_sequences = tokenizer_outputs.texts_to_sequences(output_sentences_inputs)
decoder_output_sequences = tokenizer_outputs.texts_to_sequences(output_sentences)

word2idx_outputs = tokenizer_outputs.word_index
num_words_output = len(word2idx_outputs) + 1  # +1 por el padding token
max_out_len = max(len(seq) for seq in decoder_output_sequences)

# Padding
decoder_input_sequences = pad_sequences(decoder_input_sequences, maxlen=max_out_len, padding='post')
decoder_output_sequences = pad_sequences(decoder_output_sequences, maxlen=max_out_len, padding='post')


In [12]:
decoder_targets = np.zeros_like(decoder_output_sequences)
decoder_targets[:, :-1] = decoder_output_sequences[:, 1:]
decoder_targets[:, -1] = 0  # último token se pone a 0 (padding)

In [13]:
# Entradas del encoder
print("word2idx_inputs:", len(word2idx_inputs))
print("max_input_len:", max_input_len)

# Diccionario y tamaños del decoder
print("word2idx_outputs:", len(word2idx_outputs))
print("max_out_len:", max_out_len)
print("num_words_output:", num_words_output)

# Formas de las secuencias
print("encoder_input_sequences shape:", encoder_input_sequences.shape)
print("decoder_input_sequences shape:", decoder_input_sequences.shape)
print("decoder_targets shape:", decoder_targets.shape)


word2idx_inputs: 1799
max_input_len: 9
word2idx_outputs: 1806
max_out_len: 10
num_words_output: 1807
encoder_input_sequences shape: (6033, 9)
decoder_input_sequences shape: (6033, 10)
decoder_targets shape: (6033, 10)


### 3 - Preparar los embeddings
Utilizar los embeddings de Glove o FastText para transformar los tokens de entrada en vectores

### 4 - Entrenar el modelo
Entrenar un modelo basado en el esquema encoder-decoder utilizando los datos generados en los puntos anteriores. Utilce como referencias los ejemplos vistos en clase.

### 5 - Inferencia
Experimentar el funcionamiento de su modelo. Recuerde que debe realizar la inferencia de los modelos por separado de encoder y decoder.