# Trabajo de Fin de Máster: Applicación Web sobre un modelo de Question Answering en español

    
Descripción de los objetivos del proyecto.
- Imagen de Bert
- ¿Cómo aprenderá Bert?
- ¿Cómo disponibilizaremos la app final?

---
## Capítulo 0: Configuración Inicial

### Importación de librerias

Descripción

In [1]:
import numpy as np
import os
import sys
ABS_DIR = os.path.join(os.getcwd(), "..")
sys.path.append(ABS_DIR)


import utils.read_and_write as rw
import utils.preprocesado as pp
import train.train_utils as tu
import predict.predict_utils as pu

### Configuración del proyecto

Desde la siguiente URL: https://huggingface.co/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es

* Descargamos los ficheros de entrenamiento y evalución de SQuAD en español
* Descargamos los ficheros de vocab y config con los que se entrenó BETO (Spanish BERT)

In [2]:
CONFIG_DIR = os.path.join(ABS_DIR, "config\\")

DATA_DIR = os.path.join(ABS_DIR, "data\\")
LOG_DIR = os.path.join(DATA_DIR, "logs\\")
MODELS_DIR = os.path.join(DATA_DIR, "models\\")
DATASETS_DIR = os.path.join(DATA_DIR, "datasets\\")
CHECKPOINTS_CALLBACK_DIR = os.path.join(DATA_DIR, "checkpoints\\")
CACHE_DIR = os.path.join(DATA_DIR, "cache\\")

logger = rw.crear_logger("tfm-app.log")

train_path = rw.comprobar_fichero_existe(os.path.join(DATASETS_DIR, "train-v2.0-es.json"), logger)
eval_path = rw.comprobar_fichero_existe(os.path.join(DATASETS_DIR, "dev-v2.0-es.json"), logger)

vocab_BERT_path = rw.comprobar_fichero_existe(os.path.join(MODELS_DIR, "vocab.txt"), logger)
config_BERT_path = rw.comprobar_fichero_existe(os.path.join(MODELS_DIR, "config.json"), logger)

config_file = rw.cargar_config()

2021-10-12 20:18:33,828 tfm-app.log INFO: El fichero log tfm-app.log ha sido creado
2021-10-12 20:18:33,832 tfm-app.log INFO: El fichero train-v2.0-es.json existe en D:\tfm_app\src\notebooks\..\data\datasets
2021-10-12 20:18:33,833 tfm-app.log INFO: El fichero dev-v2.0-es.json existe en D:\tfm_app\src\notebooks\..\data\datasets
2021-10-12 20:18:33,833 tfm-app.log INFO: El fichero vocab.txt existe en D:\tfm_app\src\notebooks\..\data\models
2021-10-12 20:18:33,834 tfm-app.log INFO: El fichero config.json existe en D:\tfm_app\src\notebooks\..\data\models


### Visualización de datos

Explicación de cómo son los datos

In [3]:
train_df = rw.json_to_dataframe(train_path)
print(train_df.shape)
eval_df = rw.json_to_dataframe(eval_path)
print(eval_df.shape)

(86818, 6)
(9905, 6)


* ENTRENAMIENTO

In [None]:
id_row = 5000
print(f"Título:\n {train_df.iloc[id_row, 1]}")
print("---"*30)
print(f"Párrafo:\n {train_df.iloc[id_row, 2]}")
print("---"*30)
print(f"Pregunta:\n {train_df.iloc[id_row, 3]}")
print("---"*30)
print(f"Posicion de respuesta:\n {train_df.iloc[id_row, 4]}")
print("---"*30)
print(f"Respuesta:\n {train_df.iloc[id_row, 5]}")

* VALIDACIÓN

In [None]:
id_row = 5000
print(f"Título:\n {eval_df.iloc[id_row, 1]}")
print("---"*30)
print(f"Párrafo:\n {eval_df.iloc[id_row, 2]}")
print("---"*30)
print(f"Pregunta:\n {eval_df.iloc[id_row, 3]}")
print("---"*30)
print(f"Posicion de respuesta:\n {eval_df.iloc[id_row, 4]}")
print("---"*30)
print(f"Respuesta:\n {eval_df.iloc[id_row, 5]}")

---
## Capítulo 1: Preprocesado de datos

### Configurando el tokenizador

In [4]:
tokenizador = pp.obtener_tokenizador(vocab=vocab_BERT_path, lowercase=False)

In [None]:
tokens = tokenizador.encode("Esto es una prueba para ver cómo se tokeniza")
print(f'ids: {tokens.ids}')
print(f'tokens: {tokens.tokens}')

### Modificando la configuración

In [5]:
import json
with open(config_BERT_path, 'r') as json_file:
    config_BERT = json.load(json_file)
config_BERT.update(config_file["train"]["preprocess"])

### Preprocesamiento

* Entrenamiento

In [6]:
x_train, y_train, train_dataset, train_squad_objects, train_errors = pp.transformar_datos_squad(train_df, tokenizador, config_BERT, logger=logger, name_data="entrenamiento")

2021-10-12 20:19:56,390 tfm-app.log INFO: Transformado el conjunto de entrenamiento al formato SquadExample
2021-10-12 20:20:03,268 tfm-app.log INFO: Se ha conseguido obtener los inputs y los targets para el conjunto de entrenamiento. Se han creado  85629 puntos


* Validación

In [7]:
x_eval, y_eval, eval_dataset, eval_squad_objects, eval_errors = pp.transformar_datos_squad(eval_df, tokenizador, config_BERT, logger=logger, name_data="validación")

2021-10-12 20:20:11,580 tfm-app.log INFO: Transformado el conjunto de validación al formato SquadExample
2021-10-12 20:20:12,167 tfm-app.log INFO: Se ha conseguido obtener los inputs y los targets para el conjunto de validación. Se han creado  9649 puntos


---
## Capítulo 2: Construcción del modelo

### Hiperparámetros

In [8]:
config_BERT

{'attention_probs_dropout_prob': 0.1,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'max_position_embeddings': 512,
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'type_vocab_size': 2,
 'vocab_size': 31002,
 'max_seq_len': 384,
 'lr': 0.005,
 'batch_size': 16,
 'nb_epoch': 2}

### Callbacks

In [9]:
callbacks = tu.generar_callbacks(x_eval, y_eval, eval_squad_objects, logger=logger)

### Formato Keras

In [10]:
train_dataset_keras = pp.input_formato_keras(train_dataset, config_BERT)

### Modelo

In [11]:
modelo = tu.obtener_modelo(logger=logger, config=config_BERT)

Some layers from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased were not used when initializing TFBertForQuestionAnswering: ['mlm___cls']
- This IS expected if you are initializing TFBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForQuestionAnswering were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2021-10-12 20:20:14,712 tfm-app.log INFO: Se ha realizado la carga del modelo

In [12]:
modelo.summary()

Model: "tf_bert_for_question_answering"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109850880 
_________________________________________________________________
qa_outputs (Dense)           multiple                  1538      
Total params: 109,852,418
Trainable params: 1,538
Non-trainable params: 109,850,880
_________________________________________________________________


## Capítulo 3: Entrenamiento y validación

### Monitorización en Tensorboard

In [13]:
# Cargamos el tensorboard
%load_ext tensorboard
# %reload_ext tensorboard

In [14]:
%tensorboard --logdir 'D:\\tfm_app\\src\\notebooks\\..\\data\\logs\\tensorboard'

Reusing TensorBoard on port 6006 (pid 728), started 0:09:40 ago. (Use '!kill 728' to kill it.)

### Entrenamiento

In [15]:
modeloHistory = tu.entrenar_modelo(modelo, train_dataset_keras, config_BERT, callbacks, logger=logger)

2021-10-12 20:20:22,793 tfm-app.log INFO: Se realizará el entrenamiento con 2 epochs


Epoch 1/2
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
Epoch 00001: loss improved from inf to 7.10881, saving model to ..\data\checkpoints\model.01-7.11.h5


2021-10-12 22:32:32,920 tfm-app.log INFO: 
epoch=1, exact match score=0.01



epoch=1, exact match score=0.01
Epoch 2/2
Epoch 00002: loss improved from 7.10881 to 7.08091, saving model to ..\data\checkpoints\model.02-7.08.h5


2021-10-13 00:44:42,084 tfm-app.log INFO: 
epoch=2, exact match score=0.01



epoch=2, exact match score=0.01


2021-10-13 00:44:42,263 tfm-app.log INFO: Entrenamiento finalizado!
2021-10-13 00:44:42,328 tfm-app.log INFO: Guardando pesos...
2021-10-13 00:44:44,299 tfm-app.log INFO: Se han guardado los pesos en: ..\data\models\qa_model_squad_v2_esp\qa_model_squad_v2_esp.h5
2021-10-13 00:44:44,300 tfm-app.log INFO: Guardando Modelo Json...
2021-10-13 00:44:44,302 tfm-app.log ERROR: [Errno 2] No such file or directory: 'qa_model_squad_v2_esp\\qa_model_squad_v2_esp.json'


## Capítulo 4: Predicción sobre el modelo entrenado

### Carga del modelo

In [None]:
modelo = rw.cargar_modelo(logger=logger)

In [None]:
modelo.summary()

## Predicción

In [None]:
ejemplo_id = 0
context = eval_df.loc[0, "Context"]
print("Contexto:")
print(context, "\n")

question = eval_df.loc[0, "Question"]
print("Pregunta:")
print(question, "\n")

respuesta = eval_df.loc[0, "Text"]
print("Respuesta:")
print(respuesta)

In [None]:
os.path.join(MODELS_DIR, "tfm_bert_finetuned_squadV2esp\\v1")

In [None]:
modelo.load_weights(os.path.join(MODELS_DIR, "tfm_bert_finetuned_squadV2esp\\v1\\saved_model.pb"))

In [None]:
all_results = []
for count, inputs in enumerate(eval_dataset.batch(16)):
    x, _ = inputs  
    start_logits, end_logits = modelo(x, training=False)
    output_dict = dict(
        start_logits=start_logits,
        end_logits=end_logits)
    for result in get_raw_results(output_dict):
        all_results.append(result)
    if count % 100 == 0:
        print("{}/{}".format(count, 2709))

In [None]:
pred_start, pred_end = modelo.predict(eval_dataset.batch(16))

In [None]:
texto = "Esto es una texto de prueba.\n Vamos a ver si funciona"

In [None]:
print(texto)

In [None]:
pu.whitespace_split(texto)

In [None]:
doc_tokens = ["E"]

In [None]:
doc_tokens[-1] += "s"

In [None]:
doc_tokens

In [None]:
for w in texto:
    print(w)