<a href="https://colab.research.google.com/github/joSanchez28/BERT_on_tweets/blob/master/Libreta2_BERT_FineTuning_con_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning de el transformador BERT para el análisis de sentimientos en Twitter

En esta libreta se pretende hacer transfer learning con el modelo BERT preentrenado disponible gracias a HuggingFace con el objetivo de resolver la tarea de análisis o clasificación de sentimientos en tweets.

Al final de esta libreta se puede encontrar que la precisión que conseguimos tras el entrenamiento es de 0.862 en el conjunto de validación. Un análisis más exhaustivo de la eficacia del modelo será llevado a cabo en otra libreta.





El código se puede ejecutar directamente en la plataforma Google Colab. No obstante, a la hora de ejecutarlo es conveniente tener en cuenta que la libreta ha sido originalmente ejecutada con Google Colab Pro (con el aumento de recursos que esto conlleva con respecto a la versión estándar). Si estás usando Google Colab, puedes comprobar la GPU a tu disposición ejecutando la siguiente celda.




In [0]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Thu Jun 11 08:59:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

## Cargamos los paquetes necesarios


In [0]:
import os
import pandas as pd
import re
import time
import numpy as np
import tensorflow as tf
import tensorflow_datasets

In [0]:
!pip install pyyaml h5py  # Para guardar los modelos en formato HDF5



En concreto, importamos las funciones del paquete de HuggingFace necesarias para cargar BERT con los pesos preentrenados y con la estructura adecuada para hacer clasificación de sentimientos con dos clases.


In [0]:
pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |████████████████████████████████| 675kB 3.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 16.4MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 22.3MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |███

In [0]:
from transformers import (
    BertConfig,
    BertTokenizer,
    TFBertForSequenceClassification,
    glue_convert_examples_to_features,
    glue_processors,
)

In [0]:
# Parámetros del script usado por HuggingFace para hacer análisis de sentimientos sobre otro conjunto de datos
USE_XLA = False
USE_AMP = False
TASK = "sst-2"
TFDS_TASK = "sst2"
num_labels = 2
tf.config.optimizer.set_jit(USE_XLA)
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": USE_AMP})

In [0]:
# Load tokenizer and model from pretrained model/vocabulary. Specify the number of labels to classify (2+: classification, 1: regression)
config = BertConfig.from_pretrained("bert-base-cased", num_labels=num_labels)
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = TFBertForSequenceClassification.from_pretrained("bert-base-cased", config=config)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…




## Nuestro conjunto de datos
En esta sección, cargamos y preprocesamos nuestro conjunto de tweets.


Cargamos el conjunto de datos y lo metemos en un dataframe de pandas.


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
#Cargamos los tres conjuntos de datos
data_path = "/content/drive/My Drive/Datos/"
df_train = pd.read_csv(data_path + "train_set.csv")
df_val = pd.read_csv(data_path + "val_set.csv")
df_test = pd.read_csv(data_path + "test_set.csv")

### Preprocesado del conjunto de datos

Cambiamos los nombres de usuario por la palabra 'USER' y las direcciones url por la palabra 'URL'. No quitamos los signos de puntuación porque BERT trabaja con ellos. También dejamos las mayúsculas, pues hemos tomado una variante de BERT ("bert-base-cased") que ha sido preentrenada con ellas.




In [0]:
# Para detectar urls y sustituirlas por URL
TEXT_URL = "https?:\S+|http?:\S|www\.\S+|\S+\.(com|org|co|us|uk|net|gov|edu)"
# Para detectar nombres de usuario y sustituirlos por USER
TEXT_USER = "@\S+"
#TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9,;:.']+" #La última indica lo que dejamos; BERT sí trabaja con signos de puntuación.

In [0]:
def preprocess(text):
    text = re.sub(TEXT_URL,  'URL',    text)           # Sustituimos las URLs
    text = re.sub(TEXT_USER,  'USER', text)           # Sustituimos los usuarios
    text = re.sub(r'\s+', ' ',   text).strip()        # Eliminamos dobles espacios en blanco y los espacios en blanco al principio o al final
    return text

In [0]:
df_train.text = df_train.text.apply(lambda x: preprocess(x))
df_val.text = df_val.text.apply(lambda x: preprocess(x))
df_test.text = df_test.text.apply(lambda x: preprocess(x))

In [0]:
print(df_train["text"][0])

USER yay !! have you told him now?


In [0]:
df_train.text.iloc[919]

'USER URL - ThatÂ´s pretty!!!!'

Cambiamos las etiquetas positivas de 4 a 1 para reutilizar las funciones de HuggingFace.

In [0]:
df_train.target.value_counts()

4    640000
0    640000
Name: target, dtype: int64

In [0]:
decode_map = {0: 0, 4: 1}
def decode_sentiment(label):
    return decode_map[int(label)]

df_train.target = df_train.target.apply(lambda x: decode_sentiment(x))
df_val.target = df_val.target.apply(lambda x: decode_sentiment(x))
df_test.target = df_test.target.apply(lambda x: decode_sentiment(x))

In [0]:
df_train.target.value_counts()

1    640000
0    640000
Name: target, dtype: int64

In [0]:
df_val.target.value_counts()

1    80000
0    80000
Name: target, dtype: int64

In [0]:
df_test.target.value_counts()

1    80000
0    80000
Name: target, dtype: int64

Nos quedamos con la parte relevante de los conjuntos de datos.


In [0]:
df_train = df_train[["target","text"]]
df_val = df_val[["target","text"]]
df_test = df_test[["target","text"]]
df_train.columns = ["label", "sentence"]
df_train.index.name = "idx"
df_train = df_train.reset_index()
df_val.columns = ["label", "sentence"]
df_val.index.name = "idx"
df_val = df_val.reset_index()
df_test.columns = ["label", "sentence"]
df_test.index.name = "idx"
df_test = df_test.reset_index()

In [0]:
df_train.head()

Unnamed: 0,idx,label,sentence
0,0,1,USER yay !! have you told him now?
1,1,1,USER and me kicking your ass in rock band
2,2,1,USER I've lived under Pegasus' flight path for...
3,3,0,USER Link doesn't work
4,4,0,USER USER awwww poor metria!!


#### Convertimos los conjuntos de datos a un formato con el que pueda trabajar BERT

En primer lugar, transformamos los conjuntos de datos en tensores (de TensorFlow).

In [0]:
train_examples = df_train.shape[0]
valid_examples = df_val.shape[0]
print(train_examples)
print(valid_examples)

1280000
160000


In [0]:
data_train = tf.data.Dataset.from_tensor_slices(df_train.to_dict('list'))

In [0]:
data_val = tf.data.Dataset.from_tensor_slices(df_val.to_dict('list'))

In [0]:
data_test = tf.data.Dataset.from_tensor_slices(df_test.to_dict('list'))

A continuación, le damos a los tensores el formato que requiere el modelo BERT desarrollado por HuggingFace (aprovechamos las funciones que usan en uno de sus scripts para hacer también clasificación de sentimientos). En la libreta 1 ya vimos un histograma que nos mostraba la distribución del número de palabras en los tweets.



In [0]:
train_dataset = glue_convert_examples_to_features(data_train, tokenizer, max_length=40, task=TASK) #O:128 #2:36

In [0]:
valid_dataset = glue_convert_examples_to_features(data_val, tokenizer, max_length=40, task=TASK) #O:128

In [0]:
test_dataset = glue_convert_examples_to_features(data_test, tokenizer, max_length=36, task=TASK) #O:128

## ENTRENAMOS EL MODELO


In [0]:
EPOCHS = 6
BATCH_SIZE = 32
EVAL_BATCH_SIZE = BATCH_SIZE * 2

In [0]:
# MNLI expects either validation_matched or validation_mismatched
train_dataset = train_dataset.shuffle(128).batch(BATCH_SIZE).repeat(-1) #Original:128 <- 1000
valid_dataset = valid_dataset.batch(EVAL_BATCH_SIZE)

In [0]:
checkpoint_path = "/content/drive/My Drive/" 
checkpoint_dir = os.path.dirname(checkpoint_path)

class TimeHistory(tf.keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.times = []

    def on_epoch_begin(self, epoch, logs={}):
        self.epoch_time_start = time.time()

    def on_epoch_end(self, epoch, logs={}):
        self.times.append(time.time() - self.epoch_time_start)

time_callback = TimeHistory()

my_callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath = checkpoint_path + 'my_best_model_BERT.{epoch:02d}-{val_accuracy:.2f}.h5', 
    verbose=1, save_best_only=True, save_weights_only=False, monitor = 'val_accuracy', mode = 'max'), 
    time_callback
  ]

In [0]:
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
opt = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
if USE_AMP:
    # loss scaling is currently required when using mixed precision
    opt = tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt, "dynamic")


if num_labels == 1:
    loss = tf.keras.losses.MeanSquaredError()
else:
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")
model.compile(optimizer=opt, loss=loss, metrics=[metric])

# Train and evaluate using tf.keras.Model.fit()
train_steps = train_examples // BATCH_SIZE
valid_steps = valid_examples // EVAL_BATCH_SIZE

history = model.fit(
    train_dataset,
    epochs=EPOCHS,
    steps_per_epoch=train_steps,
    validation_data=valid_dataset,
    validation_steps=valid_steps,
    verbose = 1,
    callbacks=my_callbacks
)

# Save TF2 model
#os.makedirs("./save/", exist_ok=True)
#model.save_pretrained(checkpoint_path) #save_pretrained(save_directory) 
#Save a model and its configuration file to a directory, so that it can be re-loaded using the 
#:func:`~transformers.PreTrainedModel.from_pretrained` class method.


Epoch 00003: val_accuracy did not improve from 0.86207
Epoch 4/6
Epoch 00004: val_accuracy did not improve from 0.86207
Epoch 5/6
Epoch 00005: val_accuracy did not improve from 0.86207
Epoch 6/6
Epoch 00006: val_accuracy did not improve from 0.86207


Guardamos el modelo.


In [0]:
model.save(checkpoint_path + 'final_model_BERT_aaa.tf', save_format="tf")

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: /content/drive/My Drive/final_model_BERT_aaa.tf/assets


Guardamos también los datos monitorizados durante el entrenamiento.


In [0]:
# convert the history.history dict to a pandas DataFrame:     
hist_df = pd.DataFrame(history.history) 

# save to json:  
hist_json_file = checkpoint_path + 'history.json' 
with open(hist_json_file, mode='w') as f:
    hist_df.to_json(f)

# or save to csv: 
hist_csv_file = 'history.csv'
with open(hist_csv_file, mode='w') as f:
    hist_df.to_csv(f)

In [0]:
time_callback.times

[7729.534623146057,
 7713.080624580383,
 7665.665296316147,
 7668.061312198639,
 7684.038738012314,
 7699.367078065872]

In [0]:
hist_df["times"] = time_callback.times
hist_df

Unnamed: 0,loss,accuracy,val_loss,val_accuracy,times
0,0.345548,0.848331,0.323767,0.861344,7729.534623
1,0.290168,0.876339,0.332546,0.862075,7713.080625
2,0.249771,0.896404,0.356749,0.859175,7665.665296
3,0.216663,0.911868,0.388973,0.857944,7668.061312
4,0.189929,0.923989,0.422314,0.855456,7684.038738
5,0.169763,0.932598,0.466934,0.850569,7699.367078


In [0]:
# save to json:  
hist_json_file = checkpoint_path + 'history_with_times.json' 
with open(hist_json_file, mode='w') as f:
    hist_df.to_json(f)