# Experimento 16
Experimentación de análisis de sentimiento (texto negativo o positivo) a partir de los datos de la plataforma IMDB. 

La idea fundamental es la de medir el rendimiento y la precisión a la hora de clasificar un comentario como positivo o negativo con mayor o menor cantidad de datos. Las técnicas de fine-tuning parten de un modelo pre-entrenado con una carga masiva de datos pero, ¿Cuántos datos se requieren para alcanzar buenos resultados en una tarea concreta? ¿Merece la pena realizar un re-entrenamiento con muchos datos? ¿Se podrían obtener más o menos los mismos resultados con menos atos?

** Se limpiarán los textos de etiquetas html y otros tokens que no aporten información semántica. Se empleará el dataset completo. Para el entrenamiento se emplearán las primeras 48.000 filas. Para la validación se emplearán siempre los mismos datos (últimas 2.000 filas). **

In [None]:
#Instalación de los paquetes necesarios

In [None]:
pip install transformers numpy torch sklearn wandb

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 7.4MB/s 
Collecting wandb
[?25l  Downloading https://files.pythonhosted.org/packages/5c/ee/d755f9e5466df64c8416a2c6a860fb3aaa43ed6ea8e8e8e81460fda5788b/wandb-0.10.28-py2.py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 35.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 54.6MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████

In [None]:
#Importación de los paquetes necesarios
import torch
import tensorflow as tf
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import DistilBertTokenizerFast, TFDistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
import pandas as pd
from bs4 import BeautifulSoup
import re,string,unicodedata
import nltk
from nltk.corpus import stopwords


In [None]:
#Se define una seed o semilla para que se obtengan los mismos resultados en todas las ejecuciones
def set_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available
    if is_tf_available():
        print('SE USA TENSORFLOW');
        import tensorflow as tf

        tf.random.set_seed(seed)

set_seed(1)

SE USA TENSORFLOW


In [None]:
"""
  Se define el modelo preentrenado que se empleará. En este caso se partirá de un
  "distilbert", es decir, un modelo "BERT" preparado para que consuma menos 
  recursos, "uncased", esto es, que no distingue entre mayúsculas y minúsculas 
  en su entrenamiento.
"""
model_name= "distilbert-base-uncased"
# Se establece un tamaño máximo de carácteres para cada ejemplo
max_length = 512

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Se carga el dataset desde el drive.
df = pd.read_csv('/content/drive/MyDrive/imdb_dataset.csv')

In [None]:
# Información del dataset
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,negative
freq,5,25000


In [None]:
# Ejemplos de filas. La columna "review" incluye el comentario y "review" si es positivo o negativo
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
# Para poder trabajar es necesario traducir la columna "sentiment" a 0 o 1, si el valor es "positive" o "negative"
df['sentiment01'] = df['sentiment'].replace(['positive','negative'],[0,1])

In [None]:
# El dataframe queda de la siguiente forma
df.head()

Unnamed: 0,review,sentiment,sentiment01
0,One of the other reviewers has mentioned that ...,positive,0
1,A wonderful little production. <br /><br />The...,positive,0
2,I thought this was a wonderful way to spend ti...,positive,0
3,Basically there's a family where a little boy ...,negative,1
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,0


In [None]:
# En este experimento se preprocesarán los textos

# Se cargan las "stop words"
nltk.download('stopwords')
stopword_list=nltk.corpus.stopwords.words('english')



# Se eliminan las etiquetas HTML mediante la librería BeautifulSoup
def preprocesado(texto):
  # Se eliminan las etiquetas HTML mediante la librería BeautifulSoup
  soup = BeautifulSoup(texto, "html.parser")
  # Se convierte a minúsculas
  texto = soup.get_text().lower()
  # Se aíslan los signos de puntuación y se eliminan. Excepto ? y !
  texto = re.sub(r'([\'\"\.\(\)\!\?\\\/\,])', r' \1 ', texto)
  texto = re.sub(r'[^\w\s\?]', ' ', texto)
  # Se eliminan caracteres especiales (se sustituyen por un espacio)
  texto = re.sub(r'([\;\:\|•«\n])', ' ', texto)  
  # Se eliminan las stop words (no aportan significado), excepto 'not' y 'can'
  texto = " ".join([word for word in texto.split()
                  if word not in stopword_list
                  or word in ['not', 'can']])
  # Se eliminan los espacios consecutivos (se han generado algunos previamente)
  texto = re.sub(r'\s+', ' ', texto).strip()
  return texto

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [None]:
# Se sobreescribe la columna con el texto limpio
df['review']=df['review'].apply(preprocesado)

In [None]:
df['review'][1]

'wonderful little production filming technique unassuming old time bbc fashion gives comforting sometimes discomforting sense realism entire piece actors extremely well chosen michael sheen not got polari voices pat can truly see seamless editing guided references williams diary entries not well worth watching terrificly written performed piece masterful production one great master comedy life realism really comes home little things fantasy guard rather use traditional dream techniques remains solid disappears plays knowledge senses particularly scenes concerning orton halliwell sets particularly flat halliwell murals decorating every surface terribly well done'

In [None]:
# Se separan los datos en entrenamiento y evaluación
# En todos los experimentos se usarán ahora las mismas 2000 filas de datos para la evaluación
# En este caso, se emplearán 48.000 para el entrenamiento
train_texts = df['review'][:48000]
train_labels = df['sentiment01'][:48000]

valid_texts = df['review'][48000:50000]
valid_labels = df['sentiment01'][48000:50000]

In [None]:
# Se comprueba el tamaño de los conjuntos
print('Tamaño de entrenamiento: ' + str(len(train_texts)) if len(train_labels) == len(train_labels) else "La partición no ha sido correcta" )
print('Tamaño de test: ' + str(len(valid_texts)) if len(valid_labels) == len(valid_texts) else "La partición no ha sido correcta" )

Tamaño de entrenamiento: 48000
Tamaño de test: 2000


In [None]:
# Se carga el tokenizador o tokenizer.
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [None]:
# Se tokenizan los textos de entrenamiento y test. Si no se llega a "max_length",
# se rellena con 0s, y si se sobrepasa a dicho valor.
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts.tolist(), truncation=True, padding=True, max_length=max_length)

In [None]:
# Se crean los datasets de TensorFlow, que posteriormente alimentarán la función "fit".
# Se emparejan o mapean los textos con sus correspondientes etiquetas (0 o 1), por lo que no se necesitará emparejar
# después en la función "fit".

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(valid_encodings),
    valid_labels
))

In [None]:
# Se comprueba el servicio de procesamiento. Si no hay GPU disponibles, se 
# recomienda conectar el entorno de ejecución a una. 
# (Entorno de ejecución -> Cambiar tipo de entorno de ejecución -> GPU (acelerador de hardware))

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'Hay {torch.cuda.device_count()} GPU(s) disponibles.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No hay GPU disponibles, usando la CPU.')
    device = torch.device("cpu")

Hay 1 GPU(s) disponibles.
Device name: Tesla T4


In [None]:
import wandb
from wandb.keras import WandbCallback

wandb.init(config={"hyper": "parameter"}, project="IMDB_TF")

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


wandb: Paste an API key from your profile and hit enter: ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])
model.fit(train_dataset.shuffle(100).batch(16),
          epochs=3,
          batch_size=16,
          validation_data=val_dataset.shuffle(100).batch(16),
          callbacks=[WandbCallback()],
          )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=363423424.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'vocab_layer_norm', 'activation_13', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_19', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported



[34m[1mwandb[0m: [32m[41mERROR[0m Can't save model, h5py returned error: Saving the model to HDF5 format requires the model to be a Functional model or a Sequential model. It does not work for subclassed models, because such models are defined via the body of a Python method, which isn't safely serializable. Consider saving to the Tensorflow SavedModel format (by setting save_format="tf") or using `save_weights`.


Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fe8202e4e10>

In [None]:
# Se guardan los datos del modelo para poder cargarlo posteriormente.
# De esta manera se evita tener que realizar el entrenamiento cada vez.
model.save_pretrained("./drive/MyDrive/Modelos/IMDB_TF_limpio_8020_48k_E16")

In [None]:
# Se carga el modelo reentrenado que se descargó previamente. 
from transformers import TFDistilBertForSequenceClassification
loaded_model = TFDistilBertForSequenceClassification.from_pretrained('./drive/MyDrive/Modelos/IMDB_TF_limpio_8020_48k_E16')

Some layers from the model checkpoint at ./drive/MyDrive/Modelos/IMDB_TF_limpio_8020_48k_E16 were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./drive/MyDrive/Modelos/IMDB_TF_limpio_8020_48k_E16 and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Se predice el valor positivo o negativo de un texto
text = """
The film is terrible. I totally hate it. If I was the director, it would be much better. It's completely awfull.
"""

#text = valid_texts.values[17]

predict_input = tokenizer.encode(text,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")
tf_output = loaded_model.predict(predict_input)[0]



In [None]:
# Se emplea la función softmax para interpretar correctamente ambos valores como:
# probabilidad de que la etiqueta sea 0 (comentario positivo) y de que sea 1 (comentario negativo)
tf_prediction = tf.nn.softmax(tf_output, axis=1).numpy()[0]
print("La probabilidad de que el comentario sea positivo es de "+ str(round(tf_prediction[0],6))+"%, y de que sea negativo de " + str(round(tf_prediction[1],6))+"%.");
print("Por lo tanto, el comentario es " + ("positivo" if round(tf_prediction[0],6) > round(tf_prediction[1],6) else "negativo"))



La probabilidad de que el comentario sea positivo es de 0.000332%, y de que sea negativo de 0.999668%.
Por lo tanto, el comentario es negativo
