<a href="https://colab.research.google.com/github/joSanchez28/BERT_on_tweets/blob/master/Libreta4_LSTM_con_Word_Embeddings_preentrenados.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTMs con Word Embeddings para clasificación de sentimientos

En esta libreta creamos un modelo de forma sencilla con Keras usando unidades recurrentes LSTM de tipo bidireccional y lo entrenamos con nuestro conjunto de tweets.

En primer lugar importamos los paquetes necesarios.

In [0]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
#Para la LSTM-CNN
from tensorflow.keras.layers import Activation 
from tensorflow.keras.layers import Conv1D, MaxPooling1D
##
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.initializers import Constant

import numpy as np
import pandas as pd
import re
import time

Parámetros para el modelo y el entrenamiento:

In [0]:
# Embedding
#max_features = 20000 #Original
max_features = 48000 #Fijado viendo que el nº de palabras que aparece al menos 5 veces en el conjunto de entrenamiento es 47193
#maxlen = 100 #Original
#maxlen = 150 #La que usé
maxlen = 40
#embedding_size = 128
embedding_dim = 100 #Podría ser también 25, 50 o 200 (los que se pueden descargar)

# Convolution
kernel_size = 5
filters = 64
pool_size = 4

# LSTM
lstm_output_size = 70

# Training
batch_size = 30
epochs = 20

'''
Note:
batch_size is highly sensitive.
'''

'\nNote:\nbatch_size is highly sensitive.\n'

## Carga de los word embeddings

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
word_emb_path = "/content/drive/My Drive/WordEmbeddings/"
#word_emb_path = '../WordEmbeddings/'
# first, build index mapping words in the embeddings set
# to their embedding vector

print('Indexing word vectors.')

embeddings_index = {}
with open(word_emb_path + "glove.twitter.27B.100d.txt", encoding="utf8") as f: #'glove.6B.100d.txt'
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Indexing word vectors.
Found 1193514 word vectors.


In [0]:
embeddings_index['you']

array([ 7.3793e-02,  2.2958e-01,  1.6190e-01,  5.1383e-01, -1.3568e-01,
        5.9524e-02,  5.7240e-01, -3.3930e-01,  1.0477e-01,  2.4796e-01,
       -1.3659e-01, -3.7421e-01, -6.1651e+00, -3.6166e-01, -3.6804e-01,
       -8.1314e-02, -3.3600e-02, -3.0373e-01, -4.0536e-01,  9.4863e-02,
       -1.4260e-01, -2.3630e-01, -1.0712e-01,  2.4055e-01,  2.2325e-01,
       -6.2564e-01,  1.9939e-01,  5.1398e-01,  4.9040e-01, -4.6308e-01,
       -1.4342e-01,  1.9332e-02, -9.5564e-02,  2.5391e-01,  7.0189e-02,
        1.9461e-01,  3.5724e-01,  2.4704e-01,  3.8155e-01, -2.3231e-01,
       -9.9356e-01,  3.2767e-01,  3.0328e-01,  5.5577e-01,  5.8440e-01,
       -2.2246e-01, -2.4206e-01, -7.4880e-01,  2.3144e-01, -5.3725e-03,
       -3.1667e-01, -1.2560e-01,  4.0173e-01, -3.3374e-01,  9.1548e-01,
        2.6268e-01, -6.8389e-01,  3.3916e-01,  1.7124e-01,  4.7471e-01,
        3.8165e-01,  9.8252e-02, -4.3935e-01,  2.7527e-01,  2.3848e-01,
       -3.7455e-02, -9.7668e-01, -8.1719e-03, -3.7798e-01,  2.16

## Carga de los conjuntos de datos

In [0]:
#Cargamos los tres conjuntos de datos
data_path = "/content/drive/My Drive/Datos/"
#data_path = "../Datos/"
df_train = pd.read_csv(data_path + "train_set.csv")
df_val = pd.read_csv(data_path + "val_set.csv")
df_test = pd.read_csv(data_path + "test_set.csv")

## Preprocesado del conjunto de datos

Al igual que hicimos en la libreta 2 con el modelo BERT, sustituimos las URLs por la palabra URL y los nombres de usuario por la palabra USER. Además, esta vez quitamos todos los signos de puntuación y ponemos todo el texto en minúscula.

In [0]:
# Para detectar urls y sustituirlas por URL
TEXT_URL = "https?:\S+|http?:\S|www\.\S+|\S+\.(com|org|co|us|uk|net|gov|edu)"
# Para detectar nombres de usuario y sustituirlos por USER
TEXT_USER = "@\S+"
# Para quitar signos de puntuación o caracteres extraños
TEXT_CLEANING = "[^A-Za-z0-9]+"

In [0]:
def preprocess(text, stem=False):
    text = re.sub(TEXT_URL,  'URL',    text)           # Cambiamos las URLs por la palabra 'URL'
    text = re.sub(TEXT_USER,  'USER', text)           # Cambiamos los nombres de usuario por la palabra 'USER'
    text = re.sub(r'\s+', ' ',   text).strip()        # Eliminamos dobles espacios en blanco y los espacios en blanco al principio o al final
    text = re.sub(TEXT_CLEANING, ' ', str(text).lower()) # Eliminamos signos de puntuación y caracteres no alfanuméricos y lo ponemos en minuscula
    return text

In [0]:
df_train.text = df_train.text.apply(lambda x: preprocess(x))
df_val.text = df_val.text.apply(lambda x: preprocess(x))
df_test.text = df_test.text.apply(lambda x: preprocess(x))

In [0]:
decode_map = {0: 0, 4: 1}
def decode_sentiment(label):
    return decode_map[int(label)]

df_train.target = df_train.target.apply(lambda x: decode_sentiment(x))
df_val.target = df_val.target.apply(lambda x: decode_sentiment(x))
df_test.target = df_test.target.apply(lambda x: decode_sentiment(x))

Nos quedamos con la parte relevante del conjunto de datos.

In [0]:
df_train = df_train[["target","text"]]
df_val = df_val[["target","text"]]
df_test = df_test[["target","text"]]
df_train.columns = ["label", "sentence"]
df_train.index.name = "idx"
df_train = df_train.reset_index()
df_val.columns = ["label", "sentence"]
df_val.index.name = "idx"
df_val = df_val.reset_index()
df_test.columns = ["label", "sentence"]
df_test.index.name = "idx"
df_test = df_test.reset_index()

In [0]:
df_train.label.value_counts()

1    640000
0    640000
Name: label, dtype: int64

In [0]:
df_val.label.value_counts()

1    80000
0    80000
Name: label, dtype: int64

## Algunas estadísticas una vez hemos preprocesado los datos

Número medio de palabras por tweet.

In [0]:
np.mean([len(sentence.split()) for sentence in df_train.sentence.values])

13.65478515625

Número de palabras distintas que aparecen en nuestro conjunto de entrenamiento:

In [0]:
word_occurences_df = df_train.sentence.str.split(expand=True).stack().value_counts()

In [0]:
word_occurences_df = word_occurences_df.to_frame()
word_occurences_df = word_occurences_df.reset_index()
word_occurences_df.columns = ["Word", "Occurences"]

In [0]:
word_occurences_df.shape[0]

243871

Número de palabras que aparecen más de 2, 3, 4, 5 y 6 veces:

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 2].shape[0]

68290

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 3].shape[0]

55151

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 4].shape[0]

47193

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 5].shape[0]

41715

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 6].shape[0]

37693

## Tokenización
Finalmente traducimos el texto a índices del vocabulario. Para ello usaremos un tokenizador creado con Keras. Este tokenizador asignará a cada una de las ``max_features = 48000`` palabras que más ocurrencias tienen en el conjunto de entrenamiento un índice. El resto de palabras serán ignoradas. 

Es importante notar que hemos fijado ``max_features = 48000`` debido a que el número de palabras que aparecen más de 4 veces es de 47193 (lo acabamos de ver en la sección anterior). De esta forma ignoraremos las palabras que aparecen menos de 5 veces, de las cuales realmente no tenemos mucha información (por lo que será dificil que se aprenda algo

In [0]:
# NUEVO vocabulary_indices
tokenizer = Tokenizer(num_words = max_features) #oov_token=None #Solo usaremos un vocabulario con max_features palabras
tokenizer.fit_on_texts(list(df_train.sentence.values))
#vocab_size = len(tokenizer.word_index) + 1
#tokenizer.texts_to_sequences(...)
#print(vocab_size)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 243871 unique tokens.


Probamos a tokenizar algunas frases para hacernos una idea de lo que hace el tokenizador. La palabra 'himsik' no está en nuestro vocabulario (porque aparece menos de 5 veces en el conjunto de entrenamiento), luego el tokenizador la ignorará.

In [0]:
tokenizer.texts_to_sequences(["my father is so fat", "you are awesome himsik"])

[[6, 1185, 10, 19, 1116], [9, 40, 164]]

Comprobamos como tras tokenizar, solo usaremos max_features palabras distintas.

In [0]:
train_tweets_tokenized = tokenizer.texts_to_sequences(df_train.sentence)
len(set([word_id for tweet in train_tweets_tokenized for word_id in tweet]))

47999

In [0]:
x_train = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train.sentence), maxlen=maxlen)
y_train = df_train.label.values
x_val = sequence.pad_sequences(tokenizer.texts_to_sequences(df_val.sentence), maxlen=maxlen)
y_val = df_val.label.values
x_test = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test.sentence), maxlen=maxlen)
y_test = df_test.label.values

## Creamos y entrenamos el modelo LSTM bidireccional

Preparamos la embedding matrix y creamos el la capa embedding.

In [0]:
print('Preparing embedding matrix.')

# prepare embedding matrix
num_words = min(max_features, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            embedding_dim,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=maxlen,
                            trainable=False) #Lo congelamos

Preparing embedding matrix.


Construimos el modelo.

In [0]:
print('Build model...')
model_LSTM_Bi = Sequential()
#model_LSTM_Bi.add(Embedding(max_features, 128, input_length=maxlen))

#sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
#embedded_sequences = embedding_layer(sequence_input)

model_LSTM_Bi.add(embedding_layer)

model_LSTM_Bi.add(Bidirectional(LSTM(64)))
model_LSTM_Bi.add(Dropout(0.5))
model_LSTM_Bi.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
#De BERT
#loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")
#model.compile(optimizer=opt, loss=loss, metrics=[metric])

#De BERT adaptado
#loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")
#model_LSTM_Bi.compile(optimizer = 'adam', loss=loss, metrics=[metric])

#Original de esta LSTM
model_LSTM_Bi.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

Build model...


Definimos las callbacks y entrenamos el modelo:

In [0]:
checkpoint_path = "/content/drive/My Drive/"
#checkpoint_path = "./"

class TimeHistory(tf.keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.times = []

    def on_epoch_begin(self, epoch, logs={}):
        self.epoch_time_start = time.time()

    def on_epoch_end(self, epoch, logs={}):
        self.times.append(time.time() - self.epoch_time_start)

time_callback = TimeHistory()

my_callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath = checkpoint_path + 'my_best_model_LSTM_Bi_WE.{epoch:02d}-{val_accuracy:.2f}.h5', 
    verbose=1, save_best_only=True, save_weights_only=False, monitor = 'val_accuracy', mode = 'max'), 
    time_callback
  ]

In [0]:
print('Train...')
history = model_LSTM_Bi.fit(x_train, y_train,
                            batch_size=batch_size,
                            epochs=epochs,
                            validation_data=(x_val, y_val),
                            verbose = 1,
                            callbacks=my_callbacks)

Train...
Epoch 1/20
Epoch 00001: val_accuracy improved from -inf to 0.82350, saving model to /content/drive/My Drive/my_best_model_LSTM_Bi_WE.01-0.82.h5
Epoch 2/20
Epoch 00002: val_accuracy improved from 0.82350 to 0.82936, saving model to /content/drive/My Drive/my_best_model_LSTM_Bi_WE.02-0.83.h5
Epoch 3/20
Epoch 00003: val_accuracy improved from 0.82936 to 0.83202, saving model to /content/drive/My Drive/my_best_model_LSTM_Bi_WE.03-0.83.h5
Epoch 4/20
Epoch 00004: val_accuracy did not improve from 0.83202
Epoch 5/20
Epoch 00005: val_accuracy improved from 0.83202 to 0.83205, saving model to /content/drive/My Drive/my_best_model_LSTM_Bi_WE.05-0.83.h5
Epoch 6/20
Epoch 00006: val_accuracy improved from 0.83205 to 0.83297, saving model to /content/drive/My Drive/my_best_model_LSTM_Bi_WE.06-0.83.h5
Epoch 7/20
Epoch 00007: val_accuracy did not improve from 0.83297
Epoch 8/20
Epoch 00008: val_accuracy did not improve from 0.83297
Epoch 9/20
Epoch 00009: val_accuracy improved from 0.83297 to

Lo evaluamos en el conjunto test:

In [0]:
loss, acc = model_LSTM_Bi.evaluate(x_test, y_test, batch_size=batch_size)
print('Test loss:', loss)
print('Test accuracy:', acc)

Test loss: 0.3805888295173645
Test accuracy: 0.8296375274658203


Guardamos el modelo y los datos que hemos ido recopilando durante el entrenamiento.

In [0]:
model_LSTM_Bi.save(checkpoint_path + 'final_model_LSTM_Bi_WE.h5')

In [0]:
# convert the history.history dict to a pandas DataFrame:     
hist_df = pd.DataFrame(history.history) 


# save to csv: 
hist_csv_file = checkpoint_path + 'history_LSTM_Bi_WE.csv'
with open(hist_csv_file, mode='w') as f:
    hist_df.to_csv(f)

In [0]:
time_callback.times

[381.38500022888184,
 373.1717915534973,
 373.3835184574127,
 364.3669447898865,
 370.7132284641266,
 369.80929255485535,
 366.3093695640564,
 366.94583201408386,
 371.34428429603577,
 368.5071392059326,
 365.5348379611969,
 363.39785146713257,
 360.41564297676086,
 359.44333004951477,
 359.2808494567871,
 356.675998210907,
 356.39134097099304,
 357.43105125427246,
 363.16758966445923,
 366.8707072734833]

In [0]:
hist_df["times"] = time_callback.times
hist_df

Unnamed: 0,loss,accuracy,val_loss,val_accuracy,times
0,0.419442,0.806927,0.390838,0.8235,381.385
1,0.38447,0.826697,0.378699,0.829356,373.171792
2,0.37234,0.833673,0.375309,0.832025,373.383518
3,0.364484,0.837668,0.377391,0.830387,364.366945
4,0.359072,0.840523,0.375273,0.83205,370.713228
5,0.354643,0.842912,0.373484,0.832969,369.809293
6,0.351082,0.844605,0.374886,0.831569,366.30937
7,0.348174,0.846277,0.375204,0.831519,366.945832
8,0.345734,0.847591,0.374238,0.833569,371.344284
9,0.343779,0.848465,0.37538,0.831438,368.507139


In [0]:
# save to csv: 
hist_csv_file = checkpoint_path + 'history_LSTM_Bi_WE_with_times.csv'
with open(hist_csv_file, mode='w') as f:
    hist_df.to_csv(f)

## Creamos y entrenamos el modelo LSTM-CNN

Construimos el modelo.

In [0]:
my_callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath = checkpoint_path + 'my_best_model_LSTM-CNN_WE.{epoch:02d}-{val_accuracy:.2f}.h5', 
    verbose=1, save_best_only=True, save_weights_only=False, monitor = 'val_accuracy', mode = 'max'), 
    time_callback
  ]

In [0]:
# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            embedding_dim,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=maxlen,
                            trainable=False) #Lo congelamos

In [0]:
print('Build model...')

model = Sequential()
model.add(embedding_layer) 
model.add(Dropout(0.25))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(lstm_output_size))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Build model...


Entrenamos el modelo:

In [0]:
print('Train...')
history = model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(x_val, y_val),
              verbose = 1,
              callbacks=my_callbacks)

Train...
Epoch 1/20
Epoch 00001: val_accuracy improved from -inf to 0.80712, saving model to /content/drive/My Drive/my_best_model_LSTM-CNN_WE.01-0.81.h5
Epoch 2/20
Epoch 00002: val_accuracy improved from 0.80712 to 0.81144, saving model to /content/drive/My Drive/my_best_model_LSTM-CNN_WE.02-0.81.h5
Epoch 3/20
Epoch 00003: val_accuracy improved from 0.81144 to 0.81275, saving model to /content/drive/My Drive/my_best_model_LSTM-CNN_WE.03-0.81.h5
Epoch 4/20
Epoch 00004: val_accuracy improved from 0.81275 to 0.81556, saving model to /content/drive/My Drive/my_best_model_LSTM-CNN_WE.04-0.82.h5
Epoch 5/20
Epoch 00005: val_accuracy improved from 0.81556 to 0.81606, saving model to /content/drive/My Drive/my_best_model_LSTM-CNN_WE.05-0.82.h5
Epoch 6/20
Epoch 00006: val_accuracy improved from 0.81606 to 0.81679, saving model to /content/drive/My Drive/my_best_model_LSTM-CNN_WE.06-0.82.h5
Epoch 7/20
Epoch 00007: val_accuracy improved from 0.81679 to 0.81729, saving model to /content/drive/My D

Lo evaluamos en el conjunto test:

In [0]:
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test loss:', loss)
print('Test accuracy:', acc)

Test loss: 0.40108928084373474
Test accuracy: 0.8174750208854675


Guardamos el modelo y los datos que hemos ido recopilando durante el entrenamiento.

In [0]:
model.save(checkpoint_path + './final_model_LSTM-CNN_WE.h5')

In [0]:
# convert the history.history dict to a pandas DataFrame:     
hist_df = pd.DataFrame(history.history) 


# or save to csv: 
hist_csv_file = checkpoint_path + 'history_LSTM-CNN_WE.csv'
with open(hist_csv_file, mode='w') as f:
    hist_df.to_csv(f)

In [0]:
time_callback.times

[249.00644278526306,
 246.42788124084473,
 245.67459964752197,
 245.157888174057,
 245.5422441959381,
 243.9165802001953,
 243.54245257377625,
 238.52641987800598,
 242.12065958976746,
 238.6482434272766,
 239.6029350757599,
 241.6167447566986,
 244.5101420879364,
 238.62593126296997,
 238.50634360313416,
 241.39828538894653,
 238.1961395740509,
 237.45863366127014,
 238.69373393058777,
 236.3691189289093]

In [0]:
hist_df["times"] = time_callback.times
hist_df

Unnamed: 0,loss,accuracy,val_loss,val_accuracy,times
0,0.454774,0.783092,0.416801,0.807125,249.006443
1,0.431754,0.797836,0.410984,0.811437,246.427881
2,0.425398,0.801681,0.408278,0.81275,245.6746
3,0.422348,0.803635,0.403609,0.815562,245.157888
4,0.420724,0.80453,0.403732,0.816063,245.542244
5,0.418654,0.806048,0.401459,0.816794,243.91658
6,0.417566,0.80665,0.401121,0.817288,243.542453
7,0.416775,0.807144,0.400908,0.816744,238.52642
8,0.416085,0.807677,0.400974,0.817463,242.12066
9,0.416013,0.807557,0.401044,0.817256,238.648243


In [0]:
# save to csv: 
hist_csv_file = checkpoint_path + 'history_LSTM-CNN_WE_with_times.csv'
with open(hist_csv_file, mode='w') as f:
    hist_df.to_csv(f)