<a href="https://colab.research.google.com/github/joSanchez28/BERT_on_tweets/blob/master/Libreta3_LSTM_b%C3%A1sica.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTMs para clasificación de sentimientos

En esta libreta creamos un modelo de forma sencilla con Keras usando unidades recurrentes LSTM de tipo bidireccional y lo entrenamos con nuestro conjunto de tweets.

En primer lugar importamos los paquetes necesarios.

In [0]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
#Para la LSTM-CNN
from tensorflow.keras.layers import Activation 
from tensorflow.keras.layers import Conv1D, MaxPooling1D
##
from tensorflow.keras.preprocessing.text import Tokenizer

import numpy as np
import pandas as pd
import re
import time

Parámetros para el modelo y el entrenamiento:

In [0]:
# Embedding
#max_features = 20000 #Original
max_features = 48000 #Fijado viendo que el nº de palabras que aparece al menos 5 veces en el conjunto de entrenamiento es 47193
#maxlen = 100 #Original
#maxlen = 150 #La que usé
maxlen = 40
embedding_size = 128

# Convolution
kernel_size = 5
filters = 64
pool_size = 4

# LSTM
lstm_output_size = 70

# Training
batch_size = 30
epochs = 8

'''
Note:
batch_size is highly sensitive.
'''

'\nNote:\nbatch_size is highly sensitive.\n'

## Carga de los conjuntos de datos

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
#Cargamos los tres conjuntos de datos
data_path = "/content/drive/My Drive/Datos/"
#data_path = "../Datos/"
df_train = pd.read_csv(data_path + "train_set.csv")
df_val = pd.read_csv(data_path + "val_set.csv")
df_test = pd.read_csv(data_path + "test_set.csv")

## Preprocesado del conjunto de datos

Al igual que hicimos en la libreta 2 con el modelo BERT, sustituimos las URLs por la palabra URL y los nombres de usuario por la palabra USER. Además, esta vez quitamos todos los signos de puntuación y ponemos todo el texto en minúscula.

In [0]:
# Para detectar urls y sustituirlas por URL
TEXT_URL = "https?:\S+|http?:\S|www\.\S+|\S+\.(com|org|co|us|uk|net|gov|edu)"
# Para detectar nombres de usuario y sustituirlos por USER
TEXT_USER = "@\S+"
# Para quitar signos de puntuación o caracteres extraños
TEXT_CLEANING = "[^A-Za-z0-9]+"

In [0]:
def preprocess(text, stem=False):
    text = re.sub(TEXT_URL,  'URL',    text)           # Cambiamos las URLs por la palabra 'URL'
    text = re.sub(TEXT_USER,  'USER', text)           # Cambiamos los nombres de usuario por la palabra 'USER'
    text = re.sub(r'\s+', ' ',   text).strip()        # Eliminamos dobles espacios en blanco y los espacios en blanco al principio o al final
    text = re.sub(TEXT_CLEANING, ' ', str(text).lower()) # Eliminamos signos de puntuación y caracteres no alfanuméricos y lo ponemos en minuscula
    return text

In [0]:
df_train.text = df_train.text.apply(lambda x: preprocess(x))
df_val.text = df_val.text.apply(lambda x: preprocess(x))
df_test.text = df_test.text.apply(lambda x: preprocess(x))

In [0]:
decode_map = {0: 0, 4: 1}
def decode_sentiment(label):
    return decode_map[int(label)]

df_train.target = df_train.target.apply(lambda x: decode_sentiment(x))
df_val.target = df_val.target.apply(lambda x: decode_sentiment(x))
df_test.target = df_test.target.apply(lambda x: decode_sentiment(x))

Nos quedamos con la parte relevante del conjunto de datos.

In [0]:
df_train = df_train[["target","text"]]
df_val = df_val[["target","text"]]
df_test = df_test[["target","text"]]
df_train.columns = ["label", "sentence"]
df_train.index.name = "idx"
df_train = df_train.reset_index()
df_val.columns = ["label", "sentence"]
df_val.index.name = "idx"
df_val = df_val.reset_index()
df_test.columns = ["label", "sentence"]
df_test.index.name = "idx"
df_test = df_test.reset_index()

In [0]:
df_train.label.value_counts()

1    640000
0    640000
Name: label, dtype: int64

In [0]:
df_val.label.value_counts()

1    80000
0    80000
Name: label, dtype: int64

## Algunas estadísticas una vez hemos preprocesado los datos

Número medio de palabras por tweet.

In [0]:
np.mean([len(sentence.split()) for sentence in df_train.sentence.values])

13.65478515625

Número de palabras distintas que aparecen en nuestro conjunto de entrenamiento:

In [0]:
word_occurences_df = df_train.sentence.str.split(expand=True).stack().value_counts()

In [0]:
word_occurences_df = word_occurences_df.to_frame()
word_occurences_df = word_occurences_df.reset_index()
word_occurences_df.columns = ["Word", "Occurences"]

In [0]:
word_occurences_df.shape[0]

243871

Número de palabras que aparecen más de 2, 3, 4, 5 y 6 veces:

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 2].shape[0]

68290

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 3].shape[0]

55151

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 4].shape[0]

47193

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 5].shape[0]

41715

In [0]:
word_occurences_df[word_occurences_df["Occurences"] > 6].shape[0]

37693

## Tokenización
Finalmente traducimos el texto a índices del vocabulario. Para ello usaremos un tokenizador creado con Keras. Este tokenizador asignará a cada una de las ``max_features = 48000`` palabras que más ocurrencias tienen en el conjunto de entrenamiento un índice. El resto de palabras serán ignoradas. 

Es importante notar que hemos fijado ``max_features = 48000`` debido a que el número de palabras que aparecen más de 4 veces es de 47193 (lo acabamos de ver en la sección anterior). De esta forma ignoraremos las palabras que aparecen menos de 5 veces, de las cuales realmente no tenemos mucha información (por lo que será dificil que se aprenda algo

In [0]:
# NUEVO vocabulary_indices
tokenizer = Tokenizer(num_words = max_features) #oov_token=None #Solo usaremos un vocabulario con max_features palabras
tokenizer.fit_on_texts(list(df_train.sentence.values))
#vocab_size = len(tokenizer.word_index) + 1
#tokenizer.texts_to_sequences(...)
#print(vocab_size)

Probamos a tokenizar algunas frases para hacernos una idea de lo que hace el tokenizador. La palabra 'himsik' no está en nuestro vocabulario (porque aparece menos de 5 veces en el conjunto de entrenamiento), luego el tokenizador la ignorará.

In [0]:
tokenizer.texts_to_sequences(["my father is so fat", "you are awesome himsik"])

[[6, 1185, 10, 19, 1116], [9, 40, 164]]

Comprobamos como tras tokenizar, solo usaremos max_features palabras distintas.

In [0]:
train_tweets_tokenized = tokenizer.texts_to_sequences(df_train.sentence)
len(set([word_id for tweet in train_tweets_tokenized for word_id in tweet]))

47999

In [0]:
x_train = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train.sentence), maxlen=maxlen)
y_train = df_train.label.values
x_val = sequence.pad_sequences(tokenizer.texts_to_sequences(df_val.sentence), maxlen=maxlen)
y_val = df_val.label.values
x_test = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test.sentence), maxlen=maxlen)
y_test = df_test.label.values

## Creamos y entrenamos el modelo LSTM bidireccional

Construimos el modelo.

In [0]:
print('Build model...')
model_LSTM_Bi = Sequential()
model_LSTM_Bi.add(Embedding(max_features, 128, input_length=maxlen))
model_LSTM_Bi.add(Bidirectional(LSTM(64)))
model_LSTM_Bi.add(Dropout(0.5))
model_LSTM_Bi.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
#De BERT
#loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")
#model.compile(optimizer=opt, loss=loss, metrics=[metric])

#De BERT adaptado
#loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")
#model_LSTM_Bi.compile(optimizer = 'adam', loss=loss, metrics=[metric])

#Original de esta LSTM
model_LSTM_Bi.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

Build model...


Definimos las callbacks y entrenamos el modelo:

In [0]:
checkpoint_path = "/content/drive/My Drive/"
#checkpoint_path = "./"

class TimeHistory(tf.keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.times = []

    def on_epoch_begin(self, epoch, logs={}):
        self.epoch_time_start = time.time()

    def on_epoch_end(self, epoch, logs={}):
        self.times.append(time.time() - self.epoch_time_start)

time_callback = TimeHistory()

my_callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath = checkpoint_path + 'my_best_model_LSTM_Bi.{epoch:02d}-{val_accuracy:.2f}.h5', 
    verbose=1, save_best_only=True, save_weights_only=False, monitor = 'val_accuracy', mode = 'max'), 
    time_callback
  ]

In [0]:
print('Train...')
history = model_LSTM_Bi.fit(x_train, y_train,
                            batch_size=batch_size,
                            epochs=epochs,
                            validation_data=(x_val, y_val),
                            verbose = 1,
                            callbacks=my_callbacks)

Train...
Epoch 1/8
Epoch 00001: val_accuracy improved from -inf to 0.82980, saving model to /content/drive/My Drive/my_best_model_LSTM_Bi.01-0.83.h5
Epoch 2/8
Epoch 00002: val_accuracy improved from 0.82980 to 0.83464, saving model to /content/drive/My Drive/my_best_model_LSTM_Bi.02-0.83.h5
Epoch 3/8
Epoch 00003: val_accuracy did not improve from 0.83464
Epoch 4/8
Epoch 00004: val_accuracy did not improve from 0.83464
Epoch 5/8
Epoch 00005: val_accuracy did not improve from 0.83464
Epoch 6/8
Epoch 00006: val_accuracy did not improve from 0.83464
Epoch 7/8
Epoch 00007: val_accuracy did not improve from 0.83464
Epoch 8/8
Epoch 00008: val_accuracy did not improve from 0.83464


Lo evaluamos en el conjunto test:

In [0]:
loss, acc = model_LSTM_Bi.evaluate(x_test, y_test, batch_size=batch_size)
print('Test loss:', loss)
print('Test accuracy:', acc)

Test loss: 0.4176775813102722
Test accuracy: 0.8237937688827515


Guardamos el modelo y los datos que hemos ido recopilando durante el entrenamiento.

In [0]:
model_LSTM_Bi.save(checkpoint_path + 'final_model_LSTM-Bi.h5')

In [0]:
# convert the history.history dict to a pandas DataFrame:     
hist_df = pd.DataFrame(history.history) 


# save to csv: 
hist_csv_file = checkpoint_path + 'history_LSTM-Bi.csv'
with open(hist_csv_file, mode='w') as f:
    hist_df.to_csv(f)

In [0]:
time_callback.times

[1577.6120159626007,
 1560.1105046272278,
 1555.884030342102,
 1543.2897021770477,
 1542.7007710933685,
 1546.0045628547668,
 1557.3562870025635,
 1564.6166729927063]

In [0]:
hist_df["times"] = time_callback.times
hist_df

Unnamed: 0,loss,accuracy,val_loss,val_accuracy,times
0,0.406861,0.814927,0.378229,0.8298,1577.612016
1,0.361261,0.840847,0.371101,0.834638,1560.110505
2,0.337109,0.85374,0.376348,0.8311,1555.88403
3,0.316792,0.863922,0.382415,0.831975,1543.289702
4,0.299333,0.872939,0.388498,0.82955,1542.700771
5,0.28378,0.880177,0.399679,0.82815,1546.004563
6,0.270821,0.886291,0.415312,0.825262,1557.356287
7,0.260113,0.89123,0.418778,0.823706,1564.616673


In [0]:
# save to csv: 
hist_csv_file = checkpoint_path + 'history_LSTM_Bi_with_times.csv'
with open(hist_csv_file, mode='w') as f:
    hist_df.to_csv(f)

Usamos el modelo para predecir la etiqueta de la frase "i hate you". Comprobamos que la salida se acerca bastante a 0, lo cual tiene sentido pues este comentario es bastante negativo.

In [0]:
model_LSTM_Bi.predict(sequence.pad_sequences(tokenizer.texts_to_sequences(["i hate you"]), maxlen=maxlen))

array([[0.05828063]], dtype=float32)

## Creamos y entrenamos el modelo LSTM-CNN

Construimos el modelo.

In [0]:
my_callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath = checkpoint_path + 'my_best_model_LSTM-CNN.{epoch:02d}-{val_accuracy:.2f}.h5', 
    verbose=1, save_best_only=True, save_weights_only=False, monitor = 'val_accuracy', mode = 'max'), 
    time_callback
  ]

In [0]:
print('Build model...')

model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen)) #max_features en vez de vocab_size
model.add(Dropout(0.25))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(lstm_output_size))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Build model...


Entrenamos el modelo:

In [0]:
print('Train...')
history = model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(x_val, y_val),
              verbose = 1,
              callbacks=my_callbacks)

Train...
Epoch 1/8
Epoch 00001: val_accuracy improved from -inf to 0.82799, saving model to /content/drive/My Drive/my_best_model_LSTM-CNN.01-0.83.h5
Epoch 2/8
Epoch 00002: val_accuracy improved from 0.82799 to 0.83097, saving model to /content/drive/My Drive/my_best_model_LSTM-CNN.02-0.83.h5
Epoch 3/8
Epoch 00003: val_accuracy improved from 0.83097 to 0.83266, saving model to /content/drive/My Drive/my_best_model_LSTM-CNN.03-0.83.h5
Epoch 4/8
Epoch 00004: val_accuracy did not improve from 0.83266
Epoch 5/8
Epoch 00005: val_accuracy did not improve from 0.83266
Epoch 6/8
Epoch 00006: val_accuracy did not improve from 0.83266
Epoch 7/8
Epoch 00007: val_accuracy did not improve from 0.83266
Epoch 8/8
Epoch 00008: val_accuracy did not improve from 0.83266


Lo evaluamos en el conjunto test:

In [0]:
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test loss:', loss)
print('Test accuracy:', acc)

Test loss: 0.394981324672699
Test accuracy: 0.828781247138977


Guardamos el modelo y los datos que hemos ido recopilando durante el entrenamiento.

In [0]:
model.save(checkpoint_path + './final_model_LSTM-CNN.h5')

In [0]:
# convert the history.history dict to a pandas DataFrame:     
hist_df = pd.DataFrame(history.history) 


# or save to csv: 
hist_csv_file = checkpoint_path + 'history_LSTM-CNN.csv'
with open(hist_csv_file, mode='w') as f:
    hist_df.to_csv(f)

In [0]:
time_callback.times

[1303.345401763916,
 1413.941558599472,
 1439.3113844394684,
 1443.7997148036957,
 1437.3164551258087,
 1440.3495645523071,
 1438.215805053711,
 1442.5869722366333]

In [0]:
hist_df["times"] = time_callback.times
hist_df

Unnamed: 0,loss,accuracy,val_loss,val_accuracy,times
0,0.406653,0.813486,0.382395,0.827987,1303.345402
1,0.365472,0.83731,0.378867,0.830975,1413.941559
2,0.346212,0.847987,0.378166,0.832656,1439.311384
3,0.332062,0.855298,0.381204,0.831975,1443.799715
4,0.320304,0.861888,0.390198,0.830712,1437.316455
5,0.31039,0.867178,0.394126,0.830863,1440.349565
6,0.302358,0.871159,0.403873,0.828013,1438.215805
7,0.294555,0.875039,0.396077,0.829206,1442.586972


In [0]:
# save to csv: 
hist_csv_file = checkpoint_path + 'history_LSTM-CNN_with_times.csv'
with open(hist_csv_file, mode='w') as f:
    hist_df.to_csv(f)