# LSTM-GloVe Model for Sentiment Analysis in text data

En este notebook se describe el proceso de carga, preprocesamiento, embedding, construcción y entrenamiento de un modelo que emplea LSTM y GloVe
para el set de datos de texto.

## Carga de Datos

A continuación se cargará el dataset unificado que se construyó en etapas anteriores (Ver `data_join.ipynb` y `NLP_tasks.ipynb`) que cuenta con cas 66000 registros de texto.

In [None]:
import pandas as pd
data = pd.read_csv('../../data/cleaned/out.csv')
data.head()

In [None]:
data.shape

## Preprocesamiento

### Emociones

Se realizará el método de one_hot_encoding para nuestra varible de salida del modelo: Las 7 emociones.

In [None]:
#Emociones
emotions = data['label'].unique()
print(emotions)

In [None]:
y = pd.get_dummies(data.label)
y

### Texto

Es necesario remover de nuestros datos información irrelevante como etiquetas, puntución, números y caracteres especiales.

In [None]:
import re
data['text'][0]

In [None]:
TAG_RE = re.compile(r'@[^> ]+')

def remove_at_sign(sentence: str):
    '''
    Replaces '@' from and input string for an empty space
    :param sentence: String that contains @
    :return: sentence without @
    '''

    return TAG_RE.sub('', sentence)

In [None]:
remove_at_sign(data['text'][0])

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

def preprocess_text(sentence: str):
    '''
    Cleans up a sentence leaving only 2 or more non-stopsentences composed of upper and lowercase
    :param sentence: String to be cleaned
    :return: sentence without numbers, special chars and long stopsentences
    '''

    cleaned_sentence = sentence.lower()
    cleaned_sentence = remove_at_sign(cleaned_sentence)
    cleaned_sentence = re.sub('[^a-zA-Z]', ' ', cleaned_sentence)
    cleaned_sentence = re.sub('\s+[a-zA-Z]\s', ' ', cleaned_sentence)
    cleaned_sentence = re.sub('\s+', ' ', cleaned_sentence)

    #Removal of stopsentences
    pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s')
    cleaned_sentence = pattern.sub('', cleaned_sentence)

    return cleaned_sentence

In [None]:
preprocess_text(data['text'][0])

In [None]:
from copy import  deepcopy

cleaned_data = deepcopy(data)
cleaned_data['text'] = cleaned_data['text'].apply(preprocess_text)
cleaned_data['text']

## Embedding

Para el proceso de embedding se usarán los datos de un modelo de embedding como lo es GloVe.
La información dicho modelo será cargada dentro de `words`. Cada uno de los tokens de GloVe que
se usará tiene una dimensión de 50. En caso de que

In [None]:
import numpy as np

words = {}

def add_to_dict(dictionary, filename):
    with open(filename, 'r') as f:
        for line in f.readlines():
            line = line.split(' ')

            try:
                dictionary[line[0]] = np.array(line[1:], dtype=float)
            except:
                continue

add_to_dict(words, './GloVe/glove.6B/glove.6B.50d.txt')

In [None]:
# words

### Tokenización y Lematización

Una vez cargada la información de los tokens de GloVe se procede a tokenizar y lematizar cada
una de las oraciones en nuestro set de datos.

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

Ejemplo de cómo se debería de tokenizar y lematizar una oración:

In [None]:
from nltk import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()

sample = preprocess_text(cleaned_data['text'][0])
token_sample = tokenizer.tokenize(sample)
lemma_sample = [lemmatizer.lemmatize(token) for token in token_sample]
lemma_sample

A continuación se define una función para la tokenización y lematización. Adicionalmente, el token final que se entrega únicamente contiene palabras definidas en `words`.

In [None]:
def sentence_to_token_list(sentence: str):
    tokens = tokenizer.tokenize(sentence)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    useful_tokens = [token for token in lemmatized_tokens if token in words]

    return  useful_tokens

In [None]:
sentence_to_token_list(sample)

Con el token anterior, el cual sabemos se puede representar por medio de uno de los tokens almacenados en `words`, entonces pasamos a la representación de estos:

In [None]:
def sentence_to_words_vectors(sentence: str, word_dict=words):
    processed_tokens = sentence_to_token_list(sentence)

    vectors = []
    for token in processed_tokens:
        if token in word_dict:
            token_vector = word_dict[token]
            vectors.append(token_vector)

    return np.array(vectors, dtype=float)

In [None]:
sentence_to_words_vectors(sample).shape

In [None]:
sentence_to_words_vectors(sample)

In [None]:
#Se obtiene nuestro conjunto de datos X
X = cleaned_data['text'].apply(lambda sentence: sentence_to_words_vectors(sentence))
X

Dado que las matrices de vectores de cada oración tienen un número diferente de filas debido a que cada oración cuenta con un número diferente de palabas. Es necesario identificar el tamaño máximo de los textos que se tienen para su "estandarización":

In [None]:
temporal = deepcopy(X)
temporal['len'] = temporal.apply(np.shape)

MAX_LEN = max(temporal['len'])[0]
MAX_LEN

Dado que el tamaño máximo es 35, entonces se llevarán todas las matrices a la forma `(35, 50)`. Los valores faltantes para cada vector serán 0s en su inicio.

In [None]:
import tensorflow as tf

X_ = tf.keras.utils.pad_sequences(X, maxlen=MAX_LEN, dtype='float32')
# X_

In [None]:
X_.shape

In [None]:
X_[0].shape

In [None]:
X_[0]

## División Entrenamiento-Validación-Test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test =  train_test_split(X_, y, test_size=0.2, random_state=3)

## Modelo

In [None]:
from keras.models import Sequential
from keras import layers
from keras.layers import Embedding, Lambda, LSTM, Flatten, Dense, Input, Dropout, Bidirectional, GlobalMaxPooling1D
from keras.optimizers import Adam, RMSprop, SGD
from kerastuner import RandomSearch, HyperParameters

def build_model(hp):
    
    model = Sequential()
    model.add(Input(shape=(MAX_LEN, 50)))

    # Hiperparámetros para LSTM 1
    lstm_units = hp.Int("lstm_units_1", min_value=128, max_value=256, step=32)
    lstm_dropout = hp.Float("lstm_dropout", min_value=0.1, max_value=0.5, step=0.1)
    
    model.add(Bidirectional(LSTM(lstm_units, return_sequences=True, dropout=lstm_dropout, recurrent_dropout=lstm_dropout)))

    # Hiperparámetros para LSTM 2
    lstm_units = hp.Int("lstm_units_2", min_value=64, max_value=128, step=32)
    lstm_dropout = hp.Float("lstm_dropout_2", min_value=0.1, max_value=0.5, step=0.1)
    
    model.add(Bidirectional(LSTM(lstm_units, return_sequences=True, dropout=lstm_dropout, recurrent_dropout=lstm_dropout)))
    model.add(GlobalMaxPooling1D())

    # Hiperparámetros para capa densa 1
    dense_units = hp.Int("dense_units_1", min_value=32, max_value=128, step=32)
    dense_dropout = hp.Float("dense_dropout_1", min_value=0.1, max_value=0.5, step=0.1)
    
    model.add(Dense(dense_units, activation='relu'))
    model.add(Dropout(dense_dropout))

    # Hiperparámetros para capa densa 2
    dense_units = hp.Int("dense_units_2", min_value=32, max_value=128, step=32)
    model.add(Dense(dense_units, activation='relu'))

    # Salida del modelo
    model.add(Dense(7, activation='softmax'))

    # Hiperparámetros para el optimizador (En otras pruebas se vio que Adam era el mejor)
    learning_rate = hp.Float("learning_rate", min_value=1e-5, max_value=1e-3, sampling="LOG")
    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Define el objeto de búsqueda aleatoria
tuner = RandomSearch(
    build_model,
    objective="val_accuracy",
    max_trials=20,  # Número de modelos a probar
    executions_per_trial=1,
    directory='./saved/fine_tuned/',
    project_name='HP_LSTM_Glove_text'
)

# Resumen de la búsqueda
tuner.search_space_summary()

In [None]:
# from keras.models import Model

# def build_gpt_model(hp):
#     input_layer = layers.Input(shape=(MAX_LEN, 50))
#     x = layers.Bidirectional(layers.LSTM(units=hp.Int('units_1', min_value=64, max_value=256, step=32),
#                                          return_sequences=True, dropout=0.25, recurrent_dropout=0.25))(input_layer)
#     x = layers.Bidirectional(layers.LSTM(units=hp.Int('units_2', min_value=64, max_value=128, step=32),
#                                          return_sequences=True, dropout=0.25, recurrent_dropout=0.25))(x)

#     # Aplica la capa de atención usando la salida de la última capa LSTM como query y key
#     attention = layers.Attention()([x, x])
#     x = layers.GlobalMaxPooling1D()(attention)
#     x = layers.Dense(units=hp.Int('dense_units_1', min_value=32, max_value=128, step=32), activation='relu')(x)
#     x = layers.Dropout(rate=hp.Float('dropout_1', min_value=0.3, max_value=0.7, step=0.1))(x)
#     x = layers.Dense(units=hp.Int('dense_units_2', min_value=32, max_value=128, step=32), activation='relu')(x)
#     x = layers.Dropout(rate=hp.Float('dropout_2', min_value=0.3, max_value=0.7, step=0.1))(x)
#     output_layer = layers.Dense(7, activation='softmax')(x)

#     model = Model(inputs=input_layer, outputs=output_layer)

In [None]:
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.callbacks import EarlyStopping, ReduceLROnPlateau

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, min_lr=1e-5)
cp = ModelCheckpoint('saved/', save_best_only=True)

callbacks = [cp, early_stopping, reduce_lr]

In [None]:
BATCH_SIZE=1024
tuner.search(X_train, y_train,
                    epochs=10,
                    validation_split=0.1,
                    batch_size=BATCH_SIZE,
                    callbacks=callbacks)

best_hp_random = tuner.get_best_hyperparameters(num_trials=1)[0]

print("Mejores hiperparámetros encontrados:")
print(best_hp_random)

In [None]:
def build_model_fine(hp):
    model = Sequential()
    model.add(layers.Input(shape=(MAX_LEN, 50)))
    model.add(layers.Bidirectional(layers.LSTM(units=hp.Int('units', min_value=128, max_value=192, step=16), 
                                               return_sequences=True, dropout=0.25, recurrent_dropout=0.25)))
    model.add(layers.GlobalMaxPooling1D())
    model.add(layers.Dense(units=hp.Int('dense_units', min_value=64, max_value=96, step=16), activation='relu'))
    model.add(layers.Dropout(rate=hp.Float('dropout', min_value=0.4, max_value=0.6, step=0.05)))
    model.add(layers.Dense(7, activation='softmax'))

    # Agrega la elección de optimizador como hiperparámetro
    optimizer_choice = hp.Choice('optimizer', values=['adam', 'rmsprop'])

    # Agrega la elección de learning rate como hiperparámetro
    learning_rate = hp.Float('learning_rate', min_value=1e-4, max_value=1e-3, sampling='LOG')

    if optimizer_choice == 'adam':
        optimizer = Adam(learning_rate=learning_rate)
    else:
        optimizer = RMSprop(learning_rate=learning_rate)

    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

best_model_fine = build_model_fine(best_hp_coarse)

In [None]:
BATCH_SIZE=2048
history = best_model_fine.fit(X_train, y_train,
                    validation_split=0.1, epochs=30,
                    batch_size=BATCH_SIZE,
                    callbacks=callbacks)

In [None]:
score = best_model_fine.evaluate(X_test, y_test, verbose=1)
print(score)

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Flatten, Dense, Input, Dropout, Bidirectional, GlobalMaxPooling1D

model = Sequential()
model.add(Input(shape=(MAX_LEN, 50)))
model.add(Bidirectional(LSTM(128, return_sequences=True, dropout=0.25, recurrent_dropout=0.25)))
model.add(Bidirectional(LSTM(64, return_sequences=True, dropout=0.25, recurrent_dropout=0.25)))
model.add(GlobalMaxPooling1D())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(7, activation='softmax'))

In [None]:
optimizer = Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [None]:
BATCH_SIZE=2048
history = model.fit(X_train, y_train,
                    validation_split=0.1, epochs=30,
                    batch_size=BATCH_SIZE,
                    callbacks=callbacks)

## Test

In [None]:
from keras.models import load_model

lstm_basic_model = load_model('saved/')

In [None]:
score = lstm_basic_model.evaluate(X_test, y_test, verbose=1)

In [None]:
print('Test Accuracy:', score[1])

# Variacion modelo

In [None]:
lstm_dropout_dense = Sequential(name='Lstm-dout-dense')
lstm_dropout_dense.add(Input(shape=(MAX_LEN, 50)))
lstm_dropout_dense.add(LSTM(64, return_sequences=True))
lstm_dropout_dense.add(Dropout(0.2))
lstm_dropout_dense.add(LSTM(32))
lstm_dropout_dense.add(Flatten())
lstm_dropout_dense.add(Dense(128, activation='relu'))
lstm_dropout_dense.add(Dense(7, activation='softmax'))

In [None]:
lstm_dropout_dense.summary()

In [None]:
lstm_dropout_dense.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['acc'])

In [None]:
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

In [None]:
history_2 = lstm_dropout_dense.fit(X_train, y_train,
                    validation_split=0.2, epochs=10,
                    batch_size=BATCH_SIZE,
                    callbacks=[early_stop])

In [None]:
score = lstm_dropout_dense.evaluate(X_test, y_test, verbose=1)
print('Test Accuracy:', score[1])