# Introducción

En este proyecto, se utilizará la base de datos de Twitter Sentiment Analysis para entrenar un modelo deep learning basado en LSTM (Long Short-Term Memory) para resolver el problema de análisis de sentimientos en los tweets.

El análisis de sentimientos en Twitter se ha vuelto cada vez más importante en la actualidad debido a la gran cantidad de información que se comparte en esta red social. El objetivo de este proyecto es construir un modelo que pueda analizar automáticamente los tweets y clasificarlos en positivos, negativos o neutros.

Para lograr esto, se utilizará una base de datos de PLN para entrenar el modelo de RNA. La base de datos contendrá una gran cantidad de tweets etiquetados con su correspondiente sentimiento. Se utilizará una arquitectura de red neuronal que permita la entrada de secuencias de palabras y su correspondiente etiqueta de salida. La red neuronal se entrenará con la base de datos de Twitter Sentiment Analysis para aprender a realizar la tarea de análisis de sentimientos en los tweets.


### Referencias



* https://keras.io/examples/nlp/text_classification_from_scratch/
* https://www.kaggle.com/code/ngyptr/lstm-sentiment-analysis-keras
* https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis



## Setup

In [2]:
!pip install Keras-Preprocessing

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Keras-Preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Keras-Preprocessing
Successfully installed Keras-Preprocessing-1.1.2


In [3]:
!pip install keras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
!pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
import numpy as np
import pandas as pd
import re
import keras
import datetime

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.models import Sequential

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

## 1. Importar Base de datos

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [7]:
# csv_train = pd.read_csv('/content/drive/MyDrive/Semestres UdeA/Semestre X/Simulacion de sistemas/Proyectos del grupo/Proyecto del curso/Twitter Bd/twitter_training.csv', header=None)
csv_train = pd.read_csv('/content/twitter_training.csv', header=None)

columnas = csv_train.columns.tolist()

# Obtener las dos columnas del DataFrame como una matriz
columna1 = csv_train.iloc[:, 2].astype(str).values
columna2 = csv_train.iloc[:, 3].astype(str).values

# Crear una matriz con las dos columnas
data_train = []
for i in range(len(columna1)):
    data_train.append([columna1[i], columna2[i]])


# dataset_validation = pd.read_csv('/content/drive/MyDrive/Semestres UdeA/Semestre X/Simulacion de sistemas/Proyectos del grupo/Proyecto del curso/Twitter Bd/twitter_validation.csv')
# data_validation = dataset_validation[['text','sentiment']]

## 2. Limpiar los datos

In [8]:
# Filtrar las filas que contienen "Irrelevant"
# Remover los textos que tengan  una sola palabra
# Remover los textos que no tengan letras y numeros
nueva_matriz = []
patron = re.compile('[^a-zA-Z0-9]+')

for fila in data_train:
    if fila[0].lower() != "irrelevant" and len(fila[1].split()) >1 and not patron.match(fila[1]):
        nueva_matriz.append(fila)

# Convertir la matriz a un arreglo de NumPy
dataset_train = np.array(nueva_matriz)

# Poner todos los textos en minúsculas
# Reemplazar los caracteres especiales por un espacio
# Reemplazar los @ y <unk> por un espacio
dataset_train[:, 1] = np.char.lower(dataset_train[:, 1])
dataset_train[:, 1] = np.char.replace(dataset_train[:, 1], '[^a-zA-z0-9\s]', '')
dataset_train[:, 1] = np.char.replace(dataset_train[:, 1], '@', '')
dataset_train[:, 1] = np.char.replace(dataset_train[:, 1], '<unk>', '')

print('Positive = ',np.sum(dataset_train[:, 0] == 'Positive'))
print('Negative = ',np.sum(dataset_train[:, 0] == 'Negative'))
print('Neutral = ',np.sum(dataset_train[:, 0] == 'Neutral'))

# for i in range(len(dataset_train)):
#   print(i," | ",dataset_train[i])
#   if(i==10) :
#     break

Positive =  18129
Negative =  16081
Neutral =  15327


## 3. Vectorizar los datos

In [None]:
print(dataset_train[:, 1])

['im getting on borderlands and i will murder you all ,'
 'i am coming to the borders and i will kill you all,'
 'im getting on borderlands and i will kill you all,' ...
 'just realized the windows partition of my mac is now 6 years behind on nvidia drivers and i have no idea how he didn’t notice'
 'just realized between the windows partition of my mac is like being 6 years behind on nvidia drivers and cars i have no fucking idea how i ever didn ’ t notice'
 'just like the windows partition of my mac is like 6 years behind on its drivers so you have no idea how i didn’t notice']


In [9]:
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(dataset_train[:, 1])
X = tokenizer.texts_to_sequences(dataset_train[:, 1])
X = pad_sequences(X)

print(X)

[[   0    0    0 ... 1659   16   29]
 [   0    0    0 ...  443   16   29]
 [   0    0    0 ...  443   16   29]
 ...
 [   0    0    0 ...   65  160 1020]
 [   0    0    0 ... 1872   84   39]
 [   0    0    0 ...   65    2 1020]]


## 4. Construir el modelo

In [10]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 166, 128)          256000    
                                                                 
 spatial_dropout1d (SpatialD  (None, 166, 128)         0         
 ropout1D)                                                       
                                                                 
 lstm (LSTM)                 (None, 196)               254800    
                                                                 
 dense (Dense)               (None, 3)                 591       
                                                                 
Total params: 511,391
Trainable params: 511,391
Non-trainable params: 0
_________________________________________________________________
None


## 5. Entrenar el modelo

In [11]:
Y = dataset_train[:, 0]
Y1 = pd.get_dummies(dataset_train[:, 0]).values

X_train, X_test, Y_train, Y_test = train_test_split(X,Y1, test_size = 0.33, random_state = 42)


In [12]:
inicio = datetime.datetime.now()
print('Inicia: ', inicio)

batch_size = 35
model.fit(X_train, Y_train, epochs = 2, batch_size=batch_size, verbose = 2)

fin  = datetime.datetime.now()
print('Termina: ', fin)
print('Duracion: ', fin - inicio)

Inicia:  2023-05-20 03:05:40.042848
Epoch 1/2
949/949 - 546s - loss: 0.8278 - accuracy: 0.6312 - 546s/epoch - 575ms/step
Epoch 2/2
949/949 - 525s - loss: 0.6676 - accuracy: 0.7264 - 525s/epoch - 554ms/step
Termina:  2023-05-20 03:24:04.439431
Duracion:  0:18:24.396583


In [13]:
validation_size = 1500

X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

425/425 - 34s - loss: 0.6470 - accuracy: 0.7328 - 34s/epoch - 80ms/step
score: 0.65
acc: 0.73


# 6. Exportar el modelo como h5

In [None]:
print(X_validate.shape)
print(Y_validate.shape)

(1500, 166)
(1500, 3)


In [None]:
ruta_guardado = '/content/drive/MyDrive/Semestres UdeA/Semestre X/Simulacion de sistemas/Proyectos del grupo/Proyecto del curso/exported_model.h5'

# guarda el modelo
model.save(ruta_guardado,save_format='h5')
print("model saved!!!")

model saved!!!


# 7. Métricas

## Métrica de presición

In [15]:
# Predecir las clases para el conjunto de prueba
y_pred_prob = model.predict(X_test)

# Máximo a posteriori, encuentra el valor máximo, axis=1 indica que se busca el índice máximo a lo largo del eje de las columnas
y_pred = np.argmax(y_pred_prob, axis=1)

# Calcular la precisión
#  calcula la precisión del modelo comparando las clases verdaderas (Y_test) con las clases predichas (y_pred).
accuracy = accuracy_score(np.argmax(Y_test, axis=1), y_pred)
print("Precisión: %.2f%%" % (accuracy * 100))

Precisión: 73.28%


# Prueba del modelo

In [None]:
pos_cnt, neg_cnt, neu_cnt, pos_correct, neg_correct, neu_correct = 0, 0, 0, 0, 0, 0
for x in range(len(X_validate)):
    
    y_predicho = model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]
    print(Y_validate[x])
    print(y_predicho)

    print(np.argmax(y_predicho) == np.argmax(Y_validate[x]))

    if np.argmax(y_predicho) == np.argmax(Y_validate[x]):
        if np.argmax(Y_validate[x]) == 0:
            neg_correct += 1
        elif np.argmax(Y_validate[x]) == 1:
            neu_correct += 1
        else:
            pos_correct += 1
       
    if np.argmax(Y_validate[x]) == 0:
        neg_cnt += 1
    elif np.argmax(Y_validate[x]) == 1:
        neu_cnt += 1
    else:
        pos_cnt += 1

print("pos_acc", pos_correct/pos_cnt*100, "%")
print("neg_acc", neg_correct/neg_cnt*100, "%")
print("neu_acc", neu_correct/neu_cnt*100, "%")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
True
1/1 - 0s - 66ms/epoch - 66ms/step
[0 1 0]
[0.00719337 0.60165256 0.39115405]
True
1/1 - 0s - 81ms/epoch - 81ms/step
[1 0 0]
[0.13903324 0.09564874 0.76531804]
False
1/1 - 0s - 66ms/epoch - 66ms/step
[1 0 0]
[0.9158452  0.03012987 0.05402496]
True
1/1 - 0s - 65ms/epoch - 65ms/step
[0 0 1]
[0.0017529  0.01742987 0.9808172 ]
True
1/1 - 0s - 70ms/epoch - 70ms/step
[0 0 1]
[0.05293953 0.0637793  0.8832812 ]
True
1/1 - 0s - 64ms/epoch - 64ms/step
[0 0 1]
[0.04386446 0.06982349 0.88631207]
True
1/1 - 0s - 79ms/epoch - 79ms/step
[0 0 1]
[0.17533211 0.081056   0.7436119 ]
True
1/1 - 0s - 69ms/epoch - 69ms/step
[1 0 0]
[0.99289674 0.00345746 0.00364582]
True
1/1 - 0s - 69ms/epoch - 69ms/step
[1 0 0]
[0.8044193  0.09458061 0.10100017]
True
1/1 - 0s - 66ms/epoch - 66ms/step
[0 0 1]
[0.02267797 0.12364711 0.8536749 ]
True
1/1 - 0s - 66ms/epoch - 66ms/step
[0 1 0]
[0.00239739 0.9961319  0.00147074]
True
1/1 - 0s - 66ms/epoch - 66m

# Prueba: Ejemplo de predicción

In [None]:
twt = [""" Come meet one of the beautiful gods of gambling. """]
#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
print(twt)
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]
print("sentiment", sentiment)
print(np.argmax(sentiment))
if(np.argmax(sentiment) == 0):
    print("negative")
elif (np.argmax(sentiment) == 1):
    print("positive")
else:
  print("neutral")

[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0  228 1243   45    6    1  349    6]]
1/1 - 0s - 46ms/epoch - 46ms/step
sentiment [0.00998857 0.24354352 0.74646795]
2
neutral


In [None]:
# carga el modelo
loaded_model = keras.models.load_model(ruta_guardado)

In [None]:
twt = [""" Come meet one of the beautiful gods of gambling. """]
#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=166, dtype='int32', value=0)
print(twt)
sentiment = loaded_model.predict(twt,batch_size=1,verbose = 2)[0]
print("sentiment", sentiment)
print(np.argmax(sentiment))
if(np.argmax(sentiment) == 0):
    print("negative")
elif (np.argmax(sentiment) == 1):
    print("positive")
else:
  print("neutral")

[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0  228 1243   45    6    1  349    6]]
1/1 - 1s - 719ms/epoch - 719ms/step
sentiment [0.00999477 0.24404229 0.7459629 ]
2
neutral
