<div style="background-color: #1DA1F2; padding: 20px;"><b><h1> Descifrando el lenguaje emocional en Twitter: Un análisis predictivo basado en aprendizaje automático. </h1></b></div>

**Autor**: Neivys Luz González Gómez

La identificación de emociones es una tarea fundamental en el campo del procesamiento de lenguaje natural, que se enfoca en clasificar textos según su tono emocional. A pesar de que el objetivo es identificar una amplia variedad de emociones humanas, la mayoría de los conjuntos de datos disponibles se limitan a las polaridades positiva, negativa y, en ocasiones, neutral.

Detectar emociones a partir de textos es un reto complejo en el procesamiento del lenguaje natural, ya que se trata de un problema de clasificación multiclase y, en muchas ocasiones, no hay suficientes datos etiquetados disponibles. Sin embargo, este conjunto de datos etiquetado proporciona la oportunidad de aplicar diversas técnicas de análisis exploratorio y modelado para entender mejor la dinámica emocional en las redes sociales y mejorar la capacidad de detección en tiempo real.

El conjunto de datos de emociones se obtiene a partir de mensajes en inglés de Twitter y contiene seis emociones básicas: neutralidad, preocupación, felicidad, tristeza, amor, sorpresa, diversión, alivio, odio, vacío, entusiasmo y aburrimiento. Este conjunto de datos ofrece una variedad más amplia de emociones humanas, lo que permite el entrenamiento y la evaluación de modelos de análisis de sentimientos con mayor precisión y exhaustividad.

<div class="alert alert-info alert-info"><b><h3>Objetivo General</h3></b>
    
**Desarrollar un modelo que permita detectar emociones en los tweets y analizar patrones en el lenguaje utilizado en Twitter para ayudar en la detección temprana de trastornos emocionales como la depresión, la ansiedad, entre otros.**
</div>

---

# Notebook N° 8: Comprobacción del modelo

En este notebook, se presenta la predicción de la variable objetivo utilizando el mejor modelo obtenido previamente mediante un proceso de Pipeline. En primer lugar, realizaremos el procesamiento previo al conjunto de datos sin incluir la variable objetivo. Posteriormente, aplicaremos el modelo con los parámetros óptimos y llevaremos a cabo la predicción de la variable objetivo, en este caso, la prediccion de la emociones: negative: 0, neutral: 1, y positive: 2.

---

In [1]:
#import libraries
import pandas as pd
import numpy as np
import math
import spacy
nlp = spacy.load("en_core_web_sm")

import re
import string

#import NLTK
import nltk
nltk.download('punkt') #Punkt es una biblioteca que se utiliza para tokenizar frases en lenguaje natural
nltk.download('stopwords') # library "stopwords"
nltk.download('wordnet') # 
nltk.download('omw-1.4') #
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#import librerias de pre-procesamiento y normalizacion
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
from nltk.tag import pos_tag

from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
# Import basic modules from sklearn
from scipy import stats ## to check normality and variance of columns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler, LabelBinarizer, OrdinalEncoder

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

## Importn Classification models
from sklearn.linear_model import SGDClassifier, Perceptron, PassiveAggressiveClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


## Sabe pipeline
import joblib

---

<div class="alert alert-block alert-info">
<b><h2> Cargar Dataset.</h2></b> 
</div>

In [3]:
emotion_data= pd.read_csv('text_emotion.csv')

In [4]:
emotion_data

Unnamed: 0,tweet_id,sentiment,author,content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...
...,...,...,...,...
39995,1753918954,neutral,showMe_Heaven,@JohnLloydTaylor
39996,1753919001,love,drapeaux,Happy Mothers Day All my love
39997,1753919005,love,JenniRox,Happy Mother's Day to all the mommies out ther...
39998,1753919043,happiness,ipdaman1,@niariley WASSUP BEAUTIFUL!!! FOLLOW ME!! PEE...


## 1. Preparación del Dataset

In [5]:
emotion_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweet_id   40000 non-null  int64 
 1   sentiment  40000 non-null  object
 2   author     40000 non-null  object
 3   content    40000 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.2+ MB


## 2. Pre-procesamiento del conjunto de datos.

### 2.1 Limpieza del texto.

In [6]:
def remove_short_tweets(df, column_name):
    # Print tweets with less than 5 characters
    print(f"Tweets with less than 5 characters in column '{column_name}':")
    short_tweets = df[df[column_name].str.len() < 5][column_name]
    print(short_tweets)
    
    # Drop rows with less than 5 characters in the tweet
    df.drop(df[df[column_name].str.len() < 5].index, inplace=True)
    
    print(f"Total number of rows removed: {len(short_tweets)}")

In [7]:
def remove_mention_only_tweets(df, column_name):
    # Print tweets with only mentions (and + 0, or 1 characters)
    print(f"Tweets with only mentions in column '{column_name}':")
    mention_only_tweets = df[df[column_name].str.replace("@[^\s]+", "").str.len() < 2][column_name]
    print(mention_only_tweets)
    
    # Drop the tweets that contain only mentions
    df.drop(df[df[column_name].str.replace("@[^\s]+", "").str.len()<2].index, inplace=True)
    
    print(f"Total number of rows removed: {len(mention_only_tweets)}")

In [8]:
def reset_dataframe_indexes(df):
    df.reset_index(drop=True, inplace=True)
    print("DataFrame indexes reset successfully!")

In [9]:
def preproces_tweet(tweet):
    # Eliminar menciones (@nombredeusuario) y URLs
    tweet = re.sub(r'@[A-Za-z0-9]+|https?://[A-Za-z0-9./]+', '', tweet)
    
    # Convertir el texto a minúsculas
    tweet = tweet.lower()
    
    # Eliminar signos de puntuación
    tweet = re.sub('[%s]' % re.escape(string.punctuation), '', tweet)
    
    # Eliminar números
    tweet = re.sub(r'\d+', '', tweet)
    
    # Eliminar palabras comunes (stopwords)
    stop_words = stopwords.words('english') + ['u', 'im', 'c', 'n']
    words = tweet.split()
    words = [word for word in words if word not in stop_words]
    
    # Lematización (reducir las palabras a su raíz)
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    tweet = ' '.join(words)
    
    return tweet

In [10]:
def remove_empty_space_records(df, column_name):
    if '' in df[column_name].values:
        print(df[column_name].value_counts()[''])
    else:
        print(f"No hay valores vacíos en '{column_name}'.")
        
    # Eliminar los registros con espacios vacíos
    df.drop(df[df[column_name] == ''].index, inplace=True)
    
    # Resetear los índices
    df.reset_index(drop=True, inplace=True)
    
    print("Registros con espacios vacíos eliminados y los índices reseteados.")

In [11]:
def convert_to_string(df, column_name):
    df[column_name] = df[column_name].astype(str)
    print(f"La columna '{column_name}' ha sido convertida a tipo 'string'.")

### 2.2  Polaridad con VADER (Valence Aware Dictionary and sEntiment Reasoner)

In [12]:
def add_vader_sentiment_column(df, column_name):
    analyzer = SentimentIntensityAnalyzer()
    sentiment_data_vader = df[column_name].apply(lambda tweet: analyzer.polarity_scores(tweet))
    sentiment_data_vader = pd.json_normalize(sentiment_data_vader)
    sentiment_data_vader["polarity_vader"] = sentiment_data_vader["compound"].apply(lambda c: "positive" if c > 0.3 else "negative" if c < -0.3 else "neutral")
    df = pd.concat([df, sentiment_data_vader], axis=1)
    print("La columna de análisis de polaridad de Vader ha sido agregada al DataFrame.")
    return df

### 2.3 Label encoder 

In [13]:
def encode_polarity_vader_column(df):
    le = LabelEncoder()
    df['polarity_vader'] = le.fit_transform(df['polarity_vader'])
    print("La columna de polaridad de Vader ha sido codificada con éxito.")
    return df

### 2.4 Selección de columnas (Feature)

Se eliminaran columnas innecesarias para el estudio.

In [14]:
def drop_and_reset(df):
    # Eliminar columnas no utilizadas
    df.drop(['tweet_id', 'author', 'sentiment', 'neg', 'neu', 'pos', 'compound'], axis=1, inplace=True)
    print("Las columnas no utilizadas han sido eliminadas con éxito.")
    
    # Resetear los índices
    df.reset_index(drop=True, inplace=True)
    print("Los índices han sido restablecidos con éxito.")
    
    return df

### 2.5 Conjunto de datos pre-procesado

In [15]:
#Remove short tweets
remove_short_tweets(emotion_data, "content")

#remove mention only tweets
remove_mention_only_tweets(emotion_data, "content")

#reset dataframe indexes
reset_dataframe_indexes(emotion_data)

#pre-procesar los tweets
emotion_data['content'] = emotion_data['content'].apply(preproces_tweet)

#remove empty space records
remove_empty_space_records(emotion_data, "content")

#convert to string
convert_to_string(emotion_data, "content")

Tweets with less than 5 characters in column 'content':
340      0
15028    0
29869    0
39415    0
Name: content, dtype: object
Total number of rows removed: 4
Tweets with only mentions in column 'content':
659      @Joshuah_Pearson
664             @emlevins
3181          @Clumsyflic
4865        @philleasfogg
4933           @WillKnott
               ...       
38438        @MariahCarey
38650      @Britt_Uh_Knee
39206        @hmigroupllc
39518        @Remy_Foster
39995    @JohnLloydTaylor
Name: content, Length: 77, dtype: object
Total number of rows removed: 77
DataFrame indexes reset successfully!
127
Registros con espacios vacíos eliminados y los índices reseteados.
La columna 'content' ha sido convertida a tipo 'string'.


In [16]:
#add vader sentiment column
emotion_data = add_vader_sentiment_column(emotion_data, "content")

#encode polarity vader column
emotion_data = encode_polarity_vader_column(emotion_data)

La columna de análisis de polaridad de Vader ha sido agregada al DataFrame.
La columna de polaridad de Vader ha sido codificada con éxito.


In [17]:
emotion_data.head()

Unnamed: 0,tweet_id,sentiment,author,content,neg,neu,pos,compound,polarity_vader
0,1956967341,empty,xoshayzers,know listenin bad habit earlier started freaki...,0.333,0.667,0.0,-0.5423,0
1,1956967666,sadness,wannamama,layin bed headache ughhhhwaitin call,0.0,1.0,0.0,0.0,1
2,1956967696,sadness,coolfunky,funeral ceremonygloomy friday,0.556,0.444,0.0,-0.3612,0
3,1956967789,enthusiasm,czareaquino,want hang friend soon,0.0,0.308,0.692,0.5423,2
4,1956968416,neutral,xkilljoyx,want trade someone houston ticket one,0.0,0.794,0.206,0.0772,1


In [18]:
#drop and reset
emotion_data = drop_and_reset(emotion_data)

Las columnas no utilizadas han sido eliminadas con éxito.
Los índices han sido restablecidos con éxito.


In [19]:
emotion_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39792 entries, 0 to 39791
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   content         39792 non-null  object
 1   polarity_vader  39792 non-null  int32 
dtypes: int32(1), object(1)
memory usage: 466.4+ KB


## 3. Crear  la Pipeline

In [20]:
# model with best parameters

pipeline = Pipeline(
    [
        ('vector', CountVectorizer(lowercase=False)), 
        ('tfidf', TfidfTransformer()),  
        ('model', SGDClassifier(random_state=42, alpha= 0.0001, class_weight='balanced', max_iter=1000, 
                                penalty='l1'))
    ]
)
pipeline

Pipeline(steps=[('vector', CountVectorizer(lowercase=False)),
                ('tfidf', TfidfTransformer()),
                ('model',
                 SGDClassifier(class_weight='balanced', penalty='l1',
                               random_state=42))])

## 4. Entrenar  la Pipeline

Primero se separa los datos de entrenamiento en característica y objetivo

In [21]:
# Create features df
X_train = emotion_data['content']

## Crete target df
y_train = emotion_data['polarity_vader']

In [22]:
# Display X
X_train

0        know listenin bad habit earlier started freaki...
1                     layin bed headache ughhhhwaitin call
2                            funeral ceremonygloomy friday
3                                    want hang friend soon
4                    want trade someone houston ticket one
                               ...                        
39787                          succesfully following tayla
39788                                happy mother day love
39789    happy mother day mommy woman man long youre mo...
39790    wassup beautiful follow peep new hit single ww...
39791    bullet train tokyo gf visiting japan since thu...
Name: content, Length: 39792, dtype: object

In [23]:
# Display y_train
y_train

0        0
1        1
2        0
3        2
4        1
        ..
39787    1
39788    2
39789    2
39790    2
39791    1
Name: polarity_vader, Length: 39792, dtype: int32

In [24]:
# Train the pipline with training data
pipeline.fit(X_train,y_train)

Pipeline(steps=[('vector', CountVectorizer(lowercase=False)),
                ('tfidf', TfidfTransformer()),
                ('model',
                 SGDClassifier(class_weight='balanced', penalty='l1',
                               random_state=42))])

In [25]:
import joblib
joblib.dump(pipeline, 'best_model.pkl')

['best_model.pkl']

## Predict the target with Pipeline

In [26]:
import joblib
import pandas as pd

pipeline_loaded = joblib.load('best_model.pkl')

In [27]:
## Loading test data
X_test = pd.read_csv('test_data.csv')

In [32]:
# Pre-processing the tweets

#Remove short tweets
remove_short_tweets(X_test, "tweet")

#remove mention only tweets
remove_mention_only_tweets(X_test, "tweet")

#reset dataframe indexes
reset_dataframe_indexes(X_test)

#pre-procesar los tweets
X_test['tweet'] = X_test['tweet'].apply(preproces_tweet)

#remove empty space records
remove_empty_space_records(X_test, "tweet")

#convert to string
convert_to_string(X_test, "tweet")

#reset dataframe indexes
reset_dataframe_indexes(X_test)

Tweets with less than 5 characters in column 'tweet':
189     love
259     love
365     love
369     love
445     love
500     love
525     love
586     love
626     hate
664     love
721     love
807     love
847     love
1135    love
1209    love
1211    love
1270    love
1284    love
1437    love
1440    love
1452    love
1602      ☝🏻
1610    love
1677    love
1739    love
1743    love
1786    love
1800    love
1832    love
1872    love
1880    love
1955    love
1974     yes
2019    love
2176    yeah
2261    love
2287    love
2330    love
2381    love
2412    love
2416    love
2475    love
2491    hate
2503    love
2516    hate
2584    love
Name: tweet, dtype: object
Total number of rows removed: 46
Tweets with only mentions in column 'tweet':
Series([], Name: tweet, dtype: object)
Total number of rows removed: 0
DataFrame indexes reset successfully!
No hay valores vacíos en 'tweet'.
Registros con espacios vacíos eliminados y los índices reseteados.
La columna 'tweet' ha sido conver

In [33]:
# Display X
X_test

Unnamed: 0,tweet
0,please offer cc atv color blue
1,always love always ill go always love always l...
2,careful date alot people aint looking love loo...
3,love xx
4,tik tok ceo experience available child singapo...
...,...
2567,ready love trust always business help eachothe...
2568,it’s ramadan need quiet
2569,haaaaappy birthday love
2570,word hate subjective pending lens youre lookin...


In [35]:
X_test['emotion label'] = pipeline_loaded.predict(X_test['tweet'])

In [37]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tweet          2572 non-null   object
 1   emotion label  2572 non-null   int32 
dtypes: int32(1), object(1)
memory usage: 30.3+ KB


In [39]:
X_test.head(20)

Unnamed: 0,tweet,emotion label
0,please offer cc atv color blue,2
1,always love always ill go always love always l...,2
2,careful date alot people aint looking love loo...,2
3,love xx,2
4,tik tok ceo experience available child singapo...,1
5,beyondfast love nvidia,2
6,james kjv blessed man endureth temptation trie...,2
7,thx drop love arb ❤️,2
8,love sing top lungsand range master bathroom p...,2
9,sudden stream tonight fun thanks pummel party ...,2


In [41]:
X_test.tail(20)

Unnamed: 0,tweet,emotion label
2552,nani bestie fr love it🫶🫶🫶,2
2553,love love said yo man defending rape grooming ...,2
2554,ghanaians always criticise black star love alw...,2
2555,he’s playing good lmao,2
2556,sad god happiness unconditional love 😢,2
2557,celebrate nationalpuppyday 🐶 lot puppy love sn...,2
2558,🤣 alcahol force happiness moes,1
2559,back root daenerys stormborn egyptianish fashi...,1
2560,rt struggling ignite love reading student 🔥 ch...,2
2561,literal sacrifice life since old,1


In [42]:
X_test['emotion label'].value_counts()

2    1836
0     453
1     283
Name: emotion label, dtype: int64

In [43]:
X_test[X_test['emotion label'] == 0]

Unnamed: 0,tweet,emotion label
23,literally changed ecosystem fun……i hate ppl,0
25,“i piece shit man cheated wife instead helping...,0
26,ive watching entire life hate air breathes,0
27,god made depressed broke airpods see would happen,0
34,harmstrong don’t hate couple share everything 😂,0
...,...,...
2539,everything make anxious hate,0
2541,bill hate drafting offense unless name josh al...,0
2545,mom like yolanda hadid lmao hate,0
2566,post depressed gender give break female litera...,0


In [44]:
X_test[X_test['emotion label'] == 1]

Unnamed: 0,tweet,emotion label
4,tik tok ceo experience available child singapo...,1
10,aesthetic demolished empty lot,1
32,empty space neta,1
38,dont worry future okay let live present,1
45,reader little folio reason anger long appear w...,1
...,...,...
2551,“take responsibility happiness never put peopl...,1
2558,🤣 alcahol force happiness moes,1
2559,back root daenerys stormborn egyptianish fashi...,1
2561,literal sacrifice life since old,1


In [45]:
X_test[X_test['emotion label'] == 2]

Unnamed: 0,tweet,emotion label
0,please offer cc atv color blue,2
1,always love always ill go always love always l...,2
2,careful date alot people aint looking love loo...,2
3,love xx,2
5,beyondfast love nvidia,2
...,...,...
2564,everyone love talking bipolar egirls shit one ...,2
2565,🔊 nowplaying bbcradios futuresounds claraamfo ...,2
2567,ready love trust always business help eachothe...,2
2569,haaaaappy birthday love,2


In [38]:
X_test.to_csv('out.csv')