# Analisis de Sentimientos en tweets sobre vacunas del Covid-19

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NickEsColR/icesi-nlp/blob/main/Sesion1/7-sentiment-analysis.ipynb)

Ahora pongamos en práctica algunos de estos conceptos en un caso más real. Para esta práctica vamos a hacer un análisis de sentimientos sobre unos tweets de vacunas del Covid-19. Este caso sería una simple clasificación binaria y podemos utilizar cualquier modelo para ese fin, lo adicional aquí es el pre-procesamiento de las entradas de texto.

Usaremos el dataset de Kaggle [Covid-19 Vaccine Tweets with Sentiment Annotation](https://www.kaggle.com/datasets/datasciencetool/covid19-vaccine-tweets-with-sentiment-annotation?resource=download)

In [1]:
import pkg_resources
import warnings

warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

  import pkg_resources


In [2]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/requirements.txt && pip install -r requirements.txt

In [3]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/NickEsColR/icesi-nlp/raw/refs/heads/main/Sesion1/covid-19_vaccine_tweets_with_sentiment.csv

Empecemos por cargar el dataset:

In [4]:
import pandas as pd

tweets = pd.read_csv('./covid-19_vaccine_tweets_with_sentiment.csv', sep=',', encoding='latin1') # type: ignore
tweets.head()

Unnamed: 0,tweet_id,label,tweet_text
0,1.360342e+18,1,"4,000 a day dying from the so called Covid-19 ..."
1,1.382896e+18,2,Pranam message for today manifested in Dhyan b...
2,1.375673e+18,2,Hyderabad-based ?@BharatBiotech? has sought fu...
3,1.381311e+18,1,"Confirmation that Chinese #vaccines ""dont hav..."
4,1.362166e+18,3,"Lab studies suggest #Pfizer, #Moderna vaccines..."


Luego, hagamos algo de limpieza, vamos a remover nulos y valores vacíos:

In [5]:
tweets.dropna(inplace=True)
tweets.tweet_text = tweets.tweet_text.apply(lambda r: r.strip())
blanks = tweets[tweets.tweet_text == ''].index
tweets.drop(blanks, inplace=True)

In [6]:
tweets[tweets.tweet_text == ''].index

Index([], dtype='int64')

Vemos que no queda ningún tweet vació.

In [7]:
tweets.label.value_counts()

label
2    3680
3    1900
1     420
Name: count, dtype: int64

Tenemos un dataset desbalanceado de 1900 ejemplares positivos y 420 negativos. (el 2 es neutral)

In [8]:
tweets = tweets[tweets.label != 2]
tweets.reset_index(drop=True, inplace=True)
tweets.head()

Unnamed: 0,tweet_id,label,tweet_text
0,1.360342e+18,1,"4,000 a day dying from the so called Covid-19 ..."
1,1.381311e+18,1,"Confirmation that Chinese #vaccines ""dont hav..."
2,1.362166e+18,3,"Lab studies suggest #Pfizer, #Moderna vaccines..."
3,1.351285e+18,1,Still want to take the #jab?\n#PfizerBioNTech\...
4,1.363344e+18,3,#Covaxin effective against mutant virus strain...


Para hacer las cosas simples, vamos a utilizar un VADER para computar el puntaje de positivo o negativo. Este modelo ya viene implementado dentro de NLTK.

In [9]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/codespace/nltk_data...


True

In [10]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
tweets['scores'] = tweets.tweet_text.apply(lambda r: sid.polarity_scores(r))
tweets.head()

Unnamed: 0,tweet_id,label,tweet_text,scores
0,1.360342e+18,1,"4,000 a day dying from the so called Covid-19 ...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
1,1.381311e+18,1,"Confirmation that Chinese #vaccines ""dont hav...","{'neg': 0.148, 'neu': 0.852, 'pos': 0.0, 'comp..."
2,1.362166e+18,3,"Lab studies suggest #Pfizer, #Moderna vaccines...","{'neg': 0.0, 'neu': 0.794, 'pos': 0.206, 'comp..."
3,1.351285e+18,1,Still want to take the #jab?\n#PfizerBioNTech\...,"{'neg': 0.198, 'neu': 0.759, 'pos': 0.043, 'co..."
4,1.363344e+18,3,#Covaxin effective against mutant virus strain...,"{'neg': 0.116, 'neu': 0.657, 'pos': 0.228, 'co..."


Con estos puntajes ahora podemos convertir el resultado en una etiqueta de predicción:

In [11]:
tweets['compound'] = tweets.scores.apply(lambda s: s['compound'])    
tweets['prediction'] = tweets['compound'].apply(lambda c: 1 if c > 0 else 3)
tweets.head()

Unnamed: 0,tweet_id,label,tweet_text,scores,compound,prediction
0,1.360342e+18,1,"4,000 a day dying from the so called Covid-19 ...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,3
1,1.381311e+18,1,"Confirmation that Chinese #vaccines ""dont hav...","{'neg': 0.148, 'neu': 0.852, 'pos': 0.0, 'comp...",-0.7783,3
2,1.362166e+18,3,"Lab studies suggest #Pfizer, #Moderna vaccines...","{'neg': 0.0, 'neu': 0.794, 'pos': 0.206, 'comp...",0.3818,1
3,1.351285e+18,1,Still want to take the #jab?\n#PfizerBioNTech\...,"{'neg': 0.198, 'neu': 0.759, 'pos': 0.043, 'co...",-0.6908,3
4,1.363344e+18,3,#Covaxin effective against mutant virus strain...,"{'neg': 0.116, 'neu': 0.657, 'pos': 0.228, 'co...",0.7135,1


Y finalmente computar unas cuantas métricas de calidad del modelo:

In [12]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_true = tweets.label.astype(int).to_numpy()
y_pred = tweets.prediction.astype(int).to_numpy()

acc = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)


print(f"Accuracy:\n{acc}\n")
print(f"Classification Report:\n{cr}")
print(f"Confusion Matrix:\n{cm}")

Accuracy:
0.378448275862069

Classification Report:
              precision    recall  f1-score   support

           1       0.10      0.29      0.14       420
           3       0.72      0.40      0.51      1900

    accuracy                           0.38      2320
   macro avg       0.41      0.34      0.33      2320
weighted avg       0.60      0.38      0.45      2320

Confusion Matrix:
[[ 122  298]
 [1144  756]]


Aún podemos hacerlo mucho mejor que la línea base (50%). Se ven problemas con las etiquetas negativas debido en parte al desbalance! Vale la pena probar con algoritmos más complejos y estrategias de balanceo. 