# Emotion Detection System in Tweets

we are going to study the [Brazilian Stock Market Tweets with Emotions
](https://www.kaggle.com/datasets/fernandojvdasilva/stock-tweets-ptbr-emotions) dataset to identify the emotions in Brazilian stock market tweets. 

<a id="data"></a>

---
# Data Exploration


In this section, we are going to load the files into `pandas.DataFrame`. At last, elaborate our preprocessed datasets.


In [None]:
%%capture
!pip install scikit-learn==1.0.2

# download the portuguese spacy module
!pip install spacy==2.3.7
!python -m spacy download pt_core_news_sm-2.3.0 --direct

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_set = pd.read_csv('/kaggle/input/stock-tweets-ptbr-emotions/tweets_stocks.csv')
test_set  = pd.read_csv('/kaggle/input/stock-tweets-ptbr-emotions/tweets_stocks-full_agreement.csv')

train_set

## Clean train-set

Let's create the train set, excluding the test data and non-useful classes.

In [None]:
# remove test set from train set
mask_remove_test = np.logical_not(train_set['tweet_id'].isin(test_set['tweet_id']))
train_set = train_set[mask_remove_test]

train_set

In [None]:
# are there any test data in train set?
train_set[train_set['tweet_id'].isin(test_set['tweet_id'])]

Let's transform the cases without class in `NEUTRAL`.

In [None]:
# columns of interest
X_column = 'text'
Y_columns = ['TRU','DIS','JOY','SAD','ANT','SUR','ANG','FEA','NEUTRAL']

In [None]:
# ignore cases without class
dd = train_set.query('NEUTRAL == -1')
for idx, row in dd.iterrows():
    for column in Y_columns:
        train_set.at[idx, column] = 0
    train_set.at[idx, 'NEUTRAL'] = 1

train_set

In [None]:
# are there any train data without class?
train_set.query('NEUTRAL == -1')

<a id="class"></a>

---
# Multilabel Classification

Multilabel classification is a classification task where each sample is labeled with `m` labels from `n_classes` possible classes, where `m` can be `0` to `n_classes` inclusive. [sklearn - Multilabel classification](https://scikit-learn.org/stable/modules/multiclass.html#multilabel-classification)


<a id="strategy1"></a>

## Strategy 1 - TF-IDF

- First, we are going to create a simple text preprocessor
- Next, we are going to use the TF-IDF to extract features from the preprocessed texts
- At last, we are going to use Decision Tree algorithm to predict the classes


In [None]:
import re
import string
import pt_core_news_sm
from unidecode import unidecode
from sklearn.base import TransformerMixin, BaseEstimator

nlp = pt_core_news_sm.load()

In [None]:
class SimpleTextPreprocessor(BaseEstimator, TransformerMixin):
    """
    Text preprocessing includes steps:
        - Lower case
        - Remove accents
    """
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, *_):
        data = pd.Series(X) if not isinstance(X, pd.Series) else X
        data = data.apply(self._preprocess_text)
        return data

    def _preprocess_text(self, text):
        # handed functions
        pre_text = text.lower()
        pre_text = unidecode(pre_text)
        return pre_text

In [None]:
# just a test to see the preprocessing
tp = SimpleTextPreprocessor()
tp.transform(train_set['text'].iloc[0:4])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

pipe1 = Pipeline(steps=[
    ('normalize', SimpleTextPreprocessor()), 
    ('features', TfidfVectorizer(
        ngram_range=(1, 2), analyzer='word',
        sublinear_tf=True, max_features=3_000,
        max_df=0.9, preprocessor=None
    )),
    ('classifier', DecisionTreeClassifier(random_state=1))
])

In [None]:
%%time
pipe1.fit(train_set[X_column], train_set[Y_columns])

### Evaluation

Let's evaluate the classes performance using `classification_report`. The reported averages include macro average (averaging the unweighted mean per label), weighted average (averaging the support-weighted mean per label), and sample average (only for multilabel classification). Micro average (averaging the total true positives, false negatives and false positives) is only shown for multi-label or multi-class with a subset of classes, because it corresponds to accuracy otherwise and would be the same for all metrics. [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import multilabel_confusion_matrix

def print_confusion_matrix(confusion_matrix, axes, class_label, class_names, fontsize=12):
    df_cm = pd.DataFrame(confusion_matrix, index=class_names, columns=class_names,)
    heatmap = sns.heatmap(df_cm, annot=True, fmt="d", cbar=False, ax=axes, cmap="crest")
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    axes.set_ylabel('True label')
    axes.set_xlabel('Predicted')
    axes.set_title(class_label)

def run_confusion_matrix(Y_test, Y_pred, labels, size=(3,3)):
    vis_arr = multilabel_confusion_matrix(Y_test, Y_pred)
    fig, ax = plt.subplots(size[0], size[1], figsize=(12, 7))
    for axes, cfs_matrix, label in zip(ax.flatten(), vis_arr, labels):
        print_confusion_matrix(cfs_matrix, axes, label, ["N", "Y"])
    
    fig.tight_layout()
    plt.show()

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [None]:
# predict the classes
Y_test = test_set[Y_columns]
Y_pred = pipe1.predict(test_set[X_column])

print(classification_report(Y_test, Y_pred, target_names=Y_columns))
print('accuracy', f'{accuracy_score(Y_test, Y_pred):.4f}')

In [None]:
run_confusion_matrix(Y_test, Y_pred, labels=Y_columns)

<a id="strategy2"></a>

## Strategy 2 - TF-IDF + Robust Preprocessing

Now, we are going to execute a serie of preprocessing techniques to better clean our data.


In [None]:
class RobustTextPreprocessor(SimpleTextPreprocessor):
    """
    Text preprocessing includes steps:
        - Lower case
        - Remove accents
        - Replace @citations (it didn't performe well)
        - Replace http://websites.com (it didn't performe well)
        - Remove numbers
        - Remove special characters symbols
        - Remove breakline
        - Lemming
    """
    def __init__(self, nlp=nlp):
        self.nlp = nlp

    def _preprocess_text(self, text):
        # handed functions
        pre_text = super()._preprocess_text(text)
        pre_text = self._replace_citation(pre_text)
        pre_text = self._replace_website(pre_text)
        pre_text = self._remove_number(pre_text)
        pre_text = self._remove_punct(pre_text)
        pre_text = self._remove_breakline(pre_text)
        pre_text = self._remove_extra_spaces(pre_text)
        pre_text = self._lemmatize(pre_text)
        return pre_text

    def _remove_number(self, text):
        # Remove numbers
        return re.sub(r'\d', ' ', text)

    def _replace_citation(self, text):
        # Replace @\w by CITATION
        return re.sub(r'@[\w\d]+', 'CITATION',  text)

    def _replace_website(self, text):
        # Replace http://websites.com by SITE
        return re.sub(r'https?:\/\/.+', 'SITE',  text)

    def _remove_punct(self, text):
        # Replace special characters symbols
        spaces = ' '*len(string.punctuation)
        return text.translate(str.maketrans(string.punctuation, spaces))

    def _remove_breakline(self, text):
        # Remove breakline
        return re.sub(r'\n', ' ',  text)
        
    def _remove_extra_spaces(self, text):
        # Remove extra spaces
        return re.sub(' +', ' ', text)

    def _lemmatize(self, text):
        # Normalization
        doc = self.nlp(text)
        return ' '.join(t.lemma_ for t in doc)

In [None]:
# just a test to see the preprocessing
tp = RobustTextPreprocessor()
tp.transform(train_set['text'].iloc[0:4])

In [None]:
pipe2 = Pipeline(steps=[
    ('normalize', RobustTextPreprocessor()), 
    ('features', TfidfVectorizer(
        ngram_range=(1, 2), analyzer='word',
        sublinear_tf=True, max_features=3_000,
        max_df=0.9, preprocessor=None
    )),
    ('classifier', DecisionTreeClassifier(random_state=1))
])

In [None]:
%%time
pipe2.fit(train_set[X_column], train_set[Y_columns])

### Evaluation

In [None]:
# predict the classes
Y_test = test_set[Y_columns].to_numpy()
Y_pred = pipe2.predict(test_set[X_column])

print(classification_report(Y_test, Y_pred, target_names=Y_columns))
print('accuracy', f'{accuracy_score(Y_test, Y_pred):.4f}')

In [None]:
run_confusion_matrix(Y_test, Y_pred, labels=Y_columns)

<a id="conclusion"></a>

---
# Conclusion

In general, both techniques presented similar results. In this case, the robust preprocessing approach was not able to help the classifier. 😔   
Anyway, we was able to create an algorithm capable of predicting nine different emotions for a stock market tweet.
