---

# Georgios_Ioannou


## Copyright © 2023 by Georgios Ioannou


---

<h1 align="center"> Text Emotion System Sentiment Analysis </h1>
<h2 align="center"> TESSA </h2>

In this notebook, we will be classifying emotion based on text documents. The dataset we will be using is called:

<p style="text-align: center;"><a href="https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp">Emotions Dataset for NLP</a></p>


---

<h2 align="center"> Remember our main steps motto "ISBE" </h2>

<h3 align="center"> Main Steps when building a Machine Learning Model </h3>

1. **I** - `Inspect and explore data`
2. **S** - `Select and engineer features`
3. **B** - `Build and train model`
4. **E** - `Evaluate model`


---

<h2 align='center'> GPU Information </h2>


In [1]:
!nvidia-smi


Wed Dec 20 08:15:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

---

<h2 align='center'> Libraries </h2>


In [2]:
# Import libraries.

# Use inline so our visualizations display in notebook.


%matplotlib inline


import matplotlib.pyplot as plt   # Data visualization.
import nltk                       # Natural Language Processing.
import numpy as np                # Data wrangling.
import os                         # Manipulate operating system interfaces.
import pandas as pd               # Data handling.
pd.set_option('display.max_colwidth', None)
import pickle                     # Python object serialization.
import plotly.express as px       # Data visualization
import plotly.graph_objects as go # Data visualization
import re                         # Regular expression operations.
import seaborn as sns             # Data visualization.
import subprocess                 # To download nltk wordnet in Kaggle.
sns.set()
import warnings                   # Ignore all warnings.
warnings.filterwarnings('ignore')


from nltk.stem import WordNetLemmatizer # Lemmatize using WordNet's built-in morphy function.
from nltk.stem import PorterStemmer     # Remove morphological affixes from words, leaving only the word stem.
from nltk.corpus import stopwords       # Remove stopwaords.
from nltk import word_tokenize          # Tokenize.
from sklearn.feature_extraction.text import CountVectorizer # Convert a collection of text documents to a matrix of token counts.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, multilabel_confusion_matrix, precision_score, recall_score # Evaluation metrics.
from sklearn.model_selection import train_test_split     # Eplit data in training/validating/testing.
from sklearn.naive_bayes import MultinomialNB            # Multinomial Naive Bayes classifier.
from sklearn.preprocessing import LabelEncoder           # Encode target labels with value between 0 and n_classes-1.
from tensorflow.keras.callbacks import EarlyStopping     # Stop training when a monitored metric has stopped improving.
from tensorflow.keras.callbacks import ReduceLROnPlateau # Reduce learning rate when a metric has stopped improving.
from tensorflow.keras.layers import Activation, BatchNormalization, Bidirectional, Concatenate, Conv1D, Dense, Dropout, Embedding, GlobalMaxPooling1D, LSTM, MaxPooling1D, ReLU # Keras layers API.
from tensorflow.keras.models import Model, Sequential # Model achitecture.
from tensorflow.keras.optimizers import Adam         # Adam optimizer.
from tensorflow.keras.preprocessing.sequence import pad_sequences # Transformsa list of sequences into a 2D Numpy array.
from tensorflow.keras.preprocessing.text import Tokenizer         # Vectorize a text corpus.
from tensorflow.keras.utils import plot_model                     # Visualize the model and save it.
from tensorflow.keras.utils import to_categorical                 # Converts a class vector (integers) to binary class matrix.


try:
    nltk.data.find('wordnet.zip')
except:
    nltk.download('wordnet', download_dir='/kaggle/working/')
    command = 'unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora'
    subprocess.run(command.split())
    nltk.data.path.append('/kaggle/working/')
    

from nltk.corpus import wordnet

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')




[nltk_data] Downloading package wordnet to /kaggle/working/...
Archive:  /kaggle/working/corpora/wordnet.zip
   creating: /kaggle/working/corpora/wordnet/
  inflating: /kaggle/working/corpora/wordnet/lexnames  
  inflating: /kaggle/working/corpora/wordnet/data.verb  
  inflating: /kaggle/working/corpora/wordnet/index.adv  
  inflating: /kaggle/working/corpora/wordnet/adv.exc  
  inflating: /kaggle/working/corpora/wordnet/index.verb  
  inflating: /kaggle/working/corpora/wordnet/cntlist.rev  
  inflating: /kaggle/working/corpora/wordnet/data.adj  
  inflating: /kaggle/working/corpora/wordnet/index.adj  
  inflating: /kaggle/working/corpora/wordnet/LICENSE  
  inflating: /kaggle/working/corpora/wordnet/citation.bib  
  inflating: /kaggle/working/corpora/wordnet/noun.exc  
  inflating: /kaggle/working/corpora/wordnet/verb.exc  
  inflating: /kaggle/working/corpora/wordnet/README  
  inflating: /kaggle/working/corpora/wordnet/index.sense  
  inflating: /kaggle/working/corpora/wordnet/data.

True

---

## #2 Select And Engineer Features


### 2.1 Preprocess The Data Using NLTK


In [20]:
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

In [21]:
# Cleaner class is responsible for cleaning the documents using its pipeline function.


class Cleaner:
    def __init__(self):
        pass

    # 1. Make a function that makes all text lowercase.

    def make_lowercase(self, input_string):
        input_string = input_string.split()
        input_string = [y.lower() for y in input_string]
        return " ".join(input_string)

    # 2. Make a function that removes all stopwords.

    def remove_stopwords(self, input_string):
        input_string = [i for i in str(input_string).split() if i not in stop_words]
        return " ".join(input_string)

    # 3. Make a function that removes all numbers.

    def remove_numbers(self, input_string):
        input_string = "".join([i for i in input_string if not i.isdigit()])
        return input_string

    # 4. Make a function that removes all punctuation.

    def remove_punctuation(self, input_string):
        input_string = re.sub(
            "[%s]" % re.escape("""!"#$%&'()*+,،-./:;<=>؟?@[\]^_`{|}~"""),
            " ",
            input_string,
        )
        input_string = input_string.replace(
            "؛",
            "",
        )
        input_string = re.sub("\s+", " ", input_string)
        input_string = " ".join(input_string.split())
        return input_string.strip()

    # 5. Make a function that removes all urls.

    def remove_urls(self, input_string):
        url_pattern = re.compile(r"https?://\S+|www\.\S+")
        return url_pattern.sub(r"", input_string)

    # 6. Make a function for lemmatization.

    def lemmatization(self, input_string):
        lemmatizer = WordNetLemmatizer()
        input_string = input_string.split()
        input_string = [lemmatizer.lemmatize(y) for y in input_string]
        return " ".join(input_string)

    # 7. Make a function that breaks words into their stem words.

    def stem_words(self, input_string):
        porter = PorterStemmer()
        words = word_tokenize(input_string)
        valid_words = []

        for word in words:
            stemmed_word = porter.stem(word)
            valid_words.append(stemmed_word)

        input_string = " ".join(valid_words)

        return input_string

    # 8. Make a pipeline function that applies all the text processing functions you just built.

    def pipeline(self, input_string):
        input_string = self.make_lowercase(input_string)  # 1.
        input_string = self.remove_stopwords(input_string)  # 2.
        input_string = self.remove_numbers(input_string)  # 3.
        input_string = self.remove_punctuation(input_string)  # 4.
        input_string = self.remove_urls(input_string)  # 5.
        input_string = self.lemmatization(input_string)  # 6.
        #         input_string = self.stem_words(input_string)         # 7.
        return input_string

In [22]:
# Clean/Normalize the documents.

cleaner = Cleaner()

combined_df["document_clean"] = combined_df["document"]
combined_df["document_clean"] = combined_df["document"].apply(cleaner.pipeline)

In [23]:
#  Print the first original document.

print("ORIGINAL DOCUMENT:\n\n")
combined_df["document"][0]

ORIGINAL DOCUMENT:




0                                                  i didnt feel humiliated
0    im feeling quite sad and sorry for myself but ill snap out of it soon
0              im feeling rather rotten so im not very ambitious right now
Name: document, dtype: object

In [24]:
#  Print the first cleaned document.

print("\nCLEANED DOCUMENT:\n\n")
combined_df["document_clean"][0]


CLEANED DOCUMENT:




0                          didnt feel humiliated
0       im feeling quite sad sorry ill snap soon
0    im feeling rather rotten im ambitious right
Name: document_clean, dtype: object

In [25]:
# Print the combined_df Pandas dataframe with the clean documents.

combined_df

Unnamed: 0,document,emotion,document_length,document_clean
0,i didnt feel humiliated,sadness,23,didnt feel humiliated
1,i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake,sadness,108,go feeling hopeless damned hopeful around someone care awake
2,im grabbing a minute to post i feel greedy wrong,anger,48,im grabbing minute post feel greedy wrong
3,i am ever feeling nostalgic about the fireplace i will know that it is still on the property,love,92,ever feeling nostalgic fireplace know still property
4,i am feeling grouchy,anger,20,feeling grouchy
...,...,...,...,...
1995,i just keep feeling like someone is being unkind to me and doing me wrong and then all i can think of doing is to get back at them and the people they are close to,anger,163,keep feeling like someone unkind wrong think get back people close
1996,im feeling a little cranky negative after this doctors appointment,anger,66,im feeling little cranky negative doctor appointment
1997,i feel that i am useful to my people and that gives me a great feeling of achievement,joy,85,feel useful people give great feeling achievement
1998,im feeling more comfortable with derby i feel as though i can start to step out my shell,joy,88,im feeling comfortable derby feel though start step shell


### 2.2 Define X And y


In [26]:
# X is our feature (document).

X = combined_df["document_clean"]

X

0                                                    didnt feel humiliated
1             go feeling hopeless damned hopeful around someone care awake
2                                im grabbing minute post feel greedy wrong
3                     ever feeling nostalgic fireplace know still property
4                                                          feeling grouchy
                                       ...                                
1995    keep feeling like someone unkind wrong think get back people close
1996                  im feeling little cranky negative doctor appointment
1997                     feel useful people give great feeling achievement
1998             im feeling comfortable derby feel though start step shell
1999              feel weird meet w people text like dont talk face face w
Name: document_clean, Length: 19948, dtype: object

In [27]:
# y is our label. What we want to predict.

y = combined_df["emotion"]

print("y.value_counts() =\n")
print(y.value_counts())

y.value_counts() =

emotion
joy         6739
sadness     5793
anger       2703
fear        2369
love        1630
surprise     714
Name: count, dtype: int64


In [28]:
y

0       sadness
1       sadness
2         anger
3          love
4         anger
         ...   
1995      anger
1996      anger
1997        joy
1998        joy
1999       fear
Name: emotion, Length: 19948, dtype: object

### 2.3 Split Data (train_test_split)

- Train = 80%
- Validation = 10%
- Test = 10%


In [29]:
# Train test split twice to get the train, validation, and test data.

X_train, X_remain, y_train, y_remain = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_valid, X_test, y_valid, y_test = train_test_split(
    X_remain, y_remain, test_size=0.5, random_state=42
)


print("X_train.shape  = ", X_train.shape)  # 1 Dimension.
print("X_valid.shape  = ", X_valid.shape)  # 1 Dimension.
print("X_test.shape   = ", X_test.shape)  # 1 Dimension.

print()

print("y_train.shape  = ", y_train.shape)  # 1 Dimension.
print("y_valid.shape  = ", y_valid.shape)  # 1 Dimension.
print("y_test.shape   = ", y_test.shape)  # 1 Dimension.

X_train.shape  =  (15958,)
X_valid.shape  =  (1995,)
X_test.shape   =  (1995,)

y_train.shape  =  (15958,)
y_valid.shape  =  (1995,)
y_test.shape   =  (1995,)


### 2.4 Label Encoder


In [30]:
# Create an instance of the LabelEncoder class.

label = LabelEncoder()

# Print the original y_train values.

print(y_train)
print("*" * 100)

# Fit the LabelEncoder on the y_train data and transform it.

y_train = label.fit_transform(y_train)

# Print the transformed y_train values.

print(y_train)
print("*" * 100)

# Transform the y_test and y_valid data using the fitted LabelEncoder.

y_test = label.transform(y_test)
y_valid = label.transform(y_valid)

# Print the classes that the LabelEncoder has learned.

print("label.classes_ =", label.classes_)

14962     sadness
76            joy
1707      sadness
254         anger
940           joy
           ...   
11297        fear
11980       anger
5391          joy
860       sadness
15825    surprise
Name: emotion, Length: 15958, dtype: object
****************************************************************************************************
[4 2 4 ... 2 4 5]
****************************************************************************************************
label.classes_ = ['anger' 'fear' 'joy' 'love' 'sadness' 'surprise']


In [31]:
# Print the original y_train values.

print(y_train)
print("*" * 100)

# Convert y_train into a binary matrix representation.

y_train = to_categorical(y_train)

# Print the transformed y_train values.

print(y_train)
print("*" * 100)

# Convert y_test and y_valid into a binary matrix representation.

y_test = to_categorical(y_test)
y_valid = to_categorical(y_valid)

# Print the classes that the LabelEncoder has learned.

print("label.classes_ =", label.classes_)

[4 2 4 ... 2 4 5]
****************************************************************************************************
[[0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 ...
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]]
****************************************************************************************************
label.classes_ = ['anger' 'fear' 'joy' 'love' 'sadness' 'surprise']


### 2.5 Tokenize


In [32]:
# Create an instance of the Tokenizer class.

tokenizer = Tokenizer()

# Fit the Tokenizer on the combined X_train and X_test data.

tokenizer.fit_on_texts(pd.concat([X_train, X_test], axis=0))

In [33]:
# Convert X_train, X_test, and X_valid into sequences of integers
# so that they can be input in the model.

sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_valid = tokenizer.texts_to_sequences(X_valid)
sequences_test = tokenizer.texts_to_sequences(X_test)

In [34]:
# Pad the sequences to ensure they all have the same length.

X_train = pad_sequences(sequences_train, maxlen=256, truncating="pre")
X_valid = pad_sequences(sequences_valid, maxlen=256, truncating="pre")
X_test = pad_sequences(sequences_test, maxlen=256, truncating="pre")

In [35]:
# Get the size of the vocabulary.

vocabulary_size = len(tokenizer.index_word) + 1
print("vocabulary_size =", vocabulary_size)

vocabulary_size = 14326
