## Fine-tuning a transformer model for sentiment classification

In [17]:
%reload_ext watermark
%watermark -a "postcristiano.pt"

Author: postcristiano.pt



In [3]:
!pip install -q spacy

In [4]:
!pip install -q transformers

In [16]:
# Project dependencies
import os
import math
import nltk
import spacy
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import transformers
from tokenizers import BertWordPieceTokenizer
from tqdm import tqdm
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from tensorflow import keras
from keras.utils import to_categorical
from tensorflow.keras import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer # OLD | from keras.preprocessing.text import Tokenizer
from keras.metrics import Precision, Recall, AUC
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from keras.callbacks import EarlyStopping, LearningRateScheduler, CallbackList, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam # OLD | from tensorflow.keras.optimizers.experimental import Adam


# Disable TensorFlow registration warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # FATAL
tf.get_logger().setLevel('ERROR')

# Additional configuration to avoid general logging warnings
import logging
logging.getLogger('tensorflow').disabled = True

# Ignore specific warningss
import warnings
warnings.filterwarnings('ignore')

## Load Text Data

In [29]:
# Load train data
train_data_pt = pd.read_csv('samples/train_data.txt', header = None, delimiter = ';')

In [30]:
# Load test data
test_data_pt = pd.read_csv('samples/test_data.txt', header = None, delimiter = ';')

In [31]:
# Adjust columns
train_data_pt = train_data_pt.rename(columns = {0: 'raw_text', 1:'sentiment'})
test_data_pt = test_data_pt.rename(columns = {0: 'raw_text', 1:'sentiment'})

In [32]:
# Shape
train_data_pt.shape

(16000, 2)

In [28]:
# Shape
test_data_pt.shape

(2000, 2)

In [33]:
# Data train sample
train_data_pt.head()

Unnamed: 0,raw_text,sentiment
0,i am feeling completely overwhelmed i have two...,fear
1,i have the feeling she was amused and delighted,joy
2,i was able to help chai lifeline with your sup...,joy
3,i already feel like i fucked up though because...,anger
4,i still love my so and wish the best for him i...,sadness


In [35]:
# List sentiments in data train
train_data_pt['sentiment'].value_counts()

sentiment
joy         5362
sadness     4666
anger       2159
fear        1937
love        1304
surprise     572
Name: count, dtype: int64

In [36]:
# List sentiments in data test
test_data_pt['sentiment'].value_counts()

sentiment
joy         695
sadness     581
anger       275
fear        224
love        159
surprise     66
Name: count, dtype: int64

## Pre-processing of Text Data with SpaCy

In [37]:
!python -m spacy download en_core_web_md -q

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [38]:
# Load the dictionary
pt_nlp = spacy.load('en_core_web_md')

In [39]:
# Set function 'pt_preprocessing_text' that receives a text as a parameter
def pt_preprocessing_text(text):
    
    # Process the text using the dictionary
    doc = pt_nlp(text)
    
    # Creates a list of lemmas from the tokens, converted to lowercase and without blanks, 
    # excluding words that are stopwords
    tokens = [token.lemma_.lower().strip() for token in doc if not token.is_stop]
    
    # returns the processed tokens as a single string, joining them with spaces
    return ' '.join(tokens)

In [42]:
# Applies function in train data
train_data_pt['processed_text'] = train_data_pt['raw_text'].apply(pt_preprocessing_text)

In [43]:
# Applies function in test data
test_data_pt['processed_text'] = test_data_pt['raw_text'].apply(pt_preprocessing_text)

In [44]:
# Data train sample
test_data_pt.head()

Unnamed: 0,raw_text,sentiment,processed_text
0,i feel like my only role now would be to tear ...,sadness,feel like role tear sail pessimism discontent
1,i feel just bcoz a fight we get mad to each ot...,anger,feel bcoz fight mad n u wanna publicity n let ...
2,i feel like reds and purples are just so rich ...,joy,feel like red purple rich kind perfect
3,im not sure the feeling of loss will ever go a...,sadness,m sure feeling loss away dull sweet feeling no...
4,i feel like ive gotten to know many of you thr...,joy,feel like ve get know comment email m apprecia...


## Model 1 - Fully Connected Neural Network Architecture

### *Step 1 | Vectorization with TF-IDF*

In [45]:
# Create vectorizer
pt_tfidf = TfidfVectorizer(max_df = 0.95, min_df = 2, stop_words = 'english')

Above creates an instance of TfidfVectorizer from the scikit-learn library, which is a tool used to convert a collection of raw documents into a TF-IDF (Term Frequency-Inverse Document Frequency) feature matrix. TF-IDF is a statistical technique used to quantify the importance of a word in a set of documents, commonly used in natural language processing and information retrieval tasks.

**Parameter max_df=0.95:** This parameter defines the maximum document frequency limit for the terms that will be considered. Here, it is set to 0.95, which means that words that appear in more than 95% of documents will be ignored. This helps eliminate common words that don't contribute much to the meaning of the text.

**Parameter min_df=2:** This parameter establishes the minimum document frequency for the terms. In this case, terms that appear in less than 2 documents will be ignored. This helps filter out rare terms that may only occur in a few samples and are therefore less relevant to the overall analysis.

**stop_words='english'**: This parameter instructs the vectorizer to remove all English stop words from the analysis. Stop words are common words (such as "and", "the", "in") that are often filtered out in natural language processing because they are too frequent and do not carry meaningful information for text analysis.

In [46]:
# Applies vectorizer
# fit_transform only train data sample, because fit_transform is a train procedure.
train_data_tfidf = pt_tfidf.fit_transform(train_data_pt['processed_text']) 

test_data_tfidf = pt_tfidf.transform(test_data_pt['processed_text'])

In [47]:
train_data_tfidf.shape

(16000, 5587)

In [48]:
type(train_data_tfidf)

scipy.sparse._csr.csr_matrix

In [50]:
# Convert input data (processed text) to array
X_train_array = train_data_tfidf.toarray()
X_test_array = test_data_tfidf.toarray()

### Step 2: Data Preparation

We now need to convert the target variable to numeric representation. We will use Label Encoding.

In [51]:
# Create Label Encoder
pt_le = LabelEncoder()

In [52]:
# Fit and transform the target variable in train
y_train_le = pt_le.fit_transform(train_data_pt['sentiment'])

In [53]:
# Transform the target variable in test
y_test_le = pt_le.transform(test_data_pt['sentiment'])

**We will automatically handle class imbalance.**

**Techniques for balancing the data sample**
- Oversampling: Creation of synthetic data and increases the number of records
- Undersamplimeng: Decreases records of major classes, results in data loss
- Give more weight in training to classes with smaller sample sizes

In [54]:
# Class weights
classes_weights = compute_class_weight('balanced', classes = np.unique(y_train_le), y = y_train_le)

In [55]:
type(classes_weights)

numpy.ndarray

**compute_class_weight**: This is a scikit-learn function that calculates weights for classes. These weights can be used in classification models to give more importance to classes that are underrepresented in the dataset.

**'balanced'**: This parameter indicates that the weights of the classes must be calculated in such a way that they balance the data set. This is done inversely proportional to the frequency of classes in the data set. More frequent classes receive a lower weight, while less frequent classes receive a greater weight.

**classes** = np.unique(y_treino_le): Here, np.unique(y_treino_le) finds all the unique classes in the training dataset. The classes parameter tells the compute_class_weight function what these unique classes are.

**y = y_train_le**: This is the label vector of the training dataset. The function will use these labels to calculate the frequency of each class.

The result, stored in pesos_classes, is an array where each class has an associated weight. These weights can be used in classification models (such as a decision tree, a logistic regression model, SVM, etc.) to compensate for imbalance between classes.

In [57]:
# Division into Train Data and Test Data (Test to VALIDATION)
X_train, X_val, y_train, y_val = train_test_split(X_train_array,
                                                  y_train_le,
                                                  test_size = 0.2,
                                                  random_state = 42,
                                                  stratify = y_train_le)

In [58]:
# Set target variable as categorical type
y_train_encoded = to_categorical(y_train)
y_test_encoded = to_categorical(y_test_le)
y_val_encoded = to_categorical(y_val)

In [61]:
# Shape 
y_train_encoded.shape, y_test_encoded.shape, y_val_encoded.shape

((12800, 6), (2000, 6), (3200, 6))

In [22]:
%watermark -a "postcristiano.pt"

#%watermark -v -m

#%watermark --iversions

Author: postcristiano.pt

