<a href="https://www.kaggle.com/code/malindaratnaduhita/sentiment-analysis-using-transformers-bert?scriptVersionId=190808043" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Sentiment Analysis in Python
This notebook will be using 3 different techniques:
1. Naive Bayes
2. BERT

# Read Data

In [None]:
#General purpose package
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [None]:
anies_data = pd.read_csv('/kaggle/input/indonesia-presidential-candidates-dataset-2024/Indonesia Presidential Candidates Dataset, 2024/labeled data/Anies Baswedan.csv')
prabowo_data = pd.read_csv('/kaggle/input/indonesia-presidential-candidates-dataset-2024/Indonesia Presidential Candidates Dataset, 2024/labeled data/Prabowo Subianto.csv')
ganjar_data = pd.read_csv('/kaggle/input/indonesia-presidential-candidates-dataset-2024/Indonesia Presidential Candidates Dataset, 2024/labeled data/Ganjar Pranowo.csv')

In [None]:
anies_data.info()

In [None]:
prabowo_data.info()

In [None]:
ganjar_data.info()

We will combine 3 data into one.

In [None]:
df = pd.concat([
    anies_data.assign(Candidate='Anies Baswedan'),
    prabowo_data.assign(Candidate='Prabowo Subianto'),
    ganjar_data.assign(Candidate='Ganjar Pranowo')
])

In [None]:
df.head()

We will using only 3 column namely Tweet Count, Text, and Label. We will also drop data that contains empty values.

In [None]:
df = df.loc[:, [' Tweet Count', 'Text', 'label']]
df = df.dropna()

In [None]:
df.info()

# Data processing

In [None]:
#Data processing
import re, string
import nltk

#Remove punctuations, links, mentions and \r\n new line characters
def strip_all_entities(text): 
    text = text.replace('\r', '').replace('\n', ' ').replace('\n', ' ').lower() #remove \n and \r and lowercase
    text = re.sub(r"(?:\@|https?\://)\S+", "", text) #remove links and mentions
    text = re.sub(r'[^\x00-\x7f]',r'', text) #remove non utf8/ascii characters such as '\x9a\x91\x97\x9a\x97'
    banned_list= string.punctuation + 'Ã'+'±'+'ã'+'¼'+'â'+'»'+'§'
    table = str.maketrans('', '', banned_list)
    text = text.translate(table)
    return text

#clean hashtags at the end of the sentence, and keep those in the middle of the sentence by removing just the # symbol
def clean_hashtags(tweet):
    new_tweet = " ".join(word.strip() for word in re.split('#(?!(?:hashtag)\b)[\w-]+(?=(?:\s+#[\w-]+)*\s*$)', tweet)) #remove last hashtags
    new_tweet2 = " ".join(word.strip() for word in re.split('#|_', new_tweet)) #remove hashtags symbol from words in the middle of the sentence
    return new_tweet2

#Filter special characters such as & and $ present in some words
def filter_chars(a):
    sent = []
    for word in a.split(' '):
        if ('$' in word) | ('&' in word):
            sent.append('')
        else:
            sent.append(word)
    return ' '.join(sent)

#Remove multiple spaces
def remove_mult_spaces(text): 
    return re.sub("\s\s+" , " ", text)

In [None]:
text_new = []
for t in df.Text:
    text_new.append(remove_mult_spaces(filter_chars(clean_hashtags(strip_all_entities(t)))))
    
df['clean_text'] = text_new

In [None]:
df.head()

Now we will look at the target column 'label'.

In [None]:
df['label'].value_counts()

Using map function, we will convert the label Positive and Negative to 1 and 0.

In [None]:
df['label'] = df['label'].map({'Positive':1, 'Negative':0})

In [None]:
df['label'].value_counts()

We can see that the two classes are imbalanced. To address this, we will apply oversampling to the data, which should help reduce the model’s bias towards the majority classes and improve performance on the minority classes.

In [None]:
from sklearn import preprocessing
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split

In [None]:
ros = RandomOverSampler()
train_x, train_y = ros.fit_resample(np.array(df['clean_text']).reshape(-1, 1), np.array(df['label']).reshape(-1, 1));
train_os = pd.DataFrame(list(zip([x[0] for x in train_x], train_y)), columns = ['clean_text', 'label']);

In [None]:
train_os['label'].value_counts()

In [None]:
X = train_os['clean_text'].values
y = train_os['label'].values

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

We will perform one-hot encoding to hopefully we will able to achieved higher accuracy. We will save a copy of the label-encoded target columns as they may be useful for future analysis.

In [None]:
y_train_le = y_train.copy()
y_valid_le = y_valid.copy()
y_test_le = y_test.copy()

In [None]:
ohe = preprocessing.OneHotEncoder()
y_train = ohe.fit_transform(np.array(y_train).reshape(-1, 1)).toarray()
y_valid = ohe.fit_transform(np.array(y_valid).reshape(-1, 1)).toarray()
y_test= ohe.fit_transform(np.array(y_test).reshape(-1, 1)).toarray()

In [None]:
print(f"TRAINING DATA: {X_train.shape[0]}\nVALIDATION DATA: {X_valid.shape[0]}\nTESTING DATA: {X_test.shape[0]}" )

# Baseline model: Naive Bayes

Before implementing BERT, we will define a simple Naive Bayes baseline model to classify the tweets.

First we need to tokenize the tweets using CountVectorizer.

In [None]:
#Naive Bayes
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

#Metrics
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
clf = CountVectorizer()
X_train_cv =  clf.fit_transform(X_train)
X_test_cv = clf.transform(X_test)

Then we create the TF-IDF versions of the tokenized tweets.

In [None]:
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_cv)
X_train_tf = tf_transformer.transform(X_train_cv)
X_test_tf = tf_transformer.transform(X_test_cv)

Now we can define the Naive Bayes Classifier model.

In [None]:
nb_clf = MultinomialNB()
nb_clf.fit(X_train_tf, y_train_le)

In [None]:
nb_pred = nb_clf.predict(X_test_tf)

In [None]:
print('\tClassification Report for Naive Bayes:\n\n',classification_report(y_test_le,nb_pred, target_names=['Negative', 'Positive']))

The model achieves an accuracy of 84%, indicating that it correctly classifies 84% of all instances. This accuracy is quite strong, suggesting the model is generally effective.

In the next section we will perform the sentiment analysis using BERT.

# BERT Sentiment Analysis

Now we need to define a custom tokenizer function and call the encode_plus method of the BERT tokenizer.

In [None]:
#Transformers
from transformers import BertTokenizerFast
from transformers import TFBertModel
from transformers import RobertaTokenizerFast
from transformers import TFRobertaModel

In [None]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [None]:
MAX_LEN=128

In [None]:
def tokenize(data,max_len=MAX_LEN) :
    input_ids = []
    attention_masks = []
    for i in range(len(data)):
        encoded = tokenizer.encode_plus(
            data[i],
            add_special_tokens=True,
            max_length=MAX_LEN,
            padding='max_length',
            return_attention_mask=True
        )
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
    return np.array(input_ids),np.array(attention_masks)

In [None]:
train_input_ids, train_attention_masks = tokenize(X_train, MAX_LEN)
val_input_ids, val_attention_masks = tokenize(X_valid, MAX_LEN)
test_input_ids, test_attention_masks = tokenize(X_test, MAX_LEN)

In [None]:
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

In [None]:
def create_model(bert_model, max_len=MAX_LEN):
    
    ##params###
    opt = tf.keras.optimizers.Adam(learning_rate=1e-5)
    loss = tf.keras.losses.CategoricalCrossentropy()
    accuracy = tf.keras.metrics.CategoricalAccuracy()


    input_ids = tf.keras.Input(shape=(max_len,),dtype='int32')
    
    attention_masks = tf.keras.Input(shape=(max_len,),dtype='int32')
    
    embeddings = bert_model([input_ids,attention_masks])[1]
    
    output = tf.keras.layers.Dense(2, activation="softmax")(embeddings)
    
    model = tf.keras.models.Model(inputs = [input_ids,attention_masks], outputs = output)
    
    model.compile(opt, loss=loss, metrics=accuracy)
    
    
    return model

In [None]:
model = create_model(bert_model, MAX_LEN)
model.summary()

In [None]:
history_bert = model.fit([train_input_ids,train_attention_masks], 
                         y_train, 
                         validation_data=([val_input_ids,val_attention_masks], y_valid), 
                         epochs=2, batch_size=64)

In [None]:
result_bert = model.predict([test_input_ids,test_attention_masks])

In [None]:
y_pred_bert =  np.zeros_like(result_bert)
y_pred_bert[np.arange(len(y_pred_bert)), result_bert.argmax(1)] = 1

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#Display confusion matrix
def conf_matrix(y, y_pred, title):
    fig, ax =plt.subplots(figsize=(5,5))
    labels=['Negative', 'Positive']
    ax=sns.heatmap(confusion_matrix(y, y_pred), annot=True, cmap="Blues", fmt='g', cbar=False, annot_kws={"size":25})
    plt.title(title, fontsize=20)
    ax.xaxis.set_ticklabels(labels, fontsize=17) 
    ax.yaxis.set_ticklabels(labels, fontsize=17)
    ax.set_ylabel('Test', fontsize=20)
    ax.set_xlabel('Predicted', fontsize=20)
    plt.show()

In [None]:
conf_matrix(y_test.argmax(1), y_pred_bert.argmax(1),'BERT Sentiment Analysis\nConfusion Matrix')

In [None]:
print('\tClassification Report for BERT:\n\n',classification_report(y_test,y_pred_bert, target_names=['Negative', 'Positive']))

The BERT-based model achieved strong performance across the 'Negative' and 'Positive' sentiment classes.
* Precision scores were notably high, with 92% for 'Negative' and 94% for 'Positive', meaning it accurately classified most predictions into these sentiment categories.
* Recall scores of 94% for 'Negative' and 92% for 'Positive' demonstrate the model's ability to capture a high proportion of true instances for each sentiment class.
* F1-scores, which harmoniously balance precision and recall, were consistent at 93% for both sentiment categories.
* Supported by a dataset that included 4281 'Negative' and 4381 'Positive' instances, the model's overall metrics—micro, macro, weighted, and samples average—averaged at 93%.

These results underscore the model's effectiveness in accurately predicting sentiment categories based on its training and validation with the given dataset.