### **r/place 2023 Sentiment Analysis Model Creation**

In this Jupyter notebook, I will perform transfer learning to improve a BERT model using a dataset of Reddit comments and their sentiment scores.

**Credits**<br>
* Thank you to Professor Alvin Chen of the National Taiwan Normal University for his guide to performing transfer learning with BERT using IMDB movie reviews. The tutorial can be found at: https://alvinntnu.github.io/NTNU_ENC2045_LECTURES/temp/sentiment-analysis-using-bert-keras-movie-reviews.html

* Thank you to Chaithanya Kumar A for collecting Reddit comments with associated sentiment values for this project. His dataset can be found at: https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset

**Importing Libraries**<br>
To create the NLP model, I use various libraries such as tensorflow, nltk, pandas, etc.

In [1]:
import csv
import numpy as np
import pandas as pd
import nltk
import random
import re
import sklearn
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow.keras as keras
import torch
import transformers
import unicodedata

from transformers import BertTokenizer, TFBertForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


**Loading the Data**<br>
The target CSV file has Reddit comments in Column 0 and a score in Column 1. The scores correspond to the following sentiments: -1 = negative, 0 = neutral, 1 = positive.

In [2]:
# define the data path and store the comments in a list
data_path = "data/Reddit_Data.csv"
comments = []

# read the csv and store each comment with its respective score
with open(data_path, "r", encoding="utf8") as f:
    csv_reader = csv.reader(f)
    next(csv_reader)
    for row in csv_reader:
        comment, score = row
        comments.append((comment, int(score)))

# shuffle the data for variety
random.shuffle(comments)

**Splitting the Data**<br>
I'll use a split ratio of 80:20 for this model. Feel free to tweak the ratio if you'd like!

In [3]:
train_set, test_set = train_test_split(comments,
                                       test_size=0.2,
                                       random_state=24)

**Extracting X and Y**<br>
In this model, X = Reddit Comments, and Y = Score.

In [4]:
# define the maximum token size
TOKEN_SIZE = 128

# fill in the comments and scores of the training set
x_train = []; y_train = []
for comment, score in train_set:
    if len(comment) > TOKEN_SIZE:
        x_train.append(comment[:TOKEN_SIZE])
    else:
        x_train.append(comment)
    y_train.append(int(score)+1)

# fill in the comments and scores of the testing set
x_test = []; y_test = []
for comment, score in test_set:
    if len(comment) > TOKEN_SIZE:
        x_test.append(comment[:TOKEN_SIZE])
    else:
        x_test.append(comment)
    y_test.append(int(score)+1)

**Tokenizing the Data**<br>
We will tokenize the data to make sure that the model can accept the string inputs.

In [5]:
# load the bertweet tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")

# tokenize the reddit comments
tokenized_train = tokenizer(x_train, return_tensors="np", padding=True)
tokenized_test = tokenizer(x_test, return_tensors="np", padding=True)

# convert 
labels_train = np.array(y_train)
labels_test = np.array(y_test)

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0


In [6]:
print(tokenized_train["input_ids"][0])

[    0   101    33    94  1498  8225  6010   116  2153   235 11108   136
    33 12738   130   175  1029   153     6  9778 27674   520 22282     2
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1]


**Preparing Data (for the BERT Model)**<br>
Now that we have extracted the training and testing data, we can move onto converting the data so that the BERT model can take these inputs.

In [7]:
# load the bertweet model
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis", num_labels=3)

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at finiteautomata/bertweet-base-sentiment-analysis.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [8]:
learning_rate = 2e-5

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(loss=loss,
              optimizer=optimizer,
              metrics=metric)

In [9]:
# store callbacks into a csv file
filename='data/callbacks_log.csv'
history_logger=tf.keras.callbacks.CSVLogger(filename, separator=",", append=True)

In [10]:
epochs = 1

history = model.fit(
  dict(tokenized_train), 
  labels_train,
  validation_data=(dict(tokenized_test), labels_test),
  epochs=epochs,
  callbacks=history_logger
)

132/932 [===>..........................] - ETA: 8:45:24 - loss: 0.8283 - accuracy: 0.6738

In [None]:
!mkdir -p saved_model
model.save('saved_model/')

In [None]:
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd

matplotlib.rcParams['figure.dpi'] = 150


# Plotting results
def plot1(history):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs = range(1, len(acc) + 1)
    ## Accuracy plot
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    ## Loss plot
    plt.figure()

    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
    plt.show()


def plot2(history):
    pd.DataFrame(history.history).plot(figsize=(8, 5))
    plt.grid(True)
    #plt.gca().set_ylim(0,1)
    plt.show()

In [None]:
plot2(history)