## **Classifying the sentiments of Straits Times Facebook comments**

Given a user's comment on articles posted by the Straits Times on Facebook, how can we identify what the user feels? This project assumes that there are 3 possible sentiments: 'positive', 'negative', and 'neutral'.

For this model, we will fine-tune BERT to perform the sentiment classification task. 

Most of the work here is based on the repo: https://github.com/sebsk/CS224N-Project

### **Classifier model**

Since we are using `BertForSequenceClassification`, the model will take in the input and mask tensors and produce an single tensor of size 1 x 768. This tensor is the BERT output of the `[CLS]` token.

The `BertForSequenceClassification` model will then output this tensor to a softmax layer of size `n_class` (which in our case is 3 since we have 3 possible classes).

In [0]:
import sys
!{sys.executable} -m pip install torch transformers pandas scikit-learn



In [0]:
# Define utils functions

def pad_sents(sents, pad_token):
    """ Pad list of sentences according to the longest sentence in the batch.
    @param sents (list[list[int]]): list of sentences, where each sentence
                                    is represented as a list of words
    @param pad_token (int): padding token
    @returns sents_padded (list[list[int]]): list of sentences where sentences shorter
        than the max length sentence are padded out with the pad_token, such that
        each sentences in the batch now has equal length.
        Output shape: (batch_size, max_sentence_length)
    """
    sents_padded = []

    max_len = max(len(s) for s in sents)
    batch_size = len(sents)

    for s in sents:
        padded = [pad_token] * max_len
        padded[:len(s)] = s
        sents_padded.append(padded)

    return sents_padded

def sents_to_tensor(tokenizer, sents, device):
    """
    :param tokenizer: BertTokenizer
    :param sents: list[str], list of sentences (NOTE: untokenized, continuous sentences), reversely sorted
    :param device: torch.device
    :return: sents_tensor: torch.Tensor, shape(batch_size, max_sent_length), reversely sorted
    :return: masks_tensor: torch.Tensor, shape(batch_size, max_sent_length), reversely sorted
    :return: sents_lengths: torch.Tensor, shape(batch_size), reversely sorted
    """
    tokens_list = [tokenizer.tokenize(sent) for sent in sents]
    sents_lengths = [len(tokens) for tokens in tokens_list]
    # tokens_sents_zip = zip(tokens_list, sents_lengths)
    # tokens_sents_zip = sorted(tokens_sents_zip, key=lambda x: x[1], reverse=True)
    # tokens_list, sents_lengths = zip(*tokens_sents_zip)
    tokens_list_padded = pad_sents(tokens_list, '[PAD]')
    sents_lengths = torch.tensor(sents_lengths, device=device)

    masks = []
    for tokens in tokens_list_padded:
        mask = [0 if token=='[PAD]' else 1 for token in tokens]
        masks.append(mask)
    masks_tensor = torch.tensor(masks, dtype=torch.long, device=device)
    tokens_id_list = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokens_list_padded]
    sents_tensor = torch.tensor(tokens_id_list, dtype=torch.long, device=device)

    return sents_tensor, masks_tensor, sents_lengths

## **Defining the Model**

In [5]:
from transformers import BertForSequenceClassification, BertTokenizer, AdamW
import torch
from torch import nn
import torch.nn.functional as F

In [0]:
# Define the sentiment classification model

class SentimentClassifierModel(nn.Module):

    def __init__(self, bert_config, device, n_class):
        """
        :param bert_config: str, BERT configuration description
        :param device: torch.device
        :param n_class: int
        """

        super(SentimentClassifierModel, self).__init__()

        self.n_class = n_class
        self.bert_config = bert_config
        self.bert = BertForSequenceClassification.from_pretrained(self.bert_config, num_labels=self.n_class)
        self.tokenizer = BertTokenizer.from_pretrained(self.bert_config)
        self.device = device

    def forward(self, sents):
        """
        :param sents: list[str], list of sentences (NOTE: untokenized, continuous sentences)
        :return: pre_softmax, torch.tensor of shape (batch_size, n_class)
        """

        sents_tensor, masks_tensor, sents_lengths = sents_to_tensor(self.tokenizer, sents, self.device)
        pre_softmax = self.bert(input_ids=sents_tensor, attention_mask=masks_tensor)

        return pre_softmax

    @staticmethod
    def load(model_path: str, device):
        """ Load the model from a file.
        @param model_path (str): path to model
        @return model (nn.Module): model with saved parameters
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = SentimentClassifierModel(device=device, **args)
        model.load_state_dict(params['state_dict'])

        return model

    def save(self, path: str):
        """ Save the model to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)

        params = {
            'args': dict(bert_config=self.bert_config, n_class=self.n_class),
            'state_dict': self.state_dict()
        }

        torch.save(params, path)

## **Dataset**

We use the US Twitter Airline Sentiment dataset for training: https://www.kaggle.com/crowdflower/twitter-airline-sentiment

The CSV file has the following rows: 
- `tweet_id`
- `airline_sentiment`
- `airline_sentiment_confidence`
- `negativereason`
- `negativereason_confidence`
- `airline`
- `airline_sentiment_gold`
- `name`
- `negativereason_gold`
- `retweet_count`
- `text`
- `tweet_coord`
- `tweet_created`
- `tweet_location`
- `user_timezone`

For our purposes, we only care about the following rows:
- `tweet_id`
- `airline_sentiment`
- `text`


In [0]:
# Load dataset
import pandas

pwd = '/content/gdrive'
# from google.colab import drive
# drive.mount(pwd)

df= pandas.read_csv("/content/gdrive/My Drive/Colab Notebooks/Tweets.csv", index_col=0, usecols=['tweet_id','airline_sentiment', 'text'])
df.head()

Unnamed: 0_level_0,airline_sentiment,text
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


## **General text preprocessing**

This process references the following repo: https://github.com/sebsk/CS224N-Project/blob/df0050357d40e7f46b9c421ade52cdb9358c831c/Text_preprocessing.ipynb

In [0]:
# Remove URL, RT, mention(@)

df.text = df.text.str.replace(r'http(\S)+', r'')
df.text = df.text.str.replace(r'http ...', r'')
df.text = df.text.str.replace(r'(RT|rt)[ ]*@[ ]*[\S]+',r'')
df.text = df.text.str.replace(r'@[\S]+',r'')

# Remove non-ascii words or characters
df.text = [''.join([i if ord(i) < 128 else '' for i in text]) for text in df.text]
df.text = df.text.str.replace(r'_[\S]?',r'')

# Remove extra space
df.text = df.text.str.replace(r'[ ]{2, }',r' ')

# Remove &, < and >
df.text = df.text.str.replace(r'&amp;?',r'and')
df.text = df.text.str.replace(r'&lt;',r'<')
df.text = df.text.str.replace(r'&gt;',r'>')

# Insert space between words and punctuation marks
df.text = df.text.str.replace(r'([\w\d]+)([^\w\d ]+)', r'\1 \2')
df.text = df.text.str.replace(r'([^\w\d ]+)([\w\d]+)', r'\1 \2')

# Lowercased and strip
df.text = df.text.str.lower()
df.text = df.text.str.strip()



In [0]:
df['text_length'] = [len(text.split(' ')) for text in df.text]
print(df.shape)


(14640, 3)


In [0]:
# Drop texts with length <=3 and drop duplicates
df = df[df['text_length']>3]
df = df.drop_duplicates(subset=['text'])

print(df.shape)

(13977, 3)


In [0]:
# Summary of sample size and labels
df.shape[0]

13977

In [0]:
df.airline_sentiment.value_counts()

negative    8998
neutral     2834
positive    2145
Name: airline_sentiment, dtype: int64

## **Preprocess text into the BERT format**

In [0]:
df['BERT_processed_text'] = '[CLS] '+df.text
df.BERT_processed_text

tweet_id
0                                       [CLS] what  said .
1        [CLS] plus you ' ve added commercials to the e...
2        [CLS] i didn ' t today ... must mean i need to...
3        [CLS] it ' s really aggressive to blast obnoxi...
4         [CLS] and it ' s a really big bad thing about it
                               ...                        
14635    [CLS] thank you we got on a different flight t...
14636    [CLS] leaving over 20 minutes late flight . no...
14637    [CLS] please bring american airlines to # blac...
14638    [CLS] you have my money , you change my flight...
14639    [CLS] we have 8 ppl so we need 2 know how many...
Name: BERT_processed_text, Length: 13977, dtype: object

In [0]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
df['BERT_processed_text_length'] = [len(tokenizer.tokenize(sent)) for sent in df.text]

In [0]:
df.BERT_processed_text_length

tweet_id
0         3
1        15
2        17
3        25
4        11
         ..
14635    11
14636    27
14637     8
14638    29
14639    34
Name: BERT_processed_text_length, Length: 13977, dtype: int64

In [0]:
label_dict = dict()
for i, l in enumerate(list(df.airline_sentiment.value_counts().keys())):
    label_dict.update({l: i})

df['airline_sentiment_label'] = [label_dict[label] for label in df.airline_sentiment]

In [0]:
df.airline_sentiment_label

tweet_id
0        1
1        2
2        1
3        0
4        0
        ..
14635    2
14636    0
14637    1
14638    0
14639    1
Name: airline_sentiment_label, Length: 13977, dtype: int64

## **Save data**

In [0]:
# !touch /content/gdrive/My\ Drive/Colab\ Notebooks/bert_processed_twitter_airline_sentiment.csv
!ls /content/gdrive/My\ Drive/Colab\ Notebooks
df.to_csv(pwd + '/My Drive/Colab Notebooks/bert_processed_twitter_airline_sentiment.csv')

 bert_processed_twitter_airline_sentiment.csv
 BERT-sentiment-analysis.ipynb
'Copy of project2_github.ipynb'
'Copy of project2_world_bank.ipynb'
'Copy of VideoColorizerColab.ipynb'
 DeOldify_colab.ipynb
 Tweets.csv


## **Training**

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
# Define training params
label_names = ['positive', 'negative', 'neutral']
model_name = 'st-sentiment'
device = torch.device("cuda:0")
bert_size = 'bert-base-uncased'

train_batch_size = 32 # batch size
clip_grad = 1.0 # gradient clipping
log_every = 10 # number of mini-batches before logging
max_epoch = 100 # max number of epochs
max_patience = 3 # number of iterations to wait before decaying learning rate
max_num_trial = 3 # number of trials before terminating training
lr_decay = 0.5 # learning rate decay
lr_bert = 0.00002 # BERT learning rate
lr = 0.001 # learning rate
valid_niter = 500 # perform validation after n iterations
dropout = 0.3 # dropout rate
verbose = True

prefix = model_name + '_' + bert_size
model_save_path = pwd + '/My Drive/Colab Notebooks/' + prefix+'_model.bin'

In [0]:
# Split up data into train and validation, where validation is 20% of the dataset
training_data,validation_data = train_test_split(df,test_size=0.2,random_state=42)
print(len(df), len(training_data), len(validation_data))

13977 11181 2796


In [0]:
print(training_data)

         airline_sentiment  ... airline_sentiment_label
tweet_id                    ...                        
14597             negative  ...                       0
3167              positive  ...                       2
13635             negative  ...                       0
14048             negative  ...                       0
8073              negative  ...                       0
...                    ...  ...                     ...
5371              negative  ...                       0
14068             negative  ...                       0
5575               neutral  ...                       1
897                neutral  ...                       1
7563              positive  ...                       2

[11181 rows x 6 columns]


In [0]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

train_label = dict(training_data.airline_sentiment_label.value_counts())
label_max = float(max(train_label.values()))
train_label_weight = torch.tensor([label_max/train_label[i] for i in range(len(train_label))], device=device)

pp.pprint(train_label_weight)

tensor([1.0000, 3.2735, 4.2780], device='cuda:0')


In [0]:
# Set up model and optimizer
import time
start_time = time.time()

model = SentimentClassifierModel(bert_size, device, len(label_names))
optimizer = AdamW([
        {'params': model.bert.bert.parameters()},
        {'params': model.bert.classifier.parameters(), 'lr': float(lr)}
    ], lr=float(lr_bert))

model = model.to(device)
print('Use device: %s' % device, file=sys.stderr)
print('Done! time elapsed %.2f sec' % (time.time() - start_time), file=sys.stderr)
print('-' * 80, file=sys.stderr)

Use device: cuda:0
Done! time elapsed 17.86 sec
--------------------------------------------------------------------------------


In [0]:
# Util functions for training
import math
import logging
import pickle
import numpy as np
import torch
import pandas as pd
import sys
from docopt import docopt
from sklearn.metrics import accuracy_score, matthews_corrcoef, confusion_matrix, \
    f1_score, precision_score, recall_score, roc_auc_score

import matplotlib
matplotlib.use('agg')
from matplotlib import pyplot as plt

def batch_iter(data, batch_size, shuffle=False, bert=None):
    """ Yield batches of sentences and labels reverse sorted by length (largest to smallest).
    @param data (dataframe): dataframe with ProcessedText (str) and label (int) columns
    @param batch_size (int): batch size
    @param shuffle (boolean): whether to randomly shuffle the dataset
    @param bert (str): whether for BERT training. Values: "large", "base", None
    """
    batch_num = math.ceil(data.shape[0] / batch_size)
    index_array = list(range(data.shape[0]))

    if shuffle:
        data = data.sample(frac=1)

    for i in range(batch_num):
        indices = index_array[i * batch_size: (i + 1) * batch_size]
        examples = data.iloc[indices].sort_values(by='BERT_processed_text_length', ascending=False)
        sents = list(examples.BERT_processed_text)

        targets = list(examples.airline_sentiment_label.values)
        yield sents, targets  # list[list[str]] if not bert else list[str], list[int]
        
def validation(model, df_val, bert_size, loss_func, device):
    """ validation of model during training.
    @param model (nn.Module): the model being trained
    @param df_val (dataframe): validation dataset
    @param bert_size (str): large or base
    @param loss_func(nn.Module): loss function
    @param device (torch.device)
    @return avg loss value across validation dataset
    """
    was_training = model.training
    model.eval()

    df_val = df_val.sort_values(by='BERT_processed_text_length', ascending=False)

    ProcessedText_BERT = list(df_val.BERT_processed_text)
    InformationType_label = list(df_val.airline_sentiment_label)

    val_batch_size = 32

    n_batch = int(np.ceil(df_val.shape[0]/val_batch_size))

    total_loss = 0.

    with torch.no_grad():
        for i in range(n_batch):
            sents = ProcessedText_BERT[i*val_batch_size: (i+1)*val_batch_size]
            targets = torch.tensor(InformationType_label[i*val_batch_size: (i+1)*val_batch_size],
                                   dtype=torch.long, device=device)
            batch_size = len(sents)
            pre_softmax = model(sents)[0]
            batch_loss = loss_func(pre_softmax, targets)
            total_loss += batch_loss.item()*batch_size

    if was_training:
        model.train()

    return total_loss/df_val.shape[0]

def plot_confusion_matrix(y_true, y_pred, classes, normalize=False, title=None, path='cm', cmap=plt.cm.Reds):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    pickle.dump(cm, open(path, 'wb'))

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax


In [0]:
# Train

model.train()
cn_loss = torch.nn.CrossEntropyLoss(weight=train_label_weight, reduction='mean')
torch.save(cn_loss, 'loss_func')  # for later testing

# Initialize training variables
num_trial = 0
train_iter = 0
patience = 0
cum_loss = 0
report_loss = 0
cum_examples = report_examples = epoch = 0
hist_valid_scores = []

  "type " + obj.__name__ + ". It won't be checked "


In [0]:
! ls 

gdrive	loss_func  sample_data


In [0]:
import time

train_time = begin_time = time.time()
print('Begin Maximum Likelihood training...')

# Training loop
while True:
    epoch += 1
    for sents, targets in batch_iter(training_data, batch_size=train_batch_size, shuffle=True, bert='base'):  # for each epoch
        train_iter += 1
        optimizer.zero_grad()
        batch_size = len(sents)
        pre_softmax = model(sents)[0]

        # Calculate loss and gradient function
        loss = cn_loss(pre_softmax, torch.tensor(targets, dtype=torch.long, device=device))
        loss.backward()

        # Next step
        optimizer.step()

        batch_losses_val = loss.item() * batch_size
        report_loss += batch_losses_val
        cum_loss += batch_losses_val

        report_examples += batch_size
        cum_examples += batch_size

        if train_iter % log_every == 0:
            print('epoch %d, iter %d, avg. loss %.2f, '
                  'cum. examples %d, speed %.2f examples/sec, '
                  'time elapsed %.2f sec' % (epoch, train_iter,
                     report_loss / report_examples,
                     cum_examples,
                     report_examples / (time.time() - train_time),
                     time.time() - begin_time), file=sys.stderr)

            train_time = time.time()
            report_loss = report_examples = 0.

        # perform validation
        if train_iter % valid_niter == 0:
            print('epoch %d, iter %d, cum. loss %.2f, cum. examples %d' % (epoch, train_iter,
                 cum_loss / cum_examples,
                 cum_examples), file=sys.stderr)

            cum_loss = cum_examples = 0.

            print('begin validation ...', file=sys.stderr)

            validation_loss = validation(model, validation_data, bert_size, cn_loss, device)   # dev batch size can be a bit larger

            print('validation: iter %d, loss %f' % (train_iter, validation_loss), file=sys.stderr)

            is_better = len(hist_valid_scores) == 0 or validation_loss < min(hist_valid_scores)
            hist_valid_scores.append(validation_loss)

            if is_better:
                patience = 0
                print('save currently the best model to [%s]' % model_save_path, file=sys.stderr)

                model.save(model_save_path)

                # also save the optimizers' state
                torch.save(optimizer.state_dict(), model_save_path + '.optim')
            elif patience < int(max_patience):
                patience += 1
                print('hit patience %d' % patience, file=sys.stderr)

                if patience == int(max_patience):
                    num_trial += 1
                    print('hit #%d trial' % num_trial, file=sys.stderr)
                    if num_trial == max_num_trial:
                        print('early stop!', file=sys.stderr)
                        exit(0)

                    # decay lr, and restore from previously best checkpoint
                    print('load previously best model and decay learning rate to %f%%' %
                          (float(lr_decay)*100), file=sys.stderr)

                    # load model
                    params = torch.load(model_save_path, map_location=lambda storage, loc: storage)
                    model.load_state_dict(params['state_dict'])
                    model = model.to(device)

                    print('restore parameters of the optimizers', file=sys.stderr)
                    optimizer.load_state_dict(torch.load(model_save_path + '.optim'))

                    # set new lr
                    for param_group in optimizer.param_groups:
                        param_group['lr'] *= float(lr_decay)

                    # reset patience
                    patience = 0

            if epoch == int(max_epoch):
                print('reached maximum number of epochs!', file=sys.stderr)
                exit(0)

Begin Maximum Likelihood training...


epoch 1, iter 10, avg. loss 1.11, cum. examples 320, speed 171.93 examples/sec, time elapsed 1.86 sec
epoch 1, iter 20, avg. loss 1.07, cum. examples 640, speed 182.67 examples/sec, time elapsed 3.61 sec
epoch 1, iter 30, avg. loss 1.00, cum. examples 960, speed 175.84 examples/sec, time elapsed 5.43 sec
epoch 1, iter 40, avg. loss 0.83, cum. examples 1280, speed 180.03 examples/sec, time elapsed 7.21 sec
epoch 1, iter 50, avg. loss 0.80, cum. examples 1600, speed 175.23 examples/sec, time elapsed 9.04 sec
epoch 1, iter 60, avg. loss 0.65, cum. examples 1920, speed 176.00 examples/sec, time elapsed 10.86 sec
epoch 1, iter 70, avg. loss 0.86, cum. examples 2240, speed 172.25 examples/sec, time elapsed 12.71 sec
epoch 1, iter 80, avg. loss 0.74, cum. examples 2560, speed 181.19 examples/sec, time elapsed 14.48 sec
epoch 1, iter 90, avg. loss 0.74, cum. examples 2880, speed 178.32 examples/sec, time elapsed 16.28 sec
epoch 1, iter 100, avg. loss 0.63, cum. examples 3200, speed 177.10 exam

KeyboardInterrupt: ignored

## **Validation**

In [0]:
    import numpy as np
    import pickle
    from sklearn.metrics import accuracy_score, matthews_corrcoef, confusion_matrix, \
    f1_score, precision_score, recall_score, roc_auc_score
    import matplotlib
    matplotlib.use('agg')
    from matplotlib import pyplot as plt

In [0]:
def plot_confusion_matrix(y_true, y_pred, classes, normalize=False, title=None, path='cm', cmap=plt.cm.Reds):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    pickle.dump(cm, open(path, 'wb'))

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

In [0]:
    print('load best model...')

    model = SentimentClassifierModel.load('/content/gdrive/My Drive/Colab Notebooks/' + prefix + '_model.bin', device)

    model.to(device)

    model.eval()

    df_test = validation_data

    df_test = df_test.sort_values(by='BERT_processed_text_length', ascending=False)

    test_batch_size = 32

    n_batch = int(np.ceil(df_test.shape[0]/test_batch_size))

    cn_loss = torch.load('loss_func', map_location=lambda storage, loc: storage).to(device)

    ProcessedText_BERT = list(df_test.BERT_processed_text)
    InformationType_label = list(df_test.airline_sentiment_label)

    test_loss = 0.
    prediction = []
    prob = []

    softmax = torch.nn.Softmax(dim=1)

    with torch.no_grad():
        for i in range(n_batch):
            sents = ProcessedText_BERT[i*test_batch_size: (i+1)*test_batch_size]
            targets = torch.tensor(InformationType_label[i * test_batch_size: (i + 1) * test_batch_size],
                                   dtype=torch.long, device=device)
            batch_size = len(sents)

            pre_softmax = model(sents)[0]
            batch_loss = cn_loss(pre_softmax, targets)
            test_loss += batch_loss.item()*batch_size
            prob_batch = softmax(pre_softmax)
            prob.append(prob_batch)

            prediction.extend([t.item() for t in list(torch.argmax(prob_batch, dim=1))])

    prob = torch.cat(tuple(prob), dim=0)
    loss = test_loss/df_test.shape[0]

    pickle.dump([label_names[i] for i in prediction], open(prefix+'_test_prediction', 'wb'))
    pickle.dump(prob.data.cpu().numpy(), open(prefix + '_test_prediction_prob', 'wb'))

    accuracy = accuracy_score(df_test.airline_sentiment_label.values, prediction)
    matthews = matthews_corrcoef(df_test.airline_sentiment_label.values, prediction)

    precisions = {}
    recalls = {}
    f1s = {}
    aucrocs = {}

    for i in range(len(label_names)):
        prediction_ = [1 if pred == i else 0 for pred in prediction]
        true_ = [1 if label == i else 0 for label in df_test.airline_sentiment_label.values]
        f1s.update({label_names[i]: f1_score(true_, prediction_)})
        precisions.update({label_names[i]: precision_score(true_, prediction_)})
        recalls.update({label_names[i]: recall_score(true_, prediction_)})
        aucrocs.update({label_names[i]: roc_auc_score(true_, list(t.item() for t in prob[:, i]))})

    metrics_dict = {'loss': loss, 'accuracy': accuracy, 'matthews coef': matthews, 'precision': precisions,
                         'recall': recalls, 'f1': f1s, 'aucroc': aucrocs}

    pickle.dump(metrics_dict, open(prefix+'_evaluation_metrics', 'wb'))

    cm = plot_confusion_matrix(list(df_test.airline_sentiment_label.values), prediction, label_names, normalize=False,
                          path=prefix+'_test_confusion_matrix', title='confusion matrix for test dataset')
    plt.savefig(prefix+'_test_confusion_matrix', format='png')
    cm_norm = plot_confusion_matrix(list(df_test.airline_sentiment_label.values), prediction, label_names, normalize=True,
                          path=prefix+'_test normalized_confusion_matrix', title='normalized confusion matrix for test dataset')
    plt.savefig(prefix+'_test_normalized_confusion_matrix', format='png')

    print('loss: %.2f' % loss)
    print('accuracy: %.2f' % accuracy)
    print('matthews coef: %.2f' % matthews)
    print('-' * 80)
    for i in range(len(label_names)):
        print('precision score for %s: %.2f' % (label_names[i], precisions[label_names[i]]))
        print('recall score for %s: %.2f' % (label_names[i], recalls[label_names[i]]))
        print('f1 score for %s: %.2f' % (label_names[i], f1s[label_names[i]]))
        print('auc roc score for %s: %.2f' % (label_names[i], aucrocs[label_names[i]]))
        print('-' * 80)

load best model...
loss: 0.55
accuracy: 0.83
matthews coef: 0.69
--------------------------------------------------------------------------------
precision score for positive: 0.89
recall score for positive: 0.90
f1 score for positive: 0.89
auc roc score for positive: 0.94
--------------------------------------------------------------------------------
precision score for negative: 0.73
recall score for negative: 0.62
f1 score for negative: 0.67
auc roc score for negative: 0.90
--------------------------------------------------------------------------------
precision score for neutral: 0.74
recall score for neutral: 0.87
f1 score for neutral: 0.80
auc roc score for neutral: 0.97
--------------------------------------------------------------------------------


In [0]:
precisions

{'negative': 0.7326923076923076,
 'neutral': 0.7437858508604207,
 'positive': 0.8887621220764403}

In [0]:
recalls

{'negative': 0.6195121951219512,
 'neutral': 0.8702460850111857,
 'positive': 0.8985005767012687}

In [0]:
f1s

{'negative': 0.6713656387665198,
 'neutral': 0.8020618556701031,
 'positive': 0.8936048178950387}

In [0]:
aucrocs

{'negative': 0.9045302557564778,
 'neutral': 0.9672781887289846,
 'positive': 0.9354930850151072}

## **Preprocess Straits Times comments**

In [1]:
import pandas
st_df = pandas.read_csv("/content/gdrive/My Drive/Colab Notebooks/st-comments.csv", index_col=0, encoding='latin-1', usecols=['comment_id','post_title', 'comment_text'])
st_df.head()

Unnamed: 0_level_0,post_title,comment_text
comment_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Husband runs off when wife is pregnant and def...,Better to be happy bringing up your child than...
1,Husband runs off when wife is pregnant and def...,"..kinda in this boat at the moment , feeling v..."
2,Husband runs off when wife is pregnant and def...,So sad that the marriage institution is taken ...
3,Husband runs off when wife is pregnant and def...,"When a women found a men younger than her, ppl..."
4,Husband runs off when wife is pregnant and def...,Both gained what they want.guy want short happ...


In [2]:
# Remove URL, RT, mention(@)

st_df['text'] = st_df.comment_text

st_df.text = st_df.text.str.replace(r'http(\S)+', r'')
st_df.text = st_df.text.str.replace(r'http ...', r'')
st_df.text = st_df.text.str.replace(r'(RT|rt)[ ]*@[ ]*[\S]+',r'')
st_df.text = st_df.text.str.replace(r'@[\S]+',r'')

# Remove non-ascii words or characters
st_df.text = [''.join([i if ord(i) < 128 else '' for i in text]) for text in st_df.text]
st_df.text = st_df.text.str.replace(r'_[\S]?',r'')

# Remove extra space
st_df.text = st_df.text.str.replace(r'[ ]{2, }',r' ')

# Remove &, < and >
st_df.text = st_df.text.str.replace(r'&amp;?',r'and')
st_df.text = st_df.text.str.replace(r'&lt;',r'<')
st_df.text = st_df.text.str.replace(r'&gt;',r'>')

# Insert space between words and punctuation marks
st_df.text = st_df.text.str.replace(r'([\w\d]+)([^\w\d ]+)', r'\1 \2')
st_df.text = st_df.text.str.replace(r'([^\w\d ]+)([\w\d]+)', r'\1 \2')

# Lowercased and strip
st_df.text = st_df.text.str.lower()
st_df.text = st_df.text.str.strip()

st_df['text_length'] = [len(text.split(' ')) for text in st_df.text]
print(st_df.shape)


(257, 4)


### **Preprocess comments into BERT format**

In [3]:
st_df['BERT_processed_text'] = '[CLS] '+ st_df.text
st_df.BERT_processed_text

comment_id
0      [CLS] better to be happy bringing up your chil...
1      [CLS] .. kinda in this boat at the moment , fe...
2      [CLS] so sad that the marriage institution is ...
3      [CLS] when a women found a men younger than he...
4      [CLS] both gained what they want . guy want sh...
                             ...                        
252    [CLS] how come this photographer skill like da...
253    [CLS] this tension between perspective of citi...
254                          [CLS] send in the eagles !!
255    [CLS] sling shot lah . just encourage to shoot...
256    [CLS] deploy the air rifle club people to trai...
Name: BERT_processed_text, Length: 257, dtype: object

In [6]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
st_df['BERT_processed_text_length'] = [len(tokenizer.tokenize(sent)) for sent in st_df.text]
st_df.BERT_processed_text_length

comment_id
0      34
1      53
2      95
3      88
4      17
       ..
252    11
253    23
254     6
255    21
256    14
Name: BERT_processed_text_length, Length: 257, dtype: int64

In [7]:
st_df

Unnamed: 0_level_0,post_title,comment_text,text,text_length,BERT_processed_text,BERT_processed_text_length
comment_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Husband runs off when wife is pregnant and def...,Better to be happy bringing up your child than...,better to be happy bringing up your child than...,30,[CLS] better to be happy bringing up your chil...,34
1,Husband runs off when wife is pregnant and def...,"..kinda in this boat at the moment , feeling v...",".. kinda in this boat at the moment , feeling ...",49,"[CLS] .. kinda in this boat at the moment , fe...",53
2,Husband runs off when wife is pregnant and def...,So sad that the marriage institution is taken ...,so sad that the marriage institution is taken ...,95,[CLS] so sad that the marriage institution is ...,95
3,Husband runs off when wife is pregnant and def...,"When a women found a men younger than her, ppl...","when a women found a men younger than her , pp...",78,[CLS] when a women found a men younger than he...,88
4,Husband runs off when wife is pregnant and def...,Both gained what they want.guy want short happ...,both gained what they want . guy want short ha...,17,[CLS] both gained what they want . guy want sh...,17
...,...,...,...,...,...,...
252,"Menacing mynas, pigeons, and crows: complaints...",How come this photographer skill like dat one ...,how come this photographer skill like dat one ...,9,[CLS] how come this photographer skill like da...,11
253,"Menacing mynas, pigeons, and crows: complaints...",This tension between perspective of citizens l...,this tension between perspective of citizens l...,22,[CLS] this tension between perspective of citi...,23
254,"Menacing mynas, pigeons, and crows: complaints...",Send in the eagles!!,send in the eagles !!,5,[CLS] send in the eagles !!,6
255,"Menacing mynas, pigeons, and crows: complaints...",Sling shot lah. Just encourage to shoot them. ...,sling shot lah . just encourage to shoot them ...,20,[CLS] sling shot lah . just encourage to shoot...,21


### **Save processed file**

In [0]:
st_df.to_csv(pwd + '/My Drive/Colab Notebooks/bert_processed_st_comments.csv')

In [16]:
# Load model
model = SentimentClassifierModel.load('/content/gdrive/My Drive/Colab Notebooks/' + prefix + '_model.bin', device)

model.to(device)

SentimentClassifierModel(
  (bert): BertForSequenceClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768

In [0]:
st_df = st_df.sort_values(by='BERT_processed_text_length', ascending=False)

In [18]:
st_df

Unnamed: 0_level_0,post_title,comment_text,text,text_length,BERT_processed_text,BERT_processed_text_length
comment_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
148,Tsai Ing-wen re-elected Taiwan President; KMT'...,"This, morning, I said , We cannot give a One P...","this , morning , i said , we cannot give a one...",221,"[CLS] this , morning , i said , we cannot give...",233
232,"Menacing mynas, pigeons, and crows: complaints...",Seeing so many comments on different birds inv...,seeing so many comments on different birds inv...,167,[CLS] seeing so many comments on different bir...,188
142,Tsai Ing-wen re-elected Taiwan President; KMT'...,"Tsai lng wen, Re-Elected,as Taiwan President, ...","tsai lng wen , re - elected , as taiwan presid...",132,"[CLS] tsai lng wen , re - elected , as taiwan ...",149
239,"Menacing mynas, pigeons, and crows: complaints...",They are annoying a nuisance only if they star...,they are annoying a nuisance only if they star...,128,[CLS] they are annoying a nuisance only if the...,135
249,"Menacing mynas, pigeons, and crows: complaints...",I think many people feed the pigeons out of ki...,i think many people feed the pigeons out of ki...,124,[CLS] i think many people feed the pigeons out...,131
...,...,...,...,...,...,...
127,Tsai Ing-wen re-elected Taiwan President; KMT'...,Taiwan wins,taiwan wins,2,[CLS] taiwan wins,2
221,Forum: Promote plant-based diet to cut Singapo...,Yes,yes,1,[CLS] yes,1
224,Forum: Promote plant-based diet to cut Singapo...,YES,yes,1,[CLS] yes,1
69,China launches gigantic telescope in hunt for ...,Great,great,1,[CLS] great,1


In [0]:
cn_loss = torch.load('loss_func', map_location=lambda storage, loc: storage).to(device)

In [0]:
ProcessedText_BERT = list(st_df.BERT_processed_text)

In [21]:
ProcessedText_BERT

['[CLS] this , morning , i said , we cannot give a one party , government , too much power , they will find ways to make things legally , to corrupt and abusing its power . many countries , are having a two party system . never , trusted this government , they had lost their way to govern . are singaporean , better off after so many years under this regime , the answer is definitely no . congratulations , to tsai ing - wen , successfully , re - elected as taiwan president . she , had increased , the minimum wage , in taiwan . while those past president are reluctant to do . since , our government like to changed the constitution , not my elected president and re - write the history of singapore . whereby , our beloved , ong teng chong is our first elected president . without ong teng chong , singaporean will not have the mrt system . so , this coming ge , singaporean must stand united , to re - write history once again , to vote for a majority of oppositions in parliament , to make all

In [0]:
softmax = torch.nn.Softmax(dim=1)

In [0]:
labels = ['negative', 'neutral', 'positive']

In [24]:
sents = ProcessedText_BERT[:2]
sents

['[CLS] this , morning , i said , we cannot give a one party , government , too much power , they will find ways to make things legally , to corrupt and abusing its power . many countries , are having a two party system . never , trusted this government , they had lost their way to govern . are singaporean , better off after so many years under this regime , the answer is definitely no . congratulations , to tsai ing - wen , successfully , re - elected as taiwan president . she , had increased , the minimum wage , in taiwan . while those past president are reluctant to do . since , our government like to changed the constitution , not my elected president and re - write the history of singapore . whereby , our beloved , ong teng chong is our first elected president . without ong teng chong , singaporean will not have the mrt system . so , this coming ge , singaporean must stand united , to re - write history once again , to vote for a majority of oppositions in parliament , to make all

In [25]:
len(sents)

2

In [28]:
pre_softmax = model(sents)[0]
pre_softmax

tensor([[ 2.6207, -1.1077, -1.3669],
        [ 2.4422, -0.3667, -1.6395]], device='cuda:0', grad_fn=<AddmmBackward>)

In [29]:
pre_softmax.shape

torch.Size([2, 3])

In [30]:
prob = softmax(pre_softmax)
prob

tensor([[0.9592, 0.0230, 0.0178],
        [0.9284, 0.0560, 0.0157]], device='cuda:0', grad_fn=<SoftmaxBackward>)

In [31]:
prob.shape

torch.Size([2, 3])

In [32]:
prob[0]

tensor([0.9592, 0.0230, 0.0178], device='cuda:0', grad_fn=<SelectBackward>)

In [0]:
# Find the highest value of the tensor
label_indexes = [t.item() for t in list(torch.argmax(prob, dim=1))]

In [35]:
prediction = labels[label_indexes[1]]
prediction

'positive'

In [37]:
predictions = []
with torch.no_grad():
  sents = ProcessedText_BERT
  pre_softmax = model(sents)[0]
  prob = softmax(pre_softmax)
  predictions.extend([t.item() for t in list(torch.argmax(prob, dim=1))])
print(predictions)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 1, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 2, 1, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 2, 1, 1, 1, 0, 0, 0, 1, 2, 2, 0, 0, 1, 0, 0, 0, 0, 2, 1, 1, 0, 0, 2, 1, 2, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 2, 0, 1, 0, 2, 0, 1, 0, 0, 0, 1, 1, 2, 0, 2, 2, 2, 1, 0, 0, 2, 0, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 0, 1, 0, 0, 1, 2, 1, 0, 1, 0, 2, 1, 1, 2, 2, 1, 0, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 0, 2, 1, 2, 1, 1, 1, 2, 0, 2, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1]


In [51]:
[labels[pred_val] for pred_val in predictions]

['negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 

In [0]:
st_df['predictions'] = [labels[pred_val] for pred_val in predictions]

In [56]:
print(st_df.comment_text[90])
print(st_df.predictions[90])

Well done,Â congratsÂ to President Tsai and to the Taiwan people.
positive


In [0]:
st_df.to_csv(pwd + '/My Drive/Colab Notebooks/bert_predicted_st_comments.csv')

In [58]:
st_df.predictions.value_counts()

negative    136
neutral      67
positive     54
Name: predictions, dtype: int64