# Finetuning a BERT to determine the valence of Glassdoor/Indeed reviews
# CS72 Final, 22S
## Written by Leah Ryu and Michelle Chen
### leah.ryu.22@dartmouth.edu and michelle.chen.22@dartmouth.edu

With a bunch of review sentences which have "labels" of positive and negative, classified according to topic, we can fine-tune a BERT model to label reviews as negative or positive. Then, once we have a nice accuracy, we can use this model to label reviews that lack gold labels. These reviews come from the 'content' field of the Indeed reviews, which is a general body of text without a specified valence.

We owe great thanks to the HW6 Jupyter notebooks and the many BERT tutorials available online, including:

https://www.geeksforgeeks.org/fine-tuning-bert-model-for-sentiment-analysis/#:~:text=Google%20created%20a%20transformer%2Dbased,dataset%20would%20lead%20to%20overfitting

https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb#scrollTo=sd1LiXGjZ420


In [None]:
!pip install -q transformers datasets

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import transformers as ppb
from transformers import AutoModel, BertTokenizerFast

#for pytorch
import torch
import torch.nn as nn

# Parsing the text files
We need our text files parsed into one large dataframe with <\<content\>> and <\<valence\>> labels so that we can fine-tune our BERT with it. Let's take all the already-labeled data from each company — so, everything excluding the neutral data from Indeed. We can first use this data without worrying about topic categories or dates to fine-tune the BERT. 


In [None]:
# Libraries needed to import files from drive
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
# Removing trailing whitespace from sentences (unneeded newlines, single-spaces, etc.)
def remove_whitespace_from(review_sentences):
  stripped_sentences = []
  for sentence in review_sentences:
    stripped_sentence = sentence.rstrip()
    stripped_sentences.append(stripped_sentence)
  return stripped_sentences

In [None]:
# Open all the files we need: pos and neg classification data
# for the four companies, from Glassdoor and Indeed.

# Riot reviews

f1 = open("/content/drive/MyDrive/compling_final/Glassdoor/riotProsClassified.txt", 'r')
riotPos = remove_whitespace_from(f1.readlines())

f2 = open("/content/drive/MyDrive/compling_final/Glassdoor/riotConsClassified.txt", 'r')
riotNeg = remove_whitespace_from(f2.readlines())

f3 = open("/content/drive/MyDrive/compling_final/Indeed/riotProsIndeedClassified.txt", 'r')
riotIndeedPos = remove_whitespace_from(f3.readlines())

f4 = open("/content/drive/MyDrive/compling_final/Indeed/riotConsIndeedClassified.txt", 'r')
riotIndeedNeg = remove_whitespace_from(f4.readlines())

# Sony reviews

f5 = open("/content/drive/MyDrive/compling_final/Glassdoor/sonyProsClassified.txt", 'r')
sonyPos = remove_whitespace_from(f5.readlines())

f6 = open("/content/drive/MyDrive/compling_final/Glassdoor/sonyConsClassified.txt", 'r')
sonyNeg = remove_whitespace_from(f6.readlines())

f7 = open("/content/drive/MyDrive/compling_final/Indeed/sonyProsIndeedClassified.txt", 'r')
sonyIndeedPos = remove_whitespace_from(f7.readlines())

f8 = open("/content/drive/MyDrive/compling_final/Indeed/sonyConsIndeedClassified.txt", 'r')
sonyIndeedNeg = remove_whitespace_from(f8.readlines())

# Ubisoft reviews

f9 = open("/content/drive/MyDrive/compling_final/Glassdoor/ubisoftProsClassified.txt", 'r')
ubisoftPos = remove_whitespace_from(f9.readlines())

f10 = open("/content/drive/MyDrive/compling_final/Glassdoor/ubisoftConsClassified.txt", 'r')
ubisoftNeg = remove_whitespace_from(f10.readlines())

f11 = open("/content/drive/MyDrive/compling_final/Indeed/ubisoftProsIndeedClassified.txt", 'r')
ubisoftIndeedPos = remove_whitespace_from(f11.readlines())

f12 = open("/content/drive/MyDrive/compling_final/Indeed/ubisoftConsIndeedClassified.txt", 'r')
ubisoftIndeedNeg = remove_whitespace_from(f12.readlines())

# Activision reviews

f13 = open("/content/drive/MyDrive/compling_final/Glassdoor/activisionProsClassified.txt", 'r')
activisionPos = remove_whitespace_from(f13.readlines())

f14 = open("/content/drive/MyDrive/compling_final/Glassdoor/activisionConsClassified.txt", 'r')
activisionNeg = remove_whitespace_from(f14.readlines())

f15 = open("/content/drive/MyDrive/compling_final/Indeed/activisionProsIndeedClassified.txt", 'r')
activisionIndeedPos = remove_whitespace_from(f15.readlines())

f16 = open("/content/drive/MyDrive/compling_final/Indeed/activisionConsIndeedClassified.txt", 'r')
activisionIndeedNeg = remove_whitespace_from(f16.readlines())

In [None]:
# We need to store all the data in one big dataframe with the correct labels.
# https://cmdlinetips.com/2018/01/how-to-create-pandas-dataframe-from-multiple-lists/

# As per the tutorial above, we'll make two long lists, then put them into a 
# dictionary and use that to make the dataframe
features = []
labels = []

# True = positive, False = negative
def appendFilesToLabelsAndFeaturesList(valence, featuresList):
  for i in range(len(featuresList)):
    feature = featuresList[i].strip("\n")
    if (feature != "[LISTSEP]"):
      features.append(featuresList[i])
      if (valence):
        labels.append(1)
      else:
        labels.append(0)

# Appending Riot reviews
appendFilesToLabelsAndFeaturesList(True, riotPos)
appendFilesToLabelsAndFeaturesList(False, riotNeg)
appendFilesToLabelsAndFeaturesList(True, riotIndeedPos)
appendFilesToLabelsAndFeaturesList(False, riotIndeedNeg)

# Appending Sony reviews
appendFilesToLabelsAndFeaturesList(True, sonyPos)
appendFilesToLabelsAndFeaturesList(False, sonyNeg)
appendFilesToLabelsAndFeaturesList(True, sonyIndeedPos)
appendFilesToLabelsAndFeaturesList(False, sonyIndeedNeg)

# Appending Ubisoft reviews
appendFilesToLabelsAndFeaturesList(True, ubisoftPos)
appendFilesToLabelsAndFeaturesList(False, ubisoftNeg)
appendFilesToLabelsAndFeaturesList(True, ubisoftIndeedPos)
appendFilesToLabelsAndFeaturesList(False, ubisoftIndeedNeg)

# Appending Activision reviews
appendFilesToLabelsAndFeaturesList(True, activisionPos)
appendFilesToLabelsAndFeaturesList(False, activisionNeg)
appendFilesToLabelsAndFeaturesList(True, activisionIndeedPos)
appendFilesToLabelsAndFeaturesList(False, activisionIndeedNeg)

In [None]:
# https://www.geeksforgeeks.org/python-shuffle-two-lists-with-same-order/
# Python3 code to demonstrate working of shuffle two lists with same order
# using zip() + * operator + shuffle()
import random

# Shuffle two lists with same order
temp = list(zip(features, labels))
random.shuffle(temp)

features, labels = zip(*temp)
# These come out as tuples and must be cast to lists
features, labels = list(features), list(labels)

In [None]:
dictionary = {'features': features, 'labels': labels}
df = pd.DataFrame(dictionary)

In [None]:
df

Unnamed: 0,features,labels
0,The Oceania branch of Riot is also a little di...,1
1,Everybody is really open and ready to listen t...,1
2,"Here in HK office , I know everybody , that ’ ...",1
3,——— Multi Cultural office ——— We have people f...,1
4,It 's a great place for someone who is a gamer...,1
...,...,...
12295,"Compensation , hour , Nda",0
12296,"Overworked , under paid contract work , very l...",0
12297,Not being paid enough,0
12298,salary are not very high,0


## BERT

In [None]:
# Import BERT-base pretrained model
# https://huggingface.co/bert-base-uncased
bert = AutoModel.from_pretrained('bert-base-uncased')

# Load the fast BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Now that we have our dataframe sorted as two columns, one with features (in this case, our review sentences) and the second with labels (0 or 1 indicating negative or positive), we can go ahead and split up our data into training, validation, and testing sets.

In [None]:
# We'll use the ratio .70: .15: .15, first splitting up into 0.7 and 0.3, then 
# splitting the 0.3 in half.
train_text, temp_text, train_labels, temp_labels = train_test_split(df['features'], df['labels'], 
                                                                    random_state=2021, 
                                                                    test_size=0.3, 
                                                                    stratify=df['labels'])


val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels, 
                                                                random_state=2021, 
                                                                test_size=0.5, 
                                                                stratify=temp_labels)

Now we're going to tokenize the data and encode it into a format that BERT can read. Under the hood, tokenization is the separation of sentences into their tokens (which look a lot like words but are often more granular) and the addition of the `[CLS]` and `[SEP]` tokens at the beginning and end of the sequence. Then, encoding means transforming tokens into their `input_ids`, which are integers.

In [None]:
tokenizedTrain = train_text.apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
tokenizedVal = val_text.apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
tokenizedTest = test_text.apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

Now we have a number of encoded token vectors of varying lengths. We need to pad them all to the length so that we can represent all the vectors as a singular 2D array and have them processed as a batch.

In [None]:
# Given a list of token sequences, this returns the length of the longest sequence.
def determineMaxLength(tokenized):
  max_len = 0
  for i in tokenized.values:
      if len(i) > max_len:
          max_len = len(i)
  return max_len

maxLenTrain = determineMaxLength(tokenizedTrain)
maxLenVal = determineMaxLength(tokenizedVal)
maxLenTest = determineMaxLength(tokenizedTest)

# We'll take the longest out of all the sequences data sets and use that to determine
# how much we should pad each sequence.
max_len = max(maxLenTrain, maxLenVal, maxLenTest)

paddedTrain = np.array([i + [0]*(max_len-len(i)) for i in tokenizedTrain.values])
paddedVal = np.array([i + [0]*(max_len-len(i)) for i in tokenizedVal.values])
paddedTest = np.array([i + [0]*(max_len-len(i)) for i in tokenizedTest.values])

In [None]:
# As a sanity check, we can look at the shape of our training data array
np.array(paddedTrain).shape

(8610, 222)

In [None]:
features = df['features']
labels = df['labels']

In [None]:
def labelsObjectToList(labels):
  labelsList = []
  for label in labels:
    labelsList.append(int(label))
  return labelsList

In [None]:
# We convert all this tokenized data into a form that PyTorch can use.
train_seq = torch.tensor(paddedTrain)
train_mask = torch.tensor(np.where(paddedTrain != 0, 1, 0))
train_y = torch.tensor(labelsObjectToList(train_labels))

val_seq = torch.tensor(paddedVal)
val_mask = torch.tensor(np.where(paddedVal != 0, 1, 0))
val_y = torch.tensor(labelsObjectToList(val_labels))

halfPaddedTest = paddedTest[:len(paddedTest)//2]
test_seq = torch.tensor(halfPaddedTest)
test_mask = torch.tensor(np.where(halfPaddedTest != 0, 1, 0))
test_y = torch.tensor(labelsObjectToList(test_labels[:len(test_labels)//2]))

In [None]:
print(len(test_labels[:len(test_labels)//2]))

922


# IMPORTANT NOTE 
After this point, the code is DIRECTLY taken from 
https://github.com/Himabindugssn/Sentiment-classification-using-transformers with little to no modification.

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# define a batch size
batch_size = 64

# wrap tensors
train_data = TensorDataset(train_seq, train_mask, train_y)

# sampler for sampling the data during training
train_sampler = RandomSampler(train_data)

# dataLoader for train set
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# wrap tensors
val_data = TensorDataset(val_seq, val_mask, val_y)

# sampler for sampling the data during training
val_sampler = SequentialSampler(val_data)

# dataLoader for validation set
val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)

In [None]:
# freeze the BERT architecture

for param in bert.parameters():
    param.requires_grad = False

In [None]:
class BERT_architecture(nn.Module):

    def __init__(self, bert):
      
      super(BERT_architecture, self).__init__()

      self.bert = bert 
      
      # dropout layer
      self.dropout = nn.Dropout(0.2)
      
      # relu activation function
      self.relu =  nn.ReLU()

      # dense layer 1
      self.fc1 = nn.Linear(768,512)
      
      # dense layer 2 (Output layer)
      self.fc2 = nn.Linear(512,2)

      #softmax activation function
      self.softmax = nn.LogSoftmax(dim=1)

    #define the forward pass
    def forward(self, sent_id, mask):

      #pass the inputs to the model  
      _, cls_hs = self.bert(sent_id, attention_mask=mask, return_dict=False)
      
      x = self.fc1(cls_hs)

      x = self.relu(x)

      x = self.dropout(x)

      # output layer
      x = self.fc2(x)
      
      # apply softmax activation
      x = self.softmax(x)

      return x

In [None]:
# pass the pre-trained BERT to our define architecture
model = BERT_architecture(bert)

In [None]:
# optimizer from hugging face transformers
from transformers import AdamW

# define the optimizer
optimizer = AdamW(model.parameters(),lr = 1e-5)  # learning rate



In [None]:
from sklearn.utils.class_weight import compute_class_weight

#compute the class weights
class_weights = compute_class_weight(class_weight = "balanced",
                                        classes = np.unique(train_labels),
                                        y = train_labels 
                                     )
print("class weights are {} for {}".format(class_weights,np.unique(train_labels)))

class weights are [1.08849558 0.92481203] for [0 1]


In [None]:
#count of both the categories of training labels
pd.value_counts(train_labels)

1    4655
0    3955
Name: labels, dtype: int64

In [None]:
#wrap class weights in tensor
weights= torch.tensor(class_weights,dtype=torch.float)

# define loss function
# add weights to handle the "imbalance" in the dataset
cross_entropy  = nn.NLLLoss(weight=weights) 

# number of training epochs
epochs = 2

In [None]:
# function to train the model
def train():
  
  model.train()

  total_loss, total_accuracy = 0, 0
  
  # empty list to save model predictions
  total_preds=[]
  
  # iterate over batches
  for step,batch in enumerate(train_dataloader):
    
    # progress update after every 50 batches.
    if step % 50 == 0 and not step == 0:
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(train_dataloader)))

    # # push the batch to gpu
    # batch = [r.to(device) for r in batch]
 
    sent_id, mask, labels = batch

    # clear previously calculated gradients 
    model.zero_grad()        

    # get model predictions for the current batch
    preds = model(sent_id, mask)

    # compute the loss between actual and predicted values
    loss = cross_entropy(preds, labels)

    # add on to the total loss
    total_loss = total_loss + loss.item()

    # backward pass to calculate the gradients
    loss.backward()

    # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # update parameters
    optimizer.step()

    preds = preds.detach().numpy()

    # append the model predictions
    total_preds.append(preds)

  # compute the training loss of the epoch
  avg_loss = total_loss / len(train_dataloader)
  
  # predictions are in the form of (no. of batches, size of batch, no. of classes).
  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  #returns the loss and predictions
  return avg_loss, total_preds

In [None]:
# function for evaluating the model
def evaluate():
  
  print("\nEvaluating...")
  
  # deactivate dropout layers
  model.eval()

  total_loss, total_accuracy = 0, 0
  
  # empty list to save the model predictions
  total_preds = []

  # iterate over batches
  for step,batch in enumerate(val_dataloader):
    
    # Progress update every 50 batches.
    if step % 50 == 0 and not step == 0:
            
      # Report progress.
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(val_dataloader)))

    # # push the batch to gpu
    # batch = [t.to(device) for t in batch]

    sent_id, mask, labels = batch

    # deactivate autograd
    with torch.no_grad():
      
      # model predictions
      preds = model(sent_id, mask)

      # compute the validation loss between actual and predicted values
      loss = cross_entropy(preds,labels)

      total_loss = total_loss + loss.item()

      preds = preds.detach().numpy()

      total_preds.append(preds)

  # compute the validation loss of the epoch
  avg_loss = total_loss / len(val_dataloader) 

  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  return avg_loss, total_preds

In [None]:
# set initial loss to infinite
best_valid_loss = float('inf')

# empty lists to store training and validation loss of each epoch
train_losses=[]
valid_losses=[]

#for each epoch
for epoch in range(epochs):
     
    print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
    
    #train model
    train_loss, _ = train()
    
    #evaluate model
    valid_loss, _  = evaluate()
    
    #save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    # append training and validation loss
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    
    print('\nTraining Loss: {}'.format(train_loss))
    print('Validation Loss: {}'.format(valid_loss))


 Epoch 1 / 2
  Batch    50  of    135.
  Batch   100  of    135.

Evaluating...

Training Loss: 0.6847317448368779
Validation Loss: 0.6730107747275254

 Epoch 2 / 2
  Batch    50  of    135.
  Batch   100  of    135.

Evaluating...

Training Loss: 0.6693249066670736
Validation Loss: 0.6617287130191408


# IMPORTANT NOTE
After this point, the user should take care to save the trained model by going to the file icon on the left sidebar of the screen and double-clicking on the file titled `saved_weights.pt`. Then, store it in a safe place in Drive.

In [None]:
#load weights of best model
path = 'drive/MyDrive/compling_final/saved_weights.pt'
model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [None]:
# get predictions for test data
with torch.no_grad():
  preds = model(test_seq, test_mask)
  # preds = preds.detach().cpu().numpy()
  preds = preds.detach().numpy()

In [None]:
from sklearn.metrics import classification_report

In [None]:
pred = np.argmax(preds, axis = 1)
print(classification_report(test_y, pred))

              precision    recall  f1-score   support

           0       0.61      0.40      0.48       425
           1       0.60      0.78      0.68       497

    accuracy                           0.61       922
   macro avg       0.61      0.59      0.58       922
weighted avg       0.61      0.61      0.59       922



In [None]:
# Retrieve prediction counts from classification report
def countPredictedLabels(pred):
  pos = 0
  neg = 0
  for integer in pred:
    if integer == 0:
      neg += 1
    else:
      pos += 1
  return pos, neg

posCount, negCount = countPredictedLabels(pred)
print(posCount, negCount)