In [43]:
!pip install shap torch datasets



In [44]:
!pip install accelerate -U



In [45]:
from transformers import BertTokenizerFast, AdamW, get_linear_schedule_with_warmup, DataCollatorWithPadding, AutoModelForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import DataLoader, TensorDataset, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split
import numpy as np
import torch
import pandas as pd
import matplotlib.pyplot as plt
import shap
import scipy as sp
from datasets import Dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cpu


In [46]:
torch_rng = torch.Generator().seed() # for reproducability
# loading the files
amazon_pos = "amazon-pos.txt"
amazon_neg = "amazon-neg.txt"
google_pos = "google-pos.txt"
google_neg = "google-neg.txt"

In [47]:
# Initializing lists for storing reviews and their labels
sentences = []
labels = []

# Helper function to read data from a file and assign a label
def data_array(filename, label):
  file = open(filename)
  for line in file:
    sentences.append(line.strip())
    labels.append(label)

# Positive = 0 and Negative = 1
data_array(amazon_pos, 0)
data_array(google_pos, 0)
data_array(amazon_neg, 1)
data_array(google_neg, 1)

I initially tried training and testing on the entire dataset but the code took a very long time to run (over an hour). Therefore, I used the code below to downsample the data to reduce size by a factor of N for faster computability.

In [48]:
# downsampling
N = 1
sentences = sentences[::N]
labels = labels[::N]
print(f"Downsampled, {len(sentences)}")

Downsampled, 2000


Even though longer sentences take longer to evaluate and our dataset has varying sentence lengths I set padding to true. I was able to reduce the runtime by downsampling. Alternatively, I could've grouped similar sized sentences and set padding to false.

In [49]:
# splitting the data into training and testing sets (80% training 20% testing)
X_train, X_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.2, random_state=42)
#loading the tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Tokenize the training and testing data and add labels to the encodings
train_encodings = dict(tokenizer(X_train, truncation=True, padding=True))
test_encodings = dict(tokenizer(X_test, truncation=True, padding=True))
train_encodings["label"] = y_train
test_encodings["label"] = y_test

# Creating datasets from the encodings
train_data = Dataset.from_dict(train_encodings)
test_data = Dataset.from_dict(test_encodings)

epochs = 1
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=epochs,
)



In [50]:
# Using a pre-trained BERT model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    data_collator=DataCollatorWithPadding(tokenizer, padding=True, return_tensors="pt"),
    tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("test-trainer")
predictions = trainer.predict(test_dataset=test_data)
confusion_matrix = np.zeros((2,2))
for i, row in enumerate(predictions.predictions):
    confusion_matrix[test_data["label"][i]][np.argmax(row)] += 1

print(confusion_matrix)

# I needed a function for scoring input sentences and found a custom function
# that takes in a list of strings and outputs scores.
# https://shap.readthedocs.io/en/latest/example_notebooks/text_examples/sentiment_analysis/Using%20custom%20functions%20and%20tokenizers.html
model.eval()
model.to(device)
def predict(x):
    tv = torch.tensor(
        [
            tokenizer.encode(v, padding="max_length", max_length=128, truncation=True)
            for v in x
        ]
    )
    attention_mask = (tv != 0).type(torch.int64)
    outputs = model(tv, attention_mask=attention_mask)[0].detach().cpu().numpy()
    scores = (np.exp(outputs).T / np.exp(outputs).sum(-1)).T
    val = sp.special.logit(scores)
    return val

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


[[182.  17.]
 [ 13. 188.]]


In [66]:
TP = confusion_matrix[1, 1]
TN = confusion_matrix[0, 0]
FP = confusion_matrix[0, 1]
FN = confusion_matrix[1, 0]

# Calculate Accuracy
accuracy = (TP + TN) / (TP + TN + FP + FN)

# Calculate Precision
precision = TP / (TP + FP)

# Calculate Recall
recall = TP / (TP + FN)

# Calculate Specificity
specificity = TN / (TN + FP)

# Calculate F1 Score
f1_score = 2 * (precision * recall) / (precision + recall)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("Specificity:", specificity)
print("F1 Score:", f1_score)

Accuracy: 0.925
Precision: 0.9170731707317074
Recall: 0.9353233830845771
Specificity: 0.914572864321608
F1 Score: 0.9261083743842363


Based on the above metrics, the model is performing well on the test data. The high F1 score indicates a good balance between precision and recall (low false positives and false negatives)

In [40]:
trainer.save_model("./model")

Next, I create a SHAP explainer for the model. It selects 5 short sample sentences (to save computation time) from the test set and then generates and visualizes the SHAP values for each of them. Analyzing them will help me understand how the model is making decisions.

Analysis:
1. With the company being so large, sometimes it's difficult to get significant recognition of novel work.


The words "difficult" and "novel" seem to have the most significant impact, with "difficult" pushing towards a negative sentiment and "novel" pushing slightly towards a positive sentiment.

In [63]:
# getting the indices for the best and worst reviews that the model is most confident on
indices = np.arange(np.shape(predictions.predictions)[0])
sort_idx = sorted(indices, key=lambda x: predictions.predictions[x, 0])
neg = sort_idx[:5]
pos = sort_idx[-5:]

In [64]:
# getting the 5 most positive and most negative reviews that the model is most confident on
best_5 = [X_test[i] for i in pos]
worst_5 = [X_test[i] for i in neg]
print(best_5)
print(worst_5)

['Great Benefits. Diverse Workforce. Self Guided', 'Flexible, lots of opportunities, good environment.', 'Google has very good benefit including health insurance, 401k and free breafast, lunch, dinner.', 'Good Salary and benefits and emerging technology that gives good work experience. Work environment is good, and there are some employee discount perks.', 'Benefits and incentives are amazing at Amazon.']
['working conditions, not much leniency with variable shift hours, lack of team leads and managers on the warehouse floor during shifts', 'Frustrating management at times, hard to feel meanginful with so many people, tiring work sometimes.', 'Repetitive, easy to get overlooked, high turnover rate', 'Can be hard on family life throughout the year.', 'Physically demanding.Daily Rates/goals. Timeoff takes a while to accrue your first year. Long hours on your feet...shifts are ten hours plus.Management/staff have high turnover.']


In [65]:
explainer = shap.Explainer(model=predict, masker=shap.maskers.Text(r"\s"), output_names=["POS", "NEG"])
print("Explainer constructed")
shap_values = explainer(best_5)
shap.plots.text(shap_values)

Explainer constructed


  0%|          | 0/30 [00:00<?, ?it/s]


PartitionExplainer explainer:  20%|██        | 1/5 [00:00<?, ?it/s][A

  0%|          | 0/30 [00:00<?, ?it/s]


PartitionExplainer explainer:  60%|██████    | 3/5 [00:45<00:20, 10.07s/it][A

  0%|          | 0/182 [00:00<?, ?it/s]


PartitionExplainer explainer:  80%|████████  | 4/5 [02:25<00:47, 47.54s/it][A

  0%|          | 0/498 [00:00<?, ?it/s]


PartitionExplainer explainer: 100%|██████████| 5/5 [07:18<00:00, 139.19s/it][A

  0%|          | 0/42 [00:00<?, ?it/s]


PartitionExplainer explainer: 6it [07:37, 91.50s/it]


The visualizations confirm that the model is accurately classifying sentences since all the sentences above have a positive SHAP value. Interestingly, 4/5 sentences has the word 'benefits'. Words like 'amazing', 'good', and 'great' also add to the net positive score.

In [67]:
explainer = shap.Explainer(model=predict, masker=shap.maskers.Text(r"\s"), output_names=["POS", "NEG"])
print("Explainer constructed")
shap_values = explainer(worst_5)
shap.plots.text(shap_values)

Explainer constructed


  0%|          | 0/420 [00:00<?, ?it/s]


PartitionExplainer explainer:  20%|██        | 1/5 [00:00<?, ?it/s][A

  0%|          | 0/210 [00:00<?, ?it/s]


PartitionExplainer explainer:  60%|██████    | 3/5 [05:02<01:24, 42.15s/it][A

  0%|          | 0/56 [00:00<?, ?it/s]


PartitionExplainer explainer:  80%|████████  | 4/5 [05:28<00:35, 35.32s/it][A

  0%|          | 0/72 [00:00<?, ?it/s]


PartitionExplainer explainer: 100%|██████████| 5/5 [06:05<00:00, 36.01s/it][A

  0%|          | 0/498 [00:00<?, ?it/s]


PartitionExplainer explainer: 6it [09:33, 114.65s/it]


The visualizations confirm that the model is accurately classifying sentences since all the sentences above have a negative SHAP value. Interestingly, 'repetitive' and 'hard' are the words with the highest negative values.
There are some words like 'conditions', 'feet', 'have' etc that are also classified as negative.

Since shap value calculations take too long, I am training the model on all of the data but only running the calculations on downsampled data to understand how the model generally works.


In [68]:
# downsampling
M = 80
sample_sentences = X_test[::M]
print(f"Downsampled, {len(sample_sentences)}")

Downsampled, 5


In [69]:
explainer = shap.Explainer(model=predict, masker=shap.maskers.Text(r"\s"), output_names=["POS", "NEG"])
print("Explainer constructed")
shap_values = explainer(sample_sentences)
shap.plots.text(shap_values)

Explainer constructed


  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/272 [00:00<?, ?it/s]


PartitionExplainer explainer:  40%|████      | 2/5 [00:00<?, ?it/s][A

  0%|          | 0/42 [00:00<?, ?it/s]


PartitionExplainer explainer:  80%|████████  | 4/5 [02:03<00:07,  7.88s/it][A
PartitionExplainer explainer: 100%|██████████| 5/5 [02:05<00:00,  5.34s/it][A

  0%|          | 0/72 [00:00<?, ?it/s]


PartitionExplainer explainer: 6it [02:37, 39.30s/it]


The above plots help us understand how the model works on general sentences (not the best or worst sentences).
It seems to be doing a great job at classifying. It picks up on words like 'great', 'good', 'wonderful' that contribute to the overall positive score.

Random reviews like "-Velocity" were classified as negative where "-" was positive and "velocity" was negative.