Thie notebook explores using BERT for text classification.  Before starting, change the runtime to GPU: Runtime > Change runtime type > Hardware accelerator: GPU.

First, create a folder named ANLP23_data in your Google drive account, and copy this notebook to that folder, along with `data/convote`, `data/loc` and `data/lmrd` from the Github repo.  Double click on this notebook from your drive account, which will open it in the Google Colab environment.  Begin executing the cells from that environment.

Let's give this notebook access to the data in your ANLP23 folder so we can train and evaluate BERT on the `convote` data.  (Note you are only providing this access to yourself as you execute this notebook.)  You can give Colab notebooks access to other data in the same way (by uploading it first to your Drive account, and then providing access here).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers
!pip install sentence-transformers
!pip install pysbd

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=6cdff6ba0c44b2a8d97c8fd75b895320337fd005e83a7cd40238c956d9dd0f43
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-tr

In [None]:
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))

['My name is Jonas E. Smith. ', 'Please turn to p. 55.']


In [None]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m81.9/126.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [None]:
from transformers import BertModel, BertTokenizer
import torch
from tqdm import tqdm
import torch.nn as nn
import numpy as np
import random
import time
from scipy import sparse
from sklearn import linear_model
from sklearn.metrics import classification_report, f1_score, accuracy_score
#from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from collections import Counter

Double-check that this notebook is running on the GPU (this should "Running on cuda").

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on {}".format(device))

Running on cuda


In [None]:
def read_labels(filename):
    labels={}
    with open(filename, "r", encoding="utf-8-sig") as file:
        for line in file:
            cols = line.split("\t")
            label = cols[0]
            if label not in labels:
                labels[label]=len(labels)
    return labels

In [None]:
def read_data(filename, labels, max_data_points=None):
    """
    :param filename: the name of the file
    :return: list of tuple ([word index list], label)
    as input for the forward and backward function
    """
    data = []
    data_labels = []
    with open(filename, "r", encoding="utf-8-sig") as file:
        for line in file:
            cols = line.split("\t")
            label = cols[0]
            text = cols[1]

            data.append(text)
            data_labels.append(labels[label])


    # shuffle the data - regression training needs shuffles (see class of Hypothesis testing 2)
    tmp = list(zip(data, data_labels))
    random.shuffle(tmp)
    data, data_labels = zip(*tmp)

    if max_data_points is None:
        return data, data_labels

    return data[:max_data_points], data_labels[:max_data_points]

In [None]:
labels=read_labels("/content/train.tsv")
print(labels)

{'NK': 0, 'K': 1}


We'll limit the training and dev data to 1,000 data points for this exercise.

In [None]:
train_x, train_y=read_data("/content/train.tsv", labels, max_data_points=1000)

In [None]:
print(train_x)

("Everything (socially) gets SO much easier after high school. It's all up from here!\n", "Happy New Year to you, you lovely person! I like to think we're cooler than the cool kids...\n", "The thing is, we're not social enough to feel normal in a group of people. Maybe you might feel normal, but I definitely feel like I don't belong, so in that sense, I am a social outcast. There's nothing wrong with being a social outcast to be honest, it's just that social outcasts have to tone their expectations differently. They can't expect the same thing that normal people expect.\n", "I don't know what to say other than try talking to her. You might be able to fix it still. I think she just has a lot shit going on in her own life that made her snap when you made a harmless joke.\n", "Diagnosed with GAD, started Zoloft about 3 weeks ago, slowly tapering up to 50 mg. I say slowly tapering because I had some side effects that made me feel more anxious but discussed them with my psych and continued 

In [None]:
dev_x, dev_y=read_data("/content/dev.tsv", labels, max_data_points=1000)

In [None]:
def evaluate(model, all_x, all_y):
    model.eval()
    corr = 0.
    total = 0.
    all_labels = []
    all_predictions = []
    misclassified = []

    with torch.no_grad():
        for x, y in zip(all_x, all_y):
            y_preds = model.forward(x)
            for idx, y_pred in enumerate(y_preds):
                prediction = torch.argmax(y_pred)
                if prediction == y[idx]:
                    corr += 1.

                total += 1

                # Collect labels and predictions for F1 score
                all_labels.append(y[idx].item())
                all_predictions.append(prediction.item())

    # Calculate accuracy
    accuracy = corr / total
    print(f"Accuracy: {accuracy}")

    # Calculate F1 score
    f1 = f1_score(all_labels, all_predictions, average='weighted')
    print(f"F1 Score: {f1}")

    return accuracy, f1

In [None]:
class BERTClassifier(nn.Module):

    def __init__(self, params):
        super().__init__()

        self.model_name=params["model_name"]
        self.tokenizer = BertTokenizer.from_pretrained(self.model_name, do_lower_case=params["doLowerCase"], do_basic_tokenize=False)
        self.bert = BertModel.from_pretrained(self.model_name)

        self.num_labels = params["label_length"]

        self.fc = nn.Linear(params["embedding_size"], self.num_labels)

    def get_batches(self, all_x, all_y, batch_size=32, max_toks=256):

        """ Get batches for input x, y data, with data tokenized according to the BERT tokenizer
      (and limited to a maximum number of WordPiece tokens """

        batches_x=[]
        batches_y=[]

        for i in range(0, len(all_x), batch_size):

            current_batch=[]

            x=all_x[i:i+batch_size]

            batch_x = self.tokenizer(x, padding=True, truncation=True, return_tensors="pt", max_length=max_toks)
            batch_y=all_y[i:i+batch_size]

            batches_x.append(batch_x.to(device))
            batches_y.append(torch.LongTensor(batch_y).to(device))

        return batches_x, batches_y

    def forward(self, batch_x):

        bert_output = self.bert(input_ids=batch_x["input_ids"],
                         attention_mask=batch_x["attention_mask"],
                         token_type_ids=batch_x["token_type_ids"],
                         output_hidden_states=True)

        bert_hidden_states = bert_output['hidden_states']

        # We're going to represent an entire document just by its [CLS] embedding (at position 0)
        out = bert_hidden_states[-1][:,0,:]

        out = self.fc(out)

        return out#.squeeze()

Now let's train BERT on this data.  A few practicalities of this environment: if you encounter an out of memory error:

* Reset the notebook (Runtime > Factory reset runtime) and execute all cells from the beginning.
* If your `max_length` is high, try reducing the `batch_size` in `get_batches` above.

Even on a GPU, BERT can take a long time to train, so you might try experimenting first with smaller `max_data_points` above. before running it on the full training data.

In [None]:
def train_and_evaluate(bert_model_name, model_filename, train_x, train_y, dev_x, dev_y, labels, embedding_size=768, doLowerCase=None):

  start_time=time.time()
  bert_model = BERTClassifier(params={"doLowerCase": doLowerCase, "model_name": bert_model_name, "embedding_size":embedding_size, "label_length": len(labels)})
  bert_model.to(device)

  batch_x, batch_y = bert_model.get_batches(train_x, train_y)
  dev_batch_x, dev_batch_y = bert_model.get_batches(dev_x, dev_y)

  optimizer = torch.optim.Adam(bert_model.parameters(), lr=1e-5)
  cross_entropy=nn.CrossEntropyLoss()

  num_epochs=15
  best_dev_acc = 0.

  for epoch in range(num_epochs):
      bert_model.train()

      # Train
      for x, y in tqdm(list(zip(batch_x, batch_y))):
          y_pred = bert_model.forward(x)
          loss = cross_entropy(y_pred.view(-1, bert_model.num_labels), y.view(-1))
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

      # Evaluate
      dev_accuracy, dev_f1 = evaluate(bert_model, dev_batch_x, dev_batch_y)
      print(type(dev_accuracy))
      print(type(dev_f1))
      if epoch % 1 == 0:
          print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
          if dev_accuracy > best_dev_acc:
              torch.save(bert_model.state_dict(), model_filename)
              best_dev_acc = dev_accuracy

  bert_model.load_state_dict(torch.load(model_filename))
  print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))
  print("Time: %.3f seconds ---" % (time.time() - start_time))


In [None]:
train_and_evaluate("bert-base-cased", "convote-bert-base-cased", train_x, train_y, dev_x, dev_y, labels, embedding_size=768, doLowerCase=False)

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

100%|██████████| 19/19 [00:24<00:00,  1.30s/it]


Accuracy: 0.77
F1 Score: 0.6699435028248587
<class 'float'>
<class 'numpy.float64'>
Epoch 0, dev accuracy: 0.770


100%|██████████| 19/19 [00:22<00:00,  1.19s/it]


Accuracy: 0.775
F1 Score: 0.6816285938159242
<class 'float'>
<class 'numpy.float64'>
Epoch 1, dev accuracy: 0.775


100%|██████████| 19/19 [00:23<00:00,  1.26s/it]


Accuracy: 0.81
F1 Score: 0.7588704318936875
<class 'float'>
<class 'numpy.float64'>
Epoch 2, dev accuracy: 0.810


100%|██████████| 19/19 [00:23<00:00,  1.26s/it]


Accuracy: 0.86
F1 Score: 0.8487199480181935
<class 'float'>
<class 'numpy.float64'>
Epoch 3, dev accuracy: 0.860


100%|██████████| 19/19 [00:23<00:00,  1.24s/it]


Accuracy: 0.845
F1 Score: 0.8335016686100277
<class 'float'>
<class 'numpy.float64'>
Epoch 4, dev accuracy: 0.845


100%|██████████| 19/19 [00:23<00:00,  1.25s/it]


Accuracy: 0.845
F1 Score: 0.8270242732993708
<class 'float'>
<class 'numpy.float64'>
Epoch 5, dev accuracy: 0.845


100%|██████████| 19/19 [00:23<00:00,  1.25s/it]


Accuracy: 0.855
F1 Score: 0.844243496441639
<class 'float'>
<class 'numpy.float64'>
Epoch 6, dev accuracy: 0.855


100%|██████████| 19/19 [00:23<00:00,  1.25s/it]


Accuracy: 0.845
F1 Score: 0.8455808903365909
<class 'float'>
<class 'numpy.float64'>
Epoch 7, dev accuracy: 0.845


100%|██████████| 19/19 [00:23<00:00,  1.25s/it]


Accuracy: 0.815
F1 Score: 0.8270994946021113
<class 'float'>
<class 'numpy.float64'>
Epoch 8, dev accuracy: 0.815


100%|██████████| 19/19 [00:23<00:00,  1.25s/it]


Accuracy: 0.89
F1 Score: 0.8915679824561402
<class 'float'>
<class 'numpy.float64'>
Epoch 9, dev accuracy: 0.890


100%|██████████| 19/19 [00:23<00:00,  1.25s/it]


Accuracy: 0.865
F1 Score: 0.8549853242732499
<class 'float'>
<class 'numpy.float64'>
Epoch 10, dev accuracy: 0.865


100%|██████████| 19/19 [00:23<00:00,  1.26s/it]


Accuracy: 0.825
F1 Score: 0.801913393756294
<class 'float'>
<class 'numpy.float64'>
Epoch 11, dev accuracy: 0.825


100%|██████████| 19/19 [00:23<00:00,  1.26s/it]


Accuracy: 0.83
F1 Score: 0.8163027940220923
<class 'float'>
<class 'numpy.float64'>
Epoch 12, dev accuracy: 0.830


100%|██████████| 19/19 [00:23<00:00,  1.25s/it]


Accuracy: 0.835
F1 Score: 0.8183209752419254
<class 'float'>
<class 'numpy.float64'>
Epoch 13, dev accuracy: 0.835


100%|██████████| 19/19 [00:23<00:00,  1.25s/it]


Accuracy: 0.815
F1 Score: 0.7874295190713102
<class 'float'>
<class 'numpy.float64'>
Epoch 14, dev accuracy: 0.815

Best Performing Model achieves dev accuracy of : 0.890
Time: 418.197 seconds ---


In [None]:
test_x, test_y=read_data("/content/test.tsv", labels, max_data_points=1000)

def test_model(model_filename, bert_model_name, test_x, test_y, labels, embedding_size=768, doLowerCase=None):
    bert_model = BERTClassifier(params={"doLowerCase": doLowerCase, "model_name": bert_model_name, "embedding_size": embedding_size, "label_length": len(labels)})
    bert_model.to(device)
    bert_model.load_state_dict(torch.load(model_filename))

    test_batch_x, test_batch_y = bert_model.get_batches(test_x, test_y)

    test_accuracy, test_F1 = evaluate(bert_model, test_batch_x, test_batch_y)
    print("Test accuracy: %.3f" % test_accuracy)
    print("Test F1 score: %.3f" % test_F1)


test_model("convote-bert-base-cased", "bert-base-cased", test_x, test_y, labels)

Accuracy: 0.81
F1 Score: 0.8191695546142986
Test accuracy: 0.810
Test F1 score: 0.819


In [None]:
import scipy.stats as stats

def calculate_confidence_interval(correct_predictions, total_predictions, confidence_level=0.95):
    # Calculate the proportion (accuracy)
    p_hat = correct_predictions / total_predictions

    # Calculate the z-score
    z = stats.norm.ppf(1 - (1 - confidence_level) / 2)

    # Calculate the margin of error
    margin_of_error = z * ((p_hat * (1 - p_hat) / total_predictions) + (z**2 / (4 * total_predictions**2)))**0.5

    # Calculate the confidence interval
    ci_lower = p_hat - margin_of_error
    ci_upper = p_hat + margin_of_error

    return ci_lower, ci_upper

# Example usage
ci_lower, ci_upper = calculate_confidence_interval(160, 200)
print(f"The 95% confidence interval for the accuracy is: {ci_lower:.2f} to {ci_upper:.2f}")


The 95% confidence interval for the accuracy is: 0.74 to 0.86


In [None]:
#Finally the majority class function and evaluation
def majority_class(trainY, devY):
    labelCounts=Counter()
    for label in trainY:
        labelCounts[label]+=1
    majority=labelCounts.most_common(1)[0][0]

    correct=0.
    for label in devY:
        if label == majority:
            correct+=1

    print("%s\t%.3f" % (majority, correct/len(devY)))

In [None]:
baseline=majority_class(train_y,test_y)

0	0.775


In [None]:
def test_model_and_save_results(model_filename, bert_model_name, test_x, test_y, labels, output_csv, embedding_size=768, doLowerCase=None):
    bert_model = BERTClassifier(params={"doLowerCase": doLowerCase, "model_name": bert_model_name, "embedding_size": embedding_size, "label_length": len(labels)})
    bert_model.to(device)
    bert_model.load_state_dict(torch.load(model_filename))

    test_batch_x, test_batch_y = bert_model.get_batches(test_x, test_y)

    test_accuracy, test_F1 = evaluate(bert_model, test_batch_x, test_batch_y)
    print("Test accuracy: %.3f" % test_accuracy)
    print("Test F1 score: %.3f" % test_F1)

    # Get predictions and original labels
    predictions_list = []
    original_labels_list = []

    with torch.no_grad():
        for x, y in zip(test_batch_x, test_batch_y):
            y_pred = bert_model.forward(x)
            predictions = torch.argmax(y_pred, dim=-1)

            # Save predictions and original labels
            predictions_list.extend(predictions.cpu().numpy())
            original_labels_list.extend(y.cpu().numpy())

    # Write results to CSV
    with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Input', 'Predicted Label', 'Original Label']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()

        for input_text, predicted_label, original_label in zip(test_x, predictions_list, original_labels_list):
            try:
                if predicted_label == 0:
                  pred_label = 'NK'

                else:
                  pred_label = 'K'

                if original_label == 0:
                  org_label = 'NK'

                else:
                  org_label = 'K'

                writer.writerow({'Input': input_text, 'Predicted Label': labels[pred_label], 'Original Label': labels[org_label]})
            except KeyError as e:
                print(f"Error: {e} (predicted_label={predicted_label}, original_label={original_label})")
                print(f"Labels list: {labels}")

    print(f"Results saved to {output_csv}")


In [None]:
import pandas as pd
import csv

def final_analysis(model_filename, bert_model_name, input_tsv, output_csv, embedding_size=768, doLowerCase=None):
    bert_model = BERTClassifier(params={"doLowerCase": doLowerCase, "model_name": bert_model_name, "embedding_size": embedding_size, "label_length": len(labels)})
    bert_model.to(device)
    bert_model.load_state_dict(torch.load(model_filename))

    # Load the TSV file into a DataFrame
    file_path = input_tsv
    df = pd.read_csv(file_path, sep='\t')
    df['dummy'] = 0

    response_list = tuple(df['response'].to_list())
    dummy_list = tuple(df['dummy'].to_list())
    year_list = df['year'].to_list()
    subreddit_list = df['subreddit'].to_list()

    test_batch_x, test_batch_y = bert_model.get_batches(response_list, dummy_list)

    # Get predictions
    predictions_list = []

    with torch.no_grad():
        for x in test_batch_x:
            y_pred = bert_model.forward(x)
            predictions = torch.argmax(y_pred, dim=-1)

            # Save predictions
            predictions_list.extend(predictions.cpu().numpy())

    # Write results to CSV
    with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Input', 'Year', 'Subreddit','Predicted Label']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()

        for input_text, year, subreddit, predicted_label in zip(response_list, year_list, subreddit_list ,predictions_list):
            try:
                if predicted_label == 0:
                  pred_label = 'NK'

                else:
                  pred_label = 'K'


                writer.writerow({'Input': input_text, 'Year': year, 'Subreddit': subreddit, 'Predicted Label': labels[pred_label]})
            except KeyError as e:
                print(f"Error: {e} (predicted_label={predicted_label}, input={input_text})")
                print(f"Labels list: {labels}")

    print(f"Results saved to {output_csv}")


final_analysis("convote-bert-base-cased", "bert-base-cased", "final_dataset.tsv", "test_results_final.csv")





FileNotFoundError: ignored