# Fine Tuning Bert for Tweet Classification

This notebook goes through an example of fine tuning BERT for text classification using the `transformers` library.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |████████████████████████████████| 450kB 799kB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 1.3MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/a6/b4/7a41d630547a4afd58143597d5a49e07bfd4c42914d8335b2a5657efc14b/sacremoses-0.0.38.tar.gz (860kB)
[K     |████████████████████████████████| 870kB 1.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.38-cp36-none-any.whl size=884629 sha256=153c5be25500aaa672

In [2]:
import numpy as np
import pandas as pd
import torch

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from torch.utils.data import (
    TensorDataset,
    DataLoader,
    RandomSampler,
    SequentialSampler
)
from tqdm import tqdm
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    AdamW,
    BertConfig,
    get_linear_schedule_with_warmup
)

Using TensorFlow backend.


## Fetch and Process Data

In this example, I'll use the "Disasters on Social Media" dataset from [Figure Eight](https://www.figure-eight.com/data-for-everyone/).

> Contributors looked at over 10,000 tweets culled with a variety of searches like “ablaze”, “quarantine”, and “pandemonium”, then noted whether the tweet referred to a disaster event (as opposed to a joke with the word or a movie review or something non-disastrous).

The classification task is to predict whether a tweet refers to an actual disaster event or not.



In [3]:
! curl -o disaster-tweets.csv https://d1p17r2m4rzlbo.cloudfront.net/wp-content/uploads/2016/03/socialmedia-disaster-tweets-DFE.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 2156k  100 2156k    0     0  6781k      0 --:--:-- --:--:-- --:--:-- 6781k


In [0]:
data = pd.read_csv('disaster-tweets.csv', encoding='ISO-8859-1')

To use the BERT pretrained model, we need to use the same tokenizer that was used for initial training. `tokenizer.encode` below properly encodes a text input sequence to a list of tokens (including special sentence start and end tokens). `Z` here is a mask identifying which tokens in the input are actual tokens versus padding. This is a necessary input to the BERT model.

In [0]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
X = [tokenizer.encode(text) for text in data.text]
X = pad_sequences(X, padding='post', maxlen=128)
y = np.array(data.choose_one == 'Relevant').astype(np.int)
Z = (X != 0).astype(np.int)

X_train, X_test, y_train, y_test, Z_train, Z_test = \
    train_test_split(X, y, Z, test_size=0.1, random_state=2018)

## Model Building and Training

In [0]:
def fine_tune_bert(
    X_train, X_test, y_train, y_test, Z_train, Z_test, lr=2e-5, batch_size=32,
    epochs=3, freeze_bert_layers=False
):
  train_data = TensorDataset(*map(torch.tensor, (X_train, Z_train, y_train)))
  train_sampler = RandomSampler(train_data)
  train_dataloader = DataLoader(
      train_data, sampler=train_sampler, batch_size=batch_size)

  validation_data = TensorDataset(*map(torch.tensor, (X_test, Z_test, y_test)))
  validation_sampler = SequentialSampler(validation_data)
  validation_dataloader = DataLoader(
      validation_data, sampler=validation_sampler, batch_size=batch_size)

  model = BertForSequenceClassification.from_pretrained(
      'bert-base-uncased',
      num_labels=2,
      output_attentions=False,
      output_hidden_states=False
  )

  if torch.cuda.is_available():
      print("Using GPU")
      model.cuda()
      device = torch.device("cuda")
  else:
      device = torch.device("cpu")

  if freeze_bert_layers:
    for param in model.bert.parameters():
        param.requires_grad = False

  optimizer = AdamW(
      model.parameters(),
      lr=lr,
      eps=1e-8
  )

  total_steps = len(train_dataloader) * epochs
  scheduler = get_linear_schedule_with_warmup(
      optimizer,
      num_warmup_steps=0,
      num_training_steps=total_steps
  )

  loss_values = []
  for epoch_i in range(0, epochs):
      total_loss = 0
      model.train()
      for step, batch in enumerate(train_dataloader):
          seq, mask, labels = (x.to(device) for x in batch)
          model.zero_grad()
          outputs = model(
              seq.to(torch.int64),
              token_type_ids=None,
              attention_mask=mask.to(torch.int64),
              labels=labels.to(torch.int64)
          )
          loss = outputs[0]
          total_loss += loss.item()
          loss.backward()
          torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
          optimizer.step()
          scheduler.step()

      # Calculate the average loss over the training data.
      avg_train_loss = total_loss / len(train_dataloader)
      print(f"Epoch {epoch_i}")
      print(f"--Training Loss: {avg_train_loss}")

      model.eval()

      total_tp = 0
      val_total = 0
      for batch in validation_dataloader:
          seq, mask, labels = (t.to(device) for t in batch)
          with torch.no_grad():        
              outputs = model(
                  seq.to(torch.int64),
                  token_type_ids=None,
                  attention_mask=mask.to(torch.int64)
              )
          logits = outputs[0]
          logits = logits.detach().cpu().numpy()
          labels = labels.to('cpu').numpy()
          val_total += logits.shape[0]
          preds = np.argmax(logits, axis=1)
          batch_tp = (preds == labels.flatten()).sum()
          total_tp += batch_tp

      # Report the final accuracy for this validation run.
      print("--Validation Accuracy: {0:.2f}".format(total_tp / val_total))

In [7]:
fine_tune_bert(X_train, X_test, y_train, y_test, Z_train, Z_test, epochs=10)

Using GPU
Epoch 0
--Training Loss: 0.4368228014388116
--Validation Accuracy: 0.85
Epoch 1
--Training Loss: 0.32688747726234735
--Validation Accuracy: 0.85
Epoch 2
--Training Loss: 0.2518084531564728
--Validation Accuracy: 0.84
Epoch 3
--Training Loss: 0.18333846961872446
--Validation Accuracy: 0.82
Epoch 4
--Training Loss: 0.13055380151357526
--Validation Accuracy: 0.83
Epoch 5
--Training Loss: 0.1028647772858248
--Validation Accuracy: 0.84
Epoch 6
--Training Loss: 0.08213668294689234
--Validation Accuracy: 0.83
Epoch 7
--Training Loss: 0.0686279149068629
--Validation Accuracy: 0.83
Epoch 8
--Training Loss: 0.05718729714406472
--Validation Accuracy: 0.82
Epoch 9
--Training Loss: 0.04954418827006533
--Validation Accuracy: 0.82


In [8]:
fine_tune_bert(
    X_train, X_test, y_train, y_test, Z_train, Z_test, freeze_bert_layers=True,
    epochs=10)

Using GPU
Epoch 0
--Training Loss: 0.6846879298001333
--Validation Accuracy: 0.60
Epoch 1
--Training Loss: 0.6483962861151477
--Validation Accuracy: 0.67
Epoch 2
--Training Loss: 0.6384691331121657
--Validation Accuracy: 0.67
Epoch 3
--Training Loss: 0.6335444050092324
--Validation Accuracy: 0.68
Epoch 4
--Training Loss: 0.6295321819439433
--Validation Accuracy: 0.68
Epoch 5
--Training Loss: 0.6276212239187527
--Validation Accuracy: 0.68
Epoch 6
--Training Loss: 0.6220013259672651
--Validation Accuracy: 0.69
Epoch 7
--Training Loss: 0.6208711312292448
--Validation Accuracy: 0.69
Epoch 8
--Training Loss: 0.6209131724694196
--Validation Accuracy: 0.69
Epoch 9
--Training Loss: 0.6206710254834369
--Validation Accuracy: 0.69


## Comparison with Logistic Regression

Let's see how BERT compares with plain logistic regression on this task.

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
lr_model = LogisticRegression()

In [0]:
vec = TfidfVectorizer(tokenizer=lambda x: x, lowercase=False, max_df=.75)
X_train_tfidf = vec.fit_transform(X_train)
X_test_tfidf = vec.transform(X_test)

In [12]:
lr_model.fit(X_train_tfidf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
preds = lr_model.predict(X_test_tfidf)

In [14]:
print(f"Logistic Regression Validation Accuracy: {(preds == y_test).mean()}")

Logistic Regression Validation Accuracy: 0.8005514705882353
