# Transfer Learning Using BERT 

## Project Summary: 
This project aims to use the Google released Bert model for transfer learning. The idea is similar to transfer learning using image recognition model (e.g. VGG, ResNet) by adding a classifer head to the underlying outputs from the base model.

### Problem Statement:
How can we use transfer learning for an NLP problem to improve classification results? 

In particular, I am going to tackle the 'Quora Insincere Question Classification' problem on Kaggle.

Link: https://www.kaggle.com/c/quora-insincere-questions-classification

#### Dataset used:
Kaggle competition dataset - Quora Insincere Question Classification

#### Resources used:
Colab

#### Code implemented in:
PyTorch

#### Credit: Lim Si Jie

In [1]:
!pip install pytorch-pretrained-bert
!pip install 
!pip install kaggle --u

Collecting pytorch-pretrained-bert
[?25l  Downloading https://files.pythonhosted.org/packages/5d/3c/d5fa084dd3a82ffc645aba78c417e6072ff48552e3301b1fa3bd711e03d4/pytorch_pretrained_bert-0.6.1-py3-none-any.whl (114kB)
[K    100% |████████████████████████████████| 122kB 3.8MB/s 
Installing collected packages: pytorch-pretrained-bert
Successfully installed pytorch-pretrained-bert-0.6.1
[31mERROR: You must give at least one requirement to install (see "pip help install")[0m

Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

ambiguous option: --u (--upgrade, --upgrade-strategy, --use-pep517, --user?)


## Mount Google Drive to Google's Linux VM (Colab)

In [2]:
from google.colab import drive
drive.mount('/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /gdrive


In [3]:
#Check whether Google Drive is connected

with open('/gdrive/My Drive/test.txt', 'w') as f:
  f.write('Hello Google Drive!')
!cat '/gdrive/My Drive/test.txt'

Hello Google Drive!

In [0]:
#Connecting to Kaggle API via token and showing all available dataset

#Note: Yout can view how to download Kaggle API token here - https://github.com/Kaggle/kaggle-api

!pip install -U -q kaggle
!mkdir -p ~/.kaggle

!cp "/gdrive/My Drive/Deep Learning Workshop/kaggle.json" ~/.kaggle/

In [5]:
#Test that the kaggle command is working
!kaggle datasets list

ref                                                          title                                                size  lastUpdated          downloadCount  
-----------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  
ronitf/heart-disease-uci                                     Heart Disease UCI                                     3KB  2018-06-25 11:33:56          12228  
russellyates88/suicide-rates-overview-1985-to-2016           Suicide Rates Overview 1985 to 2016                 396KB  2018-12-01 19:18:25           8638  
karangadiya/fifa19                                           FIFA 19 complete player dataset                       2MB  2018-12-21 03:52:59          11505  
iarunava/cell-images-for-detecting-malaria                   Malaria Cell Images Dataset                         337MB  2018-12-05 05:40:21           2156  
mohansacharya/graduate-admissions                         

In [6]:
#Download the Kaggle NLP dataset that you are interested in
#In my case, I am downloading the quora insincere question classification dataset

!kaggle competitions download -c quora-insincere-questions-classification

Downloading train.csv.zip to /content
 75% 41.0M/54.4M [00:02<00:01, 9.05MB/s]
100% 54.4M/54.4M [00:02<00:00, 19.9MB/s]
Downloading embeddings.zip to /content
100% 5.96G/5.96G [02:27<00:00, 59.9MB/s]

Downloading sample_submission.csv.zip to /content
100% 4.08M/4.08M [00:00<00:00, 10.0MB/s]

Downloading test.csv.zip to /content
 83% 13.0M/15.7M [00:00<00:00, 13.2MB/s]
100% 15.7M/15.7M [00:00<00:00, 24.7MB/s]


In [7]:
!ls -al

#!unzip embeddings.zip (I'm not unzipping the embeddings since I won't be using it in my approach)
!unzip train.csv.zip
!unzip test.csv.zip

total 6321976
drwxr-xr-x 1 root root       4096 Mar  5 00:54 .
drwxr-xr-x 1 root root       4096 Mar  5 00:51 ..
drwxr-xr-x 1 root root       4096 Feb 26 17:33 .config
-rw-r--r-- 1 root root 6395920052 Mar  5 00:54 embeddings.zip
drwxr-xr-x 1 root root       4096 Feb 26 17:33 sample_data
-rw-r--r-- 1 root root    4282631 Mar  5 00:54 sample_submission.csv.zip
-rw-r--r-- 1 root root   16426497 Mar  5 00:54 test.csv.zip
-rw-r--r-- 1 root root   57047694 Mar  5 00:51 train.csv.zip
Archive:  train.csv.zip
  inflating: train.csv               
Archive:  test.csv.zip
  inflating: test.csv                


In [8]:
!ls -al

total 6477464
drwxr-xr-x 1 root root       4096 Mar  5 00:54 .
drwxr-xr-x 1 root root       4096 Mar  5 00:51 ..
drwxr-xr-x 1 root root       4096 Feb 26 17:33 .config
-rw-r--r-- 1 root root 6395920052 Mar  5 00:54 embeddings.zip
drwxr-xr-x 1 root root       4096 Feb 26 17:33 sample_data
-rw-r--r-- 1 root root    4282631 Mar  5 00:54 sample_submission.csv.zip
---------- 1 root root   35011536 Feb  6 00:46 test.csv
-rw-r--r-- 1 root root   16426497 Mar  5 00:54 test.csv.zip
---------- 1 root root  124206772 Oct 30 16:56 train.csv
-rw-r--r-- 1 root root   57047694 Mar  5 00:51 train.csv.zip


In [9]:
#Import relevant libraries

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM, BertForNextSentencePrediction, BertAdam
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader, random_split
from collections import Counter
from torch.nn.utils.rnn import pad_sequence
from sklearn.model_selection import train_test_split
from tqdm._tqdm_notebook import tqdm_notebook
from tqdm import tqdm

tqdm_notebook.pandas(desc='Progress')

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


## Testing for CUDA

In [10]:
# check if CUDA is available
train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
    print('CUDA is not available.  Training on CPU ...')
else:
    print('CUDA is available!  Training on GPU ...')

CUDA is available!  Training on GPU ...


In [11]:
#Since there are a lot of data in the dataset, I will be subsetting it for debugging/testing of the model first. Else, it will take a long time for my model to train
#REMOVE: can remove this line of code when done testing

raw_df = pd.read_csv('train.csv')
print(len(raw_df[raw_df['target'] == 1]))
print(len(raw_df[raw_df['target'] == 0]))

pos_df = raw_df[raw_df['target'] == 1].iloc[:100, :]
neg_df = raw_df[raw_df['target'] == 0].iloc[:100, :]

short_df = pos_df.append(neg_df)

short_df.to_csv('train_short.csv')

80810
1225312


In [12]:
#Activate the logger for more information on what's happening
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

INFO:pytorch_pretrained_bert.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache, downloading to /tmp/tmp2g6j2157
100%|██████████| 231508/231508 [00:00<00:00, 408034.90B/s]
INFO:pytorch_pretrained_bert.file_utils:copying /tmp/tmp2g6j2157 to cache at /root/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:pytorch_pretrained_bert.file_utils:creating metadata file for /root/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:pytorch_pretrained_bert.file_utils:removing temp file /tmp/tmp2g6j2157
INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068

In [0]:
#Defining the Dataset class to load the NLP dataset

class Dataset(Dataset):
  'Characterizes a dataset for PyTorch'
  def __init__(self, df_path, maxlen):
    
    'Initialization'
    
    #This will determine the max length of your tensor. If your tensor length < max length, it will be padded with 0.
    #The rational is to have the same tensor length being passed into the model for more efficient computation.
    
    self.maxlen = maxlen
    
    #For simplicity, we will remove the indexes where the question is more than 515 in length (Bert has a limit of 515)
    self.df = pd.read_csv(df_path).drop(59428, axis = 0).drop(205748, axis = 0).drop(163583, axis = 0).drop(443216, axis = 0) .reset_index()
    
    
    self.df.labels = self.df.target
    self.df.text = self.df.question_text
    
    #Tokenize the questions
    
    print('Start Tokenizing')
    self.df.text = self.df.text.apply(tokenizer.tokenize) #.progress_apply(tokenizer.tokenize)
    
    #Index the tokens 
    
    print('Start Indexing Tokens')
    self.df.text = self.df.text.apply(tokenizer.convert_tokens_to_ids) #progress_apply(tokenizer.convert_tokens_to_ids)
    
    #Pad the text_index with 0 so that it hits the max_len
    
    print('Start Padding Process')
    self.df.text = self.df.text.apply(self.pad_data) #progress_apply(self.pad_data) 
    
    #Converting all numpy array (for text) to tensor
    
    print('Converting numpy array to tensor')
    self.df.text = self.df.text.apply(torch.from_numpy) #progress_apply(torch.from_numpy)
    
    #Note: I am overwritting the column to reduce memory usage. If you prefer, you can create new columns for each step (tokenizing, indexing, padding)
    
    '''print('Start Tokenizing')
    self.df.text_token = self.df.text.apply(tokenizer.tokenize) #.progress_apply(tokenizer.tokenize)
    
    #Index the tokens 
    
    print('Start Indexing Tokens')
    self.df.text_idx = self.df.text_token.apply(tokenizer.convert_tokens_to_ids) #progress_apply(tokenizer.convert_tokens_to_ids)
    
    #Pad the text_index with 0 so that it hits the max_len
    
    print('Start Padding Process')
    self.df.text_idx_padded = self.df.text_idx.apply(self.pad_data) #progress_apply(self.pad_data) 
    
    #Converting all numpy array (for text) to tensor
    
    print('Converting numpy array to tensor')
    self.df.text_idx_padded = self.df.text_idx_padded.apply(torch.from_numpy) #progress_apply(torch.from_numpy)
    
    #drop the text_token and token_indexing to reduce memory usage
    self.df = self.df.drop('text', axis = 1)'''

  def __len__(self):
    'Denotes the total number of samples'
    return len(self.df.text)

  def __getitem__(self, index):
    'Generates one sample of data'
    # Select sample
    text_idx = self.df.text[index]
    labels = self.df.labels[index]

    return text_idx, labels
   
  def pad_data(self, s):
    #Pad the tensor with zeros so that all tensors have the same length.
    padded = np.zeros((self.maxlen,), dtype=np.int64)
    if len(s) > self.maxlen: 
      padded[:] = s[:self.maxlen]
    else: padded[:len(s)] = s
    return padded

In [0]:
#
quora_df = Dataset('train.csv', maxlen = 178)

#For debugging: 
#quora_df = Dataset('train_short.csv', maxlen = 178)



Start Tokenizing


In [0]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [0]:
#Defining the Bert model 

model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased').to(device) #BertModel

for param in model.parameters():
  param.requires_grad = False
  
'''model.fc = nn.Sequential(
    nn.Linear(178, 2048),
    nn.Sigmoid(),
    #nn.Dropout(0.1),
    nn.Linear(2048, 1024),
    nn.Sigmoid(),
    #nn.Dropout(0.1),
    nn.Linear(1024, 512),
    nn.Sigmoid(),
    #nn.Dropout(0.1),
    nn.Linear(512, 2)).to(device)'''

'''model.fc = nn.Sequential(
    nn.Embedding(178, 178),
    nn.LayerNorm(178, 512),
    nn.Hardshrink(),
    nn.Dropout(0.2),
    nn.Linear(512, 256),
    nn.Hardshrink(),
    nn.Dropout(0.2),
    nn.Linear(256, 128),
    nn.Hardshrink(),
    nn.Dropout(0.2),
    nn.Linear(128, 2)).to(device)'''

'''model.fc = nn.Sequential(
    nn.BatchNorm1d(178, 178),
    nn.Linear(178, 512),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(256, 2)).to(device)'''

'''model.fc = nn.Sequential(
    nn.BatchNorm1d(178, 178),
    nn.Linear(178, 512),
    nn.Sigmoid(),
    nn.Dropout(0.2),
    nn.Linear(512, 256),
    nn.Sigmoid(),
    nn.Dropout(0.2),
    nn.Linear(256, 2)).to(device)'''

'''model.fc = nn.Sequential(
    nn.Linear(178, 512),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 2)).to(device)'''

model.fc = nn.Sequential(
    nn.Linear(178, 178),
    nn.Hardshrink(),
    #nn.Dropout(0.2),
    nn.Linear(178, 2)).to(device)

class_weight = torch.FloatTensor([1, 17]).to(device)

criterion = nn.CrossEntropyLoss(weight = class_weight)
#criterion = nn.CrossEntropyLoss()

optimizer = BertAdam(model.fc.parameters(), lr = 0.02)

#optimizer = BertAdam(model.parameters(), lr = 0.01)


In [0]:
def split_num(dataset, train_split = 0.7):  
  dataset_len = len(dataset)  # To check how many elements there are in the dataset
  
  #train_ and test_ split based on number of elements in the dataset
  train_ = round(dataset_len * train_split)
  test_ = round(dataset_len * (1 - train_split))
  
  return (train_, test_)

In [0]:
def train_model(model, criterion, optimizer, num_epochs, file_name):
  
  max_epochs = num_epochs 

  min_validation_loss = np.Inf
  min_validation_acc = 0

  for epoch in range(max_epochs):

      print('Epoch', epoch)
      print('-' * 20)
      print('')

      # Training

      model.train() 

      records = 0
      train_running_loss = 0.0
      train_running_corrects = 0

      for inputs, labels in train_dataloaders:
          
          if records % 100000 == 0:
            print('Training in progress:', '-------->' , records, 'out of', len(train_dataloaders.dataset))
          
          # Transfer to GPU
          inputs, labels = inputs.to(device), labels.to(device)

          # zero the parameter gradients
          optimizer.zero_grad()
          
          # forward + backward + optimize
          bert_output = model.fc(inputs.float())
          #bert_output = model(inputs)
          #prob = torch.sigmoid(bert_output)
          
          loss = criterion(bert_output, labels)
          loss.backward()
          optimizer.step()

          train_running_loss += loss.item()
          #_, preds = torch.max(prob, 1)
          _, preds = torch.max(bert_output, 1)
          
          train_running_corrects += torch.sum(preds == labels.long())

          records += train_dataloaders.batch_size

      epoch_loss = train_running_loss / records
      epoch_acc = train_running_corrects.item() / records

      print('Training loss: {:.4f}, Training accuracy: {:.4f}'.format(epoch_loss, epoch_acc))
      print('')

      test_correct = 0
      test_total = 0
      test_running_loss = 0

      with torch.no_grad():

          model.eval()

          for inputs, labels in test_dataloaders:
              
              if test_total % 100000 == 0:
                print('Validation in progress:', '------>', test_total, 'out of', len(test_dataloaders.dataset))
              
              inputs, labels = inputs.to(device), labels.to(device)
              
              outputs = model.fc(inputs.float())
              #outputs = model.fc(inputs)

              loss = criterion(outputs, labels)

              _, predicted = torch.max(outputs.data, 1)

              test_running_loss += loss.item()

              test_total += labels.size(0)
              test_correct += (predicted == labels).sum().item()

          val_loss = test_running_loss / test_total
          val_acc = test_correct / test_total
              
      print('Validation loss: {:4f}'.format(val_loss))
      print('Validation accuracy: {:4f}'.format(val_acc))
      print('')

      #if (val_loss < min_validation_loss) & (val_acc > min_validation_acc):
      if (val_acc > min_validation_acc):

        #Update min_validation_loss and min_validation_acc if both validation accuracy and validation loss improves 
        #min_validation_loss = val_loss
        min_validation_acc = val_acc

        #Save the model weights if both validation accuracy and validation loss improves 
        torch.save(model.state_dict(), file_name)
        print('Model validation loss < previous model. Model saved')
        print('')
        
      print('-' * 20)
      print('')

In [0]:
train_set, test_set = random_split(quora_df, split_num(quora_df))

In [0]:
#Using the image datasets and the trainforms, define the dataloaders
train_dataloaders = DataLoader(
            train_set,
            batch_size=30000,
            shuffle=True,
            num_workers=4)

test_dataloaders = DataLoader(
            test_set,
            batch_size=30000,
            shuffle=True,
            num_workers=4)

In [0]:
file_name = '/gdrive/My Drive/Deep Learning Workshop/Advanced NLP Sequencing/Project/Bert.h5'

In [0]:
#Load model checkpoint so that we don't have to re-run the training
#model.load_state_dict(torch.load(file_name))

In [0]:
train_model(model, criterion, optimizer, 2000, file_name)

#DataLoader affects how much CUDA memory is being used