<a href="https://colab.research.google.com/github/margeumkim/email-test_2/blob/master/BERT_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Fine-Tuning Tutorial with PyTorch 

Ref: Chris McCormick and Nick Ryan's [notebook](https://colab.research.google.com/drive/1uCg_ePi2UwqjveYsmGUm67tFAse-viUm#scrollTo=D4-EDAws2HrM&line=3&uniqifier=1) 


# 1. Setup 

## 1.1. Using Colab GPU for Training

Google Colab offers free GPUs and TPUs! Since we'll be training a large neural network it's best to take advantage of this (in this case we'll attach a GPU), otherwise training will take a very long time.

A GPU can be added by going to the menu and selecting:

Edit 🡒 Notebook Settings 🡒 Hardware accelerator 🡒 (GPU)

Then run the following cell to confirm that the GPU is detected.



In [3]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device.

In [4]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


# 1.2. Installing the Hugging Face Library
Next, let's install the transformers package from Hugging Face which will give us a pytorch interface for working with BERT. (This library contains interfaces for other pretrained language models like OpenAI's GPT and GPT-2.) We've selected the pytorch interface because it strikes a nice balance between the high-level APIs (which are easy to use but don't provide insight into how things work) and tensorflow code (which contains lots of details but often sidetracks us into lessons about tensorflow, when the purpose here is BERT!).

At the moment, the Hugging Face library seems to be the most widely accepted and powerful pytorch interface for working with BERT. In addition to supporting a variety of different pre-trained transformer models, the library also includes pre-built modifications of these models suited to your specific task. For example, in this tutorial we will use BertForSequenceClassification.

The library also includes task-specific classes for token classification, question answering, next sentence prediciton, etc. Using these pre-built classes simplifies the process of modifying BERT for your purposes.

In [5]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |▌                               | 10kB 27.5MB/s eta 0:00:01[K     |█                               | 20kB 5.9MB/s eta 0:00:01[K     |█▌                              | 30kB 7.2MB/s eta 0:00:01[K     |██                              | 40kB 8.0MB/s eta 0:00:01[K     |██▍                             | 51kB 6.7MB/s eta 0:00:01[K     |███                             | 61kB 7.3MB/s eta 0:00:01[K     |███▍                            | 71kB 8.1MB/s eta 0:00:01[K     |███▉                            | 81kB 8.4MB/s eta 0:00:01[K     |████▍                           | 92kB 7.8MB/s eta 0:00:01[K     |████▉                           | 102kB 8.1MB/s eta 0:00:01[K     |█████▍                          | 112kB 8.1MB/s eta 0:00:01[K     |█████▉                          | 122kB 8.1

The code in this notebook is actually a simplified version of the run_glue.py example script from huggingface.

run_glue.py is a helpful utility which allows you to pick which GLUE benchmark task you want to run on, and which pre-trained model you want to use (you can see the list of possible models here). It also supports using either the CPU, a single GPU, or multiple GPUs. It even supports using 16-bit precision if you want further speed up.

Unfortunately, all of this configurability comes at the cost of readability. In this Notebook, we've simplified the code greatly and added plenty of comments to make it clear what's going on.

# 2. Loading CoLA Dataset

We'll use [The Corpus of Linguistic Acceptability (CoLA) ](https://nyu-mll.github.io/CoLA/) dataset for single sentence classification. It's a set of sentences labeled as grammatically correct or incorrect. It was first published in May of 2018, and is one of the tests included in the "GLUE Benchmark" on which models like BERT are competing.

## 2.1. Download & Extract

We'll use the wget package to download the dataset to the Colab instance's file system.

In [6]:
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=82b9da160a80aee875461fbfa84d7fd2803c1ac3f1e3f66ad7af25b64b46067d
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


The dataset is hosted on GitHub in this repo: https://nyu-mll.github.io/CoLA/

In [7]:
import wget
import os

print('Downloading dataset...')

# The URL for the dataset zip file.
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'

# Download the file (if we haven't already)
if not os.path.exists('./cola_public_1.1.zip'):
    wget.download(url, './cola_public_1.1.zip')

Downloading dataset...


Unzip the dataset to the file system. You can browse the file system of the Colab instance in the sidebar on the left.

In [8]:
# Unzip the dataset (if we haven't already)
if not os.path.exists('./cola_public/'):
  !unzip cola_public_1.1.zip

Archive:  cola_public_1.1.zip
   creating: cola_public/
  inflating: cola_public/README      
   creating: cola_public/tokenized/
  inflating: cola_public/tokenized/in_domain_dev.tsv  
  inflating: cola_public/tokenized/in_domain_train.tsv  
  inflating: cola_public/tokenized/out_of_domain_dev.tsv  
   creating: cola_public/raw/
  inflating: cola_public/raw/in_domain_dev.tsv  
  inflating: cola_public/raw/in_domain_train.tsv  
  inflating: cola_public/raw/out_of_domain_dev.tsv  


## 2.2. Parse
We can see from the file names that both tokenized and raw versions of the data are available.

We can't use the pre-tokenized version because, in order to apply the pre-trained BERT, we must use the tokenizer provided by the model. This is because (1) the model has a specific, fixed vocabulary and (2) the BERT tokenizer has a particular way of handling out-of-vocabulary words.

We'll use pandas to parse the "in-domain" training set and look at a few of its properties and data points.

In [9]:
import pandas as pd

# Load the dataset into a pandas dataframe.
df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

Number of training sentences: 8,551



Unnamed: 0,sentence_source,label,label_notes,sentence
4813,ks08,1,,I am not certain about whether he will go or not.
91,gj04,1,,Ron yawned himself awake.
4444,ks08,0,*,Americans have paying income tax ever since 1913.
2078,rhl07,1,,No one can forgive you that comment.
3242,l-93,1,,Tony broke her arm.
2702,l-93,1,,Amanda drove the package.
2116,rhl07,1,,Interviewing Richard Nixon gave Norman Mailer ...
7714,ad03,1,,Medea tended to appear to be evil.
4362,ks08,1,,Pat promised Leslie to be aggressive.
2618,l-93,1,,Doug removed the scratches from around the sink.


The two properties we actually care about are the  sentence and its label, which is referred to as the "acceptibility judgment" (0=unacceptable, 1=acceptable).

Here are five sentences which are labeled as not grammatically acceptible. Note how much more difficult this task is than something like sentiment analysis!

In [10]:
df.loc[df.label == 0].sample(5)[['sentence', 'label']]

Unnamed: 0,sentence,label
2464,On the table jumped a cat.,0
2832,The old and new carts banged.,0
8299,Jason whispered the phoenix had escaped,0
3880,The teacher handed the student.,0
8265,Jason arrived by Medea.,0


Let's extract the sentences and labels of our training set as numpy ndarrays.

In [0]:
# Get the lists of sentences and their labels.
sentences = df.sentence.values
labels = df.label.values
# for my own implementation -- I should demote a person's name (love-p, symes-k) as a dataframe, too

# 3. Tokenization & Input Formatting 
In this section, we'll transform our dataset into the format that BERT can be trained on.

## 3.1. BERT Tokenizer

To feed our text to BERT, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

The tokenization must be performed by the tokenizer included with BERT--the below cell will download this for us. We'll be using the "uncased" version here.