<a href="https://colab.research.google.com/github/margeumkim/BRIDGEMAIL/blob/master/BERT_FineTune_SelectWords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Fine-Tuning Tutorial with PyTorch
Ref: Chris McCormick and Nick Ryan's notebook

## 1.1. Using Colab GPU for Training

Google Colab offers free GPUs and TPUs! Since we'll be training a large neural network it's best to take advantage of this (in this case we'll attach a GPU), otherwise training will take a very long time.

A GPU can be added by going to the menu and selecting:

Edit 🡒 Notebook Settings 🡒 Hardware accelerator 🡒 (GPU)

Then run the following cell to confirm that the GPU is detected.

In [1]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device.

In [2]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


## 1.2. Installing the Hugging Face Library

Next, let's install the transformers package from Hugging Face which will give us a pytorch interface for working with BERT. (This library contains interfaces for other pretrained language models like OpenAI's GPT and GPT-2.) We've selected the pytorch interface because it strikes a nice balance between the high-level APIs (which are easy to use but don't provide insight into how things work) and tensorflow code (which contains lots of details but often sidetracks us into lessons about tensorflow, when the purpose here is BERT!).

At the moment, the Hugging Face library seems to be the most widely accepted and powerful pytorch interface for working with BERT. In addition to supporting a variety of different pre-trained transformer models, the library also includes pre-built modifications of these models suited to your specific task. For example, in this tutorial we will use BertForSequenceClassification.

The library also includes task-specific classes for token classification, question answering, next sentence prediciton, etc. Using these pre-built classes simplifies the process of modifying BERT for your purposes.

In [3]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |▌                               | 10kB 23.6MB/s eta 0:00:01[K     |█                               | 20kB 5.9MB/s eta 0:00:01[K     |█▌                              | 30kB 7.3MB/s eta 0:00:01[K     |██                              | 40kB 7.3MB/s eta 0:00:01[K     |██▍                             | 51kB 6.5MB/s eta 0:00:01[K     |███                             | 61kB 7.1MB/s eta 0:00:01[K     |███▍                            | 71kB 7.6MB/s eta 0:00:01[K     |███▉                            | 81kB 8.0MB/s eta 0:00:01[K     |████▍                           | 92kB 7.5MB/s eta 0:00:01[K     |████▉                           | 102kB 7.9MB/s eta 0:00:01[K     |█████▍                          | 112kB 7.9MB/s eta 0:00:01[K     |█████▉                          | 122kB 7.9

The code in this notebook is actually a simplified version of the run_glue.py example script from huggingface.

run_glue.py is a helpful utility which allows you to pick which GLUE benchmark task you want to run on, and which pre-trained model you want to use (you can see the list of possible models here). It also supports using either the CPU, a single GPU, or multiple GPUs. It even supports using 16-bit precision if you want further speed up.

Unfortunately, all of this configurability comes at the cost of readability. In this Notebook, we've simplified the code greatly and added plenty of comments to make it clear what's going on.

# 2. Loading CoLA Dataset

We'll use [The Corpus of Linguistic Acceptability (CoLA) ](https://nyu-mll.github.io/CoLA/) dataset for single sentence classification. It's a set of sentences labeled as grammatically correct or incorrect. It was first published in May of 2018, and is one of the tests included in the "GLUE Benchmark" on which models like BERT are competing.

## 2.1. Download & Extract
Import the datasets

In [4]:
 from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [5]:
import pandas as pd


# Import the training data
path = "/content/drive/My Drive/data/train_set_6733.csv"
train_df = pd.read_csv(path)

# Report the number of sentences.
print('Number of training messages: {:,}\n'.format(train_df.shape[0]))


# Dataset is now stored in a Pandas Dataframe

Number of training messages: 586



In [6]:
# Import the training data
path_dict = "/content/drive/My Drive/data/dict_topic_for_bert.csv"
dict_for_bert = pd.read_csv(path_dict)

In [7]:
train_df.iloc[10]["content"]
# . or >  split --> each sentence == join with the primary category
# . or >  split --> each sentence == join with the 3.1 
# . or >  split --> each sentence == join with the 3.6
# . or >  split --> each sentence == join with the 3.2

'Just a "heads up" ... Ken may get a call from Gov Gilmore regarding the Republican Governors\' Association. Below are Sue Landwehr\'s recommendations (with which I concur). Let me know if he calls and let us know if you need any additional information. ----- Forwarded by Steven J Kean/NA/Enron on 10/04/2000 09:20 AM ----- Richard Shapiro@ENRON 10/04/2000 07:17 AM To: Susan M Landwehr/HOU/EES@EES cc: Elizabeth Linnell/NA/Enron@Enron@EES, Steven J Kean/NA/Enron@Enron@EES Subject: Re: RGA request I agree w/ your recommendations. Susan M Landwehr@EES 10/03/2000 09:50 PM To: Richard Shapiro/NA/Enron@Enron cc: Elizabeth Linnell/NA/Enron@Enron, Steven J Kean/NA/Enron@Enron Subject: RGA request Rick--you may have seen a recent letter from Gov Jim Gilmore and the RGA requesting that we make an additional contribution in the next few weeks to the RGA for their efforts on the upcoming November elections. They list a fundraising goal of $1,660,000 (just a bit aggressive!) If we are not able to ma

In [51]:
train_df['any_3_6'].describe()

count    586.000000
mean       0.293515
std        0.455762
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: any_3_6, dtype: float64

In [9]:
train_randselect = []
train_t31 = []
train_t36 = []
train_t32 = []

for index, row in train_df.iterrows():
    if type(row['content']) ==  str:
      for j in range(len(row['content'].split('.'))):
        train_randselect.append([row['primary_cat'], row['content'].split('.')[j]])
        train_t31.append([row['any_3_1'], row['content'].split('.')[j]])
        train_t36.append([row['any_3_6'], row['content'].split('.')[j]])
        train_t32.append([row['any_3_2'], row['content'].split('.')[j]])
    else:
      print (index)
      pass


1
54
83
288


In [10]:
train_t36_df = pd.DataFrame(train_t36, columns = ['Label', 'Sentence'])   # 34% yes  /// 32866 sentences
train_t31_df = pd.DataFrame(train_t31, columns = ['Label', 'Sentence'])   # 19% yes
train_t32_df = pd.DataFrame(train_t32, columns = ['Label', 'Sentence'])   # 14% yes

In [11]:
term_list = dict_for_bert['Term']
term_list

0              lower
1             member
2              taken
3        spokeswoman
4             summer
            ...     
1494           light
1495        research
1496            love
1497    circumstance
1498           phone
Name: Term, Length: 1499, dtype: object

In [43]:
train_t36_bool_list = []

for index, row in train_t36_df.iterrows():
    if type(row['Sentence']) == str:
        my_bool = any(item in list(row['Sentence'].split(' ')) for item in list(dict_for_bert['Term'])) 
        #print (my_bool)
        train_t36_bool_list.append(my_bool)

train_t36_df['term_bool'] = train_t36_bool_list

In [35]:
i = 1
train_t36_df.iloc[i]["term_bool"] = any(item in list(train_t36_df.iloc[i]['Sentence'].split(' ')) for item in list(dict_for_bert['Term'])) 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [48]:
train_t36_df['term_bool'].describe()

count     32866
unique        2
top        True
freq      25048
Name: term_bool, dtype: object

In [49]:
train_t36_df_red = train_t36_df[train_t36_df['term_bool'] == True]
len(train_t36_df_red)

25048

In [56]:
train_t36_df_red.to_csv('train_t36_ready_complete.csv', index=True)
!cp train_t36_ready_complete.csv "drive/My Drive/"

In [53]:
train_t36_df_1000 = train_t36_df_red.sample(1000)

In [54]:
train_t36_df_1000['Label'].describe()  # 33%

count    1000.000000
mean        0.333000
std         0.471522
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: Label, dtype: float64