#### BERT Modeling 
For BERT fine-tuning and modeling, I use the code from the notebook linked below: 
https://colab.research.google.com/drive/1eJ77vkDWbMdZIuDKWqdHoT8Tc7j90zlP#scrollTo=GLs72DuMODJO

In [1]:
!pip install pytorch-pretrained-bert pytorch-nlp

Collecting pytorch-pretrained-bert
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |████████████████████████████████| 133kB 8.4MB/s 
[?25hCollecting pytorch-nlp
[?25l  Downloading https://files.pythonhosted.org/packages/4f/51/f0ee1efb75f7cc2e3065c5da1363d6be2eec79691b2821594f3f2329528c/pytorch_nlp-0.5.0-py3-none-any.whl (90kB)
[K     |████████████████████████████████| 92kB 7.8MB/s 
Installing collected packages: pytorch-pretrained-bert, pytorch-nlp
Successfully installed pytorch-nlp-0.5.0 pytorch-pretrained-bert-0.6.2


In [2]:
#Imported Libraries
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_pretrained_bert import BertTokenizer, BertConfig
from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification
from tqdm import tqdm, trange
import pandas as pd
import numpy as np
import tensorflow as tf

#Double checking if we are in a GPU server
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

Found GPU at: /device:GPU:0


'Tesla P100-PCIE-16GB'

In [3]:
#Import cleaned data and store it in a dataframe called train
from google.colab import files
uploaded = files.upload()

train = pd.read_csv('train_clean.csv', index_col=0)
train.head()

Saving train_clean.csv to train_clean.csv


Unnamed: 0,user,tweet,date,label,website_link
1978.0,CSchwier90,not a big deal yet only states dont have a ca...,3/13/2020,0.0,1
1281.0,C1231Will,a kid at the school was tested positive fo...,3/9/2020,0.0,0
2040.0,seklemec,tested positive for being too cute,3/15/2020,0.0,0
1864.0,vxIcanic,ugh just found out ive been tested positive...,3/13/2020,0.0,0
1163.0,DigitalTrends,an employee at s hq tested positive for,3/9/2020,0.0,0


#### BERT Pre-processing: 
BERT requires input data to follow a certain format, to comply with requirements the following functions & steps will be applied:
1.   `bert_preprocessing`: Adds special tokens ([CLS] & [SEP]) at the beginning and end of each tweet. It would then tokenize the tweet and convert tokens into ids. Since BERT requires input array to be of the same size, a maximum length of 280 will be selected (Twitter's max character count.) Then the tweets will be padded(adding zero's (0) for shorter tweets) and truncated(cutting longer tweet to max length) so that input sequence will be the same size. 
2.   `create_mask`: create a mask of 1 for tokens and 0 for padding.
3. Splitting training data and mask 
4. Converting data to tensor data
5. Create an iterator using torch DataLoader to save on memory. 




In [4]:
def bert_preprocessing(df):
  ''' Adds special tokens (CLS & SEP) to tweets then tokenize it. It then tokenize tweets and convert it to ids followed by padding and trunicating'''
  #Adding [CLS] and [SEP] to each tweet
  tweet = [f'[CLS] {i} [SEP]' for i in df.tweet.astype('str').values]
  
  #Load pre-trained Bert tokenizer & tokenize the tweets
  tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
  token = [tokenizer.tokenize(token) for token in tweet] 
  
  #Convert tokenize tweets to ids
  token_ids = [tokenizer.convert_tokens_to_ids(x) for x in token]
    
  #The token ids will be padded and truncated token_ids so that it is within BERT required format (requires to be the same length)
  #Twitter's max character length is 280 characters, which will be used as max length. 
  #Padding - adds zero's (0) to the tweets that are shorter than 280 characters. 
  #Trunicating - cuts tweets that are past the 280 character limit. 
  token_ids = pad_sequences(token_ids, maxlen=280, dtype='long', truncating='post', padding='post')
  
  return token_ids

def create_masks(token):
  '''Creates a lisk of token and padding mask for data'''
  #Create attention mask 
  attention_masks = [] 

  #Create a mask of 1 for tokens and 0 for padding 
  for x in token:
      mask = [float(i<0) for i in x]
      attention_masks.append(mask)
  
  return attention_masks

#Applying function 
token_id = bert_preprocessing(train)
attention_masks = create_masks(token_id)

100%|██████████| 231508/231508 [00:00<00:00, 904198.29B/s]


In [6]:
#Setting y-variable
y = train.label.values

#Splitting tokenize ids and mask
X_train, X_test, y_train, y_test = train_test_split(token_id, y, random_state=42, stratify=y)

X_train_mask, X_test_mask, _, _ = train_test_split(attention_masks,token_id, random_state=42, stratify=y)

#Convert to tensor data
var_list = [X_train, X_test, y_train, y_test, X_train_mask, X_test_mask]
X_train, X_test, y_train, y_test, X_train_mask, X_test_mask = [torch.tensor(var) for var in var_list]

X_train = X_train.long()
y_train = y_train.long()

In [7]:
# Select a batch size for training.
batch_size = 2

# Create an iterator of our data with torch DataLoader to helps save on memory during training
train_data = TensorDataset(X_train, X_train_mask, y_train)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

test_data = TensorDataset(X_test, X_test_mask, y_test)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

#### Fine-tuning BertForSequeceClassification

In [8]:
#Setting up model 
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda() #Tell pytorch to run model on the GPU 

100%|██████████| 407873900/407873900 [00:14<00:00, 27830050.52B/s]


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
   

In [9]:
# Setting Decay Rate
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}] 

#Setting Optimizer (contains all the hyperparamter for training loop)
optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=2e-5,
                     warmup=.1)

t_total value of -1 results in schedule not being applied


In [10]:
# Calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [11]:
#Code below is taken from the notebook stated above
t = [] 

# Store our loss and accuracy for plotting
train_loss_set = []

# Number of training epochs (authors recommend between 2 and 4)
epochs = 4

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):
  
  
  # Training
  
  # Set our model to training mode (as opposed to evaluation mode)
  model.train()
  
  # Tracking variables
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()
    # Forward pass
    loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    train_loss_set.append(loss.item())    
    # Backward pass
    loss.backward()
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    
    
    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

  print("Train loss: {}".format(tr_loss/nb_tr_steps))
    
    
  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Tracking variables 
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0

  # Evaluate data for one epoch
  for batch in test_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Telling the model not to compute or store gradients, saving memory and speeding up validation
    with torch.no_grad():
      # Forward pass, calculate logit predictions
      logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

  print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  next_m.mul_(beta1).add_(1 - beta1, grad)


Train loss: 0.7048370620434724


Epoch:  25%|██▌       | 1/4 [00:34<01:44, 34.88s/it]

Validation Accuracy: 0.45652173913043476
Train loss: 0.6935821107579666


Epoch:  50%|█████     | 2/4 [01:09<01:09, 34.87s/it]

Validation Accuracy: 0.45652173913043476
Train loss: 0.6976006800688587


Epoch:  75%|███████▌  | 3/4 [01:44<00:34, 34.84s/it]

Validation Accuracy: 0.5434782608695652
Train loss: 0.697975520080733


Epoch: 100%|██████████| 4/4 [02:19<00:00, 34.78s/it]

Validation Accuracy: 0.5434782608695652





In [None]:
import pickle
pd.to_pickle(model, 'bert_model.pkl')