<br/>

# Part 3: Harness the Beast -- Fine-Tuning Google Bert & OpenAI GPT2  

<br/>

Transfer learning is one of the most important method to train a state-of-the-art NLP model after the Ulmfit Paper & Google Bert came out. Fine-tuning Bert & GPT2 requires huge computation power, but the conclusion is -- it totally worth it.

There are several important things when fine-tuning Bert:

## **1. translate pretrained weight<br/>**
The Google Bert is trained by tensorflow framework. But we want to use pytorch to fine tune the model because it is more user friendly. Then we have to translate the tensorflow pretrained weight to pytorch pretrain weight. There's **pytorch_pretrained_bert** package out there help us to do so.


## **2. Warm-up <br/>**
The triangle shape learning rate schedule works well for fine-tuning (according the paper). It has a warm-up time (about 5% of total step, it's a hyper-parameter), the learning rate growth from zero to full learning rate at the point when warm-up is just over, then it decay slowly and uniformly and at the end of training step it just reach 0. So we need to calculate the number of step before we start training.

## **3. Gradient Accumulations & Automatic Mixed Precision <br/>**
Batch Size is important hyper-parameter in deep-learning. Batch size too small will lead to unstable gradient (and is hard to compute parallelly). But the problem is that in kaggle kernel the GPU Memory is only 16G (Tesla P100) not enough for even batch size 32. There's 2 approach to deal with this:

**Automatic Mixed Precision**<br/>
The great company **Nvidia** recently released a very useful package ---- **AMP**<br/>

There are 3 main benefit:<br/>
**Speeds up math-intensive operations**, such as linear and convolution layers, by using Tensor Cores.<br/>
**Speeds up memory-limited operations** by accessing half the bytes compared to single-precision.<br/>
**Reduces memory requirements for training models**, enabling larger models or larger minibatches.<br/>

It's so useful and so easy to plug & play, so we have no reason to reject it! :)<br/>

For detail explaination, please check [here](https://developer.nvidia.com/automatic-mixed-precision)


**Gradient Accumulations**<br/>
By using AMP the batch size can reach 32 in kaggle kernel, but still not enough.
One more approach is gradient accumulation. At each step we just accumulate the gradient (without update the parameters), and when the batch_size*step is large enough (64 96 or 128) we just update the parameters, which mimic larger batch size.<br/>
It's easy to understand but we need to be careful when implementing it, or the whole model may become garbage.


# Bert Fine-Tuning Pipline

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np 
import pandas as pd
from joblib import Parallel, delayed
import os
import datetime
import pkg_resources
import seaborn as sns
import time
import scipy.stats as stats
import gc
import re
import operator 
import sys
from sklearn import metrics
from sklearn import model_selection
import torch
import torch.nn as nn
import torch.utils.data
import torch.nn.functional as F
from nltk.stem import PorterStemmer
from sklearn.metrics import roc_auc_score
%load_ext autoreload
%autoreload 2
%matplotlib inline
from tqdm import tqdm, tqdm_notebook
import os
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings(action='once')
import pickle
from apex import amp
import shutil

In [3]:
torch.cuda.device_count() # AWS p3.8xlarge --- 4 Tesla V100 

4

In [4]:
pd.read_csv('train.csv').__len__()

1804874

In [12]:
MAX_SEQUENCE_LENGTH = 230
SEED = 2025
EPOCHS = 2
Data_dir="./"
Input_dir = "./"  #
WORK_DIR = "./working/" # Create by yourself to save pytorch Bert weight
num_to_load=1704874   #Train Size
valid_size= 100000   #Validation Size
TOXICITY_COLUMN = 'target'

In [5]:
from pytorch_pretrained_bert import convert_tf_checkpoint_to_pytorch
from pytorch_pretrained_bert import BertTokenizer, BertForSequenceClassification,BertAdam

  return f(*args, **kwds)
  _config = json.load(open(_config_path))
  return f(*args, **kwds)


In [6]:
# Translate model from tensorflow to pytorch
BERT_MODEL_PATH = './uncased_L-12_H-768_A-12/'
convert_tf_checkpoint_to_pytorch.convert_tf_checkpoint_to_pytorch(
    BERT_MODEL_PATH + 'bert_model.ckpt',
BERT_MODEL_PATH + 'bert_config.json',
WORK_DIR + 'pytorch_model.bin')

shutil.copyfile(BERT_MODEL_PATH + 'bert_config.json', WORK_DIR + 'bert_config.json')

Building PyTorch model from configuration: {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

Converting TensorFlow checkpoint from /home/ubuntu/.jupyter/uncased_L-12_H-768_A-12/bert_model.ckpt
Loading TF weight bert/embeddings/LayerNorm/beta with shape [768]
Loading TF weight bert/embeddings/LayerNorm/gamma with shape [768]
Loading TF weight bert/embeddings/position_embeddings with shape [512, 768]
Loading TF weight bert/embeddings/token_type_embeddings with shape [2, 768]
Loading TF weight bert/embeddings/word_embeddings with shape [30522, 768]
Loading TF weight bert/encoder/layer_0/attention/output/LayerNorm/beta with shape [768]
Loading TF weight bert/encoder/layer_0/attention/output/LayerNorm/gamma with shape [768]
Loading

'./working/bert_config.json'

In [7]:
# This is the Bert configuration file
from pytorch_pretrained_bert import BertConfig
bert_config = BertConfig('./uncased_L-12_H-768_A-12/'+'bert_config.json')


In [14]:
# Converting the lines to BERT format
def convert_lines(example, max_seq_length,tokenizer):
    max_seq_length -=2
    all_tokens = []
    longer = 0
    for text in tqdm(example):
        tokens_a = tokenizer.tokenize(text)
        if len(tokens_a)>max_seq_length:
            tokens_a = tokens_a[:max_seq_length]
            longer += 1
        one_token = tokenizer.convert_tokens_to_ids(["[CLS]"]+tokens_a+["[SEP]"])+[0] * (max_seq_length - len(tokens_a))
        all_tokens.append(one_token)
    print(longer)
    return np.array(all_tokens)

In [15]:
identity_columns = [
    'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness']# for calculating validation score

# for custom loss
y_columns=['target']
y_aux_columns = ['target_prob','target_prob','target_prob','severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat']

In [16]:
tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_PATH, cache_dir=None,do_lower_case=True)

In [17]:
train_df = pd.read_csv(os.path.join(Data_dir,"train.csv")).sample(num_to_load+valid_size,random_state=SEED)
print('loaded %d records' % len(train_df))

loaded 1804874 records


In [18]:
%%time
# Make sure all comment_text values are strings
train_df['comment_text'] = train_df['comment_text'].astype(str) 
sequences = convert_lines(train_df["comment_text"].fillna("DUMMY_VALUE"),MAX_SEQUENCE_LENGTH,tokenizer)
train_df=train_df.fillna(0)
# List all identities


train_df = train_df.drop(['comment_text'],axis=1)
# convert target to 0,1


100%|██████████| 1804874/1804874 [30:15<00:00, 994.10it/s]


20250
CPU times: user 30min 47s, sys: 4.94 s, total: 30min 52s
Wall time: 30min 45s


In [None]:
train_df['target_prob']=train_df['target']

In [21]:
train_df['target']=(train_df['target']>=0.5).astype(float)

In [20]:
# fw = open('sequences180_with_length','wb')  
# pickle.dump(sequences, fw, -1)  
# pickle.dump(seq_length,fw)
# pickle.dump(train_df, fw)  
# fw.close()  

In [11]:
# fw = open('sequences','rb')  
# sequences = pickle.load(fw)
# train_df = pickle.load(fw)  
# fw.close()  

In [22]:
# np.savez("180w_seed_2098_len_230.npz", sequences)
# train_df.to_csv('180w_seed_2098_len_230.csv',index=False)

In [14]:
# r = np.load("180w_seed_2098_len_230.npz")
# sequences = r['arr_0']
# train_df = pd.read_csv('180w_seed_2098_len_230.csv')

In [24]:
sequences.shape

(1804874, 230)

In [25]:
train_df.shape

(1804874, 45)

In [27]:
# build up weight for custom loss

# Overall
weights = np.ones((len(train_df),)) / 4

# Subgroup
weights += (train_df[identity_columns].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int) / 4

# Background Positive, Subgroup Negative
weights += (( (train_df['target'].values>=0.5).astype(bool).astype(np.int) +
   (train_df[identity_columns].fillna(0).values<0.5).sum(axis=1).astype(bool).astype(np.int) ) > 1 ).astype(bool).astype(np.int) / 4

# Background Negative, Subgroup Positive
weights += (( (train_df['target'].values<0.5).astype(bool).astype(np.int) +
   (train_df[identity_columns].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int) ) > 1 ).astype(bool).astype(np.int) / 4
loss_weight = 1.0 / weights.mean()

In [29]:
X = sequences[:num_to_load]                
y = train_df[y_columns+y_aux_columns].values[:num_to_load]
X_val = sequences[num_to_load:]                
y_val = train_df[y_columns+y_aux_columns].values[num_to_load:]
weights_train = weights[:num_to_load] 

In [30]:
test_df=train_df.tail(valid_size).copy()
train_df=train_df.head(num_to_load)

In [31]:
output_model_file = "2epoch_no_dense_170w_seqbuck.bin"
lr=2e-5
batch_size = 128
accumulation_steps=1
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

<torch._C.Generator at 0x7f69d70d65f0>

In [32]:
weights_torch = torch.tensor(weights_train,dtype=torch.float)
train_dataset = torch.utils.data.TensorDataset(torch.tensor(X,dtype=torch.long),torch.tensor(y,dtype=torch.float),weights_torch)
train = train_dataset

## Buildup Network

In [33]:
BERT_OUT = 128
DENSE_HIDDEN_UNITS = 128

In [34]:
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.bert_layer = BertForSequenceClassification.from_pretrained("./working",cache_dir=None,num_labels = BERT_OUT)
        self.linear_out = nn.Linear(BERT_OUT, 1)
        self.linear_aux_out = nn.Linear(BERT_OUT, 8)
        #self.identity_out = nn.Linear(BERT_OUT, 9)
        self.drop_out_layer = nn.Dropout(p=0.1)        
        
    def forward(self, x, attention_mask=None, labels=None):
        bert_out = self.bert_layer(x, attention_mask=attention_mask, labels=labels)
        drop_out_layer_out = self.drop_out_layer(bert_out)
        result = self.linear_out(drop_out_layer_out)
        aux_result = self.linear_aux_out(bert_out)
        #identity_result = self.identity_out(bert_out)
        out = torch.cat([result, aux_result], 1)
        
        return out

In [36]:
model = NeuralNet()
model.zero_grad()
model = model.cuda()
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']

In [38]:
# calculate the step that learning rate reach 0
num_train_optimization_steps = int(EPOCHS*len(train)/batch_size/accumulation_steps)
num_train_optimization_steps

26638

In [40]:
# original Bert setting
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

# warmup is important
optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=lr,
                     warmup=0.05,
                     t_total=num_train_optimization_steps)

model, optimizer = amp.initialize(model, optimizer, opt_level="O1",verbosity=0)

model=model.train()

In [41]:
model = torch.nn.DataParallel(model) 

## Number of Trainerable Parameters Group

In [43]:
optimizer_grouped_parameters[0]['params'].__len__()

79

In [44]:
optimizer_grouped_parameters[1]['params'].__len__()

126

## Loss Fuction

In [46]:
def custom_loss(preds,targets,weights):
    ''' Define custom loss function for weighted BCE on 'target' column '''
    bce_loss_1 = nn.BCEWithLogitsLoss(weight=weights)(preds[:,0],targets[:,0])
    bce_loss_2 = nn.BCEWithLogitsLoss()(preds[:,1:],targets[:,1:])
    return ((bce_loss_1 * loss_weight)*0.60 + bce_loss_2*0.40)*2 

## Training

In [50]:
tq = range(EPOCHS)
for epoch in tq:
    start_time = time.time()
    train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
    #train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
    avg_loss = 0.
    avg_accuracy = 0.
    lossf=None
    count = 0
    optimizer.zero_grad()   # Bug fix - thanks to @chinhuic
    for i,(x_batch, y_batch,weight_batch) in enumerate(train_loader):
#        optimizer.zero_grad()
        y_pred = model(x_batch.cuda(), attention_mask=(x_batch>0).cuda(), labels=None)
        loss =  custom_loss(y_pred,y_batch.cuda(),weight_batch.cuda())
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        if (i+1) % accumulation_steps == 0:     # Wait for several backward steps
            optimizer.step()                    # Now we can do an optimizer step
            optimizer.zero_grad()    
        if lossf:
            lossf = 0.98*lossf+0.02*loss.item()
        else:
            lossf = loss.item()
        count += 1
        if count%100 == 0:
            print('step:',count)
            print('time_elapsed:',(time.time()-start_time)/60,'min',lossf)
    torch.save(model.state_dict(), 'batch128_probtarget2_'+str(epoch)+'.bin')

step: 100
time_elapsed: 1.4143638451894125 min 1.1232507607062325
step: 200
time_elapsed: 2.6277382373809814 min 0.8408596892799032
step: 300
time_elapsed: 3.8333160916964215 min 0.6025500591032656
step: 400
time_elapsed: 5.036450576782227 min 0.4747738666715351
step: 500
time_elapsed: 6.239763224124909 min 0.4009440955322863
step: 600
time_elapsed: 7.442309661706289 min 0.3805666459231075
step: 700
time_elapsed: 8.650884234905243 min 0.3692055115898399
step: 800
time_elapsed: 9.861127376556396 min 0.35607749229518393
step: 900
time_elapsed: 11.067019267876942 min 0.3506028733813989
step: 1000
time_elapsed: 12.269352165857951 min 0.35626946913882585
step: 1100
time_elapsed: 13.472266602516175 min 0.3469740949427739
step: 1200
time_elapsed: 14.675750207901 min 0.3444224798299555
step: 1300
time_elapsed: 15.879150104522704 min 0.3323961508100495
step: 1400
time_elapsed: 17.088626070817313 min 0.33060169265859163
step: 1500
time_elapsed: 18.293953319390614 min 0.3346307585501615
step: 160

## Run validation

In [51]:
# Run validation
model =NeuralNet()
model = torch.nn.DataParallel(model)
model.load_state_dict(torch.load('batch128_probtarget2_1.bin'))
model.cuda()
for param in model.parameters():
    param.requires_grad=False
model.eval()
valid_preds = np.zeros((len(X_val)))
valid = torch.utils.data.TensorDataset(torch.tensor(X_val,dtype=torch.long))
valid_loader = torch.utils.data.DataLoader(valid, batch_size=128, shuffle=False)

tk0 = tqdm(valid_loader)
for i,(x_batch,)  in enumerate(tk0):
    pred = model(x_batch.cuda(), attention_mask=(x_batch>0).cuda(), labels=None)
    valid_preds[i*128:(i+1)*128]=pred[:,0].detach().cpu().squeeze().numpy()




IncompatibleKeys(missing_keys=[], unexpected_keys=[])

DataParallel(
  (module): NeuralNet(
    (bert_layer): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1)
                )
                (output): BertSelfOutput(
                  (den

DataParallel(
  (module): NeuralNet(
    (bert_layer): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1)
                )
                (output): BertSelfOutput(
                  (den

100%|██████████| 782/782 [03:28<00:00,  4.22it/s]


In [52]:
# Calculate final score

def calculate_overall_auc(df, model_name):
    true_labels = df[TOXICITY_COLUMN]>0.5
    predicted_labels = df[model_name]
    return metrics.roc_auc_score(true_labels, predicted_labels)

def power_mean(series, p):
    total = sum(np.power(series, p))
    return np.power(total / len(series), 1 / p)

def get_final_metric(bias_df, overall_auc, POWER=-5, OVERALL_MODEL_WEIGHT=0.25):
    bias_score = np.average([
        power_mean(bias_df[SUBGROUP_AUC], POWER),
        power_mean(bias_df[BPSN_AUC], POWER),
        power_mean(bias_df[BNSP_AUC], POWER)
    ])
    return (OVERALL_MODEL_WEIGHT * overall_auc) + ((1 - OVERALL_MODEL_WEIGHT) * bias_score)



SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'  # stands for background positive, subgroup negative
BNSP_AUC = 'bnsp_auc'  # stands for background negative, subgroup positive

def compute_auc(y_true, y_pred):
    try:
        return metrics.roc_auc_score(y_true, y_pred)
    except ValueError:
        return np.nan

def compute_subgroup_auc(df, subgroup, label, model_name):
    subgroup_examples = df[df[subgroup]>0.5]
    return compute_auc((subgroup_examples[label]>0.5), subgroup_examples[model_name])

def compute_bpsn_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup negative examples and the background positive examples."""
    subgroup_negative_examples = df[(df[subgroup]>0.5) & (df[label]<=0.5)]
    non_subgroup_positive_examples = df[(df[subgroup]<=0.5) & (df[label]>0.5)]
    examples = subgroup_negative_examples.append(non_subgroup_positive_examples)
    return compute_auc(examples[label]>0.5, examples[model_name])

def compute_bnsp_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup positive examples and the background negative examples."""
    subgroup_positive_examples = df[(df[subgroup]>0.5) & (df[label]>0.5)]
    non_subgroup_negative_examples = df[(df[subgroup]<=0.5) & (df[label]<=0.5)]
    examples = subgroup_positive_examples.append(non_subgroup_negative_examples)
    return compute_auc(examples[label]>0.5, examples[model_name])

def compute_bias_metrics_for_model(dataset,
                                   subgroups,
                                   model,
                                   label_col,
                                   include_asegs=False):
    """Computes per-subgroup metrics for all subgroups and one model."""
    records = []
    for subgroup in subgroups:
        record = {
            'subgroup': subgroup,
            'subgroup_size': len(dataset[dataset[subgroup]>0.5])
        }
        record[SUBGROUP_AUC] = compute_subgroup_auc(dataset, subgroup, label_col, model)
        record[BPSN_AUC] = compute_bpsn_auc(dataset, subgroup, label_col, model)
        record[BNSP_AUC] = compute_bnsp_auc(dataset, subgroup, label_col, model)
        records.append(record)
    return pd.DataFrame(records).sort_values('subgroup_auc', ascending=True)


In [53]:

MODEL_NAME = 'model1'
test_df[MODEL_NAME]=torch.sigmoid(torch.tensor(valid_preds)).numpy()
TOXICITY_COLUMN = 'target'
bias_metrics_df = compute_bias_metrics_for_model(test_df, identity_columns, MODEL_NAME, 'target')
bias_metrics_df
get_final_metric(bias_metrics_df, calculate_overall_auc(test_df, MODEL_NAME))

Unnamed: 0,bnsp_auc,bpsn_auc,subgroup,subgroup_auc,subgroup_size
2,0.961452,0.906807,homosexual_gay_or_lesbian,0.863688,556
7,0.967091,0.908245,white,0.880614,1269
5,0.96018,0.923157,muslim,0.882408,1049
6,0.973,0.896655,black,0.893712,747
4,0.955299,0.944347,jewish,0.908925,379
8,0.960492,0.942113,psychiatric_or_mental_illness,0.915371,235
3,0.942202,0.968407,christian,0.927764,2006
1,0.958788,0.960312,female,0.936172,2818
0,0.963767,0.95418,male,0.937014,2223


0.9418549268955726