<h1> Second Round of BERT Training </h1>

In this round, I follow the RoBERTa implementation, which uses dynamic masking and whole-text prediction (rather than NSP). Named entity tokens in the dataset have now been replaced with their BIOES tags. I also add regularization in attempt to balance the distribution of samples across targets in the corpus.


In [1]:
import os
import re
import json
import pandas as pd
import stanza
import spacy 
import importlib
import pickle

import torch
from torch.optim import Adagrad
from transformers import BertTokenizer, BertForSequenceClassification, PretrainedConfig
from common import ClassificationDataset


In [2]:
project_dir = "/Users/paulp/Library/CloudStorage/OneDrive-UniversityofEasternFinland/UEF/Thesis"
data_dir = os.path.join(project_dir,"Data")
model_dir = os.path.join(project_dir, "Models")

os.chdir(data_dir)

old_dataset = pd.read_csv('compiled_data_set.csv', index_col = 0)

#L1 to integer map for loading categories into BERT
with open('target_idx.json') as f:
    data = f.read()
target_idx = json.loads(data)

n_classes = len(target_idx.keys())

<h2> Create a New NE-masked Dataset </h2>

In [3]:
spec_tokens = ['<?>', '<*>', '<R>', #one of the corpora uses these
               '<B-MISC>',
               '<I-MISC>',
               '<E-MISC>',
               '<S-MISC>',
               '<B-LOC>',
               '<I-LOC>',
               '<E-LOC>',
               '<S-LOC>',
               '<B-PER>', 
               '<I-PER>', 
               '<E-PER>', 
               '<S-PER>', 
               '<B-ORG>',
                '<I-ORG>',
                '<E-ORG>',
                '<S-ORG>'] # these will mask named entities later if needed

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', 
                                          additional_special_tokens = spec_tokens,
                                         unk_token = '[UNK]')


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
with open('spec_tokens_ne.txt', 'wb') as file:
    pickle.dump(spec_tokens, file)

In [5]:
with open('spec_tokens_ne.txt', 'rb') as file:
    spec_tokens = pickle.load(file)

In [23]:
processors = {'tokenize':'spacy','ner':'conll03'}
tok_ner = stanza.Pipeline('en', processors=processors, package='ewt')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-09-22 11:19:21 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | spacy   |
| mwt       | ewt     |
| pos       | ewt     |
| lemma     | ewt     |
| depparse  | ewt     |
| ner       | conll03 |

2022-09-22 11:19:21 INFO: Use device: cpu
2022-09-22 11:19:21 INFO: Loading: tokenize
2022-09-22 11:19:22 INFO: Loading: mwt
2022-09-22 11:19:22 INFO: Loading: pos
2022-09-22 11:19:22 INFO: Loading: lemma
2022-09-22 11:19:22 INFO: Loading: depparse
2022-09-22 11:19:22 INFO: Loading: ner
2022-09-22 11:19:22 INFO: Done loading processors!


In [120]:
def ne_replace(text):
    p = tok_ner.process(text)
    p = p.to_dict()
    new_tokens = []
    for sent in p:
        for tok in sent:
            if tok['ner'] == 'O':
                new_tok = tok['text']
            else:
                new_tok = '<' + tok['ner'] + '>'
            new_tokens.append(new_tok)
    t = tokenizer.convert_tokens_to_string(new_tokens)
    t = re.sub('< ([\*R\?]) >', '<\g<1>>', t)
    return t

In [143]:
# test the function
q = old_dataset.sample(1)['Text'].item()
s = ne_replace(q)
print(q, '\n\n', s)


      Hi Mr Smith, I am writing here to discuss with you if we should add a new group of animals to the new cards. You know that, under your leading, I've got some experience from the African set and Animals of the Americas, so I hope my suggestion might help. Firstly, I think we could reorganize the animals from different levels of concern. We could make four levels, such as extinction, critical, endangerment and vulnerable. Secondly, we could add some rare animals to our cards. I reckon animals such as flying fox, blue whale, snow leopard and Siberian tiger will make our cards more attractive. I also want to have the famous saying Dead as a dodo! on one of the cards. It's very interesting. Thirdly, I suppose the copyright cost for the photos won't be too high, because we already have copyright for many of the animals. Although we don't have a dodo, but I think the cost could be similar or less than our last product: Animals of the African Continent. Finally, I think our cards would 

There might be problems with detokenizing back into a string: so far, the detokenized samples have some added spaces, like around apostrophes in contractions and possessives for example, but these mostly retokenize back into a recognizeable BERT-like form. 

In [148]:
# create new dataset. 
#This takes a long time to run. Load the new dataset from Data directory instead
new_dataset = old_dataset
new_dataset['Text'] = new_dataset['Text'].apply(lambda x: ne_replace(x))
new_dataset.to_csv('ne_masked_dataset.csv')

In [60]:
# Load from data directory
new_dataset = pd.read_csv('ne_masked_dataset.csv', index_col=0)

<h1> RoBERTa </h1>

<h2> Training </h2>

visualizing in exBERT lite is more convenient than using BERTviz.

In [47]:
from transformers import RobertaTokenizer, RobertaModel, RobertaForSequenceClassification, RobertaConfig, RobertaForMaskedLM
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [53]:
config = RobertaConfig(max_position_embeddings = 256,
                      hidden_dropout_prob = 0.05,
                      classifier_dropout = 0.05) #try dropout as a strategy of reducing bias

In [54]:
roberta_pretrain = RobertaForMaskedLM(config)

In [61]:
new_dataset.head()

Unnamed: 0.1,Unnamed: 0,Corpus,Target,Text,Length
0,0,ICLE,GE,I 've been making music now for 20 years . You...,390
1,1,ICLE,GE,A quick inspection of the waste - paper basket...,557
2,2,ICLE,CN,Recycling of waste has long been a controversi...,587
3,3,ICLE,CN,"Few years age , government in some cities such...",785
4,4,ICLE,JP,"Gender discrimination . These Days , we often ...",829


In [72]:
data_text = ''
for text in new_dataset['Text']:
    no_newline = re.sub('\n', ' ', text)
    train_text = train_text + no_newline + '\n'

In [76]:
data_list = data_text.split('\n')

In [79]:
from torch.utils.data import RandomSampler

In [112]:
def sample_dataset(df, sampler, n):
    ds = ''
    for a in range(n):
        text = df['Text'].iloc[next(sampler)]
        text = re.sub('\n', ' ', text)
        ds = ds + '\n\n' + text
    return ds

def tr_ts_vl_split(df, tr_size=0.85, vl_size=0.075):
    
    sampler = RandomSampler(df)
    iterator = iter(sampler)
    
    n_samples = len(sampler)
    ts_size = 1.0-tr_size-vl_size
    
    train_size = round(tr_size*n_samples)
    val_size = round(vl_size*n_samples)
    test_size = n_samples - train_size - val_size
    
    train_ds = sample_dataset(df, iterator, train_size)
    val_ds = sample_dataset(df, iterator, val_size)
    test_ds = sample_dataset(df, iterator, test_size)
    
    return train_ds, val_ds, test_ds
    

In [113]:
train_ds, val_ds, test_ds = tr_ts_vl_split(dataset)

In [116]:
with open('roberta_pretrain_train_ds.txt', 'w') as file:
    file.write(train_ds)
with open('roberta_pretrain_val_ds.txt', 'w') as file:
    file.write(val_ds)
with open('roberta_pretain_test_ds.txt', 'w') as file:
    file.write(test_ds)

In [125]:
os.chdir(model_dir)

In [129]:
%%bash
mkdir -p gpt2_bpe
wget -O --no-check-certificate gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json \
wget -O --no-check-certificate gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe \
for SPLIT in train valid test; do \
    python -m multiprocessing_bpe_encoder \
        --encoder-json gpt2_bpe/encoder.json \
        --vocab-bpe gpt2_bpe/vocab.bpe \
        --inputs ../Data/roberta_pretrain_${SPLIT}_ds.txt \
        --outputs ../Data/wikitext-103-raw/wiki.${SPLIT}.bpe \
done

bash: line 4: syntax error near unexpected token `do'
bash: line 4: `for SPLIT in train valid test; do \'


CalledProcessError: Command 'b'mkdir -p gpt2_bpe\nwget -O --no-check-certificate gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json \\\nwget -O --no-check-certificate gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe \\\nfor SPLIT in train valid test; do \\\n    python -m multiprocessing_bpe_encoder \\\n        --encoder-json gpt2_bpe/encoder.json \\\n        --vocab-bpe gpt2_bpe/vocab.bpe \\\n        --inputs ../Data/roberta_pretrain_${SPLIT}_ds.txt \\\n        --outputs ../Data/wikitext-103-raw/wiki.${SPLIT}.bpe \\\ndone\n'' returned non-zero exit status 2.

In [44]:
roberta_model = RobertaForSequenceClassification(config)

In [45]:
roberta_model.config

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 256,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.18.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [37]:
config_revision = roberta_cls_model.config.to_dict()
config_revision['max_position_embeddings'] = 256
config_revision = RobertaConfig(config_revision)

roberta_cls_model = RobertaForSequenceClassification(config_revision)

TypeError: '>' not supported between instances of 'dict' and 'int'

In [9]:
sample = 'this is a sample string. How well can or can\'t you tokenize me?'
encoded = tokenizer(sample, return_tensors = 'pt')
output = model(**encoded)

In [10]:
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0370,  0.0704, -0.0398,  ..., -0.0669, -0.0483, -0.0442],
         [ 0.0292, -0.1744,  0.0657,  ..., -0.0586,  0.1099, -0.1207],
         [ 0.2565,  0.1220,  0.1763,  ..., -0.3393,  0.1181,  0.1021],
         ...,
         [ 0.0163, -0.0051, -0.0751,  ...,  0.1124,  0.0505, -0.0916],
         [ 0.0338, -0.0959,  0.2445,  ..., -0.4090,  0.0130, -0.0446],
         [-0.0284,  0.0666, -0.0780,  ..., -0.1201, -0.0458, -0.0832]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-6.6461e-03, -2.0431e-01, -2.1699e-01, -8.1624e-02,  1.1961e-01,
          1.9411e-01,  2.5973e-01, -9.3657e-02, -7.2522e-02, -1.4712e-01,
          2.1082e-01, -3.0681e-02, -8.7889e-02,  9.2792e-02, -1.3684e-01,
          4.7925e-01,  2.1791e-01, -4.5997e-01,  3.6649e-02, -1.1003e-02,
         -2.5565e-01,  6.1179e-02,  4.6769e-01,  3.2334e-01,  1.1820e-01,
          6.3845e-02, -1.0979e-01, -3.3367e-02,  1.7878e-01,  2.003

In [19]:
model.parameters

<bound method Module.parameters of RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affin