<a href="https://colab.research.google.com/github/limshaocong/SysBERT/blob/main/MLM_TAPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preliminaries

Install HuggingFace transformers and datasets libaries.

In [None]:
! pip install datasets transformers

In [2]:
! nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-4b377839-100b-2abc-a4fd-c8bc56e5df1e)


# Import and Pre-process Data


In [22]:
from datasets import load_dataset

ds = load_dataset('limsc/mlm-tapt-requirements')

ds

Using custom data configuration limsc--reqbert-mlm-tapt-a68ae7ead0db5cc7
Reusing dataset parquet (/root/.cache/huggingface/datasets/limsc___parquet/limsc--reqbert-mlm-tapt-a68ae7ead0db5cc7/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    val: Dataset({
        features: ['source', 'reqs'],
        num_rows: 1990
    })
    train: Dataset({
        features: ['source', 'reqs'],
        num_rows: 37797
    })
})

In [24]:
from transformers import AutoTokenizer

# model_checkpoint = 'bert-base-cased'
# model_checkpoint = 'roberta-base'
model_checkpoint = 'allenai/scibert_scivocab_cased'

# Load pre-trained tokenizer from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(
    model_checkpoint,
    use_fast = True
)

# Wrapper function to batch encode dataset
def encode(requirements):
    return tokenizer(requirements['reqs'], truncation = False)

ds_tokenized = ds.map(
    encode,
    batched = True,
    remove_columns = ['source', 'reqs']
)

ds_tokenized

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/217k [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/38 [00:00<?, ?ba/s]

DatasetDict({
    val: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1990
    })
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 37797
    })
})

In [31]:
chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result['labels'] = result['input_ids'].copy()
    
    return result

In [32]:
ds_chunked = ds_tokenized.map(group_texts, batched = True)
ds_chunked

Loading cached processed dataset at /root/.cache/huggingface/datasets/limsc___parquet/limsc--reqbert-mlm-tapt-a68ae7ead0db5cc7/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/cache-7703d2879101e29d.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/limsc___parquet/limsc--reqbert-mlm-tapt-a68ae7ead0db5cc7/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/cache-216f083ffc6af87a.arrow


DatasetDict({
    val: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 612
    })
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 11510
    })
})

In [33]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer = tokenizer,
    mlm_probability = 0.15
)

In [34]:
batch_size = 16

batched_train_ds = ds_chunked['train'].to_tf_dataset(
    columns = ['input_ids', 'attention_mask', 'labels'],
    collate_fn = data_collator,
    shuffle = True,
    batch_size = batch_size,
)

batched_val_ds = ds_chunked['val'].to_tf_dataset(
    columns = ['input_ids', 'attention_mask', 'labels'],
    collate_fn = data_collator,
    shuffle = False,
    batch_size = batch_size,
)

# Task-Adpative Pre-Training (TAPT)

In [37]:
import tensorflow as tf
from transformers import TFAutoModelForMaskedLM, create_optimizer

tf.keras.utils.set_random_seed(1)
num_epochs = 60

def create_model():
  
  if model_checkpoint == 'allenai/scibert_scivocab_cased':
    model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint, from_pt = True)
  else:
    model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)
  
  num_train_steps = len(batched_train_ds) * num_epochs

  optimizer, schedule = create_optimizer(
      init_lr = 2e-5,
      num_warmup_steps = 1000,
      num_train_steps = num_train_steps,
      weight_decay_rate = 0.01,
  )

  model.compile(optimizer = optimizer)

  return model

model = create_model()

model.summary()

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForMaskedLM: ['cls.predictions.decoder.bias']
- This IS expected if you are initializing TFBertForMaskedLM from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForMaskedLM from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transfo

Model: "tf_bert_for_masked_lm_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109347840 
                                                                 
 mlm___cls (TFBertMLMHead)   multiple                  24916620  
                                                                 
Total params: 109,971,084
Trainable params: 109,971,084
Non-trainable params: 0
_________________________________________________________________


In [38]:
import math
import os
from tensorflow.keras.callbacks import Callback, CSVLogger, ModelCheckpoint
from transformers import PushToHubCallback

class Perplexity(Callback):

    def __init__(self, validation):
        super(Perplexity, self).__init__()
        self.validation = validation

    def on_epoch_end(self, epoch, logs = {}):

        val_score = self.model.evaluate(self.validation)     
        val_perplexity = math.exp(val_score)
        logs['val_perplexity'] = val_perplexity

perplexity_cb = Perplexity(batched_val_ds)

csvlogger_f = f'{model_checkpoint}-tapt-results.csv'
csvlogger_cb = CSVLogger(csvlogger_f)

checkpoint_path = f'{model_checkpoint}' + '-cp/tapt-epoch{epoch}.ckpt'
checkpoint_dir = os.path.dirname(checkpoint_path)

modelcheckpoint_cb = ModelCheckpoint(
    filepath = checkpoint_path,
    save_weights_only = True,
    verbose = 1
)

In [39]:
callbacks = [perplexity_cb, csvlogger_cb, modelcheckpoint_cb]

In [40]:
model.fit(
    batched_train_ds,
    validation_data = batched_val_ds,
    epochs = num_epochs,
    callbacks = callbacks
)

Epoch 1/60
 15/719 [..............................] - ETA: 3:43 - loss: 2.6980

KeyboardInterrupt: ignored

# Save Model to HuggingFace Hub

In [15]:
! huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        (Deprecated, will be removed in v0.3.0) To login with username and password instead, interrupt with Ctrl+C.
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on y

In [19]:
os.listdir(checkpoint_dir)

chosen_epoch = 30
chosen_epoch_path = f'{model_checkpoint}-cp/tapt-epoch{chosen_epoch}.ckpt'

# Reinstate checkpoint weights
model = create_model()
model.load_weights(chosen_epoch_path)

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fec61016e10>

In [20]:
repo_dict = {
  'bert-base-cased' : 'limsc/reqbert-tapt',
  'roberta-base' : 'limsc/reqroberta-tapt',
  'allenai/scibert_scivocab_cased' : 'limsc/reqscibert-tapt'
}

repo_path = repo_dict[model_checkpoint] + f'-epoch{chosen_epoch}'

model.push_to_hub(repo_path)

Cloning https://huggingface.co/limsc/reqbert-tapt-epoch30 into local empty directory.


Upload file tf_model.h5:   0%|          | 3.34k/500M [00:00<?, ?B/s]

remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/limsc/reqbert-tapt-epoch30
   c4eba2c..ebbce68  main -> main



'https://huggingface.co/limsc/reqbert-tapt-epoch30/commit/ebbce685f1c04758a3e7fa84a80b1927be64fd66'