<a href="https://colab.research.google.com/github/limshaocong/ReqBERT/blob/main/t0_tapt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preliminaries

Install HuggingFace transformers and datasets libaries.

In [None]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 5.2 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 46.8 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 4.8 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 41.4 MB/s 
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.2 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 

In [None]:
! nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-2949c49b-3330-25c1-c52b-7f70815b0ae2)


# Import and Pre-process Data


In [None]:
from datasets import load_dataset

ds = load_dataset('limsc/mlm-tapt-requirements')

ds

Downloading:   0%|          | 0.00/805 [00:00<?, ?B/s]

Using custom data configuration limsc--reqbert-mlm-tapt-a68ae7ead0db5cc7


Downloading and preparing dataset csv/default (download: 4.00 MiB, generated: 7.86 MiB, post-processed: Unknown size, total: 11.86 MiB) to /root/.cache/huggingface/datasets/limsc___parquet/limsc--reqbert-mlm-tapt-a68ae7ead0db5cc7/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/214k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.98M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/limsc___parquet/limsc--reqbert-mlm-tapt-a68ae7ead0db5cc7/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    val: Dataset({
        features: ['source', 'reqs'],
        num_rows: 1990
    })
    train: Dataset({
        features: ['source', 'reqs'],
        num_rows: 37797
    })
})

In [None]:
from transformers import AutoTokenizer

# model_checkpoint = 'bert-base-cased'
model_checkpoint = 'roberta-base'
# model_checkpoint = 'allenai/scibert_scivocab_cased'

# Load pre-trained tokenizer from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(
    model_checkpoint,
    use_fast = True
)

# Wrapper function to batch encode dataset
def encode(requirements):
    return tokenizer(requirements['reqs'], truncation = False)

ds_tokenized = ds.map(
    encode,
    batched = True,
    remove_columns = ['source', 'reqs']
)

ds_tokenized

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (592 > 512). Running this sequence through the model will result in indexing errors


  0%|          | 0/38 [00:00<?, ?ba/s]

DatasetDict({
    val: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1990
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 37797
    })
})

In [None]:
chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result['labels'] = result['input_ids'].copy()
    
    return result

In [None]:
ds_chunked = ds_tokenized.map(group_texts, batched = True)
ds_chunked

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/38 [00:00<?, ?ba/s]

DatasetDict({
    val: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 650
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 12204
    })
})

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer = tokenizer,
    mlm_probability = 0.15
)

In [None]:
batch_size = 16

batched_train_ds = ds_chunked['train'].to_tf_dataset(
    columns = ['input_ids', 'attention_mask', 'labels'],
    collate_fn = data_collator,
    shuffle = True,
    batch_size = batch_size,
)

batched_val_ds = ds_chunked['val'].to_tf_dataset(
    columns = ['input_ids', 'attention_mask', 'labels'],
    collate_fn = data_collator,
    shuffle = False,
    batch_size = batch_size,
)

# Task-Adpative Pre-Training (TAPT)

In [None]:
import tensorflow as tf
from transformers import TFAutoModelForMaskedLM, create_optimizer

tf.keras.utils.set_random_seed(1)
num_epochs = 50

def create_model():
  
  if model_checkpoint == 'allenai/scibert_scivocab_cased':
    model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint, from_pt = True)
  else:
    model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)
  
  num_train_steps = len(batched_train_ds) * num_epochs

  optimizer, schedule = create_optimizer(
      init_lr = 2e-5,
      num_warmup_steps = 1000,
      num_train_steps = num_train_steps,
      weight_decay_rate = 0.01,
  )

  model.compile(optimizer = optimizer)

  return model

model = create_model()

model.summary()

Downloading:   0%|          | 0.00/627M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Model: "tf_roberta_for_masked_lm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 roberta (TFRobertaMainLayer  multiple                 124055040 
 )                                                               
                                                                 
 lm_head (TFRobertaLMHead)   multiple                  39642969  
                                                                 
Total params: 124,697,433
Trainable params: 124,697,433
Non-trainable params: 0
_________________________________________________________________


In [None]:
import math
import os
from tensorflow.keras.callbacks import Callback, CSVLogger, ModelCheckpoint
from transformers import PushToHubCallback

class Perplexity(Callback):

    def __init__(self, validation):
        super(Perplexity, self).__init__()
        self.validation = validation

    def on_epoch_end(self, epoch, logs = {}):

        val_score = self.model.evaluate(self.validation)     
        val_perplexity = math.exp(val_score)
        logs['val_perplexity'] = val_perplexity

perplexity_cb = Perplexity(batched_val_ds)

csvlogger_f = f'{model_checkpoint}-tapt-results.csv'
csvlogger_cb = CSVLogger(csvlogger_f)

checkpoint_path = f'{model_checkpoint}' + '-cp/tapt-epoch{epoch}.ckpt'
checkpoint_dir = os.path.dirname(checkpoint_path)

# csvlogger_f = f'scibert-scivocab-tapt-results.csv'
# csvlogger_cb = CSVLogger(csvlogger_f)

# checkpoint_path = f'scibert-scivocab' + '-cp/tapt-epoch{epoch}.ckpt'
# checkpoint_dir = os.path.dirname(checkpoint_path)

modelcheckpoint_cb = ModelCheckpoint(
    filepath = checkpoint_path,
    save_weights_only = True,
    verbose = 1
)

In [None]:
callbacks = [perplexity_cb, csvlogger_cb, modelcheckpoint_cb]

In [None]:
model.fit(
    batched_train_ds,
    validation_data = batched_val_ds,
    epochs = num_epochs,
    callbacks = callbacks
)

Epoch 1/50

Epoch 1: saving model to roberta-base-cp/tapt-epoch1.ckpt
Epoch 2/50

Epoch 2: saving model to roberta-base-cp/tapt-epoch2.ckpt
Epoch 3/50

Epoch 3: saving model to roberta-base-cp/tapt-epoch3.ckpt
Epoch 4/50

Epoch 4: saving model to roberta-base-cp/tapt-epoch4.ckpt
Epoch 5/50

Epoch 5: saving model to roberta-base-cp/tapt-epoch5.ckpt
Epoch 6/50

Epoch 6: saving model to roberta-base-cp/tapt-epoch6.ckpt
Epoch 7/50

Epoch 7: saving model to roberta-base-cp/tapt-epoch7.ckpt
Epoch 8/50

Epoch 8: saving model to roberta-base-cp/tapt-epoch8.ckpt
Epoch 9/50

Epoch 9: saving model to roberta-base-cp/tapt-epoch9.ckpt
Epoch 10/50

Epoch 10: saving model to roberta-base-cp/tapt-epoch10.ckpt
Epoch 11/50

Epoch 11: saving model to roberta-base-cp/tapt-epoch11.ckpt
Epoch 12/50

Epoch 12: saving model to roberta-base-cp/tapt-epoch12.ckpt
Epoch 13/50

Epoch 13: saving model to roberta-base-cp/tapt-epoch13.ckpt
Epoch 14/50

Epoch 14: saving model to roberta-base-cp/tapt-epoch14.ckpt
Epoch

<keras.callbacks.History at 0x7f8f0e3bd450>

# Save Model to HuggingFace Hub

In [None]:
! huggingface-cli login
# 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


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        (Deprecated, will be removed in v0.3.0) To login with username and password instead, interrupt with Ctrl+C.
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on y

In [None]:
os.listdir(checkpoint_dir)

chosen_epoch = 50
chosen_epoch_path = f'{model_checkpoint}-cp/tapt-epoch{chosen_epoch}.ckpt'

# Reinstate checkpoint weights
model = create_model()
model.load_weights(chosen_epoch_path)

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f8fc0078ed0>

In [None]:
repo_dict = {
  'bert-base-cased' : 'limsc/reqbert-tapt',
  'roberta-base' : 'limsc/reqroberta-tapt',
  'allenai/scibert_scivocab_cased' : 'limsc/reqscibert-tapt'
}

repo_path = repo_dict[model_checkpoint] + f'-epoch{chosen_epoch}'

model.push_to_hub(repo_path)

Cloning https://huggingface.co/limsc/reqroberta-tapt-epoch50 into local empty directory.


Upload file tf_model.h5:   0%|          | 3.34k/625M [00:00<?, ?B/s]

remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/limsc/reqroberta-tapt-epoch50
   05e9b0f..7614741  main -> main



'https://huggingface.co/limsc/reqroberta-tapt-epoch50/commit/7614741d0cedc6d8170e62d9925604140a0de8ff'