<a href="https://colab.research.google.com/github/limshaocong/SysBERT/blob/main/MLM_TAPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [44]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [45]:
! nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-ecb364eb-1a9d-0838-492f-c90e2d3e53e5)


In [46]:
! git clone https://github.com/limshaocong/SysBERT/

fatal: destination path 'SysBERT' already exists and is not an empty directory.


In [47]:
from datasets import load_dataset

train_path = '/content/SysBERT/Data/mlm_train.csv'
test_path = '/content/SysBERT/Data/mlm_test.csv'

ds = load_dataset(
    'csv',
    data_files = {
        'train' : train_path,
        'test' : test_path
    }
)

ds

Using custom data configuration default-6edfc9cdaae88501
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-6edfc9cdaae88501/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['source', 'reqs'],
        num_rows: 11493
    })
    test: Dataset({
        features: ['source', 'reqs'],
        num_rows: 1278
    })
})

In [48]:
from transformers import AutoTokenizer

model_checkpoint = 'bert-base-cased'
# model_checkpoint = 'roberta-base'

tokenizer = AutoTokenizer.from_pretrained(
    model_checkpoint,
    use_fast = True
)

def encode(requirements):
    return tokenizer(requirements['reqs'], truncation = False)

ds_tokenized = ds.map(
    encode,
    batched = True,
    remove_columns = ['source', 'reqs']
)

ds_tokenized

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-6edfc9cdaae88501/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-8e83f2ef3a7d9504.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-6edfc9cdaae88501/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-443bd3c7950c4935.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 11493
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1278
    })
})

In [49]:
chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result['labels'] = result['input_ids'].copy()
    
    return result

In [50]:
ds_chunked = ds_tokenized.map(group_texts, batched = True)
ds_chunked

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-6edfc9cdaae88501/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-436d5690a47a1fc0.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-6edfc9cdaae88501/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-c86b0ffadfb922ea.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3484
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 407
    })
})

In [51]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer = tokenizer,
    mlm_probability = 0.15
)

In [52]:
batch_size = 16

tf_train_dataset = ds_chunked['train'].to_tf_dataset(
    columns = ['input_ids', 'attention_mask', 'labels'],
    collate_fn = data_collator,
    shuffle = True,
    batch_size = batch_size,
)

tf_eval_dataset = ds_chunked['test'].to_tf_dataset(
    columns = ['input_ids', 'attention_mask', 'labels'],
    collate_fn = data_collator,
    shuffle = False,
    batch_size = batch_size,
)

# Task-Adpative Pre-Training (TAPT)

In [53]:
from transformers import TFAutoModelForMaskedLM

model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-large-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


In [54]:
model.summary()

Model: "tf_bert_for_masked_lm_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  332529664 
                                                                 
 mlm___cls (TFBertMLMHead)   multiple                  31300932  
                                                                 
Total params: 333,610,308
Trainable params: 333,610,308
Non-trainable params: 0
_________________________________________________________________


In [55]:
from transformers import create_optimizer
import tensorflow as tf

num_epochs = 10
num_train_steps = len(tf_train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr = 1e-5,
    num_warmup_steps = 1000,
    num_train_steps = num_train_steps,
    weight_decay_rate = 0.01,
)

model.compile(optimizer = optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [56]:
model.fit(
    tf_train_dataset,
    validation_data = tf_eval_dataset,
    epochs = num_epochs
)

Epoch 1/10


ResourceExhaustedError: ignored