<a href="https://colab.research.google.com/github/limshaocong/SysBERT/blob/main/t2_finetuning_seqclass_sc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preliminaries

In [1]:
! pip install --user datasets transformers torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
! nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-405a4d3d-53fb-8e6b-5097-f7936a4e096e)


In [3]:
# ! huggingface-cli login
# # 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

# Import & Pre-process Data

In [11]:
model_type_dict = {
    'bert-base-cased' : 'bert-base-cased',
    'roberta-base' : 'roberta-base',
    'allenai/scibert_scivocab_cased' : 'allenai/scibert_scivocab_cased',
    'limsc/reqbert-tapt-epoch29' : 'bert-base-cased', # preferred
    'limsc/reqbert-tapt-epoch30' : 'bert-base-cased',
    'limsc/reqroberta-tapt-epoch20' : 'roberta-base',
    'limsc/reqroberta-tapt-epoch33' : 'roberta-base',
    'limsc/reqroberta-tapt-epoch43' : 'roberta-base', # preferred
    'limsc/reqroberta-tapt-epoch50' : 'roberta-base',
    'limsc/reqscibert-tapt-epoch10' : 'allenai/scibert_scivocab_cased', # preferred
    'limsc/reqscibert-tapt-epoch20' : 'allenai/scibert_scivocab_cased', # preferred
    'limsc/reqscibert-tapt-epoch31' : 'allenai/scibert_scivocab_cased',
    'limsc/reqscibert-tapt-epoch49' : 'allenai/scibert_scivocab_cased',
}

model_name_dict = {
    'bert-base-cased' : 'bert',
    'roberta-base' : 'roberta',
    'allenai/scibert_scivocab_cased' : 'scibert',
    'limsc/reqbert-tapt-epoch29' : 'reqbert-e29',
    'limsc/reqbert-tapt-epoch30' : 'reqbert-e30',
    'limsc/reqroberta-tapt-epoch20' : 'reqroberta-e20',
    'limsc/reqroberta-tapt-epoch33' : 'reqroberta-e33',
    'limsc/reqroberta-tapt-epoch43' : 'reqroberta-e43',
    'limsc/reqroberta-tapt-epoch50' : 'reqroberta-e50',
    'limsc/reqscibert-tapt-epoch10' : 'reqscibert-e10',
    'limsc/reqscibert-tapt-epoch20' : 'reqscibert-e20',
    'limsc/reqscibert-tapt-epoch31' : 'reqscibert-e31',
    'limsc/reqscibert-tapt-epoch49' : 'reqscibert-e49',
}

task_name_dict = {
    'limsc/fr-nfr-classification' : 'frnfr',
    'limsc/req-subclass-classification' : 'subclass',
    'limsc/concept-recognition' : 'cr',
    'limsc/sysmlv2-entity-extraction' : 'ee'
}

In [12]:
from datasets import load_dataset

ds_name = 'limsc/req-subclass-classification'
ds = load_dataset(ds_name)
ds

Using custom data configuration limsc--req-subclass-classification-0635892898f55fc9
Reusing dataset parquet (/root/.cache/huggingface/datasets/limsc___parquet/limsc--req-subclass-classification-0635892898f55fc9/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['source', 'reqs', 'class'],
        num_rows: 500
    })
    val: Dataset({
        features: ['source', 'reqs', 'class'],
        num_rows: 62
    })
    test: Dataset({
        features: ['source', 'reqs', 'class'],
        num_rows: 63
    })
})

In [13]:
num_labels = ds['train'].features['class'].num_classes

To transform natural language requirements into a BERT-compatible format, the text must first be tokenized. This is performed using a pre-trained tokenizer.

In [14]:
# bert-base-cased
# roberta-base
# limsc/reqbert-tapt-epoch29
# limsc/reqroberta-tapt-epoch43
# limsc/reqscibert-tapt-epoch20

model_checkpoint = 'bert-base-cased'

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    model_type_dict[model_checkpoint],
    use_fast = True
)

def encode(requirements):
    return tokenizer(requirements['reqs'], truncation = True, max_length = 128)

tokenized_ds = ds.map(encode, batched = True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/limsc___parquet/limsc--req-subclass-classification-0635892898f55fc9/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/cache-feeefa486c4b3cd1.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/limsc___parquet/limsc--req-subclass-classification-0635892898f55fc9/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/cache-13a1af4074d8ba8f.arrow


In [16]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['source', 'reqs', 'class', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 500
    })
    val: Dataset({
        features: ['source', 'reqs', 'class', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 62
    })
    test: Dataset({
        features: ['source', 'reqs', 'class', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 63
    })
})

In [17]:
from transformers import DataCollatorWithPadding

batch_size = 32
output_col = 'class'

data_collator = DataCollatorWithPadding(
    tokenizer = tokenizer,
    return_tensors = 'tf'
)

def batching(tokenized_ds, batch_size):

  batched_train_ds = tokenized_ds['train'].to_tf_dataset(
      columns = ['attention_mask', 'input_ids', 'token_type_ids'],
      label_cols = [output_col],
      shuffle = False,
      drop_remainder = False,
      collate_fn = data_collator,
      batch_size = batch_size
  )

  batched_val_ds = tokenized_ds['val'].to_tf_dataset(
      columns = ['attention_mask', 'input_ids', 'token_type_ids'],
      label_cols = [output_col],
      shuffle = False,
      drop_remainder = False,
      collate_fn = data_collator,
      batch_size = batch_size
  )

  batched_test_ds = tokenized_ds['test'].to_tf_dataset(
      columns = ['attention_mask', 'input_ids', 'token_type_ids'],
      label_cols = [output_col],
      shuffle = False,
      drop_remainder = False,
      collate_fn = data_collator,
      batch_size = batch_size
  )

  return batched_train_ds, batched_val_ds, batched_test_ds

batched_train_ds, batched_val_ds, batched_test_ds = batching(tokenized_ds, batch_size)

# Model Fine-tuning (Single Loop)

In [25]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification, create_optimizer

# For Tensorflow 2.6, the weights of the classification head is only affected
# by seeds set using tf.random_set_seed.
# https://stackoverflow.com/questions/32419510/how-to-get-reproducible-results-in-keras

seed = 67897
tf.random.set_seed(seed)
num_epochs = 2
initial_lr = 2e-5

def create_model(num_epochs, initial_lr):

  model = TFAutoModelForSequenceClassification.from_pretrained(
      model_checkpoint,
      num_labels = num_labels,
      # from_pt = True
  )

  batches_per_epoch = len(tokenized_ds['train']) // batch_size
  total_train_steps = int(batches_per_epoch * num_epochs)

  optimizer, schedule = create_optimizer(
      init_lr = initial_lr,
      num_warmup_steps = 0,
      num_train_steps = total_train_steps,
      weight_decay_rate = 0.01
  )

  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)

  model.compile(
      optimizer = optimizer,
      loss = loss,
      metrics = tf.metrics.SparseCategoricalAccuracy()
  )

  return model

model = create_model(num_epochs, initial_lr)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
from transformers import models

for layer in model.layers[:]:
    print(layer, layer.trainable)

print('=========================================================================')

encoder_layer_name = {
    'bert-base-cased' : models.bert.modeling_tf_bert.TFBertMainLayer,
    'roberta-base' : models.roberta.modeling_tf_roberta.TFRobertaMainLayer,
    'allenai/scibert_scivocab_cased' : models.bert.modeling_tf_bert.TFBertMainLayer
}

frozen_layers = []

for layer in model.layers[:]:
  
  # Replace transformers.models.bert.modeling_tf_bert.TFBertMainLayer
  # with the corresponding MainLayer name from the previous code output
  if isinstance(layer, encoder_layer_name[model_type_dict[model_checkpoint]]):
    
    for idx, layer in enumerate(layer.encoder.layer):
      
      if idx in frozen_layers:
        layer.trainable = False
      
      # Confirm the chosen layers are frozen
      print(layer, layer.trainable)

<transformers.models.bert.modeling_tf_bert.TFBertMainLayer object at 0x7f702c1de650> True
<keras.layers.core.dropout.Dropout object at 0x7f702cb1ffd0> True
<keras.layers.core.dense.Dense object at 0x7f702cb24490> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f6fee232390> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f6ff1a57c90> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f70521b7e10> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f702cbbb750> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f702cbd5790> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f702cb6c950> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f702cb829d0> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f702cb9ba50> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f702cb32c50> True
<transfo

In [27]:
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_75 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  9228      
                                                                 
Total params: 108,319,500
Trainable params: 108,319,500
Non-trainable params: 0
_________________________________________________________________


In [28]:
import math
import os
import numpy as np
from tensorflow.keras.callbacks import Callback, CSVLogger, ModelCheckpoint
from transformers.keras_callbacks import PushToHubCallback
from sklearn.metrics import f1_score

class gridsearch(Callback):

    def on_epoch_end(self, epoch, logs):

        logs['seed'] = seed
        logs['batch_size'] = batch_size
        logs['learning_rate'] = initial_lr

gridsearch_cb = gridsearch()

csvlogger_file = f'{model_name_dict[model_checkpoint]}-{task_name_dict[ds_name]}.csv'
csvlogger_cb = CSVLogger(csvlogger_file, append = True)

In [29]:
callbacks = [gridsearch_cb, csvlogger_cb]

In [30]:
model.fit(
    batched_train_ds,
    validation_data = batched_val_ds,
    epochs = num_epochs,
    callbacks = callbacks
)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f702ca28290>

In [31]:
# y_true = tokenized_ds['test']['is_functional']
# y_pred = np.argmax(model.predict(batched_test_ds)['logits'], axis = 1)
# micro_f1 = f1_score(y_true, y_pred)

# print(f'Test macro F1: {micro_f1:2f}')

# Hyperparameter tuning

In [None]:
batch_sizes = [16, 32]
initial_lrs = [5e-5, 3e-5, 2e-5]
seeds = [21916, 25412, 56281, 61712, 30488,
         28215, 78867, 87843, 67918, 93327,
         95420, 11905, 86349, 12082, 81996]

num_epochs = 10

for batch_size in batch_sizes:

  batched_train_ds, batched_val_ds, batched_test_ds = batching(tokenized_ds, batch_size)

  for initial_lr in initial_lrs:
    
    for seed in seeds:
    
      tf.random.set_seed(seed)
      model = create_model(num_epochs, initial_lr)

      frozen_layers = []

      for layer in model.layers[:]:
        
        if isinstance(layer, encoder_layer_name[model_type_dict[model_checkpoint]]):
          
          for idx, layer in enumerate(layer.encoder.layer):
            
            if idx in frozen_layers:
              layer.trainable = False

      csvlogger_file = f'subclass/{task_name_dict[ds_name]}-{model_name_dict[model_checkpoint]}.csv'
      csvlogger_cb = CSVLogger(csvlogger_file, append = True)

      callbacks = [gridsearch_cb, csvlogger_cb]
      
      model.fit(
          batched_train_ds,
          validation_data = batched_val_ds,
          epochs = num_epochs,
          callbacks = callbacks
      )

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10

In [None]:
%cp -av '/content/subclass' '/content/drive/MyDrive/Thesis/logs/'