<a href="https://colab.research.google.com/github/limshaocong/SysBERT/blob/main/Sequence_classification_PROMISE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preliminaries

In [1]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 7.0 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 68.7 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 56.6 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 54.8 MB/s 
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.1 MB/s 
Collecting huggingfa

In [2]:
! nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-cd9a3d34-15ab-9a2c-1cce-ef3e194cc1d0)


# Import & Pre-process Data

Clone the repository containing the labeled requirements data.

In [3]:
! git clone https://github.com/limshaocong/SysBERT/

Cloning into 'SysBERT'...
remote: Enumerating objects: 526, done.[K
remote: Counting objects: 100% (526/526), done.[K
remote: Compressing objects: 100% (492/492), done.[K
remote: Total 526 (delta 71), reused 434 (delta 22), pack-reused 0[K
Receiving objects: 100% (526/526), 6.68 MiB | 8.81 MiB/s, done.
Resolving deltas: 100% (71/71), done.


This sequence classification task is performed using the labeled PROMISE dataset. The targte variable is denoted by the 'is_functional' column; 1 = functional requirement, 0 = non-functional requirement. The train, validation and test datasets are created by stratified sampling in a 70/15/15 ratio. The data is imported as a HuggingFace [Dataset](https://huggingface.co/docs/datasets/access) object for ease of downstream manipulation.

In [4]:
from datasets import load_dataset

train_path = '/content/SysBERT/Requirements/Labeled/Sequence_Classification/Pre-processed/Full/train.csv'
val_path = '/content/SysBERT/Requirements/Labeled/Sequence_Classification/Pre-processed/Full/val.csv'
test_path = '/content/SysBERT/Requirements/Labeled/Sequence_Classification/Pre-processed/Full/test.csv'

ds = load_dataset(
    'csv',
    data_files = {
        'train': train_path,
        'val' : val_path,
        'test': test_path
        }
)

ds

Using custom data configuration default-034be2e3c24945ac


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-034be2e3c24945ac/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-034be2e3c24945ac/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['reqs', 'is_functional'],
        num_rows: 669
    })
    val: Dataset({
        features: ['reqs', 'is_functional'],
        num_rows: 143
    })
    test: Dataset({
        features: ['reqs', 'is_functional'],
        num_rows: 144
    })
})

To transform natural language requirements into a BERT-compatible format, the text must first be tokenized. This is performed using a pre-trained tokenizer.

In [5]:
from transformers import AutoTokenizer

model_checkpoint = 'bert-base-cased'

tokenizer = AutoTokenizer.from_pretrained(
    model_checkpoint,
    use_fast = True
)

def encode(requirements):
    return tokenizer(requirements['reqs'], truncation = True, max_length = 128)

tokenized_ds = ds.map(encode, batched = True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [7]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['reqs', 'is_functional', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 669
    })
    val: Dataset({
        features: ['reqs', 'is_functional', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 143
    })
    test: Dataset({
        features: ['reqs', 'is_functional', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 144
    })
})

In [6]:
sample = ds['train'][0]

print(sample)
encode(sample)

{'reqs': 'The RFS system should be available 24/7 especially during the budgeting period. The RFS system shall be available 90% of the time all year and 98% during the budgeting period. 2% of the time the system will become available within 1 hour of the time that the situation is reported.', 'is_functional': 0}


{'input_ids': [101, 1109, 24695, 1708, 1449, 1431, 1129, 1907, 1572, 120, 128, 2108, 1219, 1103, 4788, 1158, 1669, 119, 1109, 24695, 1708, 1449, 4103, 1129, 1907, 3078, 110, 1104, 1103, 1159, 1155, 1214, 1105, 5103, 110, 1219, 1103, 4788, 1158, 1669, 119, 123, 110, 1104, 1103, 1159, 1103, 1449, 1209, 1561, 1907, 1439, 122, 2396, 1104, 1103, 1159, 1115, 1103, 2820, 1110, 2103, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [8]:
from transformers import DataCollatorWithPadding

batch_size = 16
output_col = 'is_functional'

data_collator = DataCollatorWithPadding(
    tokenizer = tokenizer,
    return_tensors = 'tf'
)

batched_train_ds = tokenized_ds['train'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = [output_col],
    shuffle = False,
    drop_remainder = False,
    collate_fn = data_collator,
    batch_size = batch_size
)

batched_val_ds = tokenized_ds['val'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = [output_col],
    shuffle = False,
    drop_remainder = False,
    collate_fn = data_collator,
    batch_size = batch_size
)

batched_test_ds = tokenized_ds['test'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = [output_col],
    shuffle = False,
    drop_remainder = False,
    collate_fn = data_collator,
    batch_size = batch_size
)

# Model Fine-tuning

In [15]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification, create_optimizer

tf.keras.utils.set_random_seed(1)
num_epochs = 10

def create_model():

  model = TFAutoModelForSequenceClassification.from_pretrained(
      model_checkpoint,
      num_labels = 2
  )

  batches_per_epoch = len(tokenized_ds['train']) // batch_size
  total_train_steps = int(batches_per_epoch * num_epochs)

  optimizer, schedule = create_optimizer(
      init_lr = 2e-5,
      num_warmup_steps = 1000,
      num_train_steps = total_train_steps,
      weight_decay_rate = 0.01
  )

  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)

  model.compile(
      optimizer = optimizer,
      loss = loss,
      metrics = tf.metrics.SparseCategoricalAccuracy()
  )

  return model

model = create_model()

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
from transformers import models

for layer in model.layers[:]:
    print(layer, layer.trainable)

print('=========================================================================')

frozen_layers = range(0, 4)

for layer in model.layers[:]:
  
  # Replace transformers.models.bert.modeling_tf_bert.TFBertMainLayer
  # with the corresponding MainLayer name from the previous code output
  if isinstance(layer, models.bert.modeling_tf_bert.TFBertMainLayer):
    
    for idx, layer in enumerate(layer.encoder.layer):
      
      if idx in frozen_layers:
        layer.trainable = False
      
      # Confirm the chosen layers are frozen
      print(layer, layer.trainable)

<transformers.models.bert.modeling_tf_bert.TFBertMainLayer object at 0x7f7dca6dec90> True
<keras.layers.core.dropout.Dropout object at 0x7f7e4a7ed3d0> True
<keras.layers.core.dense.Dense object at 0x7f7e4a7ed890> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f7dca6c4bd0> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f7e4a96f990> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f7e4a901090> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f7e4a913e90> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f7e4a8abc50> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f7e4a8c3b90> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f7e4a8d9a90> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f7e4a8718d0> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7f7e4a888850> True
<tra

In [17]:
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_75 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 108,311,810
Trainable params: 79,960,322
Non-trainable params: 28,351,488
_________________________________________________________________


In [18]:
import math
import os
import numpy as np
from tensorflow.keras.callbacks import Callback, CSVLogger, ModelCheckpoint
from sklearn.metrics import f1_score

class macro_F1(Callback):

    def __init__(self):    
        super(macro_F1, self).__init__()

    def on_epoch_end(self, epoch, logs = {}):

        y_train_true = tokenized_ds['train']['is_functional']
        y_train_pred = np.argmax(self.model.predict(batched_train_ds)['logits'], axis = 1)
        logs['train_macro_f1'] = f1_score(y_train_true, y_train_pred)

        y_val_true = tokenized_ds['val']['is_functional']
        y_val_pred = np.argmax(self.model.predict(batched_val_ds)['logits'], axis = 1)
        logs['val_macro_f1'] = f1_score(y_val_true, y_val_pred)

macro_F1_cb = macro_F1()

csvlogger_file = f'reqbert-frnfr-results.csv'
csvlogger_cb = CSVLogger(csvlogger_file)

checkpoint_path = 'model_checkpoints/reqbert-epoch{epoch}.ckpt'
checkpoint_dir = os.path.dirname(checkpoint_path)

modelcheckpoint_cb = ModelCheckpoint(
    filepath = checkpoint_path,
    save_weights_only = True,
    verbose = 1
)

In [19]:
callbacks = [macro_F1_cb, csvlogger_cb, modelcheckpoint_cb]

In [20]:
model.fit(
    batched_train_ds,
    validation_data = batched_val_ds,
    epochs = num_epochs,
    callbacks = callbacks
)

Epoch 1/10
Epoch 1: saving model to model_checkpoints/reqbert-epoch1.ckpt
Epoch 2/10
Epoch 2: saving model to model_checkpoints/reqbert-epoch2.ckpt
Epoch 3/10
Epoch 3: saving model to model_checkpoints/reqbert-epoch3.ckpt
Epoch 4/10
Epoch 4: saving model to model_checkpoints/reqbert-epoch4.ckpt
Epoch 5/10
Epoch 5: saving model to model_checkpoints/reqbert-epoch5.ckpt
Epoch 6/10
Epoch 6: saving model to model_checkpoints/reqbert-epoch6.ckpt
Epoch 7/10
Epoch 7: saving model to model_checkpoints/reqbert-epoch7.ckpt
Epoch 8/10
Epoch 8: saving model to model_checkpoints/reqbert-epoch8.ckpt
Epoch 9/10
Epoch 9: saving model to model_checkpoints/reqbert-epoch9.ckpt
Epoch 10/10
Epoch 10: saving model to model_checkpoints/reqbert-epoch10.ckpt


<keras.callbacks.History at 0x7f7e4a6eb850>

In [24]:
# y_true = tokenized_ds['test']['is_functional']
# y_pred = np.argmax(model.predict(batched_test_ds)['logits'], axis = 1)
# micro_f1 = f1_score(y_true, y_pred)

# print(f'Test macro F1: {micro_f1:2f}')

Test macro F1: 0.896970
