<a href="https://colab.research.google.com/github/limshaocong/SysBERT/blob/main/Sequence_classification_PROMISE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preliminaries

In [1]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 5.1 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 49.8 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 44.8 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 58.8 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.2 MB/s 


In [2]:
! nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-d46cb3fe-bd7e-e427-f99a-cf2239b86195)


# Import & Pre-process Data

Clone the repository containing the labeled requirements data.

In [3]:
! git clone https://github.com/limshaocong/SysBERT/

Cloning into 'SysBERT'...
remote: Enumerating objects: 433, done.[K
remote: Counting objects: 100% (433/433), done.[K
remote: Compressing objects: 100% (418/418), done.[K
remote: Total 433 (delta 32), reused 380 (delta 8), pack-reused 0[K
Receiving objects: 100% (433/433), 3.60 MiB | 19.74 MiB/s, done.
Resolving deltas: 100% (32/32), done.


This sequence classification task is performed using the labeled PROMISE dataset. The targte variable is denoted by the 'is_functional' column; 1 = functional requirement, 0 = non-functional requirement. The train, validation and test datasets are created by stratified sampling in a 70/15/15 ratio. The data is imported as a HuggingFace [Dataset](https://huggingface.co/docs/datasets/access) object for ease of downstream manipulation.

In [4]:
from datasets import load_dataset

train_path = '/content/SysBERT/Requirements/Labeled/Sequence_Classification/Pre-processed/train.csv'
val_path = '/content/SysBERT/Requirements/Labeled/Sequence_Classification/Pre-processed/val.csv'
test_path = '/content/SysBERT/Requirements/Labeled/Sequence_Classification/Pre-processed/test.csv'

promise = load_dataset(
    'csv',
    data_files = {
        'train': train_path,
        'val' : val_path,
        'test': test_path
        }
)

promise

Using custom data configuration default-b604b0b533148ac9


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-b604b0b533148ac9/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-b604b0b533148ac9/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['reqs', 'is_functional'],
        num_rows: 437
    })
    val: Dataset({
        features: ['reqs', 'is_functional'],
        num_rows: 94
    })
    test: Dataset({
        features: ['reqs', 'is_functional'],
        num_rows: 94
    })
})

To transform natural language requirements into a BERT-compatible format, the text must first be tokenized. This is performed using a pre-trained tokenizer.

In [5]:
from transformers import AutoTokenizer

model_checkpoint = 'bert-base-cased'

tokenizer = AutoTokenizer.from_pretrained(
    model_checkpoint,
    use_fast = True
)

def encode(requirements):
    return tokenizer(requirements['reqs'], truncation = True, max_length = 128)

promise_tokenized = promise.map(encode, batched = True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [6]:
sample = promise['train'][0]

print(sample)
encode(sample)

{'reqs': 'All movies shall be streamed on demand at any time of the day.', 'is_functional': 1}


{'input_ids': [101, 1398, 5558, 4103, 1129, 20273, 1113, 4555, 1120, 1251, 1159, 1104, 1103, 1285, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [7]:
promise_tokenized

DatasetDict({
    train: Dataset({
        features: ['reqs', 'is_functional', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 437
    })
    val: Dataset({
        features: ['reqs', 'is_functional', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 94
    })
    test: Dataset({
        features: ['reqs', 'is_functional', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 94
    })
})

In [15]:
from transformers import DataCollatorWithPadding

batch_size = 16
output_col = 'is_functional'

data_collator = DataCollatorWithPadding(
    tokenizer = tokenizer,
    return_tensors = 'tf'
)

batched_train_ds = promise_tokenized['train'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = [output_col],
    shuffle = False,
    drop_remainder = False,
    collate_fn = data_collator,
    batch_size = batch_size
)

batched_val_ds = promise_tokenized['val'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = [output_col],
    shuffle = False,
    drop_remainder = False,
    collate_fn = data_collator,
    batch_size = batch_size
)

batched_test_ds = promise_tokenized['test'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = [output_col],
    shuffle = False,
    drop_remainder = False,
    collate_fn = data_collator,
    batch_size = batch_size
)

# Model Fine-tuning

In [39]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

tf.keras.utils.set_random_seed(1234)

model = TFAutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels = 2
)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [40]:
for layer in model.layers[:]:
    print(layer)

<transformers.models.bert.modeling_tf_bert.TFBertMainLayer object at 0x7fcd8a092c90>
<keras.layers.core.dropout.Dropout object at 0x7fce909fd410>
<keras.layers.core.dense.Dense object at 0x7fce909fd890>


In [41]:
from transformers import models

frozen_layers = range(0, 8)

for layer in model.layers[:]:
  
  # Replace transformers.models.bert.modeling_tf_bert.TFBertMainLayer
  # with the corresponding MainLayer name from the previous code output
  if isinstance(layer, models.bert.modeling_tf_bert.TFBertMainLayer):
    
    for idx, layer in enumerate(layer.encoder.layer):
      
      if idx in frozen_layers:
        layer.trainable = False
      
      # Confirm the chosen layers are frozen
      print(layer, layer.trainable)

<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fcd8d79f6d0> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fcd8d73fb90> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fcd8d75c4d0> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fcd8d772490> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fce90abd610> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fce90ad4650> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fce90aeb610> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fce90a83650> False
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fce90a9a6d0> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fce90ad4590> True
<transformers.models.bert.modeling_tf_bert.TFBertLayer object at 0x7fce90a4a790> True
<transformers.models.bert.modeling_tf_bert.TFB

In [42]:
model.summary()

Model: "tf_bert_for_sequence_classification_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_113 (Dropout)       multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 108,311,810
Trainable params: 51,608,834
Non-trainable params: 56,702,976
_________________________________________________________________


In [45]:
from transformers import create_optimizer

num_epochs = 5

batches_per_epoch = len(promise_tokenized['train']) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(
    init_lr = 1e-5,
    num_warmup_steps = 0,
    num_train_steps = total_train_steps
)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)

model.compile(
    optimizer = optimizer,
    loss = loss,
    metrics = tf.metrics.SparseCategoricalAccuracy()
)

In [46]:
model.fit(
    batched_train_ds,
    validation_data = batched_val_ds,
    epochs = num_epochs
)



KeyboardInterrupt: ignored