<a href="https://colab.research.google.com/github/mariaberardi/NLP_examples/blob/main/NLP_model_fine_tuning_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I will walk you over the procedure we used to fine tune an NLP model for a project I did in the past. My team used BERT through the Transformers library, which makes the training process pretty accessible. 

Training an NLP model from scratch is very time consuming. For our purposes, taking a publicly available pre-trained model and only fine tuning it in our desired subject was the preferred strategy. 

In [1]:
!pip install transformers
!pip install datasets

from transformers import BertModel, BertConfig
from datasets import load_dataset

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 4.8 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 9.5 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 10.3 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.0 tokenizers-0.12.1 transformers-4.22.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.5.2-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 5.1

In [2]:
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer

In [3]:
from datasets import load_metric

In [4]:
import numpy as np

For the sake of this demonstration, we will use a publicly available text dataset to fine tune a pretrained model. To obtain results specific to a given project, a dataset in that area or subject is of course more suitable. We used data collected from a survey we distributed in our case. 

In [5]:
raw_dataset = load_dataset('glue', 'mrpc')

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [7]:
raw_dataset['train'][:5]

{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
  'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at 

In [8]:
raw_dataset['train'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

At this point, we will need to use a tokenizer. This is the tool that splits the textual data into tokens to train the model. 

In [9]:
from transformers import AutoTokenizer

In [10]:
checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [11]:
def tokenize_function(example):
  return tokenizer(example['sentence1'], example['sentence2'], padding = 'max_length', truncation = True, max_length=128)

In [12]:
tokenized_dataset = raw_dataset.map(tokenize_function, batched = True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [13]:
tokenized_dataset.column_names

{'train': ['sentence1',
  'sentence2',
  'label',
  'idx',
  'input_ids',
  'token_type_ids',
  'attention_mask'],
 'validation': ['sentence1',
  'sentence2',
  'label',
  'idx',
  'input_ids',
  'token_type_ids',
  'attention_mask'],
 'test': ['sentence1',
  'sentence2',
  'label',
  'idx',
  'input_ids',
  'token_type_ids',
  'attention_mask']}

In [14]:
tokenized_dataset = tokenized_dataset.remove_columns(['idx', 'sentence1', 'sentence2'])

In [15]:
tokenized_dataset = tokenized_dataset.rename_column('label', 'labels')

In [16]:
tokenized_dataset = tokenized_dataset.with_format('torch')

In [17]:
tokenized_dataset['train']

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

In [18]:
#to speed up the training for now, I will select a very small subset of the data
train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(100))
eval_dataset = tokenized_dataset["test"].shuffle(seed=42).select(range(100))

In [19]:
data_collator = DataCollatorWithPadding(tokenizer)

In [20]:
#we are using a pretrained model for fine-tuning 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 2)

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [21]:
training_args = TrainingArguments('test-trainer')

In [22]:
trainer = Trainer(model, training_args, train_dataset = train_dataset, eval_dataset = eval_dataset, data_collator = data_collator, tokenizer = tokenizer)

In [23]:
trainer.train()

***** Running training *****
  Num examples = 100
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 39
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=39, training_loss=0.4686579092954978, metrics={'train_runtime': 428.53, 'train_samples_per_second': 0.7, 'train_steps_per_second': 0.091, 'total_flos': 19733329152000.0, 'train_loss': 0.4686579092954978, 'epoch': 3.0})

In [33]:
#an option to save the model in Google Drive, since we were using a Colab notebook
#import pickle as pkl
#from google.colab import drive
#drive.mount('/content/gdrive')
#path = #"your pahth"
#pkl.dump(model, open( path+"save.p", "wb" ))

In [24]:
#model evaluation metric
metric = load_metric("accuracy")

  


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [25]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [26]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 100
  Batch size = 8


{'eval_loss': 0.6054608821868896,
 'eval_runtime': 46.5375,
 'eval_samples_per_second': 2.149,
 'eval_steps_per_second': 0.279,
 'epoch': 3.0}

There is an alternative fine-tuning strategy using Keras. We include it here. 

In [27]:
#to use Keras we need these libraries
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



Downloading:   0%|          | 0.00/527M [00:00<?, ?B/s]

loading weights file tf_model.h5 from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9/tf_model.h5
All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
# We convert our datasets in standard tf.data.Dataset. 
# set them in TensorFlow format

tf_train_dataset = train_dataset.with_format("tensorflow")
tf_eval_dataset = eval_dataset.with_format("tensorflow")

In [29]:
# convert everything in big tensors 

train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["labels"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)

eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["labels"]))
eval_tf_dataset = eval_tf_dataset.batch(8)

In [30]:
# compile the model to train as any Keras model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

In [31]:
# train
model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7ff92b2fda10>

In [32]:
# save the model

#from transformers import AutoModelForSequenceClassification

#model.save_pretrained("my_imdb_model")
#pytorch_model = AutoModelForSequenceClassification.from_pretrained("my_imdb_model", from_tf=True)