Ken Perry attribution
- The following code is adapted from the Course (as of late May 2022) example notebook `Fine_tune_HuggingFace_model_in_Keras_with_plain_datasets.ipynb`
- Change dataset to Financial Phrasebank
  - The "official" version of the data is hidden behind a download linke
  - To use it, you need to
   - go to the link, manually download it to your local machine
   - upload it to the `/content` directory on Colab
  - I give an alternate source, with more examples

Fine-tune a HF DistlBERT model on Financial Phrasebank
- create **custom model** consisting of HF DistilBERT **plus own classification head**
- evaluatate accuracy before and after fine-tuning
- create HF DistilBERT model complete with it's own head from HF
- experiment with datasets along the way
- Financial Phrasebank dataset can be sourced in the code in two ways
  - as a HF dataset
  - directly from author's website, but we must perform preprocessing

In creating own head
- had to observe shape of DistilBERT output `x`
  - `x['last_hidden_state']`
  - sequence (of length of no. of tokens of input)
  - each element of sequence is of size $d_\text{model} = 768$
  - token 0 corresponds to the `[CLS]` token
  - so my head accesses the sequence position of `[CLS]`: x['last_hidden_state'][:, 0, :]
    - first dim. is batch, 0 is first position, : is vector of length 768

In using the HF model with head
- had to observe that output are **logits** not *probabilities*
  - so had to convert to probability or just simply take largets count


In [1]:
try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("We're running Colab")

We're running Colab


In [2]:
import tensorflow as tf

print("Running TensorFlow version ",tf.__version__)

# Parse tensorflow version
import re

version_match = re.match("([0-9]+)\.([0-9]+)", tf.__version__)
tf_major, tf_minor = int(version_match.group(1)) , int(version_match.group(2))
print("Version {v:d}, minor {m:d}".format(v=tf_major, m=tf_minor) )

Running TensorFlow version  2.11.0
Version 2, minor 11


In [3]:
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
if gpu_devices:
    print('Using GPU')
    tf.config.experimental.set_memory_growth(gpu_devices[0], True)
else:
    print('Using CPU')

Using GPU


In [4]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import numpy as np

# Make sure required packages are installed
- vast.ai does not have some packages that are commonly installed in Colab, etc.

In [5]:
# Derived from: https://stackoverflow.com/a/44210735

import pkg_resources

required = {'scikit-learn'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)



In [6]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.0-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.2 tokenizers-0.13.2 transformers-4.27.0


In [7]:
!pip install  datasets
from datasets import load_dataset



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<

# We can source the Financial Phrasebank dataset from several places
- HuggingFace
- the authors
  - requires downloading and preprocessing (implemented below)

In [8]:
financial_phrasebank_src =  "HuggingFace" # "author"

dataset_name = "financial_phrasebank"
subset_name = 'sentences_allagree'

# Mapping of numeric labels to string
label_to_int = { 'negative': 0, 'neutral': 1, 'positive': 2}

# Dataset keys
(text_hdr, label_hdr) = ("sentence", "label")

from datasets import load_dataset

In [9]:
if financial_phrasebank_src == "HuggingFace":
  print(f"Obtaining dataset '{dataset_name}', subset '{subset_name}' from {financial_phrasebank_src}")
  dataset = load_dataset(dataset_name, subset_name)


Obtaining dataset 'financial_phrasebank', subset 'sentences_allagree' from HuggingFace


Downloading builder script:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/13.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.86k [00:00<?, ?B/s]

Downloading and preparing dataset financial_phrasebank/sentences_allagree to /root/.cache/huggingface/datasets/financial_phrasebank/sentences_allagree/1.0.0/550bde12e6c30e2674da973a55f57edde5181d53f5a5a34c1531c53f93b7e141...


Downloading data:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

Dataset financial_phrasebank downloaded and prepared to /root/.cache/huggingface/datasets/financial_phrasebank/sentences_allagree/1.0.0/550bde12e6c30e2674da973a55f57edde5181d53f5a5a34c1531c53f93b7e141. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
if financial_phrasebank_src == "author":
  print(f"Obtaining dataset '{dataset_name}', subset '{subset_name}' from {financial_phrasebank_src}")
 
  dataset_downlolad_url = "https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10/link/0c96051eee4fb1d56e000000/download"
  dataset_zipfile_url = "https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip"

  download_path = "/content/FinancialPhraseBank-v1.0.zip"

  # The dataset seems to be hidden behind a `download` link (`dataset_downlolad_url`). 
  # Invoke the link (`dataset_download_url`) manually to get the zip file instead
  # - this will download it to local machine
  # - then upload to the `/content` directory on Colab

  import os

  if not os.path.exists(download_path):
    print("You must manually go to the URL: ", dataset_zipfile_url, "\n\tdownload the file and upload it to Colab")
    # !wget $dataset_zipfile_url

    unzipped_dir=download_path.replace(".zip", "")
    unzipped_file=os.path.join(unzipped_dir, "Sentences_AllAgree.txt")


  if not os.path.exists(unzipped_file):
    ! unzip $download_path

  print("Loading: ", unzipped_file)

  # Unfortunately, the unzipped file is not encoded as utf-8, so `load_dataset` failes when it encounters a non-Unicode character.

  # Can read it as a CSV file by passing the proper encoding argument and separator.
  # Then write it back out as a CSV file in "standard" encoding.

  import pandas as pd
  df = pd.read_csv(unzipped_file, encoding='latin1', delimiter='@', header=None)

  unzipped_file_mod = unzipped_file.replace(".txt", "_mod.csv")

  df.to_csv(unzipped_file_mod, sep="\t", header=[text_hdr, label_hdr], index=None)

  # Finally: can load the dataset from the modified CSV file
  raw_datasets = load_dataset("csv", data_files=unzipped_file_mod, delimiter="\t")

  def process_example(example):
    text, label = example[text_hdr], example[label_hdr]

    # Replace label with integer:
    label_int = label_to_int[label]

    return { text_hdr: text, label_hdr: label_int }

  dataset = raw_datasets.map(process_example)


In [11]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 2264
    })
})

In [12]:
dataset["train"][:2]

{'sentence': ['According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .',
  "For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m ."],
 'label': [1, 2]}

# Re-using a `DistilBert` model with a task specific Classifer head

`BERT` is a *very large* Language Model.

`DistilBert` is a *much smaller* model obtained from `BERT` via a process known as distillation


Let's take a look at the model configuration of each model

In [13]:
from transformers import DistilBertConfig, BertConfig

DistilBertConfig()

DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.27.0",
  "vocab_size": 30522
}

In [14]:
BertConfig()

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.27.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

A couple of comparisons between the two models
- both models produce a sequence of latent vectors (sequence length equal to length of input sequence)
- the latent dimension of BERT (`hidden_size`) and `DistilBert` (`dim`) are both 768
- the number of layers of BERT ('num_hiden_layers`) is 12; `DistilBert` (`n_layers`) is 6
- both has 12 attention heads per layer


# Instantiating the pre-trained model

We are going to adapt `DistilBert` to a new "Target" task
- `DistilBert` was trained on the Masked Language Modelling task
- So the complete model includes a Classification head for that task
- Our task is different: Text Sequence Classification
- We will therefore invoke a "headless" version of the model and graft on our own head
  - which will need to be trained



By invoking the model with the `*Model` architecture: we get a model that returns (the sequence of) hidden states.  That is: a model without a head.

Had we invoked it with the `*AutoModelForSequenceClassification` architecture, we get a model with a *binary* classification head.
- But this dataset has *three* classes, so have to design a head with 3 outputs


In [15]:
from transformers import DistilBertTokenizerFast, TFDistilBertModel
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
bert = TFDistilBertModel.from_pretrained("distilbert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


We get warning messages because
- `DistilBert` was trained for Masked Language Modelling and we are invoking a "headless" model
  - because we need to use it for a different task: Text Sequence Classification

We will accomplish this by deriving a sub-class of `kera.Model`
- that *contains* a `DistilBert` model
  - refered to as the "encoder"
- override the `call` method
  - to invoke the encoder
  - post-process the output (obtain the encoding of the special `[CLS]` input token
  - use the encoding of `[CLS]` as input to a task-specfiic Classifier head

In [16]:
class TextClassificationModel(keras.Model):
  def __init__(self, encoder, train_encoder=True):
    super(TextClassificationModel, self).__init__()
    self.encoder = encoder
    self.encoder.trainable = train_encoder
    self.dropout1 = layers.Dropout(0.1)
    self.dropout2 = layers.Dropout(0.1)
    self.dense1 = layers.Dense(20, activation="relu")
    self.dense2 = layers.Dense(3, activation='softmax')
  
  def call(self, input):
    x = self.encoder(input)
    x = x['last_hidden_state'][:, 0, :]
    x = self.dropout1(x)
    x = self.dense1(x)
    x = self.dropout2(x)
    x = self.dense2(x)
    return x

## Alternate way to create a model consisting of Distilbert + a new heads
- instead of sub-classing and  overriding `call`
- created a new `Functional` model

        x = self.encoder(input)
        x = x['last_hidden_state'][:, 0, :]
        x = self.dropout1(x)
        x = self.dense1(x)
        x = self.dropout2(x)
        outputs = self.dense2(x)

        model = tf.keras.Model(inputs=input, outputs=outputs)

# Prepare the data
- split into train and test datasets
- tokenize train and test datasets
- create TensorFlow `tf.data.Dataset`

In [17]:
len( dataset["train"]["label"] )

2264

In [18]:
set( dataset["train"]["label"] )

{0, 1, 2}

In [19]:
target_labels = [ str(_) for _ in list( label_to_int.values() ) ]

print(f"Target task labels: {', '.join(target_labels)}")


Target task labels: 0, 1, 2


First try:
- place all examples/labels in memory, rather than HF dataset
- then tokenize them and place them in a TF Dataset and free memory

In [20]:
train_texts, train_labels = dataset["train"][text_hdr], dataset["train"][label_hdr]

In [21]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [22]:
len(val_texts)

453

In [23]:
train_encodings = tokenizer(train_texts, truncation=True, padding="max_length", max_length=512)
val_encodings = tokenizer(val_texts, truncation=True, padding="max_length", max_length=512)

In [24]:
type(train_encodings)

transformers.tokenization_utils_base.BatchEncoding

In [25]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

In [26]:
type( dict(train_encodings))

dict

In [27]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

Free up memory
- the data is now in the `tf.data.Datasets`, don't need to keep the original in memory

In [28]:
num_train = len(train_texts)

del(train_texts)
del(val_texts)

In [29]:
del(train_encodings)
del(val_encodings)

# Transfer Learning
- Just train the Classification head
- **do not** modify the weights of the "Encoder" (`DistilBert`) model contained within `text_classification_model`

Create `text_classification_model` by adding a trainable Classifiation head to a frozen `DistilBert` model 

In [30]:
text_classification_model = TextClassificationModel(bert, train_encoder=False)

We would like to do `text_classification_model.summary()` right now
- but it will fail because "the model hasn't been built"
  - this means that the size of the inputs are unknown as of yet
  - either we invokde `build` on the model and specify the input shape
  - or we call the model with some data, thus indicating the input shape
    - we do the latter
    - create a batched dataset
    - process the first batch through the model

In [31]:
first_batch_outputs = text_classification_model(next(iter(train_dataset.batch(4))))

print(f"First batch outputs -- number of examples in batch: {first_batch_outputs.shape[0]}")

print(f"First batch outputs -- number of classes: {first_batch_outputs.shape[1]}")

print(f"First batch outputs -- sum of outputs of each row: {tf.reduce_sum(first_batch_outputs, axis=1)}.")

print()
print(f"First batch outputs:")
first_batch_outputs

First batch outputs -- number of examples in batch: 4
First batch outputs -- number of classes: 3
First batch outputs -- sum of outputs of each row: [1.0000001  0.99999994 1.         1.0000001 ].

First batch outputs:


<tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[0.38322738, 0.48554093, 0.13123174],
       [0.40428188, 0.40895686, 0.18676122],
       [0.4088328 , 0.47693524, 0.11423194],
       [0.4152962 , 0.45829642, 0.12640747]], dtype=float32)>

We can see from the above output shape
- 4 rows = batch size 4
- 3 columns: corresponds to the 3 classes
- column values appear to be probabilities (sum to 1), not logits

OK, time to get the model summary

In [32]:
def count_weights(weights_per_layer, prefix=None):
  # NOTE: the .weights attributed DOES NOT include biases (available via .bias attribute)
  count_weights = 0

  for layer, weights in enumerate(weights_per_layer):
    num_weights = np.prod(weights.shape)

    if prefix is not None:
      print(f"Trainable layer {layer} has {num_weights} weights")

    count_weights += num_weights

  return count_weights

def count_model_weights(model):
  all_weights = model.weights
  trainable_weights = model.trainable_weights

  num_layers = len( model.layers )

  # Control detailed output: supress if number of layers (length of trainable_weights) is too big
  out_prefix = "trainable" if len(trainable_weights) < 10 else None

  num_weights, num_trainable_weights = count_weights(all_weights, None), count_weights(trainable_weights, out_prefix)                                                                         

  return num_weights, num_trainable_weights
  


In [33]:
num_weights, num_trainable_weights = count_model_weights(text_classification_model)

print()
print(f"Total number of weights {num_weights:,}, number of trainable weights {num_trainable_weights:,}")

Trainable layer 0 has 15360 weights
Trainable layer 1 has 20 weights
Trainable layer 2 has 60 weights
Trainable layer 3 has 3 weights

Total number of weights 66,378,323, number of trainable weights 15,443


In [34]:
text_classification_model.summary()

Model: "text_classification_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 tf_distil_bert_model (TFDis  multiple                 66362880  
 tilBertModel)                                                   
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
 dropout_20 (Dropout)        multiple                  0         
                                                                 
 dense (Dense)               multiple                  15380     
                                                                 
 dense_1 (Dense)             multiple                  63        
                                                                 
Total params: 66,378,323
Trainable params: 15,443
Non-trainable params: 66,362,880
________________________

Let's examine the last 2 layers

In [35]:
for i, layer in enumerate( text_classification_model.layers[-2:] ):
  print(f"Layer {-2 + i}: {type(layer)} weights {layer.weights[0].shape}, biases {layer.weights[1].shape}")


Layer -2: <class 'keras.layers.core.dense.Dense'> weights (768, 20), biases (20,)
Layer -1: <class 'keras.layers.core.dense.Dense'> weights (20, 3), biases (3,)


You can see from the above
- the latent representation size (of the single `[CLS]` token) is 768
- the next to last layer is `Dense`, converts this to 20 features
- the last layer (Classifier) converts to 20 features to 3 classes



## Train: head only

In [36]:
text_classification_model.compile(
    tf.keras.optimizers.Adam(learning_rate=5e-5), 
    "sparse_categorical_crossentropy", 
    metrics=["accuracy"])


In [37]:

from tensorflow.python.ops.gen_logging_ops import histogram_summary

def train_model(model, train_dataset, val_dataset, num_epochs=4):
    history = model.fit(
      train_dataset.shuffle(1000).batch(16), 
      epochs=num_epochs, 
      validation_data=val_dataset.batch(16)
      #callbacks=[tensorboard_callback]
    )
    
    return history
    
def train_model_in_chunks(model, train_dataset, val_dataset, num_chunks=4, num_epochs=1):
  # Divide training set into chunks
  chunk_size = num_train // num_chunks

  print(f"training on {num_train} examples in chunks of size {chunk_size}")

  for epoch_num in range(num_epochs):
    for chunk_num in range(num_chunks):
      print(f"Epoch {epoch_num}, chunk {chunk_num}:")
      history = model.fit(
        train_dataset.skip(chunk_num * chunk_size).take(chunk_size).shuffle(1000).batch(16), 
        epochs=1, 
        validation_data=val_dataset.batch(16)
        # validation_data=val_dataset.take(500).batch(16),
        #callbacks=[tensorboard_callback]
    )
      
  return history

In [38]:
train_model(text_classification_model, train_dataset, val_dataset)

Epoch 1/4


Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7ff00c5e3ca0>

## Accuracy: Evaluate accuracy on validation data

In [39]:
from sklearn.metrics import accuracy_score 


In [40]:
def eval_model(model, val_dataset, val_labels, batch_size=16):
  val_logits = model.predict( val_dataset.batch(batch_size) )

  # Depending on the model, the return type of val_logits can vary
  # - ndarray
  # - Hugging Face Sequence Model output type

  try:
    # If it is a Hugging Face return type, the logits are in the result attribute 'logits'
    hf_logits = val_logits.logits
    val_logits = hf_logits
  except:
    pass

  val_preds = np.argmax( val_logits, axis=1)

  acc = accuracy_score( val_labels, val_preds)  
  return acc

In [41]:
print(f"Transfer learning (head-only) accuracy: {eval_model(text_classification_model, val_dataset, val_labels):3.2f}")

Transfer learning (head-only) accuracy: 0.64


## Fine tuning: train **all** layers

Now that the head has been trained, it's safe to update weights for the "Encoder"
- had we not trained the head first
- the gradients in the initial batches would have bee large
- and updateing the Encoder weights with these large gradients would have been harmful


Unfreeze the weights in the embedded "Encoder" `Distilbert`

In [42]:
text_classification_model.encoder.trainable = True

In [43]:

num_weights, num_trainable_weights = count_model_weights(text_classification_model)

print()
print(f"Total number of weights {num_weights:,}, number of trainable weights {num_trainable_weights:,}")


Total number of weights 66,378,323, number of trainable weights 66,378,323


In [44]:
train_model(text_classification_model, train_dataset, val_dataset)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fefee54bd30>

## Accuracy after fine-tuning: Evaluate accuracy on validation data

In [45]:
print(f"Transfer learning (fine-tuning -- all weights) accuracy: {eval_model(text_classification_model, val_dataset, val_labels):3.2f}")

Transfer learning (fine-tuning -- all weights) accuracy: 0.71


# Simpler approach: auto-generated Text Sequence Classification Head

Hugging Face has a generic `TFAutoModelForSequenceClassification` class
- that invoked the `*ForSequenceClassification` variant of a given model
- result is a model that *includes* 
  - the post-processing steps needed to feed a Classification head
  - an (uninitialized) Classification Head
    - we need to tell the head how many classes are possible: `num_labels` argument

Similarly: we can obtain the tokenizer used by a variant of a given model

```
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
```
This is not necessary for us as we have already tokenized the data
- and convert to a `tf.data.Dataset`

In [46]:
from transformers import TFAutoModelForSequenceClassification
text_classification_model_hf = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(target_labels) )

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_40', 'classifier', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [47]:
num_weights, num_trainable_weights = count_model_weights(text_classification_model_hf)

print()
print(f"AutoModel: Total number of weights {num_weights:,}, number of trainable weights {num_trainable_weights:,}")


AutoModel: Total number of weights 66,955,779, number of trainable weights 66,955,779


From the above:
- looks like **all** weights are trainable

Probably not a good idea to Fine-Tune before training the Classifiction Head !

Let's address that:

In [48]:
text_classification_model_hf.layers

[<transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertMainLayer at 0x7fefedd250a0>,
 <keras.layers.core.dense.Dense at 0x7fefedc7fd90>,
 <keras.layers.core.dense.Dense at 0x7fefedc811f0>,
 <keras.layers.regularization.dropout.Dropout at 0x7fefedc81160>]

Model architecture created by `TFAutoModelForSequenceClassification` is just like the one we created by hand.

Let's set the `TFDistilBert` model contained within to non-trainable

In [49]:
text_classification_model_hf.layers[0].trainable = False

num_weights, num_trainable_weights = count_model_weights(text_classification_model_hf)

print()
print(f"AutoModel -- head only: Total number of weights {num_weights:,}, number of trainable weights {num_trainable_weights:,}")

Trainable layer 0 has 589824 weights
Trainable layer 1 has 768 weights
Trainable layer 2 has 2304 weights
Trainable layer 3 has 3 weights

AutoModel -- head only: Total number of weights 66,955,779, number of trainable weights 592,899


In [50]:
text_classification_model_hf.compile(
    tf.keras.optimizers.Adam(learning_rate=5e-5), 
    "sparse_categorical_crossentropy", 
    metrics=["accuracy"])


In [51]:
train_model(text_classification_model_hf, train_dataset, val_dataset)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fefedc65d90>

In [52]:
print(f"Transfer learning (head-only) accuracy: {eval_model(text_classification_model_hf, val_dataset, val_labels):3.2f}")

Transfer learning (head-only) accuracy: 0.60


# Extra material: Understand datasets

The model takes a `dict` as argument, **not** an array of examples

The `dict` has keys for
- `input_ids`, `attention_mask`
- the value associated with each key is an array (of length equal to number of examples)

A batch of "examples" is thus a `dict` of arrays, **not** and array of `dict`'s !

`val_dataset` batch is:
- a tuple of length 2
  - features
    - a dict of key/value pairs
      - the values associated with a key is an array of size `batch_size`
  - labels
    - one label per example, hence an array of size `batch_size`

In [53]:
batch_size = 16

e = next( iter(val_dataset.batch(batch_size)) )
e_features, e_labels = e
e_features

{'input_ids': <tf.Tensor: shape=(16, 512), dtype=int32, numpy=
 array([[  101,  6983, 20248, ...,     0,     0,     0],
        [  101,  1996,  2373, ...,     0,     0,     0],
        [  101,  1996,  2449, ...,     0,     0,     0],
        ...,
        [  101, 22563, 21766, ...,     0,     0,     0],
        [  101,  1996,  3136, ...,     0,     0,     0],
        [  101,  6983,  7929, ...,     0,     0,     0]], dtype=int32)>,
 'attention_mask': <tf.Tensor: shape=(16, 512), dtype=int32, numpy=
 array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}

In [54]:
e_labels

<tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 2], dtype=int32)>

To **manually** create a batch of examples (features only)
- Need to create a `dict`
- with the same keys
- whose arrays are sub-arrays (of length `batch_size`) of the entire set of examples

In [55]:
b = { k: e_features[k][:batch_size] for k in e_features.keys() }

# Try to predict using our manually created batch
text_classification_model.predict( b )



array([[0.20199126, 0.41778433, 0.38022447],
       [0.10219001, 0.73698646, 0.1608236 ],
       [0.12517482, 0.76487195, 0.10995324],
       [0.1779874 , 0.5524665 , 0.26954615],
       [0.10431206, 0.8107503 , 0.08493762],
       [0.11627509, 0.7560843 , 0.12764052],
       [0.12988111, 0.58455247, 0.28556642],
       [0.13154511, 0.75028074, 0.11817414],
       [0.11067282, 0.7880505 , 0.1012767 ],
       [0.14301555, 0.6590541 , 0.19793038],
       [0.09331883, 0.7941863 , 0.11249483],
       [0.22484739, 0.35589364, 0.41925892],
       [0.11820243, 0.80760723, 0.07419029],
       [0.10871381, 0.7720098 , 0.1192764 ],
       [0.08616417, 0.8449084 , 0.06892749],
       [0.13000588, 0.7438134 , 0.12618077]], dtype=float32)

In [56]:
# Compare to predict using the batch created by Dataset operations
# The dataset returns a pair: (features, labels).  Don't need labels to predict so the "[0]" is selecting the features from the pair
text_classification_model.predict( next( iter(val_dataset.batch(batch_size)) )[0] )




array([[0.20199126, 0.41778433, 0.38022447],
       [0.10219001, 0.73698646, 0.1608236 ],
       [0.12517482, 0.76487195, 0.10995324],
       [0.1779874 , 0.5524665 , 0.26954615],
       [0.10431206, 0.8107503 , 0.08493762],
       [0.11627509, 0.7560843 , 0.12764052],
       [0.12988111, 0.58455247, 0.28556642],
       [0.13154511, 0.75028074, 0.11817414],
       [0.11067282, 0.7880505 , 0.1012767 ],
       [0.14301555, 0.6590541 , 0.19793038],
       [0.09331883, 0.7941863 , 0.11249483],
       [0.22484739, 0.35589364, 0.41925892],
       [0.11820243, 0.80760723, 0.07419029],
       [0.10871381, 0.7720098 , 0.1192764 ],
       [0.08616417, 0.8449084 , 0.06892749],
       [0.13000588, 0.7438134 , 0.12618077]], dtype=float32)

In [57]:
num_val = 10

