In [1]:
!conda install -y graphviz pydot

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.10.1
  latest version: 4.10.3

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



In [2]:
import numpy as np
import tensorflow as tf
import transformers

from transformers import (
    TFDistilBertForSequenceClassification,
    DistilBertTokenizerFast
)

2021-07-03 15:35:00.266888: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-07-03 15:35:00.266939: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


# Model Configuration

# Terminologies

## Pre-tokenized

* [Pre-tokenized](https://huggingface.co/transformers/preprocessing.html#pre-tokenized-inputs)

It is a ***list of string words*** e.g. ```["Hello", "I'm", "a", "single", "sentence"]```. The usage of **tokenized** is misleading in the documentation.

> Pre-tokenized does not mean your inputs are already tokenized (you wouldn’t need to pass them through the tokenizer if that was the case) but just split into words.
> If you want to use pre-tokenized inputs, you **MUST** set **```is_split_into_words=True```** when passing your inputs to the tokenizer.

When passing ```["Hello", "I'm", "a", "single", "sentence"]``` instead of a sentence ```"Hello I'm a single sentence```, the list is called **Pre-tokenized inputs**.
```
encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
```

# Config file

When instaitiating a model, you need to define the model inisitlization parameters that are defined in the Transformers configuration file. The base class is PretrainedConfig.

* [PretrainedConfig](https://huggingface.co/transformers/main_classes/configuration.html#pretrainedconfig)

> Base class for all configuration classes. Handles a few parameters common to all models’ configurations as well as methods for loading/downloading/saving configurations.

Each sub class has its own parameters. For instance, Bert pretrained models have the BertConfig.

* [BertConfig](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertConfig)

> This is the configuration class to store the configuration of a BertModel or a TFBertModel. It is used to instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT bert-base-uncased architecture.



For instance, the ```num_labels``` parameter is from the [PretrainedConfig](https://huggingface.co/transformers/main_classes/configuration.html#transformers.PretrainedConfig)

> num_labels (int, optional) – Number of labels to use in the last layer added to the model, typically for a classification task.

```
TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
``` 

* [Training TFBertForSequenceClassification with custom X and Y data](https://stackoverflow.com/a/63295240/4281353)


The configuration file for the model ```bert-base-uncased``` is published at [Huggingface model - bert-base-uncased - config.json](https://huggingface.co/bert-base-uncased/blob/main/config.json).

```
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
```


# Example

In [3]:
model_name = 'distilbert-base-uncased'
max_sequence_length = 256
num_labels = 2

In [21]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained(
    model_name, 
    truncation=True,
    padding=True,
    max_length=max_sequence_length,
    return_tensors="tf"
)

In [16]:
base = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

# This is the input for the tokens themselves(words from the dataset after encoding):
input_ids = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int32, name='input_ids')

# attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
# Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH, 
# and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
attention_mask = tf.keras.layers.Input((max_sequence_length,), dtype=tf.int32, name='attention_mask')

# Use previous inputs as BERT inputs. The output has 
output = base([input_ids, attention_mask])[0]

# We can also add dropout as regularization technique:
#output = tf.keras.layers.Dropout(rate=0.15)(output)

output = tf.keras.layers.BatchNormalization()(output)

# Provide number of classes to the final layer:
output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

# Final model:
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)

model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam()
)


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'activation_13', 'vocab_projector', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_39', 'pre_classifier', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i



In [6]:
tf.keras.utils.plot_model(
    model, 
    show_shapes=True, 
    expand_nested=True, 
    show_dtype=True
)

('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')


In [41]:
tokenized = tokenizer("a test sentence")
print(tokenized)

{'input_ids': [101, 1037, 3231, 6251, 102], 'attention_mask': [1, 1, 1, 1, 1]}


In [45]:
base(tokenized)

ValueError: You cannot specify both input_ids and inputs_embeds at the same time

In [18]:

for layer in base.layers:
    if layer.name == 'classifier':
        print(layer.input_shape)
        print(layer.output_shape)
        

AttributeError: The layer has never been called and thus has no defined input shape.

In [7]:
# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64 

# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased' 

# Batch size for fitting:
BATCH_SIZE = 16 

# Number of epochs:
EPOCHS=5

In [8]:
df = pd.read_csv('data.csv')
model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, df.target.nunique())

NameError: name 'pd' is not defined


# Fine-Tuning (Transfer Learning)

* [Training TFBertForSequenceClassification with custom X and Y data](https://stackoverflow.com/a/68171171/4281353)

For instance, utilize the [Sequence Classification](https://huggingface.co/transformers/task_summary.html#sequence-classification) capabilty of BERT for the text classification by fine-tuing the pre-trained BERT model upon the data provided. 

* [Fine-tuning a pretrained model](https://huggingface.co/transformers/training.html)
> How to fine-tune a pretrained model from the Transformers library. In TensorFlow, models can be directly trained using Keras and the fit method. 

* [Fine-tuning with custom datasets](https://huggingface.co/transformers/custom_datasets.html)
> This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets.<br>
> [Fine-tuning with native PyTorch/TensorFlow](https://huggingface.co/transformers/custom_datasets.html#fine-tuning-with-native-pytorch-tensorflow)
> ```
> from transformers import TFDistilBertForSequenceClassification
> 
> model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
> 
> optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
> model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
> model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
> ```

* [HuggingFace Text classification examples](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text-classification)
> This folder contains some scripts showing examples of text classification with the hugs Transformers library. 

[run_text_classification.py](https://github.com/huggingface/transformers/blob/master/examples/tensorflow/text-classification/run_text_classification.py) is the example for text classification fine-tuning for TensorFlow(https://huggingface.co/transformers/custom_datasets.html).