### Installs

We need **HuggingFace** & Weights & Biases.

In [None]:
# Install HuggingFace
!pip install transformers -q

In [None]:
# Install Weights and Biases
!pip install wandb -q

In [None]:
# # Tried this because .train wouldn't work
# pip install --upgrade tensorflow

Import and login to WandB

In [None]:
# Import wandb
import wandb

# Login with your authentication key - you'll need to have a wandb account to generate this
wandb.login()

In [None]:
# setup wandb environment variables
%env WANDB_ENTITY=mkoven
%env WANDB_PROJECT=testproject

### Pre-Processing using a Tokenizer with Hugging Face

In NLP, tokenizing a text block involves splitting it into words or subwords, which then are converted to IDs through a look-up table.

Each model has its own tokenizer to handle punctuation, etc. T**hat's why we need to import the correct tokenizer for the model of our choice.** Check out this well-written summary of tokenizers: https://huggingface.co/docs/transformers/tokenizer_summary

**HuggingFace tokenizer** automatically downloads the vocabulary used during pretraining or fine-tuning a given model. We need not create our own vocab from the dataset for fine-tuning.  The **AutoTokenizer.from_pretrained** method takes in the name of the model to build the appropriate tokenizer.



```
# This is formatted as code
```

### Download and Prepare the Dataset

In this tutorial, we're using the IMDB dataset.

In [None]:
# !wget downloads it, !tar extracts it

!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

We might need to do some minor pre-processing like test-train splitting, separating text and labels, merging text, etc. In our case, the **read_imdb_split function** will split the text and the label.

In [None]:
from pathlib import Path
from sklearn.model_selection import train_test_split

In [None]:
# found this in another article https://huggingface.co/transformers/v4.11.3/custom_datasets.html

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir == "neg" else 1)

    return texts, labels

In [None]:
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

# print(train_texts[:2])


We will also create a train-validation split.

In [None]:
# using sklearn's tt split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts,
                                                                  train_labels,
                                                                  test_size=.2)


### Tokenizing

The HuggingFace tokenizer will do the heavy lifting. We can either use AutoTokenizer which under the hood will call the correct tokenization class associated with the model name or we can directly import the tokenizer associated with the model (DistilBERT in our case). Also, note that the tokenizers are available in two flavors: a full python implementation and a “fast” implementation.

In [None]:
from transformers import AutoTokenizer, DistilBertTokenizerFast

In [None]:
from transformers import AutoTokenizer, TFDistilBertModel
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
outputs = model(inputs)

last_hidden_states = outputs.last_hidden_state

In [None]:
MODEL_NAME = 'distilbert-base-uncased'

In [None]:
## Pick one to use, either auto or specific

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)

We will feed in the sentence (text) to the tokenizer which will return encoder text (tokens converted to ids).

Learn about truncation and padding arguments here: https://huggingface.co/docs/transformers/preprocessing#everything-you-always-wanted-to-know-about-padding-and-truncation

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

Next, create a structured Dataset from a set of input encodings (train_encodings) and corresponding labels (train_labels).

Create TF Dataset if you are using TensorFlow backend to fine-tune the HuggingFace transformer. In the case of PyTorch create PyTorch DataLoader.

In [None]:
import tensorflow as tf

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))

In [None]:
# added this later since a function needed it at the end
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

## HuggingFace Transformer Models

The HuggingFace Transformer models are compatible with native PyTorch and TensorFlow 2.x. Models are standard torch.nn.Module or tf.keras.Model depending on the prefix of the model class name. If it begins with TF then it's a tf.keras.Model. Note that tokenizers are framework agnostic.

The easiest way to download a pre-trained Transformer model is to use the appropriate AutoModel(TFAutoModelForSequenceClassification in our case).

You can find the list of pre-trained models here: https://huggingface.co/transformers/pretrained_models.html

We will import TFDistilBertForSequenceClassification since we are fine-tuning a DistilBERT transformer. This will download the pre-trained model along with the classification head.


In [None]:
# Import required model class
from transformers import TFDistilBertForSequenceClassification

# Download pre-trained model
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)

Suppose we need the output layer(head) to have 3 neurons, we can initialize the same by passing num_labels=3 to the model class. It will create a DistilBERT model (in our case) instance with encoder weights copied from the distilbert-base-uncased model and a randomly initialized sequence classification head on top of the encoder with an output size of 3.

In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME,
                                                              num_labels=3)

We can also ask the model to return all hidden states and all attention weights if we need them

In [None]:

  model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME,
                                              output_hidden_states=True,
                                              output_attentions=True)

We can change how the model itself is built, by defining custom configuration class.

Each architecture comes with its own relevant configuration (in the case of DistilBERT, DistilBertConfig) which allows us to specify any of the hidden dimensions, dropout rate, etc.

**However, by doing so we will have to train the model from scratch. (Not covered in this tutorial.)**

In [None]:
from transformers import DistilBertConfig
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
model = TFDistilBertForSequenceClassification(config)

In [None]:
# Import required model class
from transformers import TFDistilBertForSequenceClassification


# Download pre-trained model
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)


# Fine-tuning
## Feature Complete Trainer / TFTrainer

You can fine-tune a HuggingFace Transformer using both native PyTorch and TensorFlow 2.

HuggingFace provides a simple but feature-complete training and evaluation interface through **Trainer()/TFTrainer().**

We can train, fine-tune, and evaluate any HuggingFace Transformers model with a wide range of training options and with built-in features like metric logging, gradient accumulation, and mixed precision.


### But first, training arguments.

Before instantiating Trainer/TFTrainer, we need to create a TrainingArguments/TFTrainingArguments **to access all the points of customization during training**.

Some notable arguments are:
* per_device_train_batch_size: The batch size per GPU/TPU core/CPU for training.

* gradient_accumulation_steps: Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

* learning_rate: The initial learning rate for Adam.

* weight_decay: The weight decay to apply (if not zero).

* num_train_epochs: Total number of training epochs to perform.

* run_name:  A descriptor for the run used for Weights and Biases logging.

Learn more about these args here: https://huggingface.co/transformers/main_classes/trainer.html?highlight=tftrainingarguments#tftrainingarguments

If you are using PyTorch DataLoader then use TrainingArguments. You can learn more about the arguments here (https://huggingface.co/docs/transformers/main_classes/trainer?highlight=tftrainingarguments#trainingarguments) . Note that there are some additional features that you can use with TrainingArguments like early stopping and label smoothing.



In [None]:
from transformers import TFTrainer, TFTrainingArguments

In [None]:
training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=2,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

In [None]:
trainer = TFTrainer(
    model=model,                     # the instantiated HF Transformers model to be trained
    args=training_args,              # training arguments, defined above
    train_dataset=train_dataset,     # training dataset
    eval_dataset=val_dataset         # evaluation dataset
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()