<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_11_5_hf_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# T81-558: Applications of Deep Neural Networks
**Module 11: Natural Language Processing with Hugging Face**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 11 Material

* Part 11.1: Introduction to Hugging Face [[Video]](https://www.youtube.com/watch?v=PzuL84ksRuE&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_11_1_hf.ipynb)
* Part 11.2: Hugging Face in Python [[Video]](https://www.youtube.com/watch?v=tkGIF4CFoV4&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_11_2_py_huggingface.ipynb)
* Part 11.3: Hugging Face Tokenizers [[Video]](https://www.youtube.com/watch?v=Cz2nvfK28eI&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_11_3_tokenizers.ipynb)
* Part 11.4: Hugging Face Datasets [[Video]](https://www.youtube.com/watch?v=yLlCZLzE2XU&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_11_4_hf_datasets.ipynb)
* **Part 11.5: Training Hugging Face Models** [[Video]](https://www.youtube.com/watch?v=7YZOik5S3vs&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_11_5_hf_train.ipynb)


# Google CoLab Instructions

The following code checks if Google CoLab is running.

In [None]:
try:
    from google.colab import drive
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Part 11.5: Training Hugging Face Models

Up to this point, we've used data and models from the Hugging Face hub unmodified. In this section, we will transfer and train a Hugging Face model. We will use Hugging Face data sets, tokenizers, and pretrained models to achieve this training.

We begin by installing Hugging Face if needed. It is also essential to install Hugging Face datasets.


In [None]:
# HIDE OUTPUT
!pip install transformers
!pip install transformers[torch]
!pip install transformers[sentencepiece]
!pip install datasets

We begin by loading the emotion data set from the Hugging Face hub. Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. The following code loads the emotion data set from the Hugging Face hub.

In [None]:
# HIDE OUTPUT
from datasets import load_dataset

emotions = load_dataset("emotion")


You can see a single observation from the training data set here. This observation includes both the text sample and the assigned emotion label. The label is a numeric index representing the assigned emotion.

In [None]:
emotions['train'][2]

We can display the labels in order of their index labels.

In [None]:
emotions['train'].features

Next, we utilize Hugging Face tokenizers and data sets together. The following code tokenizes the entire emotion data set. You can see below that the code has transformed the training set into subword tokens that are now ready to be used in conjunction with a transformer for either inference or training.

In [None]:
# HIDE OUTPUT
from transformers import AutoTokenizer

def tokenize(rows):
    return tokenizer(rows['text'], padding="max_length", truncation=True)

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

emotions.set_format(type=None)

tokenized_datasets = emotions.map(tokenize, batched=True)


We begin by importing the needed libraries from PyTorch and HuggingFace.

In [None]:
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    default_data_collator,
    TrainingArguments,
    Trainer
)

Now we generate a shuffled training and evaluation data set.

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)


We can now generate the PyTorch data sets and setup the data loaders.

In [None]:
# Create PyTorch DataLoaders
train_dataloader = DataLoader(
    small_train_dataset,
    sampler=RandomSampler(small_train_dataset),
    batch_size=8,
    collate_fn=default_data_collator,
)

eval_dataloader = DataLoader(
    small_eval_dataset,
    sampler=SequentialSampler(small_eval_dataset),
    batch_size=8,
    collate_fn=default_data_collator,
)

We will now load the distilbert model for classification. We will adjust the pretrained weights to predict the emotions of text lines.

In [None]:
# Instantiate model
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=6
)

We now train the neural network. Because the network is already pretrained, we use a small learning rate.

In [None]:
# Set up Trainer and TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    data_collator=default_data_collator,
)

# Train the model
trainer.train()