<a href="https://colab.research.google.com/github/parambharat/wandb_examples/blob/hf_examples/colabs/huggingface/wandb_hf_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TODO: Fix "Open in colab"" href**



<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/Huggingface_wandb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{huggingface_wandb} -->

<img src="https://i.imgur.com/vnejHGh.png" width="800">

<!--- @wandbcode{huggingface_tables} -->

# 🏃‍♀️ Introduction
[Hugging Face](https://huggingface.co/) provides tools to quickly train neural networks for NLP (Natural Language Processing) on any task (classification, translation, question answering, etc) and any dataset with PyTorch and TensorFlow 2.0.

## 🤔 Why should I use W&B?

<img src="https://wandb.me/mini-diagram" width="650">

- **Unified dashboard**: Central repository for all your model metrics and predictions
- **Lightweight**: No code changes required to integrate with Hugging Face
- **Accessible**: Free for individuals and academic teams
- **Secure**: All projects are private by default
- **Trusted**: Used by machine learning teams at OpenAI, Toyota, Lyft and more

Think of W&B like GitHub for machine learning models— save machine learning experiments to your private, hosted dashboard. Experiment quickly with the confidence that all the versions of your models are saved for you, no matter where you're running your scripts.

W&B lightweight integrations works with any Python script, and all you need to do is sign up for a free W&B account to start tracking and visualizing your models.

In the Hugging Face Transformers repo, we've instrumented the Trainer to automatically log training and evaluation metrics to W&B at each logging step.

Here's an in depth look at how the integration works: [Hugging Face + W&B Report](https://app.wandb.ai/jxmorris12/huggingface-demo/reports/Train-a-model-with-Hugging-Face-and-Weights-%26-Biases--VmlldzoxMDE2MTU).

# 🌴 Installation and Setup

First, let us install the latest version of Weights and Biases. We will then setup a few environment variables to enable Weights & Biases logging and finally authenticate this colab instance to use W&B.

**Note**: To enable logging to W&B, you will also need to set the `report_to` argument in your `TrainingArguments` or script to `wandb`.

In [None]:
# Install required transformer libraries along with wandb
! pip install -qqq evaluate datasets wandb git+https://github.com/huggingface/transformers

In [1]:
# Setup enviroment variables to enable logging to Weights & Biases 

import os
os.environ['WANDB_LOG_MODEL'] = "checkpoint" # can be "end", "checkpoint", ""
os.environ['WANDB_WATCH'] = "all"
os.environ['WANDB_PROJECT'] = "hf_transformers"
os.environ["WANDB_DISABLED"] = "false"


## 🖊️ Sign-up/login
If this is your first time using Weights & Baises or you are not logged in, the link that appears after running `wandb.login()` in the following code cell will take you to sign-up/login page. Signing up for a [free account](https://wandb.ai/signup) is as easy as a few clicks. 

## 🔑 Authentication
Once you've signed up, run the next cell and click on the link to get your API key and authenticate this notebook.

In [3]:
# Login and authenticate Weights & Biases
import wandb
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mparambharat[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

# Task

Text classification is a common NLP task that assigns a label or class to text. Some of the largest companies run text classification in production for a wide range of practical applications. In this example we will use the [TweetEval](https://arxiv.org/abs/2010.12421) dataset to classify tweets into identify the emotions evoked by a tweet. The dataset is used as a benchmark to train models for tweet classification tasks. We will use then use a distilled verison of RoBERTa model - [distilroberta-base](https://huggingface.co/distilroberta-base) to recoganize the emotions evoked by the tweets.

# Data

## Loading the data
Start by loading the tweet_eval dataset from the 🤗 Datasets library:

In [4]:
from datasets import load_dataset

dataset = load_dataset("tweet_eval", "emotion")



  0%|          | 0/3 [00:00<?, ?it/s]

## Understanding the dataset

In [5]:
# What does the dataset look like ?
print(dataset)

# look at an example record
print("\nSample Record:", end="\t")
print(dataset["validation"][0])

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 374
    })
})

Sample Record:	{'text': '@user @user Oh, hidden revenge and anger...I rememberthe time,she rebutted you.', 'label': 0}


There are two fields in this dataset: 

- `text`: The text of the tweet.
- `label`: The integer label of the emotion corresponding to the tweet

In [6]:
# What do the labels mean ?
idx2label = dict(enumerate(dataset["train"].features["label"].names))
label2idx = {v:k for k,v in idx2label.items()}

print(idx2label)

{0: 'anger', 1: 'joy', 2: 'optimism', 3: 'sadness'}


## Preprocessing

We need to convert the `text` to integer tokens so that they can be passed into the model as inputs. To do this we will use the  `distilroberta` tokenizer to preprocess the `text` field in the dataset.

In [7]:
from transformers import AutoTokenizer
MODEL_NAME = "distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Create a preprocessing function to tokenize `text` and truncate sequences to be no longer than distilroberta's maximum input length:

In [8]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once:

In [9]:
tokenized_ds = dataset.map(preprocess_function, batched=True,)
tokenized_ds



  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 374
    })
})

The above step added two new columns to our dataset. `input_ids` and `attention_mask`. These are the inputs we will be passing to our model.

Since all our examples are of different lengths and the model expects a batch of tokens with the same length we will need to pad our inputs. We can use the `DataCollatorWithPadding` utility to do this. To further speed up training we will pre-compute the length of texts in the tokenized dataset and sort the dataset by this column. This ensures that the batches of data have as minimal padding as possible.

In [10]:
def length_function(examples):
  return {"length": [len(example) for example in examples["input_ids"]]}
  
tokenized_ds = tokenized_ds.map(length_function, batched=True)
tokenized_ds = tokenized_ds.sort("length")



  0%|          | 0/1 [00:00<?, ?ba/s]



Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length.

In [11]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [f1-score](https://huggingface.co/spaces/evaluate-metric/f1) metric. This is the metric used in the TweetEval benchmark.
You will notice that this metric get logged automatically to your weights & biases run while training.

In [12]:
import evaluate

f1_score = evaluate.load("f1")

In [13]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1_score.compute(predictions=predictions, references=labels, average="weighted")

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

In [14]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=len(idx2label), id2label=idx2label, label2id=label2idx,
    attention_probs_dropout_prob=0.2, hidden_dropout_prob=0.3
)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll also add the `report_to="wandb"` argument here. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint. These metrics and checkpoints are automatically pushed to your wandb project.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [15]:
training_args = TrainingArguments(
    output_dir="my_emotion_model",
    learning_rate=2e-5,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    num_train_epochs=20,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=25,
    load_best_model_at_end=True,
    warmup_steps=50,
    save_total_limit=2,
    report_to="wandb", # this enables logging metrics and model checkpoints to weights & biases,
)

In [16]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: length, text. If length, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3257
  Num Epochs = 10
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 260
  Number of trainable parameters = 82121476
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1
1,1.4588,1.4452,0.010429
2,1.4307,1.412658,0.010429
3,1.3786,1.35793,0.256364
4,1.2854,1.258707,0.256364


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: length, text. If length, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 374
  Batch size = 128
Saving model checkpoint to my_emotion_model/checkpoint-26
Configuration saved in my_emotion_model/checkpoint-26/config.json
Model weights saved in my_emotion_model/checkpoint-26/pytorch_model.bin
tokenizer config file saved in my_emotion_model/checkpoint-26/tokenizer_config.json
Special tokens file saved in my_emotion_model/checkpoint-26/special_tokens_map.json
Deleting older checkpoint [my_emotion_model/checkpoint-104] due to args.save_total_limit
Logging checkpoint artifacts in checkpoint-26. ...
[34m[1mwandb[0m: Adding directory to artifact (./my_emotion_model/checkpoint-26)... Done. 4.2s
The following columns in the evaluation set don

KeyboardInterrupt: ignored

We can visuzalize the training logs by looking at the wandb.run object

In [17]:
wandb.run

Finally, we can optionally call the `wandb.finish()` method to indicate that the experiment is complete.

In [18]:
wandb.finish()

0,1
eval/f1,▁▁██
eval/loss,█▇▅▁
eval/runtime,▁▃█▁
eval/samples_per_second,█▅▁█
eval/steps_per_second,█▅▁█
train/epoch,▁▂▂▂▃▄▄▅▅▆▆▇▇██
train/global_step,▁▂▂▂▃▄▄▅▅▆▆▇▇██
train/learning_rate,▁▂▂▃▄▅▅▆▇▇█
train/loss,▇██▇▇▆▅▄▃▂▁

0,1
eval/f1,0.25636
eval/loss,1.25871
eval/runtime,0.4833
eval/samples_per_second,773.875
eval/steps_per_second,6.208
train/epoch,4.23
train/global_step,110.0
train/learning_rate,0.0
train/loss,1.2641


## Resuming Training

Since we are training the model on colab it's entirely possible that the preemptible instance was shutdown midway and that the model was not fully trained. Don't worry the Weight & Biases integration got us covered. We can also resume training from the last checkpoint by doing the following.

In [19]:
run = wandb.init(
    project=os.environ["WANDB_PROJECT"],
    id="woawjszn", # fetch the run_id from the wandb workspace
    resume="must",
    )

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01666896498333017, max=1.0)…

In [21]:
# fetch the checkpoint artifact from the run
artifact = run.use_artifact("parambharat/hf_transformers/checkpoint-woawjszn:v3", type="model")
artifact_dir = artifact.download()

[34m[1mwandb[0m: Downloading large artifact checkpoint-woawjszn:v3, 943.13MB. 12 files... 
[34m[1mwandb[0m:   12 of 12 files downloaded.  
Done. 0:0:3.3


In [22]:
# reinitialize the trainer object

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [23]:
trainer.train(resume_from_checkpoint=artifact_dir)

Loading model from ./artifacts/checkpoint-woawjszn:v3.
The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: length, text. If length, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3257
  Num Epochs = 10
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 260
  Number of trainable parameters = 82121476
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 4
  Continuing training from global step 104
  Will skip the first 4 epochs then the first 0 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already se

0it [00:00, ?it/s]

Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss,F1
5,1.2412,1.160764,0.256364
6,1.1236,0.986982,0.546248
7,1.013,0.901135,0.611501
8,0.9721,0.848417,0.648415
9,0.905,0.823327,0.65957
10,0.7969,0.778148,0.710744


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: length, text. If length, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 374
  Batch size = 128
Saving model checkpoint to my_emotion_model/checkpoint-130
Configuration saved in my_emotion_model/checkpoint-130/config.json
Model weights saved in my_emotion_model/checkpoint-130/pytorch_model.bin
tokenizer config file saved in my_emotion_model/checkpoint-130/tokenizer_config.json
Special tokens file saved in my_emotion_model/checkpoint-130/special_tokens_map.json
Deleting older checkpoint [my_emotion_model/checkpoint-78] due to args.save_total_limit
Logging checkpoint artifacts in checkpoint-130. ...
[34m[1mwandb[0m: Adding directory to artifact (./my_emotion_model/checkpoint-130)... Done. 4.2s
The following columns in the evaluation s

TrainOutput(global_step=260, training_loss=0.614474973311791, metrics={'train_runtime': 252.8882, 'train_samples_per_second': 128.792, 'train_steps_per_second': 1.028, 'total_flos': 481208351771448.0, 'train_loss': 0.614474973311791, 'epoch': 10.0})

In [24]:
wandb.run

In [25]:
wandb.finish()

VBox(children=(Label(value='5975.322 MB of 5975.322 MB uploaded (19.274 MB deduped)\r'), FloatProgress(value=1…

0,1
eval/f1,▁▅▆▇▇█
eval/loss,█▅▃▂▂▁
eval/runtime,▄▂█▁▇▇
eval/samples_per_second,▅▇▁█▂▂
eval/steps_per_second,▅▇▁█▂▂
train/epoch,▁▁▂▂▂▃▃▃▄▄▄▅▅▆▆▆▇▇▇████
train/global_step,▁▁▂▂▂▃▃▃▄▄▄▅▅▆▆▆▇▇▇████
train/learning_rate,▁▁▂▂▃▃▄▄▅▅▆▆▇▇██
train/loss,███▇▆▅▅▄▄▄▃▃▃▂▂▁
train/total_flos,▁

0,1
eval/f1,0.71074
eval/loss,0.77815
eval/runtime,0.5161
eval/samples_per_second,724.665
eval/steps_per_second,5.813
train/epoch,10.0
train/global_step,260.0
train/learning_rate,1e-05
train/loss,0.7969
train/total_flos,481208351771448.0


## Inference

Great, now that you've finetuned a model, you can use it for inference!

Grab some text you'd like to run inference on:

In [26]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for sentiment analysis with your model, and pass your text to it:

In [29]:
# Create a new wandb run and download the model artifact.
run = wandb.init(project=os.environ["WANDB_PROJECT"], job_type="inference")
# fetch your model arifact from the wandb run.
artifact = run.use_artifact('parambharat/hf_transformers/model-woawjszn:latest', type='model')
artifact_dir = artifact.download()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

[34m[1mwandb[0m: Downloading large artifact model-woawjszn:latest, 316.52MB. 8 files... 
[34m[1mwandb[0m:   8 of 8 files downloaded.  
Done. 0:0:0.0


In [30]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model=artifact_dir)
classifier(text)

loading configuration file ./artifacts/model-woawjszn:v0/config.json
Model config RobertaConfig {
  "_name_or_path": "./artifacts/model-woawjszn:v0",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.2,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.3,
  "hidden_size": 768,
  "id2label": {
    "0": "anger",
    "1": "joy",
    "2": "optimism",
    "3": "sadness"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "anger": 0,
    "joy": 1,
    "optimism": 2,
    "sadness": 3
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.26.0.dev0",
  "type_vocab_size": 1,
  "use_cach

[{'label': 'joy', 'score': 0.6936514973640442}]