This notebook will demonstrate the process of using HuggingFace Optimum to train ViT on the ChestX-ray14 Dataset, but you can plug and play any dataset for a quicker time to value for your AI projects.

In order to streamline your experience, we have created some simple scripts 
which aims to minimise the demo setup time and even make the demo accessible 
to users with minimal experience of using IPUs.



# Getting the dataset

Begin by downloading the Chest Xray Dataset. The dataset contains 112,120 frontal view X-rays of 30805 people who had common diseases recorded between 1992 - 2015 with 14 labels mined from the radiological report text using NLP techniques.

![xray-sample.jpeg](static/xray-sample.jpeg)

Download dataset at https://nihcc.app.box.com/v/ChestXray-NIHCC

Extract all files:
```
for f in images*.tar.gz; do tar xfz "$f"; done
```

In [None]:
import os, shutil
import pandas as pd

In [None]:
data = pd.read_csv('Data_Entry_2017_v2020.csv')

In [None]:
data[:10]

In [None]:
data['Finding Labels'].unique()

In [None]:
# Don't want to have spaces in folder names

data['Finding Labels'] = data['Finding Labels'].str.replace('No Finding', 'No_Finding')

In [None]:
# Some images have multiple labels
# We split them into columns

findings = data['Finding Labels'].str.split('|', expand=True)

In [None]:
# Some images have multiple labels, we keep only the first one

data['Finding Labels'] = findings[0]

In [None]:
data['Finding Labels'].unique()

In [None]:
data['Finding Labels'].value_counts()

In [None]:
os.mkdir('processed_images')

In [None]:
labels = data['Finding Labels'].unique()
for l in labels:
    os.mkdir('processed_images/{}'.format(l))

In [None]:
# Copy each image to its label subfolder

for image, label in zip(data['Image Index'], data['Finding Labels']):
    shutil.copy('images/{}'.format(image), 'processed_images/{}/'.format(label))

# Training and evaluation

You can also train with 
https://github.com/huggingface/optimum-graphcore/tree/main/examples/image-classification
    
```
python run_image_classification.py \
    --model_name_or_path google/vit-base-patch16-224-in21k \
    --ipu_config_name Graphcore/vit-base-ipu \
    --train_dir processed_images/ \
    --train_val_split 0.1 \
    --output_dir ./outputs/ \
    --do_train \
    --do_eval \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --dataloader_num_workers 8 \
    --dataloader_drop_last \
    --seed 1337
```

In [1]:
# Hyperparameters
data_dir = "processed_images"
validation_dir = None
train_val_split = 0.1
model_name_or_path = "google/vit-base-patch16-224-in21k"
ipu_config_name = "Graphcore/vit-base-ipu"

In [2]:
import os
from datasets import load_dataset

dataset = load_dataset(
    "imagefolder",
    data_dir=data_dir,
    task="image-classification",
)

split = dataset["train"].train_test_split(train_val_split)
dataset["train"] = split["train"]
dataset["validation"] = split["test"]

Resolving data files:   0%|          | 0/112120 [00:00<?, ?it/s]

Using custom data configuration default-9e135e5054dbcf0c
Reusing dataset image_folder (/home/jincheng/.cache/huggingface/datasets/image_folder/default-9e135e5054dbcf0c/0.0.0/48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached processed dataset at /home/jincheng/.cache/huggingface/datasets/image_folder/default-9e135e5054dbcf0c/0.0.0/48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597/cache-7019811ef7abccdb.arrow


In [3]:
# Prepare label mappings.
# We'll include these in the model's config to get human readable labels in the Inference API.
labels = dataset["train"].features["labels"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

In [4]:
# Load the accuracy metric from the datasets package
import datasets
import numpy as np
from scipy.special import softmax

metric_acc = datasets.load_metric("accuracy")
metric_auc = datasets.load_metric("roc_auc", "multiclass")

# Define our compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a
# predictions and label_ids field) and has to return a dictionary string to float.
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    acc = metric_acc.compute(predictions=preds, references=p.label_ids)['accuracy']
    
    pred_scores = softmax(p.predictions.astype('float32'), axis=1)
    auc = metric_auc.compute(prediction_scores=pred_scores, references=p.label_ids, multi_class='ovo')['roc_auc']
    return {"accuracy": acc, "roc_auc": auc}

# def compute_metrics(p):
#     my_predictions = np.zeros_like(p.predictions)
#     my_predictions[:, 10] = 1
    
#     preds = np.argmax(my_predictions, axis=1)
#     acc = metric_acc.compute(predictions=preds, references=p.label_ids)['accuracy']
    
#     pred_scores = softmax(my_predictions.astype('float32'), axis=1)
#     auc = metric_auc.compute(prediction_scores=pred_scores, references=p.label_ids, multi_class='ovo')['roc_auc']
#     return {"accuracy": acc, "roc_auc": auc}

In [5]:
from transformers import (
    AutoFeatureExtractor,
    AutoModelForImageClassification,
)
from optimum.graphcore import IPUConfig


ipu_config = IPUConfig.from_pretrained(
    ipu_config_name
)
model = AutoModelForImageClassification.from_pretrained(
    model_name_or_path,
    num_labels=len(labels),
    label2id=label2id,
    id2label=id2label,
)
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_name_or_path
)

Some weights of the model checkpoint at google/vit-base-patch16-224-in21k were not used when initializing ViTForImageClassification: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing ViTForImageClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ViTForImageClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# Define torchvision transforms to be applied to each image.
from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomResizedCrop,
    Resize,
    ToTensor,
)

normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
_train_transforms = Compose(
    [
        Resize(feature_extractor.size),
        ToTensor(),
        normalize,
    ]
)
_val_transforms = Compose(
    [
        Resize(feature_extractor.size),
        ToTensor(),
        normalize,
    ]
)

In [7]:
# Implement transforms as a functor instead of a function because the Async Dataloader
# can't handle functions with closures because it uses pickle underneath.
class ApplyTransforms:
    """
    Functor that applies image transforms across a batch.
    """

    def __init__(self, transforms):
        self.transforms = transforms

    def __call__(self, example_batch):
        example_batch["pixel_values"] = [self.transforms(pil_img.convert("RGB")) for pil_img in example_batch["image"]]
        return example_batch

# Set the training transforms
dataset["train"].set_transform(ApplyTransforms(_train_transforms))
# Set the validation transforms
dataset["validation"].set_transform(ApplyTransforms(_val_transforms))


In [8]:
import torch

def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["labels"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}

In [9]:
from optimum.graphcore import IPUTrainer
from optimum.graphcore import IPUTrainingArguments as TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    dataloader_num_workers=8,
    dataloader_drop_last=True,
    num_train_epochs=3,
#     max_steps = 10,
    seed=1337,
    logging_steps=50,
    save_steps=5000,
    remove_unused_columns=False,
)

trainer = IPUTrainer(
    model=model,
    ipu_config=ipu_config,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics,
    tokenizer=feature_extractor,
    data_collator=collate_fn,
)

trainer.train()

Setting replicated_tensor_sharding to False when replication_factor=1
---------- Device Allocation -----------
Embedding  --> IPU 0
Encoder 0  --> IPU 0
Encoder 1  --> IPU 0
Encoder 2  --> IPU 0
Encoder 3  --> IPU 1
Encoder 4  --> IPU 1
Encoder 5  --> IPU 1
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 2
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Head       --> IPU 3
---------------------------------------
Compiling Model...
  if height != self.image_size[0] or width != self.image_size[1]:
Graph compilation: 100%|██████████| 100/100 [00:13<00:00]
Compiled/Loaded model in 58.4291013404727 secs
***** Running training *****
  Num examples = 100908
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Device Iterations = 1
  Replication Factor = 1
  Gradient Accumulation steps = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Total optimization steps = 2364
[34m[1mwandb[0m: Currently logged in as: [33msw-apps[0m 

  0%|          | 0/2364 [00:00<?, ?it/s]

{'loss': 1.9474, 'learning_rate': 4.89424703891709e-05, 'epoch': 0.06}
{'loss': 1.366, 'learning_rate': 4.7884940778341796e-05, 'epoch': 0.13}
{'loss': 1.6601, 'learning_rate': 4.682741116751269e-05, 'epoch': 0.19}
{'loss': 1.5776, 'learning_rate': 4.576988155668359e-05, 'epoch': 0.25}
{'loss': 1.4641, 'learning_rate': 4.471235194585449e-05, 'epoch': 0.32}
{'loss': 1.6877, 'learning_rate': 4.365482233502538e-05, 'epoch': 0.38}
{'loss': 1.8744, 'learning_rate': 4.259729272419628e-05, 'epoch': 0.44}
{'loss': 1.4432, 'learning_rate': 4.153976311336718e-05, 'epoch': 0.51}
{'loss': 1.341, 'learning_rate': 4.0482233502538075e-05, 'epoch': 0.57}
{'loss': 1.8149, 'learning_rate': 3.942470389170897e-05, 'epoch': 0.63}
{'loss': 1.4903, 'learning_rate': 3.836717428087986e-05, 'epoch': 0.7}
{'loss': 1.7573, 'learning_rate': 3.7309644670050766e-05, 'epoch': 0.76}
{'loss': 1.1495, 'learning_rate': 3.6252115059221656e-05, 'epoch': 0.82}
{'loss': 1.4042, 'learning_rate': 3.519458544839256e-05, 'epoch'



Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 875.4389, 'train_samples_per_second': 345.797, 'train_steps_per_second': 2.7, 'train_loss': 1.4113887522224646, 'epoch': 3.0}


TrainOutput(global_step=2364, training_loss=1.4113887522224646, metrics={'train_runtime': 875.4389, 'train_samples_per_second': 345.797, 'train_steps_per_second': 2.7, 'train_loss': 1.4113887522224646, 'epoch': 3.0})

In [10]:
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

Compiling Model...
Graph compilation: 100%|██████████| 100/100 [00:05<00:00]
Compiled/Loaded model in 43.46780707128346 secs
***** Running Evaluation *****
  Num examples = 11212
  Batch size = 4


  0%|          | 0/2803 [00:00<?, ?it/s]

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.5718
  eval_loss               =      1.376
  eval_roc_auc            =     0.7208
  eval_runtime            = 0:00:40.28
  eval_samples_per_second =    278.319
  eval_steps_per_second   =      69.58
