# Transformers for Computer Vision (working with text data)

- Ensure you are using a GPU as your Runtime.

In [None]:
!pip install transformers==4.24.0 datasets==2.7.1 evaluate==0.3.0 gradio==3.12.0

# Tokenizers for Text

## Working with the Hugging Face library

**We want to use the same weights for our model and tokenizer. How can we use the bert uncased checkpoint ('bert-base-uncased') for our tokenizer.**

**How can we determine how large the vocabulary is?**

**Convert the following sentence into**
1. Tokens
2. Numerical IDs

In [None]:
sentence = 'I like NLP'



**What is the relationship between the CLS/SEP tokens and their token_ids?**

**What happens when a token is not in the vocabulary?**

In [None]:
'😀' 

In [None]:
sentence = 'I like NLP😀'


**How would you tokenize first_sentence and second_sentence?**

In [None]:
first_sentence = 'I like NLP.'
second_sentence = 'What about you?'


# Text classification - IMDB Dataset

## Datasets library

**How can you load the imdb dataset, using the datasets package?**

**Split the dataset as follows:**
- train - 1600
- validation - 400
- test - 400

## Overview of IMDB Dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('max_colwidth', 250)

**Convert the imdb dataset (train split) to pandas and display 10 random samples**

In [None]:
df.loc[0, 'text']

Replace `<br />` with a `''`

In [None]:
#code snippet to display a boxplot
df["Words per review"] = df["text"].str.split().apply(len)
df.boxplot("Words per review", by="label", grid=False, showfliers=False,
           color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()

**Show all reviews that are less than 200 characters long.**

In [None]:
# 0 is negative
# 1 is positive


**Set the format of the imdb dataset from pandas back to Datasets**

## Tokenizer

**Tokenize the entire dataset**

In [None]:
from transformers import AutoTokenizer

checkpoint = "bert-base-cased"


## Tiny IMDB

**Which type of model should we be using?**

**Given the following checkpoint, use a GPU if it is available and define the model**

**Create a dataset that has the following and tokenize them**
- 50 elements in the training set 
- 10 in the validation set and
- 10 in the test set.

**Using the model_name, `model_name` determine the training arguments for model training.**

In [None]:
model_name = f"{checkpoint}-finetuned-imdb"

**Kick off a training run. What is the good metric for classification?**

**Create a helper function, `get_accuracy` to use accuracy as the metric**

**Kick off another training run and confirm you get the metric you expect**

**Now kick off a full training run**

**What results do you get the following?**

In [None]:
text = 'This is not my idea of fun'

In [None]:
text = 'This was beyond incredible'

# Vision Transformers

## Getting the data

In [None]:
!wget https://github.com/jonfernandes/flowers-dataset/raw/main/flower_photos.tgz
!tar -xvf flower_photos.tgz


## Using datasets

In [None]:
!pip install transformers==4.24.0 datasets==2.7.1 evaluate==0.3.0 gradio==3.12.0

**Load the flowers datasets into HuggingFace datasets**

**Display the first 5 images**

**What are the labels?**

In [None]:
labels = ds['train'].features['label'].names
labels

Split the dataset, so that you have the following:
- Train set - 80%
- Validation set - 10%
- Test set - 10%

## Using a pre-trained model without fine-tuning

Determine which model you should be using and remember to use a GPU if one is available

**Define the feature extractor**

**Using `train_image_id = 3` what flower do you have for this image (This will vary for everyone)**

**What happens when you pass this image to the feature extractor?**

**What flower does the model predict?**

**Which flowers are in the Imagenet dataset?**

## Defining a model

**Define your own model using the following:**
- model_id
- num_labels
- id2label
- label2id
- ignore_mismatched_sizes

## Pre-processing images

In [None]:
import torchvision

from torchvision.transforms import (
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomResizedCrop,
    ToTensor,
    Resize,
    CenterCrop
)

**Define normalize. What is its purpose?**

In [None]:
normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)

In [None]:
feature_extractor.size

**Answer the following:**
- Why is `train_transform` and `validation_transform` different in the code snippet below?
- What is `pixel_vlues`?

In [None]:
train_transform = Compose(
    [
     RandomResizedCrop(feature_extractor.size),
     RandomHorizontalFlip(),
     ToTensor(),
     normalize
    ]
)

validation_transform = Compose(
        [
            Resize(feature_extractor.size),
            CenterCrop(feature_extractor.size),
            ToTensor(),
            normalize,
        ]
    )

def train_transform_images(images):
  images["pixel_values"] = [train_transform(image.convert("RGB")) for image in images["image"]]
  return images

def validation_transform_images(images):
  images["pixel_values"] = [validation_transform(image.convert("RGB")) for image in images["image"]]
  return images

**What is the difference between `map` and `with_transform`?**

In [None]:
transformed_ds = ds.with_transform(train_transform_images)
transformed_ds['train'] = ds['train'].with_transform(train_transform_images)
transformed_ds['validation'] = ds['validation'].with_transform(validation_transform_images)
transformed_ds['test'] = ds['test'].with_transform(validation_transform_images)

## A transformed image

**Using the following sample image, show how the image changes when using:**
- train_transform
- validation_transform

**Run this a couple of times**

In [None]:
sample_image = ds['train'][train_image_id]['image']
sample_image

## Getting images in the correct format

**4-images**

Working with 4 images determine the following (use the Hugging Face documentation here):
- labels that are tensors
- pixel_values that are stacked

In [None]:
four_images = [transformed_ds['train'][i] for i in range(4)]
four_images

In [None]:
print(four_images[0]['pixel_values'].shape, four_images[1]['pixel_values'].shape, four_images[2]['pixel_values'].shape, four_images[3]['pixel_values'].shape)

**What is the purpose of the collate function? Create a collate function for the images.** 

In [None]:
from torch.utils.data import DataLoader

def collate_fn(images):
  pass

train_dataloader = DataLoader(transformed_ds['train'], batch_size=4, collate_fn=collate_fn, shuffle=True)
validation_dataloader = DataLoader(transformed_ds['validation'], batch_size=4, collate_fn=collate_fn, shuffle=False)
test_dataloader = DataLoader(transformed_ds['test'], batch_size=4, collate_fn=collate_fn, shuffle=False)

In [None]:
batch = next(iter(train_dataloader))

for key, value in batch.items():
  print(key, value.shape)

## Training arguments

**Determine the training arguments**

In [None]:
from transformers import TrainingArguments, Trainer

batch_size=32
metric_name = "accuracy"
model_name = 'vit-base-patch16-224-finetuned-flower'


In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
!git config --global credential.helper store

## Model Training

**Kick of a training run**

From the [evaluate documentation](https://huggingface.co/docs/evaluate/a_quick_tour#compute):

```
metric.compute(
          references=..., 
          predictions=...)
```

**This time use the evaluate package to define the accuracy and kick-off a training run**

**What are the evaluation results for the train, test and validation split?**

## Inference in notebook

In [None]:
test_image = ds['test'][-1]['image']
test_image

**Define the function `classify_image` using the model you have pushed up to huggingface. Use argmax as the final layer**

In [None]:
import torch
from transformers import AutoModelForImageClassification, AutoFeatureExtractor

model_id = f'jonathanfernandes/vit-base-patch16-224-finetuned-flower'

def classify_image(image):
  pass

classify_image(test_image)

**This time use softmax as the final layer instead of argmax**

In [None]:
import torch

model_id = f'jonathanfernandes/vit-base-patch16-224-finetuned-flower'

def classify_image(image):
  pass

classify_image(test_image)

**Use HuggingFace's pipeline to classify the image**

In [None]:
from transformers import pipeline

model_id = f'jonathanfernandes/vit-base-patch16-224-finetuned-flower'


## Inference on your phone using Gradio

In [None]:
!wget https://github.com/jonfernandes/Advanced_AI_Transformers_for_Computer_Vision/raw/main/flower-1.jpg
!wget https://github.com/jonfernandes/Advanced_AI_Transformers_for_Computer_Vision/raw/main/flower-2.jpeg

In [None]:
!ls -l

In [None]:
import torch
from transformers import AutoModelForImageClassification, AutoFeatureExtractor
import gradio as gr

model_id = f'jonathanfernandes/vit-base-patch16-224-finetuned-flower'
labels = ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']

def classify_image(image):
  pass

#Use Gradio to make your model available to everyone.