<a href="https://colab.research.google.com/github/nathan-barry/ml-studies/blob/main/quick-tour.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HF Transformers

- Provides APIs and tools to easily download and train sota pretrained models
- Models support common tasks such as:
  - NLP
  - Computer Vision
  - Audio
  - Multimodal
- Framework supports interoperability between PyTorch, TensorFlow, and Jax

# Quick Tour

In [None]:
# Install the necessary libraries on colab machine
!pip install transformers datasets

- The pipeline() function is the easiest way to use a pretrained model for inference.
- Can use it out-of-the-box for:
  - Text classification
  - Text generation
  - Name entity recognition
  - Question Answering
  - Fill-mask
  - Summarization
  - Image classification
  - image segmentation
  - Object detection
  - Audio classification
  - Automatic speech recognition
  - Visual question answering
- Check the [docs](https://huggingface.co/docs/transformers/quicktour) to see how

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

- The pipeline() downloads and caches a default pretrained model and tokenizer

In [None]:
classifier("NATHAN BARRY IS EPIC!!!!!!")

In [None]:
# Can pass inputs as a list if you have multiple
classifier(["NATHAN BARRY IS EPIC!!!!!!", "Nathan Barry is lame."])

- The pipeline() can also iterate over an entire dataset for any task you like

In [None]:
import torch
from datasets import load_dataset, Audio

In [None]:
# Initialize the audio pipeline object
speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")

# Load the dataset
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")

# Make sure the sampling rate of the data and model are the same
dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))

In [None]:
# The audio files are automatically loaded and resampled when calling the "audio" column
# Extract the raw waveform arrays from the first 4 samples and pass it as a list to the pipeline
results = speech_recognizer(dataset[:4]["audio"])
print([res["text"] for res in results])

- For larget datasets where the inputs are large (like in speech or vision, you'll want to pass a generator instead of a list

## Use another model and tokenizer in the pipeline

- The pipeline() can accommodate any model from the HF Hub

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:
# Grab model from hub
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

# Use AutoModelForSequenceClassification and AutoTokenizer to
# load the pretrained model and itâ€™s associated tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# Specify the model and tokenizer in the pipeline()
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier("Nous sommes trÃ¨s heureux de vous prÃ©senter la bibliothÃ¨que ðŸ¤— Transformers")

- If you can't find a model for your use-case, you'll need to finetune a pretrained model on your data

## AutoClass

- Underthe hood, the AutoModelForSequenceClassification and AutoTokenizer classes work together to power the pipeline()
- An AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from its name or path.
  - Only need to select the appropriate AutoClass for your task and its associated preprocessing class

### AutoTokenizer

- A tokenizer is responsible for preprocessing text into an array of numbers as the inputs to a model
- There are multiple rules that govern the tokenization process, including how to split a word and at what levels should the words be split
- Need to use the same tokenization rules a model was pretrained with

In [None]:
encoding = tokenizer("NATHAN BARRY IS EPIC!!!!!")
print(encoding)

- The tokenizer returns a dictionary containing
  - input_ids: numberical representation of your tokens
  - attention_mask: indicates which tokens should be attended to
- A tokenizer can accept a list of inputs, and pad and truncate the text to return a batch with uniform length

In [None]:
pt_batch = tokenizer(
    ["NATHAN BARRY IS EPIC!!!!!", "nathan barry is lame."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

### AutoModel

- Transformers provide a simple and unified way to load pretrained instances.
- You should select the correct AutoModel for the task
  - Should load `AutoModelForSequenceClassification` for text (or sequence) classification

In [None]:
# Already called this above
# model = AutoModelForSequenceClassification.from_pretrained(model_name)

pt_outputs = model(**pt_batch) # Unpack the batch dictionary into the model

In [None]:
from torch import nn

In [None]:
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1) # The model outputs its the final activations in .logits field

# the probability (of what, I don't know)
print(pt_predictions)

### Save a Model

- Once your model is fine-tuned, you can save it with its tokenizer using `<PreTrainedModel>.save_pretrained()` and load it with `<PreTrainedModel>.from_pretrained()`

In [None]:
pt_save_directory = "./pt_save_pretrained"

# Save tokenizer
tokenizer.save_pretrained(pt_save_directory)

# Save model
model.save_pretrained(pt_save_directory)

In [None]:
# Load the model
model = AutoModelForSequenceClassification("./pt_save_pretrained")

## Custom Model Builds

- You can modify the model's configuration class to change how a model is built
- Config specifies a model's attributes such as number of hidden layers or attention heads
- `AutoConfig.from_pretrained("<model>", attributes= , to= , chage= )` allows you to load in the config of a pretrained model and tweak it 

In [None]:
from transformers import AutoConfig

my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12) # changes n_heads attribute

In [None]:
from transformers import AutoModel

# Create a model from your custom config
my_model = AutoModel.from_config(my_config)

## Trainer = a PyTorch optimized training loop

- All models are a standard torch.nn.Module
- Can use your own PyTorch training loop or use HF's `Trainer` class

In [None]:
# Trainer takes in a  PreTrainedModel or a torch.nn.Module
model = AutoModelForSequenceClassification.from_pretrained("distillbert-base-uncased")

In [None]:
from transformers import TrainingArguments

# Training arugments
training_args = TrainingArguments(
    output_dir="path/to/save/folder/",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2
)

In [None]:
# Trainer takes in a preprocessing class like a tokenizer or feature extractor
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
# Load a dataset and tokenize it

dataset = load_dataset("dataset")

# Create batches
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=lambda dataset: tokenizer(dataset["text"])) # Will probably just use PyTorch Dataloader instead

In [None]:
# Create trainer
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
# To Train the model
trainer.train() # Yep that's it