You can download the `requirements.txt` for this course from the workspace of this lab. `File --> Open...`

# Lesson 2: Data Types and Sizes

In this lab, you will learn about the common data types used to store the parameters of machine learning models.


The libraries are already installed in the classroom.  If you're running this notebook on your own machine, you can install the following:

```Python
!pip install torch==2.1.1
```

In [None]:
import torch

### Integers

In [None]:
# Information of `8-bit unsigned integer`
torch.iinfo(torch.uint8)

In [None]:
# Information of `8-bit (signed) integer`
torch.iinfo(torch.int8)

In [None]:
### Information of `64-bit (signed) integer`


In [None]:
### Information of `32-bit (signed) integer`


In [None]:
### Information of `16-bit (signed) integer`


### Floating Points 

In [None]:
# by default, python stores float data in fp64
value = 1/3

In [None]:
format(value, '.60f')

In [None]:
# 64-bit floating point
tensor_fp64 = torch.tensor(value, dtype = torch.float64)

In [None]:
print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")

In [None]:
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)

In [None]:
print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}")

In [None]:
# Information of `16-bit brain floating point`
torch.finfo(torch.bfloat16)

In [None]:
# Information of `32-bit floating point`
torch.finfo(torch.float32)

In [None]:
### Information of `16-bit floating point`


In [None]:
### Information of `64-bit floating point`


### Downcasting

In [None]:
# random pytorch tensor: float32, size=1000
tensor_fp32 = torch.rand(1000, dtype = torch.float32)

**Note:** As it is random, the values you get will be different from the video.

In [None]:
# first 5 elements of the random tensor
tensor_fp32[:5]

In [None]:
# downcast the tensor to bfloat16 using the "to" method
tensor_fp32_to_bf16 = tensor_fp32.to(dtype = torch.bfloat16)

In [None]:
tensor_fp32_to_bf16[:5]

In [None]:
# tensor_fp32 x tensor_fp32
m_float32 = torch.dot(tensor_fp32, tensor_fp32)

In [None]:
m_float32

In [None]:
# tensor_fp32_to_bf16 x tensor_fp32_to_bf16
m_bfloat16 = torch.dot(tensor_fp32_to_bf16, tensor_fp32_to_bf16)

In [None]:
m_bfloat16

#### Note
- You'll use "downcasting" as a simple form of quantization in the next lesson.

# Lesson 3: Loading ML Models with Different Data Types

In this lab, you will load ML models in different datatypes.

- Load the Dummy Model from the helper file.
- To access the `helper.py` file, you can click `File --> Open...`, on the top left.

In [None]:
import os

# Ensure the directory exists and get the absolute path
cache_dir = "cache"
os.makedirs(cache_dir, exist_ok=True)
cache_dir_path = os.path.abspath(cache_dir)

# Set the environment variable
os.environ['HF_HOME'] = cache_dir_path

In [None]:
from helper import DummyModel

In [None]:
model = DummyModel()

In [None]:
model

- Create a function to inspect the data types of the parameters in a model.

In [None]:
def print_param_dtype(model):
    for name, param in model.named_parameters():
        print(f"{name} is loaded in {param.dtype}")

In [None]:
print_param_dtype(model)

## Model Casting: `float16`

- Cast the model into a different precision.

In [None]:
# float 16
model_fp16 = DummyModel().half()

- Inspect the data types of the parameters.

In [None]:
print_param_dtype(model_fp16)

In [None]:
model_fp16

- Run simple inference using model.

In [None]:
import torch

In [None]:
dummy_input = torch.LongTensor([[1, 0], [0, 1]])

In [None]:
# inference using float32 model
logits_fp32 = model(dummy_input)

In [None]:
logits_fp32

In [None]:
# inference using float16 model
try:
    logits_fp16 = model_fp16(dummy_input)
except Exception as error:

    print("\033[91m", type(error).__name__, ": ", error, "\033[0m")

## Model Casting: `bfloat16`

#### Note about deepcopy
- `copy.deepcopy` makes a copy of the model that is independent of the original.  Modifications you make to the copy will not affect the original, because you're making a "deep copy".  For more details, see the Python docs on the [copy][https://docs.python.org/3/library/copy.html] library.

In [None]:
from copy import deepcopy

In [None]:
model_bf16 = deepcopy(model)

In [None]:
model_bf16 = model_bf16.to(torch.bfloat16)

In [None]:
print_param_dtype(model_bf16)

In [None]:
logits_bf16 = model_bf16(dummy_input)

- Now, compare the difference between `logits_fp32` and `logits_bf16`.

In [None]:
mean_diff = torch.abs(logits_bf16 - logits_fp32).mean().item()
max_diff = torch.abs(logits_bf16 - logits_fp32).max().item()

print(f"Mean diff: {mean_diff} | Max diff: {max_diff}")

## Using Popular Generative Models in Different Data Types

- Load [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) to perform image captioning.

#### To get the sample code that Younes showed:
- Click on the "Model Card" tab.
- On the right, click on the button "<> Use in Transformers", you'll see a popup with sample code for loading this model.

```Python
# Load model directly
from transformers import AutoProcessor, AutoModelForSeq2SeqLM

processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/blip-image-captioning-base")
```

- To see the sample code with an example, click on "Read model documentation" at the bottom of the popup.  It opens a new tab.
  https://huggingface.co/docs/transformers/main/en/model_doc/blip#transformers.BlipForConditionalGeneration
- On this page, scroll down a bit, past the "parameters", section, and you'll see "Examples:"

```Python
from PIL import Image
import requests
from transformers import AutoProcessor, BlipForConditionalGeneration

processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "A picture of"

inputs = processor(images=image, text=text, return_tensors="pt")

outputs = model(**inputs)
```

In [None]:
from transformers import BlipForConditionalGeneration

In [None]:
model_name = "Salesforce/blip-image-captioning-base"

In [None]:
model = BlipForConditionalGeneration.from_pretrained(model_name)

In [None]:
# inspect the default data types of the model

print_param_dtype(model)

- Check the memory footprint of the model. 

In [None]:
fp32_mem_footprint = model.get_memory_footprint()

In [None]:
print("Footprint of the fp32 model in bytes: ",
      fp32_mem_footprint)
print("Footprint of the fp32 model in MBs: ", 
      fp32_mem_footprint/1e+6)

- Load the same model in `bfloat16`.

In [None]:
model_bf16 = BlipForConditionalGeneration.from_pretrained(
                                               model_name,
                               torch_dtype=torch.bfloat16
)

In [None]:
bf16_mem_footprint = model_bf16.get_memory_footprint()

In [None]:
# Get the relative difference
relative_diff = bf16_mem_footprint / fp32_mem_footprint

print("Footprint of the bf16 model in MBs: ", 
      bf16_mem_footprint/1e+6)
print(f"Relative diff: {relative_diff}")

### Model Performance: `float32` vs `bfloat16`

- Now, compare the generation results of the two model.

In [None]:
from transformers import BlipProcessor

In [None]:
processor = BlipProcessor.from_pretrained(model_name)

- Load the image.

In [None]:
from helper import load_image, get_generation
from IPython.display import display

img_url = 'https://storage.googleapis.com/\
sfr-vision-language-research/BLIP/demo.jpg'

image = load_image(img_url)
display(image.resize((500, 350)))

In [None]:
results_fp32 = get_generation(model, 
                              processor, 
                              image, 
                              torch.float32)

In [None]:
print("fp32 Model Results:\n", results_fp32)

In [None]:
results_bf16 = get_generation(model_bf16, 
                              processor, 
                              image, 
                              torch.bfloat16)

In [None]:
print("bf16 Model Results:\n", results_bf16)

### Default Data Type

- For Hugging Face Transformers library, the deafult data type to load the models in is `float32`
- You can set the "default data type" as what you want.

In [None]:
# Remember, you likely want to reset the dtype if your are loading other data after loading the model
desired_dtype = torch.bfloat16
torch.set_default_dtype(desired_dtype)

In [None]:
dummy_model_bf16 = DummyModel()

In [None]:
print_param_dtype(dummy_model_bf16)

- Similarly, you can reset the default data type to float32.

In [None]:
torch.set_default_dtype(torch.float32)

In [None]:
print_param_dtype(dummy_model_bf16)

### Note
- You just used a simple form of quantization, in which the model's parameters are saved in a more compact data type (bfloat16).  During inference, the model performs its calculations in this data type, and its activations are in this data type.
- In the next lesson, you will use another quantization method, "linear quantization", which enables the quantized model to maintain performance much closer to the original model by converting from the compressed data type back to the original FP32 data type during inference.

# Lesson 4: Quantization Theory

In this lab, you will perform Linear Quantization.

#### Libraries to install
- If you are running this notebook on your local machine, you can install the following:

```Python
!pip install transformers==4.35.0
!pip install quanto==0.0.11
!pip install torch==2.1.1
```

## T5-FLAN
- Please note that due to hardware memory constraints, and in order to offer this course for free to everyone, the code you'll run here is for the T5-FLAN model instead of the EleutherAI AI Pythia model.  
- Thank you for your understanding! 🤗

For the T5-FLAN model, here is one more library to install if you are running locally:
```Python
!pip install sentencepiece==0.2.0
```


### Without Quantization

In [None]:
model_name = "google/flan-t5-small"

In [None]:
import sentencepiece as spm

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")

In [None]:
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

In [None]:
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

In [None]:
from helper import compute_module_sizes
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

## Quantize the model (8-bit precision)

In [None]:
from quanto import quantize, freeze
import torch

In [None]:
quantize(model, weights=torch.int8, activations=None)

In [None]:
print(model)

### Freeze the model
- This step takes a bit of memory, and so for the Pythia model that is shown in the lecture video, it will not run in the classroom.
- This will work fine with the smaller T5-Flan model.

In [None]:
freeze(model)

In [None]:
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

### Try running inference on the quantized model

In [None]:
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

## Note: Quantizing the model used in the lecture video will not work due to classroom hardware limitations.
- Here is the code that Marc, the instructor is walking through.  
- It will likely run on your local computer if you have 8GB of memory, which is usually the minimum for personal computers.
  - To run locally, you can download the notebook and the helper.py file by clicking on the "Jupyter icon" at the top of the notebook and navigating the file directory of this classroom.  Also download the requirements.txt to install all the required libraries.

### Without Quantization



- Load [EleutherAI/pythia-410m](https://huggingface.co/EleutherAI/pythia-410m) model and tokenizer.

```Python
from transformers import AutoModelForCausalLM
model_name = "EleutherAI/pythia-410m"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             low_cpu_mem_usage=True)
print(model.gpt_neox)


from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

- Write a start of a (`text`) sentence which you'd like the model to complete.
```Python
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
outputs
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

- Compute the model's size using the helper function, `compute_module_sizes`.
```Python
from helper import compute_module_sizes
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")
print(model.gpt_neox.layers[0].attention.dense.weight)
```
**Note:** The weights are in `fp32`.

### 8-bit Quantization

```Python
from quanto import quantize, freeze
import torch

quantize(model, weights=torch.int8, activations=None)
# after performing quantization
print(model.gpt_neox)
print(model.gpt_neox.layers[0].attention.dense.weight)
```

- The "freeze" function requires more memory than is available in this classroom.
- This code will run on a machine that has 8GB of memory, and so it will likely work if you run this code on your local machine.

```Python
# freeze the model
freeze(model)
print(model.gpt_neox.layers[0].attention.dense.weight)

# get model size after quantization
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

# run inference after quantizing the model
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

#### Comparing "linear quantization" to "downcasting"

To recap the difference between the "linear quantization" method in this lesson with the "downcasting" method in the previous lesson:

- When downcasting a model, you convert the model's parameters to a more compact data type (bfloat16).  During inference, the model performs its calculations in this data type, and its activations are in this data type.  Downcasting may work with the bfloat16 data type, but the model performance will likely degrade with any smaller data type, and won't work if you convert to an integer data type (like the int8 in this lesson).


- In this lesson, you used another quantization method, "linear quantization", which enables the quantized model to maintain performance much closer to the original model by converting from the compressed data type back to the original FP32 data type during inference. So when the model makes a prediction, it is performing the matrix multiplications in FP32, and the activations are in FP32.  This enables you to quantize the model in data types smaller than bfloat16, such as int8, in this example.

#### This is just the beginning...
- This course is intended to be a beginner-friendly introduction to the field of quantization. 🐣
- If you'd like to learn more about quantization, please stay tuned for another Hugging Face short course that goes into more depth on this topic (launching in a few weeks!) 🤗

## Did you like this course?

- If you liked this course, could you consider giving a rating and share what you liked? 💕
- If you did not like this course, could you also please share what you think could have made it better? 🙏

#### A note about the "Course Review" page.
The rating options are from 0 to 10.
- A score of 9 or 10 means you like the course.🤗
- A score of 7 or 8 means you feel neutral about the course (neither like nor dislike).🙄
- A score of 0,1,2,3,4,5 or 6 all mean that you do not like the course. 😭
  - Whether you give a 0 or a 6, these are all defined as "detractors" according to the standard measurement called "Net Promoter Score". 🧐