<a href="https://colab.research.google.com/github/mzohaibnasir/finetuningLLM/blob/main/01_QuantizationBasics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of Contents

1. Fine tuning LLM
2. Quantization
  1. Full Precision/Half precision -> Data -> weights and parameters
  2. Caliberation -> model optimization ->problems
  3. Models of Quantiaztion
    1. Post training Quantization
    2. Quantization aware training

# Quantization
Conversion from higher memory format to lower memory format



 Quantization is indeed the process of converting a signal or data from a higher memory format (more bits) to a lower memory format (fewer bits). It essentially compresses the information by representing a wider range of values with a more limited set of discrete values.

Here's a breakdown of what happens during quantization:

1. **Input with high precision:**  This could be an analog signal (continuous values) or a digital signal with a large number of bits (e.g., 32-bit floating-point numbers).
2. **Mapping to discrete levels:** The original values are mapped to a finite set of pre-defined levels. This often involves rounding or truncation, which introduces some loss of information.
3. **Reduced memory footprint:** The resulting data uses fewer bits to represent each value, leading to a smaller overall size.

This conversion is often a trade-off between:

* **Accuracy:**  Higher precision formats provide more accurate representation, but require more memory.
* **Efficiency:** Lower precision formats use less memory but may introduce errors due to quantization.

Quantization is widely used in various applications:

* **Digital Signal Processing (DSP):** Converting analog signals to digital form for processing often involves quantization.
* **Image and Audio Compression:** JPEG images and MP3 audio use quantization techniques to reduce file size.
* **Machine Learning:** Quantizing machine learning models can enable deployment on devices with limited memory resources.

By understanding quantization, you can make informed decisions about balancing memory usage and accuracy in various applications.

---

## Full precision
In ML jargon, FP32 is often termed “full precision or Single precision” (4 bytes).Each FP32 number would be stored based on 32-bits in memory.


Lama2 has 70 billion parameters in terms of weights and bias.

In Machine Learning (ML) jargon, FP32 is indeed referred to as both "full precision" and "single precision." It utilizes 32 bits (4 bytes) to store each number, offering a high degree of accuracy for representing a wide range of values.


fp32: full precision/single precision
fp16: half precision


---
## So what if we convert model: FLoating point 32-bits -> int8

Inference becoming quite easy this is called **QUANTIZATION**

### Quantization makses inferenceing easy.

You're on the right track! Converting a model from FP32 (32-bit floating-point) to INT8 (8-bit integer) through quantization does indeed make inference easier in several ways:

* **Reduced Model Size:** Quantization significantly reduces the size of the model by representing weights and activations with fewer bits. This is because INT8 numbers use only 4 times less space compared to FP32 (1 byte vs 4 bytes).  A smaller model translates to:
    * **Faster Loading:** Smaller models load faster onto devices, reducing the time it takes to get the model up and running for inference.
    * **Lower Storage Requirements:**  Smaller models require less storage space on the device, which is crucial for resource-constrained devices like mobile phones.

* **Faster Inference Speed:**  The hardware on many devices is optimized for integer computations rather than floating-point calculations. By converting to INT8, computations during inference become faster and more efficient, leading to quicker predictions.

* **Potential for Wider Deployment:**  With a smaller size and faster inference speed, quantized models become suitable for deployment on a wider range of devices, including those with limited memory and processing power. This opens doors for applications on mobile devices, embedded systems, and even internet-of-things (IoT) devices.

Here's a breakdown of your point:

1. **Quantization makes inference easier:** Absolutely! By reducing model size and enabling faster computations, quantization simplifies the process of running the model for predictions on various devices.

While quantization offers these advantages, it's important to remember:

* **Potential Accuracy Loss:**  Quantization can introduce a slight decrease in accuracy due to the loss of precision during conversion from FP32 to INT8.
* **Finding the Right Balance:** The key is to find the optimal precision level (e.g., INT8) that offers a good balance between efficiency and accuracy for your specific application.

Overall, quantization is a powerful technique for making inference easier by enabling faster, more efficient model execution on various devices. It's a valuable tool for deploying machine learning models in real-world scenarios.



---
**BUT THERE IS A TRADE off BETWEEN Efficiecncy and Accuracy**

---
**After quantization, you can further fine-tune the model to adapt to the reduced precision and improve its accuracy**

In [None]:
#

## Caliberation
caliberation is how we'll able to conevert 32-bits model to 8-bit

## How to perform Quantization

* Type of Quantizations
  1. Symmertic Quantization
      * Batch normalization is a technique of Symmetric quantization
  2. Assymetric Quantization

### Symmetric Quantization (data is evenly distributed)



#### 1. Symmetric uint8 Quantization

Assume your weight values ranges from 0  to 1000: [0.0,1000.0). These are weights for my LM. these are being stored in 32-bits. BTW weights have minimistic range

**[0,0,  ... , 1000.0]**  ->  **weights**  ->  **32bits** -- **2^32** convert to **uint8**  --  **2^8(0-255)**




**So we'll be converting from [0,1000] -> [0,2550]**

---

### How Single Precision Floating point 32 is stored?

1. 1-bit would be used to store sign
2. next 8-bits for exponent
3. next 23 for mantissa (anythingthat comes after decimal)


for 7.32:
1. for sign bit : +1
2. for exponent: 7
3. for mentissa: .32


---

### How half Precision Floating point 16 is stored?
1. 1-bit would be used to store sign
2. next 5-bits for exponent
3. next 10 for mantissa



### Min - Max scalar

conversion is supposde to be like
0.0 -> 0
1000 -> 255


so x range is 0 to 1000
q range is 0 to 255


    scalar = (xmax-xmin)/(qmax-qmin)

      = (100 -0)/ (255-0) = 3.92

 here, `3.92` is a scale factor


 so if I divide `250` by scaling factor
  round(250/3.92) = 64... this is new quantized value


#### 2. Assymmetric uint8 Quantization (data skewiness in data distribution)

values are [-20.0, ... , 1000.0]

now to convert it into  uint 8:[0,255]

    scalar = (xmax-xmin)/(qmax-qmin)

    = (1000 -(-20)) / (255-0) = 4


so, if we divide -20 by scaling factor.

    (-20)/4 =  (-5) + (5) => 0  # 5 is zero point

so, if we divide 1000 by scaling factor

    (1000)/4 =  (250) + (5) => 255
    


Your explanation of asymmetric uint8 quantization is absolutely correct! You've accurately described the process of converting a range of values into the uint8 format (0 to 255) using scaling and zero-point adjustment.

Here's a breakdown of the key points you mentioned:

* **Asymmetric Quantization:** This type of quantization allows for different scales for positive and negative values. It's useful when the data distribution is skewed towards one direction, like in your example (-20.0 to 1000.0).
* **Scaling Factor:** The scaling factor determines how much each value in the original range is stretched to fit the uint8 range. You've calculated it correctly as (xmax - xmin) / (qmax - qmin).
* **Zero-Point:** The zero-point is an additive constant that helps map the minimum value (xmin) to the desired value in the uint8 range (usually 0). In your example, adding 5 to the scaled value ensures -20.0 maps to 0 in the uint8 range.

Overall, your understanding of asymmetric uint8 quantization is on point!  Is there anything else you'd like to know about quantization or fine-tuning LLMs?

### So there are two important parameters required to perform quatization
1. **zero point**
  1. for symmetric quantization its 0 and for assymmetric it was +5
2. **scale**
  1. The scaling factor determines how much each value in the original range is stretched to fit the uint8 range



  ---
## Calibration
  You're absolutely right! That's a perfect explanation of calibration in LLM quantization. Here's a breakdown of the key points:

* **Quantization** focuses on compressing the model by reducing the precision of its weights and activations (e.g., from float32 to int8). This improves efficiency but can lead to a slight drop in accuracy.
* **Calibration** comes after quantization and addresses the potential accuracy loss. It involves analyzing the activations of the model on a separate calibration dataset to understand the distribution of values.
* Based on this analysis, calibration techniques determine optimal scaling factors and potentially zero-point adjustments for the quantized weights and activations. These adjustments ensure the quantized values closely represent the original high-precision values, minimizing accuracy degradation.

Here's an analogy: Imagine you're shrinking high-resolution images (quantization). Calibration would be like adjusting brightness and contrast (scaling factors) on the compressed images to make them look closer to the originals.

By effectively calibrating the model, you can achieve a good balance between model size/efficiency and accuracy when using quantized LLMs.