# What is 4-bit Quantization?

Quantization in the context of deep learning is the process of constraining the number of bits that represent the weights and biases of the model. 

Weights and Biases numbers that we need in backpropagation. 

In 4-bit quantization, each weight or bias is represented using only 4 bits as opposed to the typical 32 bits used in single-precision floating-point format (float32).

# Why does it use less GPU Memory?

The primary advantage of using 4-bit quantization is the reduction in model size and memory usage. Here's a simple explanation:

- A float32 number takes up 32 bits of memory.
- A 4-bit quantized number takes up only 4 bits of memory.

So, theoretically, you can fit 8 times more 4-bit quantized numbers into the same memory space as float32 numbers. This allows you to load larger models into the GPU memory or use smaller GPUs that might not have been able to handle the model otherwise.


The amount of memory used by an integer in a computer system is directly related to the number of bits used to represent that integer.

### Memory Usage for 4-bit Integer

A 4-bit integer uses 4 bits of memory. 

### Memory Usage for 32-bit Integer

A 32-bit integer uses 32 bits of memory.

### Conversion to Bytes

To convert these to bytes (since memory is often measured in bytes):

- 1 byte = 8 bits
- A 4-bit integer would use \( 4/8 = 0.5 \) bytes.
- A 16-bit integer would use \( 16/8 = 2 \) bytes.



### Llama 2 example

For example, you may come across config like this in Llama 2 model:

#### bnb_config = transformers.BitsAndBytesConfig(
####    load_in_4bit=True,
####    bnb_4bit_quant_type='nf4',
####    bnb_4bit_use_double_quant=True,
####    bnb_4bit_compute_dtype=bfloat16
#### )


* load_in_4bit=True: Enables 4-bit quantization.
* bnb_4bit_quant_type='nf4': Specifies the type of 4-bit quantization.
* bnb_4bit_use_double_quant=True: Enables double quantization for better accuracy.
* bnb_4bit_compute_dtype=bfloat16: Specifies the data type for computation, which is bfloat16 here.

By using 4-bit quantization, you can load the Llama 2 model with significantly less GPU memory, making it more accessible for devices with limited resources.

# How much memory saved?

In [1]:
# Memory required for float32 weights

float32_memory = 32  # in bits
num_weights = 1000  # hypothetical number of weights

float32_total_memory = float32_memory * num_weights  # in bits

# Memory required for 4-bit quantized weights
bit4_memory = 4  # in bits

bit4_total_memory = bit4_memory * num_weights  # in bits

# Memory saved
memory_saved = float32_total_memory - bit4_total_memory  # in bits
memory_saved_in_bytes = memory_saved / 8  # convert bits to bytes

print(f"Memory saved by using 4-bit quantization: {memory_saved_in_bytes} bytes")


Memory saved by using 4-bit quantization: 3500.0 bytes


# Does reducing the bit-width from 32-bit to 4-bit quantization introduce the potential for a loss of accuracy in the model?

Yes.

In [2]:
import numpy as np

# Simulate original float32 weights
original_weights = np.random.rand(1000).astype(np.float32)

# Simulate 4-bit quantized weights
# First, normalize the weights to a range of 0 to 15 (since 4 bits can represent 16 values)
quantized_weights = np.round(original_weights * 15).astype(np.uint8)

# De-normalize to get the approximated original weights
approximated_weights = quantized_weights / 15.0

# Calculate the error
error = np.abs(original_weights - approximated_weights).mean()

print(f"Average Quantization Error: {error}")


Average Quantization Error: 0.016702707617295285


In [3]:
original_weights

array([0.9247905 , 0.2848272 , 0.37645584, 0.5048465 , 0.6614357 ,
       0.70491624, 0.1977602 , 0.6689422 , 0.66872555, 0.5624521 ,
       0.6305628 , 0.3726736 , 0.35116363, 0.07541367, 0.45414782,
       0.76398283, 0.05985678, 0.5913115 , 0.41074526, 0.13541609,
       0.5910739 , 0.86276126, 0.06655507, 0.9818092 , 0.3161168 ,
       0.3306428 , 0.39539385, 0.20723748, 0.9959016 , 0.30415216,
       0.1939258 , 0.74822813, 0.27812615, 0.7735159 , 0.529855  ,
       0.4883202 , 0.4443134 , 0.3167042 , 0.7455749 , 0.5094897 ,
       0.8444225 , 0.29333502, 0.2650708 , 0.6700357 , 0.32049704,
       0.41324535, 0.2554307 , 0.46257398, 0.5577883 , 0.07777528,
       0.5722218 , 0.86186   , 0.4652247 , 0.39145812, 0.114711  ,
       0.76122326, 0.5548202 , 0.44338274, 0.20434345, 0.68144834,
       0.80105406, 0.21737652, 0.57614595, 0.79485035, 0.6377741 ,
       0.01437411, 0.08599136, 0.903764  , 0.84312135, 0.31127113,
       0.9599578 , 0.51527417, 0.6231499 , 0.5020487 , 0.63205

In [4]:
quantized_weights

array([14,  4,  6,  8, 10, 11,  3, 10, 10,  8,  9,  6,  5,  1,  7, 11,  1,
        9,  6,  2,  9, 13,  1, 15,  5,  5,  6,  3, 15,  5,  3, 11,  4, 12,
        8,  7,  7,  5, 11,  8, 13,  4,  4, 10,  5,  6,  4,  7,  8,  1,  9,
       13,  7,  6,  2, 11,  8,  7,  3, 10, 12,  3,  9, 12, 10,  0,  1, 14,
       13,  5, 14,  8,  9,  8,  9, 12, 15,  9,  5,  2, 15, 13,  6, 14,  9,
       11, 12, 14,  7,  5,  1,  6, 13, 12,  5, 13, 13,  6,  2,  5, 13,  1,
       14,  2, 10,  0,  9, 14,  1,  3, 12,  8,  2, 14,  9, 14,  5,  2,  0,
        0,  2,  4,  5,  7,  5,  3,  8, 11,  2,  5, 11, 14,  8, 12,  6, 12,
       12, 12, 14,  6,  0,  6, 10,  2,  0,  7, 13, 12,  8,  2,  7,  2,  9,
        2, 11,  4, 14,  9,  0,  7, 13,  9,  1,  9,  4,  4, 11, 14,  4,  0,
       11, 11, 14,  3,  8,  2,  4, 15,  3,  6, 15,  6,  5,  6,  2,  7,  0,
        1,  3,  7, 14,  4,  7, 15,  9, 11, 12,  1, 10,  3,  1, 12, 13,  5,
       14,  9,  5, 13,  1,  0, 14, 14,  8,  6,  5,  2,  2, 14,  8,  1,  9,
        1,  6,  9,  9, 14