## Understanding Post-Training Quantization

Post-Training Quantization is a powerful technique designed to reduce the size and computational complexity of deep learning models, making it particularly advantageous for deployment in resource-constrained environments such as mobile devices and embedded systems.

This process involves taking a pre-trained floating-point model and converting its weights and activations to lower-precision formats—typically 8-bit integers—while striving to retain as much accuracy as possible.

### Workflow of Post-Training Quantization:

**Pre-trained Model** → **Observer** → **Calibration (using unlabeled data)** → **Quantized Model**

### Detailed Explanation:

1. **Pre-trained Model:** This is a model that has been trained using standard floating-point precision, usually FP32 (32-bit floating point).

2. **Observers:** During the quantization process, observers are integrated into the model. These observers monitor the ranges and distributions of data (both weights and activations) during inference. Their role is crucial as they provide insights that help determine the scaling factors necessary for precision reduction.

3. **Calibration:** In this step, unlabeled data is input into the model to facilitate calibration. The process analyzes the model’s output ranges to identify the appropriate scale and zero-point needed for the conversion to lower precision.

4. **Quantized Model:** Ultimately, the floating-point model is transformed into a quantized model, typically using int8 (8-bit integer). This conversion leads to a significant reduction in the model's memory footprint and computational demands.

By following these steps, the model is optimized for deployment in environments where memory and processing power are limited, all while aiming to minimize accuracy loss.


In [1]:
import torch
import torchvision.datasets as datasets
import torchvision.transforms as transforms
import torch.nn as nn
import matplotlib.pyplot as plt
from tqdm import tqdm
from pathlib import Path
import os

### Loading the Dataset

In this section, we focus on preparing the dataset crucial for training our neural network. We utilize the **MNIST dataset**, a renowned collection of handwritten digits that serves as an ideal benchmark for image classification tasks. 

---

🔍 **Reproducibility**: By setting a manual seed, we ensure that our results remain consistent across different runs.

📊 **Data Transformations**: The dataset undergoes a series of transformations to:
- Convert the images into tensor format
- Normalize them, which is vital for enhancing the training process and improving model performance.

🚀 **Data Loaders**: We create data loaders for both the training and testing datasets, facilitating efficient mini-batch processing:
- **Mini-batch processing** allows our model to learn from smaller subsets of data, speeding up training and improving generalization.

💻 **GPU Configuration**: Lastly, we configure our environment to leverage GPU resources if available, optimizing computational efficiency.

---

This foundational step sets us up for the subsequent stages of model training and quantization, enabling us to effectively implement Post-Training Quantization on our trained model.


In [4]:
_ = torch.manual_seed(433)

In [5]:
transform = transforms.Compose([
    transforms.ToTensor(), # converting to tensors
    transforms.Normalize((0.1307,), (0.3081,)) # performing normalization on the data which is optimal in ML or DL
])

# we would be using the MNIST dataset
mnist_trainset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
mnist_testset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# creating batch norm
train_loader = torch.utils.data.DataLoader(mnist_trainset, batch_size=10, shuffle=True)
test_loader = torch.utils.data.DataLoader(mnist_testset, batch_size=10, shuffle=True)

# trying to leverage my baby GPU hahahaha ;)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data\MNIST\raw\train-images-idx3-ubyte.gz


100%|███████████████████████████████████████████████████████████████████| 9912422/9912422 [00:01<00:00, 5556356.70it/s]


Extracting ./data\MNIST\raw\train-images-idx3-ubyte.gz to ./data\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./data\MNIST\raw\train-labels-idx1-ubyte.gz


100%|████████████████████████████████████████████████████████████████████████| 28881/28881 [00:00<00:00, 400398.28it/s]


Extracting ./data\MNIST\raw\train-labels-idx1-ubyte.gz to ./data\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./data\MNIST\raw\t10k-images-idx3-ubyte.gz


100%|███████████████████████████████████████████████████████████████████| 1648877/1648877 [00:00<00:00, 3320136.78it/s]


Extracting ./data\MNIST\raw\t10k-images-idx3-ubyte.gz to ./data\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./data\MNIST\raw\t10k-labels-idx1-ubyte.gz


100%|█████████████████████████████████████████████████████████████████████████| 4542/4542 [00:00<00:00, 4554274.15it/s]

Extracting ./data\MNIST\raw\t10k-labels-idx1-ubyte.gz to ./data\MNIST\raw






### Simple Neural Network

In this section, we define a **Simple Neural Network** architecture tailored for classifying handwritten digits from the MNIST dataset. This neural network will serve as the foundation for our subsequent quantization process.

---

🧠 **Neural Network Architecture**:
- **Input Layer**: Accepts images reshaped to a flat vector of size 28x28 (i.e., 784).
- **Hidden Layers**: 
  - **First Hidden Layer**: 50 neurons
  - **Second Hidden Layer**: 80 neurons
  - **Third Hidden Layer**: 30 neurons
- **Output Layer**: Outputs 10 classes, corresponding to the digits 0-9.

🔗 **Activation Function**: 
- **ReLU (Rectified Linear Unit)** is employed between layers to introduce non-linearity, enhancing the model's ability to learn complex patterns.

📦 **Forward Pass**: 
- The input image is flattened, and data flows through the layers, applying the activation function at each hidden layer, culminating in the output layer which predicts the digit class.

---

By constructing this neural network, we prepare to train it on the MNIST dataset, laying the groundwork for effective quantization and deployment in resource-constrained environments.


In [6]:
class NeuralNetwork(nn.Module):
    def __init__(self, hidden_layer_1 = 50,hidden_layer_2 = 80, hidden_layer_3 = 30):
        super(NeuralNetwork,self).__init__()
        self.linear1 = nn.Linear(28*28, hidden_layer_1)
        self.linear2 = nn.Linear(hidden_layer_1, hidden_layer_2)
        self.linear3 = nn.Linear(hidden_layer_2, hidden_layer_3)
        self.linear4 = nn.Linear(hidden_layer_3, 10)
        self.relu = nn.ReLU()
        
    def forward(self,img):
        x = img.view(-1, 28*28)
        x = self.relu(self.linear1(x))
        x = self.relu(self.linear2(x))
        x = self.relu(self.linear3(x))
        x = self.linear4(x)
        return x

model = NeuralNetwork().to(device)

### Model Training

In this section, we implement the training routine for our neural network model. This crucial step allows the model to learn from the MNIST dataset and adjust its weights to improve accuracy.

---

🔧 **Training Parameters**:
- **Optimizer**: We utilize the **Adam optimizer** for efficient weight updates.
- **Loss Function**: The **CrossEntropyLoss** is chosen since this is a multi-class classification problem, suitable for our digit recognition task.

📈 **Training Process**:
- The training occurs over a specified number of **epochs**, where each epoch consists of several iterations over batches of data.
- **Loss Calculation**: During each iteration, the model computes the loss, which quantifies how well it performs. The average loss for the epoch is tracked and displayed dynamically.

🔄 **Iterative Updates**:
- The model's parameters are updated using backpropagation after each batch, allowing it to learn from its mistakes and improve over time.

🔍 **Progress Monitoring**:
- A progress bar (using **tqdm**) provides real-time feedback on the training status, displaying the current epoch and average loss.

By effectively training the model, we lay the groundwork for accurate digit classification, setting the stage for the subsequent quantization process.


In [7]:
def train(train_loader, model, epochs = None, total_iterations_limit = None):
    # optimizer and loss function
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    loss_function = nn.CrossEntropyLoss() # since this is a classification problem.

    total_iterations = 0  # Keep track of how many total iterations we've done

    for epoch in range(epochs):
        model.train()

        loss_sum = 0  # Sum of all the losses to calculate the average loss
        num_iterations = 0  # Keep track of the iterations in this epoch
        data_iterator = tqdm(train_loader, desc=f'Epoch {epoch+1}')

        if total_iterations_limit is not None:
            data_iterator.total = total_iterations_limit
        for data in data_iterator:
            num_iterations += 1
            total_iterations += 1
            x, y = data # 'data' is a batch (x, y), where x is the input (image), and y is the label (digit)
            x = x.to(device)
            y = y.to(device)
            optimizer.zero_grad()
            output = model(x.view(-1, 28*28))
            loss = loss_function(output, y)
            loss_sum += loss.item()
            avg_loss = loss_sum / num_iterations
            data_iterator.set_postfix(loss=avg_loss)
            loss.backward()
            optimizer.step()

            # If a total iteration limit is set, stop training once the limit is reached
            if total_iterations_limit is not None and total_iterations >= total_iterations_limit:
                return

### Model Size and Loading

In this section, we focus on understanding the memory footprint of our trained model and managing its persistence.

---

🗂️ **Model Size Estimation**:
- We define a function to calculate the size of the model in kilobytes (KB). This gives us insights into the model's complexity and how it might perform in constrained environments.
- The size is determined by saving the model's state dictionary temporarily and checking the file size.

💾 **Model Persistence**:
- The model is saved to disk under the filename **`simpleNN_ptq.pt`**.
- **Loading Mechanism**: Before training the model, we check if a saved version already exists. If so, we load it from disk to avoid redundant training, ensuring efficient resource utilization.

🔄 **Training and Saving**:
- If no pre-existing model is found, the training process commences, followed by saving the trained model for future use. This workflow allows for easy model reuse without the need for retraining, which is particularly beneficial in production settings.

By efficiently managing the model's size and implementing a robust loading mechanism, we prepare for the next steps in the post-training quantization process.


In [9]:
def print_size_of_model(model):
    torch.save(model.state_dict(), "temp_delme.p")
    print('Size (KB):', os.path.getsize("temp_delme.p")/1e3)
    os.remove('temp_delme.p')

MODEL_FILENAME = 'simpleNN_ptq.pt'

if Path(MODEL_FILENAME).exists():
    model.load_state_dict(torch.load(MODEL_FILENAME))
    print('Loaded model from disk')
else:
    train(train_loader, model, epochs=1)
    # Save the model to disk
    torch.save(model.state_dict(), MODEL_FILENAME)

Epoch 1: 100%|████████████████████████████████████████████████████████| 6000/6000 [00:58<00:00, 101.70it/s, loss=0.135]


### Time to Test Our Neural Network

In this segment, we evaluate the performance of our trained neural network by measuring its accuracy on the test dataset.

---

🔍 **Testing Mechanism**:
- We define a function to conduct the testing phase, where the model's predictions are compared against the true labels from the test dataset.
- The model is set to **evaluation mode**, disabling gradient calculations to optimize performance during inference.

📊 **Accuracy Calculation**:
- As we iterate through the test data, we keep track of the number of correct predictions versus the total predictions made.
- For each prediction, if the model's output matches the actual label, we increment our **correct** counter.

🔄 **Iterative Process**:
- We also provide a mechanism to limit the number of iterations during testing, allowing for quick checks without the need to evaluate the entire dataset when desired.

✅ **Results Presentation**:
- Finally, the accuracy of the model is computed and printed, offering insights into its performance and readiness for deployment.

This testing phase is crucial in understanding how well our model generalizes to unseen data, guiding us in subsequent optimization and quantization steps.


In [10]:
def test(model, total_iterations):
    correct,total, iterations = 0,0,0

    model.eval()
    with torch.no_grad():
        for data in tqdm(test_loader, desc='Testing'):
            x, y = data
            x = x.to(device)
            y = y.to(device)
            output = model(x.view(-1, 784))
            for idx, i in enumerate(output):
                if torch.argmax(i) == y[idx]:
                    correct +=1
                total +=1
            iterations += 1
            if total_iterations is not None and iterations >= total_iterations:
                break
    print(f'Accuracy: {round(correct/total, 8)}')

### Evaluating Model Size and Accuracy Before Quantization

Before we proceed with the quantization process, it’s essential to assess our model's current performance and resource footprint.

---

📏 **Model Size**:
- We begin by examining the **weights** of the model's first layer to understand its structure.
- The size of the model is reported, helping us gauge the potential benefits of quantization in reducing memory requirements.

💡 **Weights Overview**:
- The weights matrix is printed to provide a clear view of the parameters the model has learned during training. Additionally, we check the data type of the weights to confirm that they are in floating-point format.

🧪 **Model Accuracy**:
- Following the size evaluation, we check the model's accuracy on the test dataset. This metric is crucial as it indicates how well the model generalizes to unseen data.
- By evaluating the model before quantization, we establish a baseline for comparing post-quantization performance.

📈 **Results Summary**:
- The accuracy result highlights the model's effectiveness, ensuring that any changes made during quantization will be closely monitored against this benchmark.

This step is vital in understanding the initial state of our model, setting the stage for the enhancements that quantization can bring.


In [11]:
# Print the weights matrix of the model before quantization
print('Weights before quantization')
print(model.linear1.weight) # for the 1st layer. 
print(model.linear1.weight.dtype)

Weights before quantization
Parameter containing:
tensor([[ 2.2669e-02,  2.1956e-02, -2.6446e-03,  ...,  3.9206e-03,
          3.7860e-02,  4.6699e-03],
        [ 9.6753e-05,  3.8221e-02,  1.9658e-02,  ..., -6.3200e-03,
          2.2407e-02, -1.6425e-02],
        [ 5.9448e-02,  4.3217e-03,  1.6101e-02,  ..., -7.5925e-03,
          7.5968e-03,  2.9488e-02],
        ...,
        [-1.2917e-02,  1.4802e-02,  6.9440e-03,  ...,  3.6022e-04,
         -9.9758e-03,  9.7989e-03],
        [-3.5000e-03,  4.5107e-02,  1.5952e-02,  ...,  3.1508e-02,
          2.7726e-02,  3.5913e-02],
        [-3.3904e-02,  1.7777e-02, -3.4840e-03,  ...,  2.9134e-03,
         -1.7864e-03,  1.2128e-02]], requires_grad=True)
torch.float32


In [12]:
print('Size of the model before quantization')
print_size_of_model(model)

Size of the model before quantization
Size (KB): 187.364


In [14]:
## we also want to check the accuracy of our model 
print(f'Accuracy of the model before quantization: ')
test(model,None)

Accuracy of the model before quantization: 


Testing: 100%|████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 291.71it/s]

Accuracy: 0.9646





### Time to Quantize 😉

As we step into the quantization phase, we begin by creating a **copy of our existing model** to facilitate a smooth transition into lower-precision computations without altering the original architecture.

---

🔄 **Introducing Quantization**:
- We define a new class, `QuantizeNeuralNetwork`, which inherits from `nn.Module`. This class is designed to incorporate quantization directly into the model structure.
- By leveraging **QuantStub** and **DeQuantStub**, we seamlessly integrate quantization and dequantization processes into our model's forward pass.

🛠️ **Model Architecture**:
- The architecture remains largely consistent with our previous neural network, featuring:
  - **Input Layer:** 28x28 input flattened to a single vector.
  - **Hidden Layers:** Three fully connected layers with ReLU activations.
  - **Output Layer:** Final layer projecting to 10 classes for digit classification.

📉 **Precision Optimization**:
- The use of quantization allows us to transform the floating-point operations into lower-precision computations, which can significantly reduce both the memory footprint and computational load.
- This step prepares the model for efficient deployment, especially in resource-constrained environments like mobile devices.

With our quantization model defined, we are now poised to further explore the benefits it brings to our neural network's performance and efficiency!


In [16]:
### We make a copy of that same model: 
class QuantizeNeuralNetwork(nn.Module):
    def __init__(self, hidden_layer_1 = 50,hidden_layer_2 = 80, hidden_layer_3 = 30):
        super(QuantizeNeuralNetwork,self).__init__()
        self.quant = torch.quantization.QuantStub()
        self.linear1 = nn.Linear(28*28, hidden_layer_1)
        self.linear2 = nn.Linear(hidden_layer_1, hidden_layer_2)
        self.linear3 = nn.Linear(hidden_layer_2, hidden_layer_3)
        self.linear4 = nn.Linear(hidden_layer_3, 10)
        self.relu = nn.ReLU()
        self.dequant = torch.quantization.DeQuantStub()
        
    def forward(self,img):
        x = img.view(-1, 28*28)
        x = self.quant(x)
        x = self.relu(self.linear1(x))
        x = self.relu(self.linear2(x))
        x = self.relu(self.linear3(x))
        x = self.linear4(x)
        x = self.dequant(x)
        return x

quant_model = QuantizeNeuralNetwork().to(device)

#### Quantized Model Inference 🌟

In this phase, we take a pivotal step by **transferring the weights** from our original model to the quantized version, ensuring that we leverage the knowledge gained from previous training without the need for retraining. 

---

🔄 **Weight Transfer**:
- The quantized model, `quant_model`, is initialized by loading the state dictionary from the original model. This allows us to maintain the learned parameters and avoid redundant training.

💡 **Inference Mode**:
- By setting `quant_model.eval()`, we switch to inference mode, which disables dropout layers and other training-specific behaviors, ensuring accurate predictions during evaluation.

🔍 **Quantization Configuration**:
- We assign the default quantization configuration using `quant_model.qconfig = torch.ao.quantization.default_qconfig` to prepare the model for quantization.
- The `prepare` method inserts observers into the model, allowing us to collect statistics about the activations during inference.

📊 **Layer Statistics**:
- After testing the quantized model, we can examine the statistics of each layer, which provides insights into the range of activations processed through the network.
- This information is invaluable for understanding the behavior of the model and can guide adjustments in quantization strategies.

With the quantized model in place and its statistics reviewed, we are now equipped to assess the benefits of quantization in terms of performance and efficiency!


In [1]:
quant_model.load_state_dict(model.state_dict())
quant_model.eval() ## we are not training but foing inferencing 

quant_model.qconfig = torch.ao.quantization.default_qconfig
quant_model = torch.ao.quantization.prepare(quant_model) # Insert observers
quant_model

NameError: name 'quant_model' is not defined

In [19]:
test(quant_model,None)
print(f'Check statistics of the various layers')
quant_model

Testing: 100%|████████████████████████████████████████████████████████████████████| 1000/1000 [00:02<00:00, 385.78it/s]

Accuracy: 0.9646
Check statistics of the various layers





QuantizeNeuralNetwork(
  (quant): QuantStub(
    (activation_post_process): MinMaxObserver(min_val=-0.4242129623889923, max_val=2.821486711502075)
  )
  (linear1): Linear(
    in_features=784, out_features=50, bias=True
    (activation_post_process): MinMaxObserver(min_val=-58.53604507446289, max_val=43.8294563293457)
  )
  (linear2): Linear(
    in_features=50, out_features=80, bias=True
    (activation_post_process): MinMaxObserver(min_val=-40.140647888183594, max_val=35.24177551269531)
  )
  (linear3): Linear(
    in_features=80, out_features=30, bias=True
    (activation_post_process): MinMaxObserver(min_val=-36.355369567871094, max_val=43.074493408203125)
  )
  (linear4): Linear(
    in_features=30, out_features=10, bias=True
    (activation_post_process): MinMaxObserver(min_val=-36.072540283203125, max_val=22.88129234313965)
  )
  (relu): ReLU()
  (dequant): DeQuantStub()
)

I think this is beautiful as we can see this values of this tensors. It gives us an Idea for how to go about each layer.

### Quantization of the Neural Network 🎉

As we move forward with quantization, we utilize the statistics gathered from the observers to optimize our model. This transformation is crucial for reducing the model's memory footprint and enhancing inference speed.

---

🔍 **Layer Statistics Post-Quantization**:
- The quantized model reveals each layer's unique **scale** and **zero point**, essential parameters for transforming floating-point values into quantized integers. These statistics give us insights into how each layer processes data, ensuring optimal performance post-quantization.

### Weights Representation
- After quantization, we examine the weight matrix of the first layer:
  - **Quantized Weights**: The weights are represented in integer format, showcasing the quantization's effect on the values.
  
- **Original vs. Dequantized Weights**:
  - Comparing the original floating-point weights with their quantized counterparts gives us a deeper understanding of how quantization affects weight precision.
  - The dequantized weights demonstrate how effectively the quantized representation retains the original values, providing a glimpse into potential performance during inference.

This process not only optimizes our model but also helps us ensure that we maintain a balance between performance and efficiency. Let’s continue to explore how quantization can impact the overall performance of our neural network!


In [20]:
quant_model = torch.ao.quantization.convert(quant_model)
print(f'Check statistics of the various layers')
quant_model

Check statistics of the various layers


QuantizeNeuralNetwork(
  (quant): Quantize(scale=tensor([0.0256]), zero_point=tensor([17]), dtype=torch.quint8)
  (linear1): QuantizedLinear(in_features=784, out_features=50, scale=0.8060275912284851, zero_point=73, qscheme=torch.per_tensor_affine)
  (linear2): QuantizedLinear(in_features=50, out_features=80, scale=0.5935623645782471, zero_point=68, qscheme=torch.per_tensor_affine)
  (linear3): QuantizedLinear(in_features=80, out_features=30, scale=0.625432014465332, zero_point=58, qscheme=torch.per_tensor_affine)
  (linear4): QuantizedLinear(in_features=30, out_features=10, scale=0.464203417301178, zero_point=78, qscheme=torch.per_tensor_affine)
  (relu): ReLU()
  (dequant): DeQuantize()
)

So for each layer it has it's own Scale and zero point. 

In [21]:
# Print the weights matrix of the model after quantization
print('Weights after quantization')
print(torch.int_repr(quant_model.linear1.weight()))

Weights after quantization
tensor([[ 4,  4,  0,  ...,  1,  6,  1],
        [ 0,  6,  3,  ..., -1,  4, -3],
        [10,  1,  3,  ..., -1,  1,  5],
        ...,
        [-2,  2,  1,  ...,  0, -2,  2],
        [-1,  7,  3,  ...,  5,  5,  6],
        [-6,  3, -1,  ...,  0,  0,  2]], dtype=torch.int8)


In [22]:
print('Original weights: ')
print(model.linear1.weight)
print('')
print(f'Dequantized weights: ')
print(torch.dequantize(quant_model.linear1.weight()))
print('')

Original weights: 
Parameter containing:
tensor([[ 2.2669e-02,  2.1956e-02, -2.6446e-03,  ...,  3.9206e-03,
          3.7860e-02,  4.6699e-03],
        [ 9.6753e-05,  3.8221e-02,  1.9658e-02,  ..., -6.3200e-03,
          2.2407e-02, -1.6425e-02],
        [ 5.9448e-02,  4.3217e-03,  1.6101e-02,  ..., -7.5925e-03,
          7.5968e-03,  2.9488e-02],
        ...,
        [-1.2917e-02,  1.4802e-02,  6.9440e-03,  ...,  3.6022e-04,
         -9.9758e-03,  9.7989e-03],
        [-3.5000e-03,  4.5107e-02,  1.5952e-02,  ...,  3.1508e-02,
          2.7726e-02,  3.5913e-02],
        [-3.3904e-02,  1.7777e-02, -3.4840e-03,  ...,  2.9134e-03,
         -1.7864e-03,  1.2128e-02]], requires_grad=True)

Dequantized weights: 
tensor([[ 0.0245,  0.0245,  0.0000,  ...,  0.0061,  0.0368,  0.0061],
        [ 0.0000,  0.0368,  0.0184,  ..., -0.0061,  0.0245, -0.0184],
        [ 0.0613,  0.0061,  0.0184,  ..., -0.0061,  0.0061,  0.0306],
        ...,
        [-0.0123,  0.0123,  0.0061,  ...,  0.0000, -0.0123,  

#### Lets compare Unquantized and Quantized models

In [24]:
print('Size of the model after quantization')
print_size_of_model(quant_model)
print('Testing the model after quantization')
test(quant_model,None)

Size of the model after quantization
Size (KB): 52.77
Testing the model after quantization


Testing: 100%|████████████████████████████████████████████████████████████████████| 1000/1000 [00:02<00:00, 403.01it/s]

Accuracy: 0.9626





The size has gone down from Size (KB): 187.364 to Size (KB): 52.77 and accuracy has gone from 0.9646 to 0.9626. 

## 📊 Comparing Unquantized and Quantized Models

### Size Reduction
- **Unquantized Model Size**: 
  - **Size (KB)**: 187.364
- **Quantized Model Size**: 
  - **Size (KB)**: 52.77

### Accuracy Evaluation
- **Unquantized Model Accuracy**: 
  - **Accuracy**: 0.9646
- **Quantized Model Accuracy**: 
  - **Accuracy**: 0.9626

---

### Summary of Findings
The quantization process has led to a significant reduction in model size from **187.364 KB** to **52.77 KB**. Although there is a slight decrease in accuracy from **0.9646** to **0.9626**, the trade-off between model efficiency and accuracy is evident.

### Conclusion
Quantization proves to be a powerful technique in optimizing neural networks for deployment, allowing for smaller models that maintain a high level of accuracy. This transformation enhances inference speed and reduces resource consumption, making it an essential step in practical machine learning applications.

Let’s keep pushing the boundaries of what our models can achieve!
