### Importing PyTorch Libraries  
- `torch`: The core PyTorch library for deep learning.  
- `torch.nn`: Provides neural network building blocks (layers, activations, etc.).  
- `torch.optim`: Contains optimization algorithms like SGD and Adam.  
- `torchvision.transforms`: Applies transformations (e.g., resizing, normalization) to images.  
- `torchvision.datasets`: Provides standard datasets like MNIST, CIFAR-10 for training models.  


In [28]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets

### Loading and Preprocessing the Data  
- Loads the dataset **`selected_features_training.csv`** using `pandas`.  
- **Separates features (`X`)** and the **target variable (`y`)**:  
  - `X`: Contains the selected features used for training.  
  - `y`: Contains the labels (class categories) for classification.  


In [29]:
import pandas as pd

df = pd.read_csv('selected_features_training.csv')

# Step 3: Preprocess the Data

# Separate features and target variable
X = df.drop('label', axis=1)
y = df['label']

### Splitting the Dataset into Training and Testing Sets  
- Imports necessary libraries for **data splitting and evaluation**.  
- `train_test_split()` is used to **split the dataset** into:  
  - **80% training data (`X_train`, `y_train`)**  
  - **20% testing data (`X_test`, `y_test`)**  
- `random_state=42` ensures **consistent results** across runs.  


In [30]:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Converting Data to PyTorch Tensors  
- Converts `X_train` and `X_test` from a **pandas DataFrame** to **NumPy arrays** and then to **PyTorch tensors**.  
- `dtype=torch.float32`: Ensures feature values are stored as 32-bit floating points.  
- `dtype=torch.long`: Converts labels (`y_train`, `y_test`) to integer format required for classification tasks in PyTorch.  


In [31]:
# Convert DataFrame to NumPy and then to PyTorch tensor
X_train = torch.tensor(X_train.values, dtype=torch.float32)
y_train = torch.tensor(y_train.values, dtype=torch.long)
X_test = torch.tensor(X_test.values, dtype=torch.float32)
y_test = torch.tensor(y_test.values, dtype=torch.long)


In [33]:
print(X_train.shape)  # Expected: (num_samples, num_features)

torch.Size([100778, 20])


### Loading the Pre-Trained Teacher Model  
- Uses `load_model()` from **TensorFlow Keras** to load a pre-trained model.  
- Loads the **`pnn_model.h5`** file, which contains the trained **Probabilistic Neural Network (PNN)** model.  
- This model will act as the **teacher model** for knowledge distillation.  


In [34]:
from tensorflow.keras.models import load_model
teacher_model = load_model("pnn_model.h5")




### Generating Teacher Model Predictions (Logits)  
- Uses the **pre-trained teacher model** to make predictions on the training data (`X_train`).  
- Converts the predictions (logits) into a **PyTorch tensor** with `dtype=torch.float32`.  
- These logits represent the **soft predictions** from the teacher model, which will be used for **knowledge distillation**.  


In [35]:
teacher_logits = torch.tensor(teacher_model.predict(X_train), dtype=torch.float32)
teacher_logits

[1m3150/3150[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 1ms/step


tensor([[2.7812e-09, 3.4620e-20, 1.2752e-25,  ..., 1.3662e-35, 1.1516e-13,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [9.3335e-25, 2.1447e-31, 0.0000e+00,  ..., 5.4336e-24, 0.0000e+00,
         0.0000e+00],
        ...,
        [7.8920e-25, 6.2439e-31, 0.0000e+00,  ..., 2.9868e-23, 0.0000e+00,
         0.0000e+00],
        [1.4533e-06, 1.7698e-14, 2.0384e-19,  ..., 4.0989e-29, 2.1206e-11,
         2.5264e-30],
        [5.3072e-28, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00]])

### Defining the Student Model  
- A **lightweight neural network** designed to learn from the **teacher model** using **knowledge distillation**.  
- **Uses PyTorch’s `nn.Module`** to define the architecture.  

#### **Layers in the Student Model**  
1. **Input Layer → Hidden Layer**  
   - `nn.Linear(X_train.shape[1], 64)`:  
     - Connects the input features to a hidden layer with **64 neurons**.  
2. **Hidden Layer → Output Layer**  
   - `nn.Linear(64, len(torch.unique(y_train)))`:  
     - Connects the hidden layer to the output layer with **the number of unique classes** in `y_train`.  
3. **Activation Function**  
   - Uses **ReLU (`torch.relu()`)** to introduce non-linearity and improve learning.  

#### **Forward Pass**  
- Takes input (`x`), applies **ReLU activation** to the first layer, and then **outputs logits** from the second layer.  
- The logits will later be used for **soft target matching** during knowledge distillation.  


In [36]:
import torch.nn as nn

class StudentModel(nn.Module):
    def __init__(self):
        super(StudentModel, self).__init__()
        self.fc1 = nn.Linear(X_train.shape[1], 64)  # Input layer to a smaller hidden layer
        self.fc2 = nn.Linear(64, len(torch.unique(y_train)))  # Hidden layer to output layer

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x


### Initializing the Student Model  
- Creates an instance of the **StudentModel** class.  
- This model will learn from the **teacher model’s predictions (logits)** during **knowledge distillation**.  
- The student model is **smaller and more efficient** than the teacher model, making it suitable for deployment on **resource-limited IoT devices**.  


In [37]:
student_model = StudentModel()

### Defining Hyperparameters for Knowledge Distillation  
- **`alpha = 0.5`** → Balances between **soft loss (from teacher logits)** and **hard loss (from true labels)**.  
- **`temperature = 3.0`** → Controls the **softness of teacher's logits**, making them more informative for training.  
- **`learning_rate = 0.001`** → Sets the step size for the optimizer to update model weights.  
- **`num_epochs = 100`** → Defines the number of times the student model will train on the dataset.  

💡 **Higher `temperature` makes the teacher’s output probabilities smoother, helping the student learn better from soft labels.**  


In [39]:
alpha = 0.5         # Weight for soft vs hard loss
temperature = 3.0   # Temperature for softening the teacher's outputs
learning_rate = 0.001  # Learning rate for the optimizer
num_epochs = 100     # Number of training epochs

### Initializing the Optimizer  
- Uses **Adam optimizer (`torch.optim.Adam`)**, which is an adaptive learning rate optimization algorithm.  
- **Updates the parameters** of the `student_model` based on gradient descent.  
- **Learning rate (`lr=learning_rate`)** is set to `0.001`, controlling how much the model adjusts during training.  


In [40]:
optimizer = torch.optim.Adam(student_model.parameters(), lr=learning_rate)


### Knowledge Distillation Loss Function  
- Helps the **student model learn** from both the **teacher model** and **true labels**.  
- Uses a **balance** of two losses:  
  1. **Soft Loss (KL Divergence)** → Matches student predictions with **teacher's softened outputs**.  
  2. **Hard Loss (Cross-Entropy)** → Matches student predictions with **true labels**.  
- `alpha` controls the weight:  
  - **Higher `alpha`** → Student relies more on the teacher.  
  - **Lower `alpha`** → Student relies more on true labels.  


In [41]:
def distillation_loss(student_logits, teacher_logits, true_labels, alpha=0.5, temperature=3.0):
    # Compute teacher probabilities (softened outputs)
    teacher_probs = torch.softmax(teacher_logits / temperature, dim=1)
    # Compute student probabilities
    student_probs = torch.log_softmax(student_logits / temperature, dim=1)

    # Compute the soft loss (KL Divergence between student and teacher)
    soft_loss = nn.KLDivLoss(reduction="batchmean")(student_probs, teacher_probs)

    # Compute the hard loss (CrossEntropy with true labels)
    hard_loss = nn.CrossEntropyLoss()(student_logits, true_labels)

    # Combine soft and hard losses
    return alpha * soft_loss + (1 - alpha) * hard_loss


### Training the Student Model with Knowledge Distillation  

1. **Set the model to training mode:**  
   - `student_model.train()` ensures the model updates its parameters.  

2. **Reset gradients before each update:**  
   - `optimizer.zero_grad()` clears previous gradients to avoid accumulation.  

3. **Get student model predictions:**  
   - `student_logits = student_model(X_train)` generates logits for training data.  

4. **Compute distillation loss:**  
   - Uses `distillation_loss()` to combine knowledge from the **teacher model** and **true labels**.  

5. **Backpropagation:**  
   - `loss.backward()` calculates gradients for each parameter.  
   - `optimizer.step()` updates the student model’s weights.  

6. **Track progress:**  
   - Prints the loss **every 10 epochs** to monitor training.  


In [None]:
for epoch in range(num_epochs):
    student_model.train()
    optimizer.zero_grad()

    # Get student's predictions
    student_logits = student_model(X_train)

    # Calculate the distillation loss
    loss = distillation_loss(student_logits, teacher_logits, y_train, alpha, temperature)

    # Backpropagation
    loss.backward()
    optimizer.step()

    # Print progress every 10 epochs
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")


### Evaluating the Student Model  

1. **Set model to evaluation mode:**  
   - `student_model.eval()` ensures the model does not update weights during testing.  

2. **Disable gradient calculation:**  
   - `with torch.no_grad():` speeds up inference and reduces memory usage.  

3. **Make predictions on test data:**  
   - `test_logits = student_model(X_test)` generates logits for test samples.  
   - `predictions = torch.argmax(test_logits, dim=1)` converts logits into class labels.  

4. **Calculate accuracy:**  
   - Uses `accuracy_score()` to compare predictions with true labels (`y_test`).  

5. **Print the final accuracy of the student model.**

In [None]:
student_model.eval()
with torch.no_grad():
    test_logits = student_model(X_test)
    predictions = torch.argmax(test_logits, dim=1)

# Calculate accuracy
accuracy = accuracy_score(y_test.numpy(), predictions.numpy())
print(f"Student Model Accuracy: {accuracy:.4f}")


Student Model Accuracy: 0.9140


### Saving the Trained Student Model  
- Saves the **student model’s trained weights** using `torch.save()`.  
- `student_model.state_dict()` contains only the **model parameters** (not the full model structure).  
- The model is stored as **`student_model.h5`**, which can be **loaded later for inference or further training**.  


In [20]:
torch.save(student_model.state_dict(), "student_model.h5")


### Comparing Teacher and Student Model Sizes  
1. **Calculate Teacher Model Size:**  
   - `teacher_model.count_params()` returns the **total number of parameters** in the teacher model.  

2. **Calculate Student Model Size:**  
   - `sum(p.numel() for p in student_model.parameters())` computes the **total parameters** in the student model.  

3. **Print Model Sizes:**  
   - Displays the **number of parameters** in both models to compare their complexity. 

In [25]:
teacher_size = teacher_model.count_params()
student_size = sum(p.numel() for p in student_model.parameters())

print(f"Teacher Model Size: {teacher_size} parameters")
print(f"Student Model Size: {student_size} parameters")


Teacher Model Size: 88599 parameters
Student Model Size: 2839 parameters


- Computes the **percentage reduction** in model size after knowledge distillation.  
- Prints how much **smaller** the student model is compared to the teacher model.  

In [44]:
print(f'Redcution by {(1- (student_size/teacher_size)) *100} %')

Redcution by 96.7956748947505 %
