<a href="https://colab.research.google.com/github/pbajpai21/MLOps-Jan2025/blob/main/MLOps3_G24AIT008.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Roll No : G24AIT008

# Q1. (Dataset and Model Preparation)

Load the Dataset

In [1]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split

# Define the transformation: Convert images to tensors and normalize
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# Load Fashion-MNIST dataset
train_dataset = torchvision.datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = torchvision.datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)


Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26.4M/26.4M [00:01<00:00, 15.6MB/s]


Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29.5k/29.5k [00:00<00:00, 333kB/s]


Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4.42M/4.42M [00:00<00:00, 5.55MB/s]


Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5.15k/5.15k [00:00<00:00, 7.30MB/s]

Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw






Train-Validation Split (80:20)

In [2]:
# Define train-validation split (80:20)
train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size

train_data, val_data = random_split(train_dataset, [train_size, val_size])

# Create DataLoader for training, validation, and test sets
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
val_loader = DataLoader(val_data, batch_size=64, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)


Defining Model Architecture

In [3]:
import torch.nn as nn

class FashionMNISTModel(nn.Module):
    def __init__(self, hidden_units):
        super(FashionMNISTModel, self).__init__()
        self.fc1 = nn.Linear(28 * 28, hidden_units)  # First layer
        self.relu = nn.ReLU()  # Activation function
        self.fc2 = nn.Linear(hidden_units, 10)  # Output layer

    def forward(self, x):
        x = x.view(x.shape[0], -1)  # Flatten input
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Set default hidden layer neurons
hidden_units = 256
model = FashionMNISTModel(hidden_units)


# Q2. (Setting Up the Project & Logging Hyperparameters)

Initialize wandb project

In [4]:
!pip install wandb
import wandb

wandb.login(relogin=True)
wandb.init(project="MLOps2025_g24ait008", config={
    "learning_rate": 0.001,
    "batch_size": 64,
    "epochs": 5,
    "hidden_units": hidden_units
})




<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mg24ait008[0m ([33mg24ait008-ml-ops[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Defining Loss function and Optimizer

In [5]:
import torch.optim as optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=wandb.config.learning_rate)


# Q3. (Training and Validation)

Training the model

In [6]:
import time

def train(model, train_loader, val_loader, criterion, optimizer, epochs):
    start_time = time.time()
    model.train()

    for epoch in range(epochs):
        train_loss, train_correct = 0, 0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()
            train_correct += (outputs.argmax(dim=1) == labels).sum().item()

        train_loss /= len(train_loader)
        train_acc = train_correct / len(train_loader.dataset)

        # Validation
        val_loss, val_correct = 0, 0
        model.eval()
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                loss = criterion(outputs, labels)
                val_loss += loss.item()
                val_correct += (outputs.argmax(dim=1) == labels).sum().item()

        val_loss /= len(val_loader)
        val_acc = val_correct / len(val_loader.dataset)

        # Log metrics in wandb
        wandb.log({"Train Loss": train_loss, "Train Accuracy": train_acc,
                   "Validation Loss": val_loss, "Validation Accuracy": val_acc})

        print(f"Epoch [{epoch+1}/{epochs}], Train Loss: {train_loss:.4f}, Train Accuracy: {train_acc:.4f}, Validation Loss: {val_loss:.4f}, Validation Accuracy: {val_acc:.4f}")
    end_time = time.time()
    execution_time = round(end_time - start_time, 2)
    wandb.log({"Time spent (seconds)" : execution_time})
    print(f"Time spent: {execution_time} seconds")

try:
  train(model, train_loader, val_loader, criterion, optimizer, wandb.config.epochs)
finally:
  # Close wandb session
  wandb.finish()


Epoch [1/5], Train Loss: 0.5118, Train Accuracy: 0.8149, Validation Loss: 0.4388, Validation Accuracy: 0.8414
Epoch [2/5], Train Loss: 0.3889, Train Accuracy: 0.8587, Validation Loss: 0.3574, Validation Accuracy: 0.8710
Epoch [3/5], Train Loss: 0.3459, Train Accuracy: 0.8721, Validation Loss: 0.3331, Validation Accuracy: 0.8776
Epoch [4/5], Train Loss: 0.3183, Train Accuracy: 0.8843, Validation Loss: 0.3391, Validation Accuracy: 0.8721
Epoch [5/5], Train Loss: 0.2985, Train Accuracy: 0.8910, Validation Loss: 0.3166, Validation Accuracy: 0.8823
Time spent: 115.16 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▆▇█
Train Loss,█▄▃▂▁
Validation Accuracy,▁▆▇▆█
Validation Loss,█▃▂▂▁

0,1
Time spent (seconds),115.16
Train Accuracy,0.891
Train Loss,0.29854
Validation Accuracy,0.88233
Validation Loss,0.31659


# Q4. Hyperparameter Exploration

Defining Sweep configuration

In [7]:
sweep_config = {
    "method": "random",
    "parameters": {
        "hidden_units": {
            "values": [128, 256, 512]  # Different values for the hidden layer
        },
        "batch_size": {"values": [128, 256]},
        "epochs": {"value": 5},
        "learning_rate": {
            "values": [0.002, 0.01, 0.1]
        }
    }
}

sweep_id = wandb.sweep(sweep_config, project="MLOps2025_g24ait008")


Create sweep with ID: l5dm6euw
Sweep URL: https://wandb.ai/g24ait008-ml-ops/MLOps2025_g24ait008/sweeps/l5dm6euw


Running the sweep

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def sweep_train(config=None):
    wandb.init(config=config)
    model = FashionMNISTModel(wandb.config.hidden_units).to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    train(model, train_loader, val_loader, criterion, optimizer, epochs=5)


wandb.agent(sweep_id, sweep_train, count=10)


[34m[1mwandb[0m: Agent Starting Run: 10xsqrhq with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 256
[34m[1mwandb[0m: 	learning_rate: 0.002


Epoch [1/5], Train Loss: 0.5094, Train Accuracy: 0.8150, Validation Loss: 0.4252, Validation Accuracy: 0.8398
Epoch [2/5], Train Loss: 0.3862, Train Accuracy: 0.8582, Validation Loss: 0.3816, Validation Accuracy: 0.8580
Epoch [3/5], Train Loss: 0.3449, Train Accuracy: 0.8748, Validation Loss: 0.3484, Validation Accuracy: 0.8698
Epoch [4/5], Train Loss: 0.3208, Train Accuracy: 0.8817, Validation Loss: 0.3379, Validation Accuracy: 0.8734
Epoch [5/5], Train Loss: 0.3029, Train Accuracy: 0.8882, Validation Loss: 0.3136, Validation Accuracy: 0.8867
Time spent: 106.1 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▇▇█
Train Loss,█▄▂▂▁
Validation Accuracy,▁▄▅▆█
Validation Loss,█▅▃▃▁

0,1
Time spent (seconds),106.1
Train Accuracy,0.88825
Train Loss,0.30292
Validation Accuracy,0.88667
Validation Loss,0.31359


[34m[1mwandb[0m: Agent Starting Run: e52bjqce with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 128
[34m[1mwandb[0m: 	learning_rate: 0.1


Epoch [1/5], Train Loss: 0.5257, Train Accuracy: 0.8136, Validation Loss: 0.4631, Validation Accuracy: 0.8290
Epoch [2/5], Train Loss: 0.3978, Train Accuracy: 0.8551, Validation Loss: 0.3575, Validation Accuracy: 0.8703
Epoch [3/5], Train Loss: 0.3580, Train Accuracy: 0.8700, Validation Loss: 0.3619, Validation Accuracy: 0.8662
Epoch [4/5], Train Loss: 0.3305, Train Accuracy: 0.8797, Validation Loss: 0.3515, Validation Accuracy: 0.8715
Epoch [5/5], Train Loss: 0.3122, Train Accuracy: 0.8849, Validation Loss: 0.3234, Validation Accuracy: 0.8780
Time spent: 87.9 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▇▇█
Train Loss,█▄▃▂▁
Validation Accuracy,▁▇▆▇█
Validation Loss,█▃▃▂▁

0,1
Time spent (seconds),87.9
Train Accuracy,0.88485
Train Loss,0.31224
Validation Accuracy,0.878
Validation Loss,0.3234


[34m[1mwandb[0m: Agent Starting Run: 2aqse6gh with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 128
[34m[1mwandb[0m: 	learning_rate: 0.1


Epoch [1/5], Train Loss: 0.5214, Train Accuracy: 0.8134, Validation Loss: 0.4326, Validation Accuracy: 0.8409
Epoch [2/5], Train Loss: 0.3954, Train Accuracy: 0.8568, Validation Loss: 0.3712, Validation Accuracy: 0.8674
Epoch [3/5], Train Loss: 0.3575, Train Accuracy: 0.8697, Validation Loss: 0.3561, Validation Accuracy: 0.8651
Epoch [4/5], Train Loss: 0.3306, Train Accuracy: 0.8791, Validation Loss: 0.3317, Validation Accuracy: 0.8778
Epoch [5/5], Train Loss: 0.3110, Train Accuracy: 0.8859, Validation Loss: 0.3257, Validation Accuracy: 0.8812
Time spent: 86.24 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▆▇█
Train Loss,█▄▃▂▁
Validation Accuracy,▁▆▅▇█
Validation Loss,█▄▃▁▁

0,1
Time spent (seconds),86.24
Train Accuracy,0.88587
Train Loss,0.31102
Validation Accuracy,0.88117
Validation Loss,0.32565


[34m[1mwandb[0m: Agent Starting Run: 1zus6r6w with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 128
[34m[1mwandb[0m: 	learning_rate: 0.1


Epoch [1/5], Train Loss: 0.5242, Train Accuracy: 0.8112, Validation Loss: 0.3965, Validation Accuracy: 0.8593
Epoch [2/5], Train Loss: 0.3962, Train Accuracy: 0.8556, Validation Loss: 0.3706, Validation Accuracy: 0.8666
Epoch [3/5], Train Loss: 0.3567, Train Accuracy: 0.8699, Validation Loss: 0.3405, Validation Accuracy: 0.8767
Epoch [4/5], Train Loss: 0.3289, Train Accuracy: 0.8770, Validation Loss: 0.3306, Validation Accuracy: 0.8791
Epoch [5/5], Train Loss: 0.3116, Train Accuracy: 0.8855, Validation Loss: 0.3283, Validation Accuracy: 0.8770
Time spent: 87.55 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▇▇█
Train Loss,█▄▂▂▁
Validation Accuracy,▁▄▇█▇
Validation Loss,█▅▂▁▁

0,1
Time spent (seconds),87.55
Train Accuracy,0.88546
Train Loss,0.31157
Validation Accuracy,0.877
Validation Loss,0.32835


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: q3hfwedh with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 256
[34m[1mwandb[0m: 	learning_rate: 0.01


Epoch [1/5], Train Loss: 0.5098, Train Accuracy: 0.8170, Validation Loss: 0.4137, Validation Accuracy: 0.8478
Epoch [2/5], Train Loss: 0.3855, Train Accuracy: 0.8583, Validation Loss: 0.3576, Validation Accuracy: 0.8703
Epoch [3/5], Train Loss: 0.3457, Train Accuracy: 0.8728, Validation Loss: 0.3479, Validation Accuracy: 0.8717
Epoch [4/5], Train Loss: 0.3205, Train Accuracy: 0.8813, Validation Loss: 0.3295, Validation Accuracy: 0.8781
Epoch [5/5], Train Loss: 0.3005, Train Accuracy: 0.8907, Validation Loss: 0.3130, Validation Accuracy: 0.8838
Time spent: 93.27 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▆▇█
Train Loss,█▄▃▂▁
Validation Accuracy,▁▅▆▇█
Validation Loss,█▄▃▂▁

0,1
Time spent (seconds),93.27
Train Accuracy,0.89067
Train Loss,0.30054
Validation Accuracy,0.88375
Validation Loss,0.313


[34m[1mwandb[0m: Agent Starting Run: u82104zr with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 512
[34m[1mwandb[0m: 	learning_rate: 0.002


Epoch [1/5], Train Loss: 0.5012, Train Accuracy: 0.8191, Validation Loss: 0.3896, Validation Accuracy: 0.8562
Epoch [2/5], Train Loss: 0.3816, Train Accuracy: 0.8593, Validation Loss: 0.3857, Validation Accuracy: 0.8558
Epoch [3/5], Train Loss: 0.3428, Train Accuracy: 0.8750, Validation Loss: 0.3473, Validation Accuracy: 0.8717
Epoch [4/5], Train Loss: 0.3160, Train Accuracy: 0.8832, Validation Loss: 0.3327, Validation Accuracy: 0.8789
Epoch [5/5], Train Loss: 0.2937, Train Accuracy: 0.8907, Validation Loss: 0.3239, Validation Accuracy: 0.8804
Time spent: 103.88 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▆▇█
Train Loss,█▄▃▂▁
Validation Accuracy,▁▁▆██
Validation Loss,██▃▂▁

0,1
Time spent (seconds),103.88
Train Accuracy,0.89067
Train Loss,0.29368
Validation Accuracy,0.88042
Validation Loss,0.3239


[34m[1mwandb[0m: Agent Starting Run: ms53482q with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 512
[34m[1mwandb[0m: 	learning_rate: 0.01


Epoch [1/5], Train Loss: 0.4979, Train Accuracy: 0.8201, Validation Loss: 0.3844, Validation Accuracy: 0.8582
Epoch [2/5], Train Loss: 0.3774, Train Accuracy: 0.8615, Validation Loss: 0.3724, Validation Accuracy: 0.8618
Epoch [3/5], Train Loss: 0.3404, Train Accuracy: 0.8742, Validation Loss: 0.3378, Validation Accuracy: 0.8721
Epoch [4/5], Train Loss: 0.3195, Train Accuracy: 0.8830, Validation Loss: 0.3289, Validation Accuracy: 0.8801
Epoch [5/5], Train Loss: 0.2943, Train Accuracy: 0.8917, Validation Loss: 0.3315, Validation Accuracy: 0.8776
Time spent: 105.15 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▆▇█
Train Loss,█▄▃▂▁
Validation Accuracy,▁▂▅█▇
Validation Loss,█▆▂▁▁

0,1
Time spent (seconds),105.15
Train Accuracy,0.89167
Train Loss,0.29433
Validation Accuracy,0.87758
Validation Loss,0.33147


[34m[1mwandb[0m: Agent Starting Run: du22dkla with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 256
[34m[1mwandb[0m: 	learning_rate: 0.002


Epoch [1/5], Train Loss: 0.5107, Train Accuracy: 0.8155, Validation Loss: 0.3933, Validation Accuracy: 0.8588
Epoch [2/5], Train Loss: 0.3911, Train Accuracy: 0.8568, Validation Loss: 0.3951, Validation Accuracy: 0.8524
Epoch [3/5], Train Loss: 0.3491, Train Accuracy: 0.8720, Validation Loss: 0.3291, Validation Accuracy: 0.8788
Epoch [4/5], Train Loss: 0.3233, Train Accuracy: 0.8815, Validation Loss: 0.3725, Validation Accuracy: 0.8585
Epoch [5/5], Train Loss: 0.3047, Train Accuracy: 0.8884, Validation Loss: 0.3321, Validation Accuracy: 0.8748
Time spent: 93.68 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▆▇█
Train Loss,█▄▃▂▁
Validation Accuracy,▃▁█▃▇
Validation Loss,██▁▆▁

0,1
Time spent (seconds),93.68
Train Accuracy,0.88835
Train Loss,0.30472
Validation Accuracy,0.87483
Validation Loss,0.33206


[34m[1mwandb[0m: Agent Starting Run: oz7vfbbc with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 256
[34m[1mwandb[0m: 	learning_rate: 0.002


Epoch [1/5], Train Loss: 0.5071, Train Accuracy: 0.8160, Validation Loss: 0.4554, Validation Accuracy: 0.8267
Epoch [2/5], Train Loss: 0.3854, Train Accuracy: 0.8587, Validation Loss: 0.3640, Validation Accuracy: 0.8678
Epoch [3/5], Train Loss: 0.3457, Train Accuracy: 0.8719, Validation Loss: 0.3485, Validation Accuracy: 0.8666
Epoch [4/5], Train Loss: 0.3180, Train Accuracy: 0.8827, Validation Loss: 0.3216, Validation Accuracy: 0.8801
Epoch [5/5], Train Loss: 0.2954, Train Accuracy: 0.8908, Validation Loss: 0.3315, Validation Accuracy: 0.8832
Time spent: 92.83 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▆▇█
Train Loss,█▄▃▂▁
Validation Accuracy,▁▆▆██
Validation Loss,█▃▂▁▂

0,1
Time spent (seconds),92.83
Train Accuracy,0.89079
Train Loss,0.29539
Validation Accuracy,0.88325
Validation Loss,0.33154


[34m[1mwandb[0m: Agent Starting Run: 5w9z4jkc with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_units: 256
[34m[1mwandb[0m: 	learning_rate: 0.1


Epoch [1/5], Train Loss: 0.5039, Train Accuracy: 0.8175, Validation Loss: 0.4254, Validation Accuracy: 0.8429
Epoch [2/5], Train Loss: 0.3826, Train Accuracy: 0.8599, Validation Loss: 0.3524, Validation Accuracy: 0.8697
Epoch [3/5], Train Loss: 0.3448, Train Accuracy: 0.8731, Validation Loss: 0.3495, Validation Accuracy: 0.8707
Epoch [4/5], Train Loss: 0.3187, Train Accuracy: 0.8824, Validation Loss: 0.3201, Validation Accuracy: 0.8810
Epoch [5/5], Train Loss: 0.2987, Train Accuracy: 0.8900, Validation Loss: 0.3170, Validation Accuracy: 0.8855
Time spent: 93.42 seconds


0,1
Time spent (seconds),▁
Train Accuracy,▁▅▆▇█
Train Loss,█▄▃▂▁
Validation Accuracy,▁▅▆▇█
Validation Loss,█▃▃▁▁

0,1
Time spent (seconds),93.42
Train Accuracy,0.89002
Train Loss,0.29872
Validation Accuracy,0.8855
Validation Loss,0.31696


# Q5. (Artifact Management and Model Saving)

Saving the model

In [9]:
torch.save(model.state_dict(), "MLOps2025.pth")
wandb.init()
artifact = wandb.Artifact("MLOps2025", type="model")
artifact.add_file("MLOps2025.pth")
wandb.log_artifact(artifact)

wandb.finish()