<a href="https://colab.research.google.com/github/kamal-ark/hpc-project/blob/main/hpc_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import all necessary libraries

In [None]:
from random import random
import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler
import time, os


## Experiment set up
Define the Network and datasets for use in all steps below


### AlexNet network definition (Standard)

In [None]:
# Implementation from a reference (https://github.com/gradient-ai/alexnet)
class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        # Uses convolutions in several layers, as in the classic AlexNet
        self.layer1 = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=0),
            nn.BatchNorm2d(96),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 3, stride = 2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 3, stride = 2))
        self.layer3 = nn.Sequential(
            nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(384),
            nn.ReLU())
        self.layer4 = nn.Sequential(
            nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(384),
            nn.ReLU())
        self.layer5 = nn.Sequential(
            nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 3, stride = 2))
        self.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(9216, 4096),
            nn.ReLU())
        self.fc1 = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU())
        self.fc2= nn.Sequential(
            nn.Linear(4096, num_classes))

    # Forward method of the model, applying all necessary layers defined above
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.layer5(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

### Imagenette2-160 dataset

Download the dataset from hosted server. Use the Imagenette version as this more common to be used for faster training and 160px would be typical of expected processing in edge devices

In [None]:
!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz

--2023-12-20 08:40:42--  https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.170.120, 52.217.228.200, 52.217.236.176, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.170.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 99003388 (94M) [application/x-tar]
Saving to: ‘imagenette2-160.tgz’


2023-12-20 08:40:50 (13.0 MB/s) - ‘imagenette2-160.tgz’ saved [99003388/99003388]



In [None]:
!tar -xvf imagenette2-160.tgz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
imagenette2-160/train/n03888257/n03888257_16077.JPEG
imagenette2-160/train/n03888257/n03888257_23339.JPEG
imagenette2-160/train/n03888257/n03888257_44204.JPEG
imagenette2-160/train/n03888257/n03888257_61633.JPEG
imagenette2-160/train/n03888257/n03888257_15067.JPEG
imagenette2-160/train/n03888257/n03888257_75365.JPEG
imagenette2-160/train/n03888257/n03888257_63966.JPEG
imagenette2-160/train/n03888257/n03888257_3927.JPEG
imagenette2-160/train/n03888257/n03888257_20684.JPEG
imagenette2-160/train/n03888257/ILSVRC2012_val_00047778.JPEG
imagenette2-160/train/n03888257/n03888257_14016.JPEG
imagenette2-160/train/n03888257/n03888257_37776.JPEG
imagenette2-160/train/n03888257/ILSVRC2012_val_00041706.JPEG
imagenette2-160/train/n03888257/n03888257_17513.JPEG
imagenette2-160/train/n03888257/n03888257_17143.JPEG
imagenette2-160/train/n03888257/n03888257_6738.JPEG
imagenette2-160/train/n03888257/n03888257_4355.JPEG
imagenette2-160/train

# Experiments
### Define Dataloaders

This step is important in particular for the Reduced Precision (float16, bfloat16) variations of the experiments. This would work with a most input shapes and sizes.

Also, note that dataloader was given this task so that the training does not need to manually change data types into reduced precision because the work is better distributed with loaders than in a training method

In [None]:

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Main methods for the train dataloader, also optionally can return validation loader
def get_train_valid_loader(data_dir,
                           batch_size,
                           augment,
                           random_seed,
                           valid_size=0.1,
                           shuffle=True, num_workers=1, rp_b=False,rp=False):
    normalize = transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010],
    )

    # RP does not work for augment currently (ignores augment as we do not use in our experiments)
    # RP_B flag means that we use bfloat16 format
    if rp_b:
        train_transform = transforms.Compose([
            transforms.Resize((227,227)),
            transforms.ToTensor(),
            normalize,
            transforms.ConvertImageDtype(torch.bfloat16),
        ])
        valid_transform = transforms.Compose([
              transforms.Resize((227,227)),
              transforms.ToTensor(),
              normalize,
              transforms.ConvertImageDtype(torch.bfloat16),
        ])
    elif rp:
        train_transform = transforms.Compose([
            transforms.Resize((227,227)),
            transforms.ToTensor(),
            normalize,
            transforms.ConvertImageDtype(torch.float16),
        ])
        valid_transform = transforms.Compose([
              transforms.Resize((227,227)),
              transforms.ToTensor(),
              normalize,
              transforms.ConvertImageDtype(torch.float16),
        ])
    else: # Normal full precision, no half precision needed and hence data loader does not apply this transform
        valid_transform = transforms.Compose([
              transforms.Resize((227,227)),
              transforms.ToTensor(),
              normalize,
        ])
        if augment:
            train_transform = transforms.Compose([
              transforms.RandomCrop(32, padding=4),
              transforms.RandomHorizontalFlip(),
              transforms.ToTensor(),
              normalize,
            ])
        else:
            train_transform = transforms.Compose([
              transforms.Resize((227,227)),
              transforms.ToTensor(),
              normalize,
            ])


    # load the dataset
    #train_dataset = datasets.CIFAR10(
   #     root=data_dir, train=True,
   #     download=True, transform=train_transform,
   # )

    #valid_dataset = datasets.CIFAR10(
    #    root=data_dir, train=True,
    #    download=True, transform=valid_transform,
    #)

    # Use the imagenette2-160  data instead that was downloaded in previous steps
    train_dataset = datasets.ImageFolder(root='imagenette2-160/train/', transform=train_transform)
    valid_dataset = datasets.ImageFolder(root='imagenette2-160/val/', transform=valid_transform)

    # find the indices for train and validation


    num_train = len(train_dataset)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))

    if shuffle:
        np.random.seed(random_seed)
        np.random.shuffle(indices)

    train_idx, valid_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_sampler, num_workers=num_workers)

    valid_loader = torch.utils.data.DataLoader(
        valid_dataset, batch_size=batch_size, sampler=valid_sampler, num_workers=num_workers)

    return (train_loader, valid_loader)

# Method that returns the test data loader, so that we can evaluate accuracy
def get_test_loader(data_dir,
                    batch_size,
                    shuffle=True,rp=False,rp_b=False):
    normalize = transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    )

    # define transform
    if rp_b:
      transform = transforms.Compose([
          transforms.Resize((227,227)),
          transforms.ToTensor(),
          normalize,
          transforms.ConvertImageDtype(torch.bfloat16),
        ])
    elif rp:
      transform = transforms.Compose([
              transforms.Resize((227,227)),
              transforms.ToTensor(),
              normalize,
              transforms.ConvertImageDtype(torch.float16),
      ])
    else:
      transform = transforms.Compose([
          transforms.Resize((227,227)),
          transforms.ToTensor(),
          normalize,
      ])

    #dataset = datasets.CIFAR10(
    #    root=data_dir, train=False,
    #    download=True, transform=transform,
    #)

    # Use the validation data from ImageNette2
    dataset = datasets.ImageFolder(root='imagenette2-160/val/', transform=transform)

    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle
    )

    return data_loader


# CIFAR10 dataset
#train_loader, valid_loader = get_train_valid_loader(data_dir = './data', batch_size = 64, augment = False, random_seed = 1)

#test_loader = get_test_loader(data_dir = './data',
#                              batch_size = 64)

In [None]:
num_classes = 10
num_epochs = 11
#batch_size = 64
#learning_rate = 0.005




# WANDB setup in below cell

In [None]:
!pip install wandb -Uq

import wandb, pprint
wandb.login()

sweep_config = {
    'method': 'random',
  }

metric = {
    'name': 'loss',
    'goal': 'minimize'
  }

sweep_config['metric'] = metric

parameters_dict = {
    'num_workers': {
        'values': [1, 2, 4, 8]#, 16]
        },
    'rp_b': {# bfloat16 for many previous sweeps, this is 'rp' for float16
        'values': [True, False]
        },
    'channels_last': {
        'values': [True, False]
        },
    'batch_size': {
        'values': [64, 128, 256, 512]
        },
    'learning_rate': {
        'values': [0.005, 0.001, 0.0005]
        },
}

sweep_config['parameters'] = parameters_dict
pprint.pprint(sweep_config)

# Get new sweep ID
sweep_id = wandb.sweep(sweep_config, project="hpc-proj") #"<wandb_entity?>/hpc-proj/swt2n8jz"

{'method': 'random',
 'metric': {'goal': 'minimize', 'name': 'loss'},
 'parameters': {'batch_size': {'values': [64, 128, 256, 512]},
                'channels_last': {'values': [True, False]},
                'learning_rate': {'values': [0.005, 0.001, 0.0005]},
                'num_workers': {'values': [1, 2, 4, 8]},
                'rp_b': {'values': [True, False]}}}
Create sweep with ID: x5cp0llc
Sweep URL: https://wandb.ai/impossibile/hpc-proj/sweeps/x5cp0llc


## Training

Main function to be called from wandb. This defines all variations of the experiments and the entire setup of the network

In [None]:
#
# Let us study the model performance below for our experiments
#
# Function to be called from wandb for setting up config parameters and running the training of modules
def train_wandb(config=None):
  #
  # Get wandb config
  #
  with wandb.init(config=config):
    config = wandb.config

    # Store each epoch's timings in arrays so that in the end we can average and return values
    # (ignoring the first epoch for warmup reasons)
    dataLoaderTimeArr = []
    trainingTimeArr = []
    epochDataLoaderTimeArr = []
    epochTrainingTimeArr = []
    epochTimeArr = []

    #RP - float.16, RP_B - bfloat.16

    model = AlexNet(num_classes).to(device)
    NUM_WORKERS = wandb.config.num_workers#4

    # NOTE: For RP variations, two different sweeps are important, as both cannot be true at the same time
    # Toggle the following few lines so only one is active anytime
    #is_RP = wandb.config.rp #False
    is_RP_B = wandb.config.rp_b #False
    #if is_RP:
    #  model = model.to(dtype=torch.float16)
      #criterion = nn.MultiMarginLoss()#(log_target=False)
    if is_RP_B:
      model = model.to(dtype=torch.bfloat16)
    #if is_RP and is_RP_B: #NOTE: Choose only one RP type bfloat.16 or float.16
    #  if random() < 0.5:
    #    is_RP = False
    #  else:
    #    is_RP_B = False
    train_loader, valid_loader = get_train_valid_loader(data_dir = './data', batch_size = wandb.config.batch_size, augment = False, random_seed = 1,
                                                        num_workers=NUM_WORKERS,rp_b=is_RP_B)#rp=is_RP)#
    test_loader = get_test_loader(data_dir = './data', batch_size = 64, rp_b = is_RP_B)#_Brp=is_RP)#

    # Set channels last format if needed
    is_CHANNELS_LAST = wandb.config.channels_last #True
    if is_CHANNELS_LAST:
      model = model.to(memory_format=torch.channels_last)
    # Loss and optimizer definitions
    #criterion = nn.CrossEntropyLoss()
    criterion = nn.MultiMarginLoss()#(log_target=False)
    optimizer = torch.optim.SGD(model.parameters(), lr=wandb.config.learning_rate, weight_decay = 0.005, momentum = 0.9)

    final_loss = 0.0

    #
    # Start training for num_epochs
    for epoch in range(num_epochs):
        begin_e = time.monotonic_ns()
        begin_i = begin_e
        for i, (images, labels) in enumerate(train_loader):
            finish_i = time.monotonic_ns()
            dataLoaderTimeArr.append((finish_i-begin_i)/1000000000.0)

            #if is_RP:
            #  images = images.to(dtype=torch.float16)
            #  labels = labels.to(dtype=torch.int16)
            if is_CHANNELS_LAST:
              images = images.to(memory_format=torch.channels_last)

            begin_t = time.monotonic_ns()
            # Move tensors to the configured device
            images = images.to(device)
            labels = labels.to(device)

            # Forward pass
            outputs = model(images)
            #if is_RP:
            #  labels = labels.to(dtype=torch.long)
            # TODO some optimization possible with a criterion that accepts  half precision int labels

            # Calculate the loss
            loss = criterion(outputs, labels)
            # Just copy each time
            final_loss = loss

            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            finish_t = time.monotonic_ns()
            trainingTimeArr.append((finish_t-begin_t)/1000000000.0)
            begin_i = time.monotonic_ns()
        finish_e = time.monotonic_ns()

        #print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
        #              .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

        epochTimeArr.append((finish_e-begin_e)/1000000000.0)
        epochDataLoaderTimeArr.append(np.sum(dataLoaderTimeArr))
        epochTrainingTimeArr.append(np.sum(trainingTimeArr))
        dataLoaderTimeArr = []
        trainingTimeArr = []

    wandb_run_id = wandb.run.id

    # Save model's final checkpoint
    directory = os.path.join("alexnet", 'wandb_{}'.format(wandb_run_id))
    if not os.path.exists(directory):
        os.makedirs(directory)
    torch.save({
        'iteration': num_epochs,
        'model': model.state_dict(),
        'opt': optimizer.state_dict(),
        'loss': loss,
    }, os.path.join(directory, 'wandb_{}_{}_{}.tar'.format(wandb_run_id, epoch, 'checkpoint')))

    # Log in wandb the loss for this training instance run
    wandb.log({"loss": final_loss})


    # Log in wandb
    wandb.log({"dataloader_time": np.mean(epochDataLoaderTimeArr[1:])})
    wandb.log({"training_time": np.mean(epochTrainingTimeArr[1:])})
    wandb.log({"epoch_time": np.mean(epochTimeArr[1:])})
    #wandb.log({"final_accuracy": final_accuracy})
    # Print time taken in epoch 2
    #if epochTimeArr:
        #print("Epoch timings: ", epochTimeArr[1])
        #print("DataLoader time: ", epochDataLoaderTimeArr[1])
        #print("Training time: ", epochTrainingTimeArr[1])
        #print("Epoch training loss: ", epoch_train_loss)
        #print("Epoch top-1 training accuracy: ", epoch_accuracy)

    # Perform testing accuracy evaluation
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in test_loader:
            images = images.to(device)
            labels = labels.to(device)

            #if is_RP:
            #  images = images.to(dtype=torch.float16)
            #  labels = labels.to(dtype=torch.int16)
            if is_CHANNELS_LAST:
              images = images.to(memory_format=torch.channels_last)

            outputs = model(images)
            #if is_RP:
            #  labels = labels.to(dtype=torch.long)

            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            del images, labels, outputs
        wandb.log({"test_accuracy": 100*correct/total})

        #print('Accuracy of the network on the {} test images: {} %'.format(10000, 100 * correct / total))



## Wandb sweep message outputs below

In [None]:
wandb.agent(sweep_id, train_wandb, count=12)

[34m[1mwandb[0m: Agent Starting Run: j9bxn98g with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	channels_last: False
[34m[1mwandb[0m: 	learning_rate: 0.005
[34m[1mwandb[0m: 	num_workers: 8
[34m[1mwandb[0m: 	rp_b: True




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,4.3426
epoch_time,40.63862
loss,0.1582
test_accuracy,69.40127
training_time,36.07139


[34m[1mwandb[0m: Agent Starting Run: e1chmzs3 with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	channels_last: False
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_workers: 8
[34m[1mwandb[0m: 	rp_b: True


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.10991282466073515, max=1.…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,3.16641
epoch_time,40.00448
loss,0.36914
test_accuracy,58.77707
training_time,36.61837


[34m[1mwandb[0m: Agent Starting Run: a50ehbgg with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	channels_last: False
[34m[1mwandb[0m: 	learning_rate: 0.005
[34m[1mwandb[0m: 	num_workers: 8
[34m[1mwandb[0m: 	rp_b: False


VBox(children=(Label(value='0.010 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.9327406609195402, max=1.0…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,12.43836
epoch_time,23.77476
loss,0.10003
test_accuracy,71.00637
training_time,11.05258


[34m[1mwandb[0m: Agent Starting Run: iddn8lt0 with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	channels_last: False
[34m[1mwandb[0m: 	learning_rate: 0.0005
[34m[1mwandb[0m: 	num_workers: 2
[34m[1mwandb[0m: 	rp_b: True


VBox(children=(Label(value='0.010 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.9762803234501348, max=1.0…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,2.1427
epoch_time,38.2204
loss,0.55469
test_accuracy,35.33758
training_time,36.00619


[34m[1mwandb[0m: Agent Starting Run: puo9qmlp with config:
[34m[1mwandb[0m: 	batch_size: 512
[34m[1mwandb[0m: 	channels_last: True
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_workers: 8
[34m[1mwandb[0m: 	rp_b: True


VBox(children=(Label(value='0.010 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.9328297814944699, max=1.0…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,10.78738
epoch_time,52.8952
loss,0.60547
test_accuracy,30.44586
training_time,34.63438


[34m[1mwandb[0m: Agent Starting Run: e0u86jhk with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	channels_last: True
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_workers: 4
[34m[1mwandb[0m: 	rp_b: True




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,4.07097
epoch_time,42.05632
loss,0.41992
test_accuracy,39.89809
training_time,31.76689


[34m[1mwandb[0m: Agent Starting Run: v9i5dyz4 with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	channels_last: False
[34m[1mwandb[0m: 	learning_rate: 0.0005
[34m[1mwandb[0m: 	num_workers: 8
[34m[1mwandb[0m: 	rp_b: True


VBox(children=(Label(value='0.001 MB of 0.010 MB uploaded\r'), FloatProgress(value=0.11790224621613805, max=1.…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,6.38713
epoch_time,43.10155
loss,0.66406
test_accuracy,25.52866
training_time,36.39449


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: opppjyf5 with config:
[34m[1mwandb[0m: 	batch_size: 512
[34m[1mwandb[0m: 	channels_last: True
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_workers: 8
[34m[1mwandb[0m: 	rp_b: False


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.10981413306994703, max=1.…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,13.92012
epoch_time,32.22203
loss,0.25531
test_accuracy,53.93631
training_time,5.94391


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: p6wke1wp with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	channels_last: False
[34m[1mwandb[0m: 	learning_rate: 0.0005
[34m[1mwandb[0m: 	num_workers: 8
[34m[1mwandb[0m: 	rp_b: True


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.10992270357720654, max=1.…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,6.119
epoch_time,42.83769
loss,0.76172
test_accuracy,25.93631
training_time,36.47749


[34m[1mwandb[0m: Agent Starting Run: 76pwki36 with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	channels_last: True
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_workers: 4
[34m[1mwandb[0m: 	rp_b: True


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.11000179888469148, max=1.…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,3.8854
epoch_time,42.11505
loss,0.52344
test_accuracy,40.28025
training_time,31.62783


[34m[1mwandb[0m: Agent Starting Run: etawr5n6 with config:
[34m[1mwandb[0m: 	batch_size: 256
[34m[1mwandb[0m: 	channels_last: True
[34m[1mwandb[0m: 	learning_rate: 0.0005
[34m[1mwandb[0m: 	num_workers: 1
[34m[1mwandb[0m: 	rp_b: False


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.10993353691395724, max=1.…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,17.06331
epoch_time,28.27115
loss,0.22226
test_accuracy,54.29299
training_time,2.87396


[34m[1mwandb[0m: Agent Starting Run: zsfgql86 with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	channels_last: True
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_workers: 4
[34m[1mwandb[0m: 	rp_b: True


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112615244443683, max=1.0…

VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.10991282466073515, max=1.…

0,1
dataloader_time,▁
epoch_time,▁
loss,▁
test_accuracy,▁
training_time,▁

0,1
dataloader_time,2.44975
epoch_time,38.97622
loss,0.58203
test_accuracy,57.88535
training_time,32.02926


# Preliminary analysis of the study

Initial study of two epoch runtimes for the simple AlexNet above.

After warming up the cache (GPU) in the first epoch, the following details were obsoverd in the initial study of two epoch runtimes for the simple AlexNet above. We consider the **second** step timings alone.

That is the printed statements are for only the second epoch and some inital observations were made that guided the sweeps

NOTE: The reduced precision has lower loss all other runs involving no reduced precision originally used Cross Entropy. As this loss calculator does not support reduced precision, we switch to using the hinge loss (multi class) for all furtherexperiments including the wandb sweep study above. MultiMarginLoss is the criterion used for all methods

----------------------------------------------------------------


#### Using both reduced precision and last channels first (no extra workers)

Epoch [1/2], Step [134/134], Loss: 0.0761

Epoch [2/2], Step [134/134], Loss: 0.0623

Epoch timings:  22.783853241

DataLoader time:  18.820258812

Training time:  **1.3892146479999998**

#### No reduced precision, no last channels first, no extra dataloaders (Baseline for comparision)

Epoch [1/2], Step [134/134], Loss: 0.9561

Epoch [2/2], Step [134/134], Loss: 0.9339

Epoch timings:  22.525280558

DataLoader time:  20.670148408

Training time:  *1.8536308670000001*





#### Only last channels first
Epoch [1/2], Step [134/134], Loss: 0.8896

Epoch [2/2], Step [134/134], Loss: 0.9199

Epoch timings:  23.722282422

DataLoader time:  21.622362816

Training time:  *1.373590375*


#### Only reduced precision
Epoch [1/2], Step [134/134], Loss: 0.8828

Epoch [2/2], Step [134/134], *Loss: 0.7316*

Epoch timings:  29.867037131

DataLoader time:  22.407764295

Training time:  2.3477204069999997



#### Using four data loaders

Epoch [1/2], Step [134/134], Loss: 0.9497
Epoch [2/2], Step [134/134], Loss: 0.8493
Epoch timings:  26.6090443
DataLoader time:  **4.016212597999999**
Training time:  **6.156122407**

The following code is for getting all models in a tarball

In [None]:
!tar -czvf alexnet.tar.gz alexnet

alexnet/
alexnet/wandb_iddn8lt0/
alexnet/wandb_iddn8lt0/wandb_iddn8lt0_10_checkpoint.tar
alexnet/wandb_gd18e5z0/
alexnet/wandb_gd18e5z0/wandb_gd18e5z0_10_checkpoint.tar
alexnet/wandb_v9i5dyz4/
alexnet/wandb_v9i5dyz4/wandb_v9i5dyz4_10_checkpoint.tar
alexnet/wandb_etawr5n6/
alexnet/wandb_etawr5n6/wandb_etawr5n6_10_checkpoint.tar
alexnet/wandb_m7vn00mr/
alexnet/wandb_m7vn00mr/wandb_m7vn00mr_10_checkpoint.tar
alexnet/wandb_n417ts97/
alexnet/wandb_n417ts97/wandb_n417ts97_10_checkpoint.tar
alexnet/wandb_j9bxn98g/
alexnet/wandb_j9bxn98g/wandb_j9bxn98g_10_checkpoint.tar
alexnet/wandb_p6wke1wp/
alexnet/wandb_p6wke1wp/wandb_p6wke1wp_10_checkpoint.tar
alexnet/wandb_e1chmzs3/
alexnet/wandb_e1chmzs3/wandb_e1chmzs3_10_checkpoint.tar
alexnet/wandb_skb1k8ql/
alexnet/wandb_skb1k8ql/wandb_skb1k8ql_10_checkpoint.tar
alexnet/wandb_puo9qmlp/
alexnet/wandb_puo9qmlp/wandb_puo9qmlp_10_checkpoint.tar
alexnet/wandb_7bl3p9wg/
alexnet/wandb_7bl3p9wg/wandb_7bl3p9wg_10_checkpoint.tar
alexnet/wandb_opppjyf5/
alexnet

In [None]:
!ls -l alexnet.tar.gz

-rw-r--r-- 1 root root 8509458218 Dec 20 12:22 alexnet.tar.gz


As downloading large files failed from Google Drive and the other methods such as the gdrive code by Ramusson is non-functional now, I move this to drive and dowload with code below

In [None]:
from google.colab import drive
drive.mount('/content/gdrive',force_remount=True)


Mounted at /content/gdrive


As the total tarball of all models could not be downloaded because of the tarball size being large, I had to manually download each saved model in the following way and upload in Google Drive for sharing again in Columbia account

In [None]:
!cp '/content/alexnet/wandb_opppjyf5/wandb_opppjyf5_10_checkpoint.tar' '/content/gdrive/MyDrive/wandb_opppjyf5_10_checkpoint.tar'


In [None]:
!ls -lt /content/gdrive/MyDrive/wandb/

total 455699
-rw------- 1 root root 466635058 Dec 20 14:22 wandb_opppjyf5_10_checkpoint.tar


In [None]:
!cp '/content/gdrive/MyDrive/wandb/wandb_opppjyf5_10_checkpoint.tar' '/content/alexnet/wandb_opppjyf5/wandb_opppjyf5_10_checkpoint.tar'

In [None]:
!rm '/content/gdrive/MyDrive/wandb/wandb_r3fg86cg_10_checkpoint.tar'

In [None]:
!rm '/content/gdrive/MyDrive/wandb/wandb_opppjyf5_10_checkpoint.tar'