This notebook is for perf-testing PyTorch vs TensorFlow with and without GPU on a simple training set so I can figure out the best environment for training models. Here's the setup I used
* Windows 11, i7-10 16GB RAM, RTX 2060 GPU w 6 GB RAM, VS Code
* Fashion mnista data set to train a 9-layer CNN
* Tensorflow w CUDA via WSL set up per https://www.tensorflow.org/install/pip
* Tensorflow w CUDA via direct ML set up per https://learn.microsoft.com/en-us/windows/ai/directml/gpu-tensorflow-plugin

Overall conclusion: Tensorflow w CUDA via direct ML was the winner by a wide margin

Specific results when running the code below on various configurations above:
1) CUDA was substantially faster than CPU on the 9 layer network. Experimentation not in the notebook shows the speedup depends on the network size; cpu was faster than cuda in a small 3-layer network, about 4x benefit on this 7 layer network, and 9x faster on a larger u-net. 
2) Tensorflow w CUDA via WSL (the recommended configuration) had all sorts of problems (see below), so I won't use this again until the tech matures.
3) Tensorflow w CUDA via direct ML is the clear winner; good speed up, easy to code, just worked well
4) PyTorch trained a bit faster, but with less accuracy. It required 3x more code for the mixed results. So this will not be my environment of choice.

My experience on tensorflow CUDA via WSL (tensorflow.org's recommended configuration for Windows) was not good:
* Setting it up was a PITA
* After you are done, there are spurrious warnings about tensorRT and NUMA
* WSL eats up a ton of disk space, and worse it eats up a ton or RAM when running, and worst of all it eats 1GB of RAM even when it is not running(!) due to virualization of the operation systems
* It ran slower than CPU on my small models and crashed on larger models

In [10]:
import tensorflow as tf

device = "cpu"
if tf.config.list_physical_devices("GPU"):
    device = "cuda"
print(f"Using device: {device}")

BATCH_SIZE = 64
EPOCHS = 10

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
x_train = x_train/255.0
x_test  = x_test/255.0

def get_model():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1), padding='same'),
        tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
        tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(10),
    ])
    model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
    return model

model = get_model()
print(f"Training on {device}. Number of model parameters: {model.count_params():,d}")
current_time = tf.timestamp()
model.fit(x_train, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE)
elapsed_time = tf.timestamp() - current_time
print (f"{device} Training time: {elapsed_time:.2f} seconds")
if device == "cuda":
    print(f"GPU physical memory: {tf.config.experimental.get_memory_info('GPU:0')}")
print()

# Print accuracy and loss on the test set
total_loss, test_acc = model.evaluate(x_test,  y_test, verbose=0)
print(f"Test accuracy: {test_acc:.4f}, loss: {total_loss:.4f}")

Using device: cuda
Training on cuda. Number of model parameters: 620,682
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
cuda Training time: 58.49 seconds
GPU physical memory: {'current': 653773056, 'peak': 676817920}

Test accuracy: 0.9222, loss: 0.2786


Now let's try it with PyTorch. Note that the code is *much* more complex as it requires:
1) Manually computing the number of input parameters at each layer (since no built-in model.compile)
2) Creating a training loop (since no built-in model.fit)
3) Creating a custom dataset class, since the nn.mnist dataset loads images from disk each epoch, creating a significant performance bottleneck

Given that PyTorch required about 3x as much coding to do the same thing, and has no clear benefits, it will not be my platform of choice for now.

In [30]:
import time
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets, transforms
from pandas import read_csv

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
print(f"Using device: {device}")

DIR = "data/fashionmnist"
BATCH_SIZE = 64
EPOCHS = 10

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    total_loss, correct = 0, 0

    for X, y in dataloader:
        X, y = X.to(device), y.to(device)

        # Predict and compute loss and accuracy
        pred = model(X)
        loss = loss_fn(pred, y)
        total_loss += loss.item()
        correct += (pred.argmax(1) == y).type(torch.float).sum().item()

        # Backprop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return total_loss / size, correct / size

def train(dataloader, model, loss_fn, optimizer, epochs):
    current_time = time.time()
    for t in range(EPOCHS):
        print(f"Epoch {t+1} - ", end="")
        loss, accuracy = train_loop(dataloader, model, loss_fn, optimizer)
        print(f"loss: {loss:.4f}, accuracy: {accuracy:.4f}")
    elapsed_time = time.time() - current_time
    print (f"{device} Training time: {elapsed_time:.2f} seconds")

def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    total_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            total_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    total_loss /= size
    correct /= size
    print(f"Accuracy: {(100*correct):>0.1f}%, Avg loss: {total_loss:>8f} \n")

# A datasets.MNIST wrapper that holds normalized images in memory rather than loading from disc each epoch
# Annoying that Pytorch doesn't have this built in (like tensorflow does)
import PIL.Image as Image
import numpy as np
class FashionDataset(Dataset):
    """User defined class to build a datset using Pytorch class Dataset."""
    
    def __init__(self, dir, train):
        """Method to initilaize variables.""" 
        dataset = datasets.MNIST(dir, download=True, train=train)

        labels = []
        images = []

        # Iterate over each image and lable in the dataset
        for item in list(dataset):
            label = item[1]
            labels.append(label)
            image = np.asarray(item[0])
            image = image/255.0
            image = torch.FloatTensor(image).view(1, 28, 28)
            images.append(image)
        
        self.labels = labels
        self.images = images

    def __getitem__(self, index):
        label = self.labels[index]
        image = self.images[index]
        return image, label

    def __len__(self):
        return len(self.images)

def evaluate():
    model = nn.Sequential(
        nn.Conv2d(1, 32, kernel_size=3, padding="same"),    # Output: 32x28x28
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2),        # Output: 32x14x14; 14=28/2
        nn.Conv2d(32, 64, kernel_size=3),   # Output: 64x12x12; 12=14-3+1
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2),        # Output: 64x6x6; 6=12/2
        nn.Flatten(),
        nn.BatchNorm1d(64*6*6),
        nn.Linear(64*6*6, 256),
        nn.ReLU(),
        nn.Dropout1d(0.3),
        nn.Linear(256, 10),
    ).to(device)
    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters())
    print(f"Number of parameters: {sum(p.numel() for p in model.parameters())}")

    train_set = FashionDataset(DIR, train=True)
    train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)
    model = model.to(device)
    train(train_loader, model, loss_fn, optimizer, EPOCHS)

    test_set = FashionDataset(DIR, train=False)
    test_loader = DataLoader(test_set, batch_size=BATCH_SIZE)
    test_loop(test_loader, model, loss_fn)

print(f"Training on {device}")
evaluate()


Using device: cuda
Training on cuda
Number of parameters: 616074
Epoch 1 - loss: 0.0122, accuracy: 0.7054
Epoch 2 - loss: 0.0115, accuracy: 0.7196
Epoch 3 - loss: 0.0112, accuracy: 0.7255
Epoch 4 - loss: 0.0111, accuracy: 0.7278
Epoch 5 - loss: 0.0111, accuracy: 0.7288
Epoch 6 - loss: 0.0111, accuracy: 0.7268
Epoch 7 - loss: 0.0110, accuracy: 0.7299
Epoch 8 - loss: 0.0109, accuracy: 0.7312
Epoch 9 - loss: 0.0110, accuracy: 0.7310
Epoch 10 - loss: 0.0110, accuracy: 0.7304
cuda Training time: 31.63 seconds
Accuracy: 72.5%, Avg loss: 0.011253 

