This notebook is for perf-testing PyTorch vs TensorFlow with and without GPU on a simple training set so I can figure out the best environment for training models. Here's the setup I used
* Windows 11, i7-10 16GB RAM, RTX 2060 GPU w 6 GB RAM, VS Code
* Fashion mnista data set to train a 9-layer CNN
* Tensorflow w CUDA via WSL set up per https://www.tensorflow.org/install/pip
* Tensorflow w CUDA via direct ML set up per https://learn.microsoft.com/en-us/windows/ai/directml/gpu-tensorflow-plugin

Overall conclusion: Tensorflow w CUDA via direct ML was the winner by a wide margin

Specific results when running the code below on various configurations above:
1) CUDA was substantially faster than CPU on the 9 layer network. Experimentation not in the notebook shows the speedup depends on the network size; cpu was faster than cuda in a small 3-layer network, about 4x benefit on this 7 layer network, and 9x faster on a larger u-net. 
2) Tensorflow w CUDA via WSL (the recommended configuration) had all sorts of problems (see below), so I won't use this again until the tech matures.
3) Tensorflow w CUDA via direct ML is the clear winner; good speed up, easy to code, just worked well
4) PyTorch had similar training performance, but required 3x more code and had no clear upside, so this will not be my environment of choice.

My experience on tensorflow CUDA via WSL (tensorflow.org's recommended configuration for Windows) was not good:
* Setting it up was a PITA
* After you are done, there are spurrious warnings about tensorRT and NUMA
* WSL eats up a ton of disk space, and worse it eats up a ton or RAM when running, and worst of all it eats 1GB of RAM even when it is not running(!) due to virualization of the operation systems
* It ran slower than CPU on my small models and crashed on larger models

In [1]:
import tensorflow as tf

HAS_GPU = len (tf.config.list_physical_devices("GPU")) > 0
if HAS_GPU:
    print("Available GPU devices:", tf.config.list_physical_devices("GPU"))

BATCH_SIZE = 64
EPOCHS = 10

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
x_train = x_train/255.0
x_test  = x_test/255.0

def get_model():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1), padding='same'),
        tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
        tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(10),
    ])
    model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
    return model

with tf.device('/cpu:0'):
    model = get_model()
    print (f"Training on cpu. Number of model parameters: {model.count_params():,d}")
    current_time = tf.timestamp()
    model.fit(x_train, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE)
    elapsed_time = tf.timestamp() - current_time
    print (f"CPU Training time: {elapsed_time:.2f} seconds")

if HAS_GPU:
    model = get_model()
    print (f"Training on GPU. Number of model parameters: {model.count_params():,d}")
    current_time = tf.timestamp()
    model.fit(x_train, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE)
    elapsed_time = tf.timestamp() - current_time
    print (f"GPU Training time: {elapsed_time:.2f} seconds")
    print(f"GPU:0 physical memory: {tf.config.experimental.get_memory_info('GPU:0')}")

# Print accuracy and loss on the test set
test_loss, test_acc = model.evaluate(x_test,  y_test, verbose=0)
print(f"Test accuracy: {test_acc:.4f}, loss: {test_loss:.4f}")

Available GPU devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Training on cpu. Number of model parameters: 620,682
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU Training time: 165.61 seconds
Training on GPU. Number of model parameters: 620,682
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
GPU Training time: 54.91 seconds
GPU:0 physical memory: {'current': 653702400, 'peak': 676082176}
Test accuracy: 0.9152, loss: 0.3024


Now let's try it with PyTorch. Note that the code is *much* more complex as it requires:
1) Manually computing the number of input parameters at each layer (since no built-in model.compile)
2) Creating a training loop (since no built-in model.fit)
3) Creating a custom dataset class, since the nn.mnist dataset loads images from disk each epoch, creating a significant performance bottleneck

Given that PyTorch required about 3x as much coding to do the same thing, and has no clear benefits, it will not be my platform of choice for now.

In [2]:
import time
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets, transforms
from pandas import read_csv

DIR = "data/fashionmnist"
BATCH_SIZE = 64
EPOCHS = 10

device = "cuda" if torch.cuda.is_available() else "cpu"
train_data = read_csv(DIR + "/fashion-mnist_train.csv")
test_data = read_csv(DIR + "/fashion-mnist_train.csv")

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    avg_loss = 0
    for X, y in dataloader:
        X, y = X.to(device), y.to(device)
        # Compute prediction and loss
        #print (X.shape)
        pred = model(X)
        loss = loss_fn(pred, y)
        avg_loss += loss.item()
        # Backprop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return avg_loss / size

def train(dataloader, model, loss_fn, optimizer, epochs):
    current_time = time.time()
    for t in range(EPOCHS):
        print(f"Epoch {t+1}...", end="")
        avg_loss = train_loop(dataloader, model, loss_fn, optimizer)
        print(f"Avg loss: {avg_loss:.4f}")
    elapsed_time = time.time() - current_time
    print (f"{device} Training time: {elapsed_time:.2f} seconds")

def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= size
    correct /= size
    print(f"Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

# Custom dataset that holds things in memory rather than loading from disc each time
class FashionDataset(Dataset):
    """User defined class to build a datset using Pytorch class Dataset."""
    
    def __init__(self, data):
        """Method to initilaize variables.""" 
        self.fashion_MNIST = list(data.values)
        
        labels = []
        images = []
        
        for i in self.fashion_MNIST:
            label = i[0]
            labels.append(label)

            image = i[1:]
            image = image/255.0
            image = torch.FloatTensor(image).view(1, 28, 28)
            images.append(image)
        
        self.labels = labels
        self.images = images

    def __getitem__(self, index):
        label = self.labels[index]
        image = self.images[index]

        return image, label

    def __len__(self):
        return len(self.images)
    
def evaluate():
    model = nn.Sequential(
        nn.Conv2d(1, 32, kernel_size=3, padding="same"),    # Output: 32x28x28
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2),        # Output: 32x14x14; 14=28/2
        nn.Conv2d(32, 64, kernel_size=3),   # Output: 64x12x12; 12=14-3+1
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2),        # Output: 64x6x6; 6=12/2
        nn.Flatten(),
        nn.BatchNorm1d(64*6*6),
        nn.Linear(64*6*6, 256),
        nn.ReLU(),
        nn.Dropout1d(0.3),
        nn.Linear(256, 10),
    ).to(device)
    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters())
    print(f"Number of parameters: {sum(p.numel() for p in model.parameters())}")

    train_set = FashionDataset(train_data)
    train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)
    model = model.to(device)
    train(train_loader, model, loss_fn, optimizer, EPOCHS)

    test_set = FashionDataset(test_data)
    test_loader = DataLoader(test_set, batch_size=BATCH_SIZE)
    test_loop(test_loader, model, loss_fn)

devices = ["cpu"]
if torch.cuda.is_available():
    print("cuda GPU is available")
    devices.append("cuda")

for device in devices:
    print(f"Training on {device}")
    evaluate()


cuda GPU is available
Training on cpu
Number of parameters: 616074
Epoch 1...Avg loss: 0.0149
Epoch 2...Avg loss: 0.0137
Epoch 3...Avg loss: 0.0133
Epoch 4...Avg loss: 0.0131
Epoch 5...Avg loss: 0.0129
Epoch 6...Avg loss: 0.0128
Epoch 7...Avg loss: 0.0125
Epoch 8...Avg loss: 0.0123
Epoch 9...Avg loss: 0.0122
Epoch 10...Avg loss: 0.0120
cpu Training time: 232.72 seconds
Accuracy: 70.6%, Avg loss: 0.011809 

Training on cuda
Number of parameters: 616074
Epoch 1...Avg loss: 0.0151
Epoch 2...Avg loss: 0.0138
Epoch 3...Avg loss: 0.0134
Epoch 4...Avg loss: 0.0130
Epoch 5...Avg loss: 0.0129
Epoch 6...Avg loss: 0.0126
Epoch 7...Avg loss: 0.0125
Epoch 8...Avg loss: 0.0123
Epoch 9...Avg loss: 0.0121
Epoch 10...Avg loss: 0.0120
cuda Training time: 29.95 seconds
Accuracy: 70.2%, Avg loss: 0.011927 

