# Hosted MLFlow Alternatives

There are a few problems with using MLFlow, but mainly, the issue of where to host the MLFlow server arises. Sure you may be able to start an instance within colab, however, when your virtual environment is killed, your data is irrecoverable.

## Solutions

There are two main solutions to dealing with this problem, namely:

1. **Weights & Biases** (wandb)
2. **Neptune.ai** (neptune)

Both of these packages / platforms share a very common API (same methods) with MLFlow. Though we will be covering basic usage of each platform, we reccomend starting with **Weights & Biases** (the platform is more intuitive, and offers many more features).

## Creating an Account

These platforms are **free** for students and educators. Though we will be using specifically wandb in this notebook, the APIs are very similar.

Create your accounts here:
1. [https://wandb.ai/](https://wandb.ai/)
2. [https://neptune.ai/](https://neptune.ai/)

## Getting Started

In order to use either of these platforms, you must first start a `run`. What is a `run`? The name is fairly intuitive -- a run is simply a run of your model training process.

### Weights & Biases (wandb)

First, install wandb if you haven't already:

> ```
> pip install wandb
> ```

#### Initialize a new run
> ```python
> import wandb
> run = wandb.init(project="your_project_name")
> ```

#### Log hyperparameters

> ```python
> config = { "epochs": 10 }
> run = wandb.init(..., config=config)
> # - or -
> run.config.update({"epochs": 10, ...})
> # - or -
> run.config["epochs"] = 10


#### Log metrics
> ```python
> wandb.log({"epoch": epoch, "loss": loss, "accuracy": accuracy})
> ```

#### Finish Run

> ```python
> run.finish()
> ```


### Neptune.ai (neptune)

First, install neptune if you haven't already:

> `pip install neptune`

#### Initialize a new run

> ```python
> import neptune
> run = neptune.init_run(project="your_workspace/your_project")
> ```

#### Log hyperparameters

> ```python
> run["parameters"] = { "epochs": 10 }
> # - or -
> run["parameters/epochs"] = 10
> ```

#### Log metrics

> ```python
> run[f"metrics/epoch_{epoch}/loss"] = loss
> run[f"metrics/epoch_{epoch}/accuracy"] = accuracy
> ```

### Finish Run

> ```python
> run.stop()
> ```



## Some Additional Information

### **Use the config as a source of truth**

Example:

**Do NOT do the following**

```python
for epoch in range(NUM_EPOCHS):
    ...
```

**Instead, do:**
```python
for epoch in range(config.epochs):
    ...
```

### Advantages of these platforms

1. You can save models throughout the training process (we will be doing this)
2. You can fork runs by importing an existing model and overwriting the data. (e.g. adjust learning_rate around epoch 40)
3. Statistics about CPU and GPU usage are automatically recorded.

## Key Differences
**between MLFlow, WandB and Neptune.ai**

1. WandB and MLFlow share a very very similar API design (e.g. both use `run.log`), whereas Neptune has a different way of doing this.
2. WandB has an artifact registry
3. WandB and Neptune are hosted for you (in the cloud)
4. Only MLFlow is open-source

In [None]:
%pip install -q wandb

# - or
# %pip install -q neptune

# Hotdog, or Not hotdog

<img src="https://www.oreilly.com/content/wp-content/uploads/sites/2/2020/01/Figure_1-71076f8ac360d6a065cf19c6923310d2.jpg" width="300"/>

Many of you may have seen, or heard of the show, Silicon Valley. One of the famous clips from this show was when someone made a ML algorithm which was able to predict if something was, or was not a Hotdog.

As an example, let's implement this.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import torch
import torchvision
import torch.nn as nn
import torch.optim as optim

from PIL import Image
from torchvision import transforms
from torch.utils.data import DataLoader, Dataset

### Dataset

The dataset that we will be using already segmented our images to train and test. (How convenient!)

https://www.kaggle.com/datasets/dansbecker/hot-dog-not-hot-dog/data

Let's get started by defining our dataloader and dataset.

In [None]:
# we need to download this dataset from github (this is on kaggle
# but it is a longer process to download from kaggle; need api token)
!git clone https://github.com/youngsoul/hotdog-not-hotdog-dataset

Cloning into 'hotdog-not-hotdog-dataset'...
remote: Enumerating objects: 4586, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 4586 (delta 0), reused 1 (delta 0), pack-reused 4583 (from 1)[K
Receiving objects: 100% (4586/4586), 223.14 MiB | 14.32 MiB/s, done.
Resolving deltas: 100% (3/3), done.
Updating files: 100% (4905/4905), done.


In [None]:
class HotdogDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        """
        Args:
            root_dir (str): Directory with all the images.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.root_dir = root_dir
        self.transform = transform
        self.image_paths = []
        self.labels = []

        # Load image paths and labels
        for label, sub_dir in enumerate(['hot_dog', 'not_hot_dog']):
            sub_dir_path = os.path.join(root_dir, sub_dir)
            for filename in os.listdir(sub_dir_path):
                if filename.endswith(".jpg"):
                    self.image_paths.append(os.path.join(sub_dir_path, filename))
                    self.labels.append(label)

    def __len__(self):
        """Returns the number of samples in the dataset."""
        return len(self.image_paths)

    def __getitem__(self, idx):
        """Loads and returns a sample (image and label)."""
        img_path = self.image_paths[idx]
        image = Image.open(img_path).convert("RGB")
        label = self.labels[idx]

        if self.transform:
            image = self.transform(image)

        return image, label

### Define our Transforms & Load Dataset

In [None]:
# the transforms that we are using
transform = transforms.Compose([
    transforms.Resize((128, 128)),  # Resize to a fixed size
    transforms.ToTensor(),  # Convert image to a tensor
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize with ImageNet stats
])

train_dataset = HotdogDataset("./hotdog-not-hotdog-dataset/train", transform=transform)
test_dataset = HotdogDataset("./hotdog-not-hotdog-dataset/test", transform=transform)
val_dataset = HotdogDataset("./hotdog-not-hotdog-dataset/holdout", transform=transform)

### Create our Run

For now, you can just follow the instructions to login, however, if you are interested in secrets management, see the following:

```python
if 'COLAB_GPU' in os.environ:
    from google.colab import userdata
    key = userdata.get('wandb_api_key')
elif 'KAGGLE_CONTAINER_NAME' in os.environ:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    key = user_secrets.get_secret("wandb_api_key")
else:
    key = None

wandb.login(key=key)
```

In [None]:
import wandb

# define our run
run = wandb.init(
  project="hotdog-not-hotdog",
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33manony-moose-183990734713950594[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Define our Hyperparameters


In [None]:
# define our configuration
# feel free to change this
run.config.update({
    "n_epochs": 10,
    "lr": 0.001,
    "batch_size": 64,
})

### Create Dataloaders

Create loaders for our three different datasets: train, test and holdout (validation).

In [None]:
train_loader = DataLoader(train_dataset, batch_size=run.config["batch_size"], shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=run.config["batch_size"], shuffle=False)
val_loader = DataLoader(val_dataset, batch_size=run.config["batch_size"], shuffle=False)

### Create our Model

We aren't really creating a model, just fine tuning one. This will save us some time ...

In [None]:
device = torch.device(
    "cuda" if torch.cuda.is_available() else          # for GPUs
    "mps" if torch.backends.mps.is_available() else   # for Apple Silicon chips
    "cpu"                                             # else
)

In [None]:
from torchvision.models import resnet50

# define our model, optimizer, and criterion
model = resnet50(weights="IMAGENET1K_V2").to(device)
optimizer = optim.Adam(model.parameters(), lr=run.config["lr"])
criterion = nn.CrossEntropyLoss()

Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 124MB/s]


In [None]:
# todo: log this information into the run configuration

### Prepare for Stats

Open the dashboard (either in another tab, or here!)

In [None]:
# you can only see this if you are logged in.
%wandb

### Train our Model

Let's create our training loop...

In [None]:
for epoch in range(run.config["n_epochs"]):
    total = 0
    correct = 0
    running_loss = 0

    model.train()  # set train model
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        # zero gradients
        optimizer.zero_grad()

        # fwd pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # backwards pass
        loss.backward()
        optimizer.step()

        # tracking
        running_loss += loss.item()
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    # calculate more stats
    avg_loss = running_loss / len(train_dataset)
    accuracy = correct / total

    model.eval()
    total_val = 0
    correct_val = 0
    running_loss_val = 0
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            # forward pass
            outputs = model(inputs)
            loss = criterion(outputs, labels)

            # tracking
            running_loss_val += loss.item()
            _, predicted = torch.max(outputs, 1)
            total_val += labels.size(0)
            correct_val += (predicted == labels).sum().item()

    avg_loss_val = running_loss_val / len(val_loader)
    accuracy_val = correct_val / total_val

    # todo: add more statistics here
    run.log({
        "epoch": epoch + 1,                                       # general tracking
        "loss": avg_loss, "accuracy": accuracy,                   # training loop
        "val_loss": avg_loss_val, "val_accuracy": accuracy_val    # val loop
    })