# Ray Tune - A Deeper Dive Using MNIST with PyTorch


Apopted from Anyscal unde Apache 2.0



In [None]:
print('NOTE: Intentionally crashing session to use the newly installed library.\n')

#!pip uninstall -y pyarrow
!pip install pyarrow==10.0.1
!pip install ray


# A hack to force the runtime to restart, needed to include the above dependencies.
import os
os._exit(0)

NOTE: Intentionally crashing session to use the newly installed library.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [1]:
import os
from torchvision import datasets, transforms
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from filelock import FileLock

## PyTorch Hyperparameter Tuning

Our example will closely follow the code in the [PyTorch MNIST example](https://github.com/pytorch/examples/blob/master/mnist/main.py). However, we will create an even simpler model than the one in the example, although you could try that model and compare its predictions.

Let's start by defining a few global variables for epoch and test sizes. Also define a data location.

In [2]:
EPOCH_SIZE = 512
TEST_SIZE = 256

DATA_ROOT = 'data/mnist'

The following class defines a convolutional neural network.



In [3]:
class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
        self.fc = nn.Linear(192, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 3))
        x = x.view(-1, 192)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)

After creating that network, we can now create our data loaders for training and test data. These are just plain [PyTorch `DataLoaders`](https://pytorch.org/docs/1.1.0/data.html?highlight=dataloader#torch.utils.data.DataLoader) with two additions:

1. A `FileLock` is added to ensure that only one process downloads the data on each machine, just in case we have multiple workers per machine in our Ray cluster.
2. The root directory for the data can be specified and it will be created if it doesn't exist.

Otherwise, this code is identical to the [PyTorch example version](https://github.com/pytorch/examples/blob/master/mnist/main.py#L101).

In [4]:
def get_data_loaders():
    mnist_transforms = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.1307, ), (0.3081, ))])

    # We add FileLock here because multiple workers on the same machine coulde try
    # download the data. This would cause overwrites, since DataLoader is not threadsafe.
    # You wouldn't need this for single-process training.
    lock_file = f'{DATA_ROOT}/data.lock'
    import os
    if not os.path.exists(DATA_ROOT):
        os.makedirs(DATA_ROOT)

    with FileLock(os.path.expanduser(lock_file)):
        train_loader = torch.utils.data.DataLoader(
            datasets.MNIST(DATA_ROOT, train=True, download=True, transform=mnist_transforms),
            batch_size=64,
            shuffle=True)

        test_loader = torch.utils.data.DataLoader(
            datasets.MNIST(DATA_ROOT, train=False, transform=mnist_transforms),
            batch_size=64,
            shuffle=True)
    return train_loader, test_loader

Now we define our training and test functions. While the arguments are a bit switched up from the original PyTorch tutorial, the difference is inconsequential. The arguments are an optimizer, a model, the training data loader, and our device. Then we train the model.

In [5]:
def train(model, optimizer, train_loader, device=torch.device("cpu")):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx * len(data) > EPOCH_SIZE:
            return
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

Similarly for our test model, we define a basic _average correct prediction_ metric that we will track. We could add more metrics, but we'll keep it simple.

In [6]:
def test(model, data_loader, device=torch.device("cpu")):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(data_loader):
            if batch_idx * len(data) > TEST_SIZE:
                break
            data, target = data.to(device), target.to(device)
            outputs = model(data)
            _, predicted = torch.max(outputs.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    return correct / total

Finally, we create a wrapper function for this particular model. In doing so all we need to do is specify the configuration for the model that we would like to train and the function will do the rest:

1. Retrieve the data with the loaders returned by `get_data_loaders()`
2. Create the `ConvNet` model
3. Optimize the model using _stochastic gradient descent_.

In [7]:
def train_mnist(config):
    train_loader, test_loader = get_data_loaders()
    model = ConvNet()
    optimizer = optim.SGD(model.parameters(), lr=config["lr"], momentum=config['momentum'])
    for i in range(10):
        train(model, optimizer, train_loader)
        acc = test(model, test_loader)
        print(f"accuracy: {acc}")

### Single-Node Hyperparameter Tuning

Let's show what we might do if we performed hyperparameter tuning on a single machine. We would have to enumerate all the possibilities and either train them serially or use something like multiprocessing to train them in parallel. That setup takes a little bit of work so people often decide to train them serially, which is easiest, but requires the most time.

This is what we might do.

In [8]:
import itertools
conf = {
    "lr": [0.001, 0.01, 0.1],
    "momentum": [0.001, 0.01, 0.1, 0.9]
}

combinations = list(itertools.product(*conf.values()))
print(len(combinations))
combinations

12


[(0.001, 0.001),
 (0.001, 0.01),
 (0.001, 0.1),
 (0.001, 0.9),
 (0.01, 0.001),
 (0.01, 0.01),
 (0.01, 0.1),
 (0.01, 0.9),
 (0.1, 0.001),
 (0.1, 0.01),
 (0.1, 0.1),
 (0.1, 0.9)]

In [9]:
for lr, momentum in combinations:
    train_mnist({"lr":lr, "momentum":momentum})
    break # we'll stop this after one run and just use it for illustrative purposes

accuracy: 0.090625
accuracy: 0.125
accuracy: 0.0875
accuracy: 0.096875
accuracy: 0.075
accuracy: 0.1
accuracy: 0.109375
accuracy: 0.115625
accuracy: 0.134375
accuracy: 0.1625


### Distributed Hyperparameter Tuning with Ray Tune

Ray Tune makes it trivial to move this code from a single node to multiple nodes. Let's see how to use the code we've written with Ray Tune.

First, we set up Ray as before.

In [10]:
import ray
from ray import tune

In [11]:
ray.init(ignore_reinit_error=True)

2023-06-15 11:38:19,914	INFO worker.py:1636 -- Started a local Ray instance.


0,1
Python version:,3.10.12
Ray version:,2.5.0


The first change is we'll perform a strict `grid_search` on our hyperparameters, like we used in the previous lesson. Our hyperparameters are the learning rate, `lr`, and the `momentum`.

In [12]:
config = {
    "lr": tune.grid_search([0.001, 0.01, 0.1]),
    "momentum": tune.grid_search([0.001, 0.01, 0.1, 0.9])
}

Next we modify our trainable, `train_mnist`, to use Tune's "reporting" logger:

In [13]:
def train_mnist(config):
    from ray.tune import report
    train_loader, test_loader = get_data_loaders()
    model = ConvNet()
    optimizer = optim.SGD(model.parameters(), lr=config["lr"], momentum=config['momentum'])
    for i in range(10):
        train(model, optimizer, train_loader)
        acc = test(model, test_loader)
        # This sends the score to Tune.
        report(mean_accuracy=acc)

That's all that we need to change in order for Ray Tune to be able to parallelize our different hyperparameter combinations.

When we execute a hyperparameter sweep, we perform an **experiment**. Each distinct combination of our different hyperparameters constitutes a single **trial**.

## Tune's Functional vs. Class API

In the above previous lesson, we used the **functional API**. This API is most convenient for quickly setting up experiments, but it provides less overall flexbility compared to the **class API** [`tune.Trainable`](https://docs.ray.io/en/latest/tune/api_docs/trainable.html#tune-trainable).

We'll try both, starting with the functional API.

We add a stopping criterion, `stop={"training_iteration": 20}`, so this will go reasonably quickly, while still producing good results. Consider removing this condition if you don't mind waiting longer and you want optimal results.

**Note**: Unlike the functional API, in which you the trainable can call a `tune.report()`, the class API method `cls.step()` can only return a value.

In [14]:
%%time
analysis_func = tune.run(train_mnist, config=config, stop={"training_iteration": 20},
                         verbose=1)

2023-06-15 11:38:21,208	INFO tensorboardx.py:178 -- pip install "ray[tune]" to see TensorBoard files.


== Status ==
Current time: 2023-06-15 11:38:21 (running for 00:00:00.21)
Using FIFO scheduling algorithm.
Logical resource usage: 0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/train_mnist_2023-06-15_11-38-21
Number of trials: 12/12 (12 PENDING)


== Status ==
Current time: 2023-06-15 11:38:26 (running for 00:00:05.26)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/train_mnist_2023-06-15_11-38-21
Number of trials: 12/12 (12 PENDING)


[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/mnist/MNIST/raw/train-images-idx3-ubyte.gz


[2m[36m(train_mnist pid=6272)[0m   0%|          | 0/9912422 [00:00<?, ?it/s]
[2m[36m(train_mnist pid=6272)[0m 100%|██████████| 9912422/9912422 [00:00<00:00, 122185192.98it/s]


[2m[36m(train_mnist pid=6272)[0m Extracting data/mnist/MNIST/raw/train-images-idx3-ubyte.gz to data/mnist/MNIST/raw
[2m[36m(train_mnist pid=6272)[0m 
== Status ==
Current time: 2023-06-15 11:38:31 (running for 00:00:10.28)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/train_mnist_2023-06-15_11-38-21
Number of trials: 12/12 (10 PENDING, 2 RUNNING)


[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz[32m [repeated 10x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)[0m
[2m[36m(train_mnist pid=6273)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/mnist/MNIST/raw/train-images-idx3-ubyte.gz[32m [repeated 9x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m Ext

[2m[36m(train_mnist pid=6272)[0m   0%|          | 0/1648877 [00:00<?, ?it/s][32m [repeated 11x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m 100%|██████████| 9912422/9912422 [00:00<00:00, 142459168.95it/s][32m [repeated 5x across cluster][0m


[2m[36m(train_mnist pid=6272)[0m [32m [repeated 14x across cluster][0m
== Status ==
Current time: 2023-06-15 11:38:36 (running for 00:00:15.29)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/train_mnist_2023-06-15_11-38-21
Number of trials: 12/12 (8 PENDING, 2 RUNNING, 2 TERMINATED)


[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz[32m [repeated 8x across cluster][0m
[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/mnist/MNIST/raw/train-labels-idx1-ubyte.gz[32m [repeated 9x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m Extracting data/mnist/MNIST/raw/train-images-idx3-ubyte.gz to data/mnist/MNIST/raw[32m [repeated 8x across cluster][0m


[2m[36m(train_mnist pid=6272)[0m   0%|          | 0/28881 [00:00<?, ?it/s]100%|██████████| 28881/28881 [00:00<00:00, 157934411.77it/s][32m [repeated 7x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m 100%|██████████| 9912422/9912422 [00:00<00:00, 131028834.50it/s][32m [repeated 4x across cluster][0m


[2m[36m(train_mnist pid=6273)[0m [32m [repeated 5x across cluster][0m
== Status ==
Current time: 2023-06-15 11:38:41 (running for 00:00:20.38)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/train_mnist_2023-06-15_11-38-21
Number of trials: 12/12 (6 PENDING, 2 RUNNING, 4 TERMINATED)


[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 6x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 5x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m Extracting data/mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/mnist/MNIST/raw[32m [repeated 6x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m [32m [repeated 4x across cluster][0m


[2m[36m(train_mnist pid=6272)[0m   0%|          | 0/9912422 [00:00<?, ?it/s][32m [repeated 6x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m   0%|          | 0/9912422 [00:00<?, ?it/s]
[2m[36m(train_mnist pid=6272)[0m 100%|██████████| 9912422/9912422 [00:00<00:00, 169693317.46it/s]


== Status ==
Current time: 2023-06-15 11:38:46 (running for 00:00:25.46)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/train_mnist_2023-06-15_11-38-21
Number of trials: 12/12 (4 PENDING, 2 RUNNING, 6 TERMINATED)


[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 11x across cluster][0m
[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/mnist/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 12x across cluster][0m
[2m[36m(train_mnist pid=6272)[0m Extracting data/mnist/MNIST/raw/train-labels-idx1-ubyte.gz to data/mnist/MNIST/raw[32m [repeated 10x across cluster][0m
[2m[36m(train_mnist pid=6272)[0m [32m [repeated 10x across cluster][0m


[2m[36m(train_mnist pid=6272)[0m   0%|          | 0/1648877 [00:00<?, ?it/s][32m [repeated 11x across cluster][0m
[2m[36m(train_mnist pid=6272)[0m 100%|██████████| 1648877/1648877 [00:00<00:00, 38632164.16it/s][32m [repeated 4x across cluster][0m


== Status ==
Current time: 2023-06-15 11:38:51 (running for 00:00:30.46)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/train_mnist_2023-06-15_11-38-21
Number of trials: 12/12 (2 PENDING, 2 RUNNING, 8 TERMINATED)


[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 5x across cluster][0m
[2m[36m(train_mnist pid=6272)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/mnist/MNIST/raw/train-images-idx3-ubyte.gz[32m [repeated 5x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m Extracting data/mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/mnist/MNIST/raw[32m [repeated 6x across cluster][0m


[2m[36m(train_mnist pid=6272)[0m   0%|          | 0/9912422 [00:00<?, ?it/s][32m [repeated 5x across cluster][0m
[2m[36m(train_mnist pid=6272)[0m 100%|██████████| 9912422/9912422 [00:00<00:00, 126377625.52it/s]


[2m[36m(train_mnist pid=6273)[0m [32m [repeated 6x across cluster][0m


[2m[36m(train_mnist pid=6273)[0m 100%|██████████| 9912422/9912422 [00:00<00:00, 142803157.40it/s]


== Status ==
Current time: 2023-06-15 11:38:56 (running for 00:00:35.47)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/train_mnist_2023-06-15_11-38-21
Number of trials: 12/12 (2 RUNNING, 10 TERMINATED)




2023-06-15 11:38:59,419	INFO tune.py:1111 -- Total run time: 38.22 seconds (38.19 seconds for the tuning loop).


== Status ==
Current time: 2023-06-15 11:38:59 (running for 00:00:38.20)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/train_mnist_2023-06-15_11-38-21
Number of trials: 12/12 (12 TERMINATED)


CPU times: user 1.07 s, sys: 182 ms, total: 1.26 s
Wall time: 38.3 s


In [15]:
print("Best config: ", analysis_func.get_best_config(metric="mean_accuracy", mode="max"))

Best config:  {'lr': 0.1, 'momentum': 0.01}


In [16]:
analysis_func.dataframe().sort_values('mean_accuracy', ascending=False).head()

Unnamed: 0,mean_accuracy,time_this_iter_s,done,training_iteration,trial_id,date,timestamp,time_total_s,pid,hostname,node_ip,time_since_restore,iterations_since_restore,config/lr,config/momentum,logdir
5,0.90625,0.345789,False,10,213f5_00005,2023-06-15_11-38-43,1686829123,5.980468,6273,b739b75c6450,172.28.0.12,5.980468,10,0.1,0.01,/root/ray_results/train_mnist_2023-06-15_11-38...
11,0.903125,0.223297,False,10,213f5_00011,2023-06-15_11-38-59,1686829139,4.379885,6273,b739b75c6450,172.28.0.12,4.379885,10,0.1,0.9,/root/ray_results/train_mnist_2023-06-15_11-38...
2,0.9,0.545079,False,10,213f5_00002,2023-06-15_11-38-37,1686829117,5.02385,6272,b739b75c6450,172.28.0.12,5.02385,10,0.1,0.001,/root/ray_results/train_mnist_2023-06-15_11-38...
10,0.89375,0.371153,False,10,213f5_00010,2023-06-15_11-38-58,1686829138,4.960583,6272,b739b75c6450,172.28.0.12,4.960583,10,0.01,0.9,/root/ray_results/train_mnist_2023-06-15_11-38...
8,0.890625,0.644053,False,10,213f5_00008,2023-06-15_11-38-53,1686829133,5.528092,6272,b739b75c6450,172.28.0.12,5.528092,10,0.1,0.1,/root/ray_results/train_mnist_2023-06-15_11-38...


In [17]:
analysis_func.dataframe()[['mean_accuracy', 'config/lr', 'config/momentum']].sort_values('mean_accuracy', ascending=False)

Unnamed: 0,mean_accuracy,config/lr,config/momentum
5,0.90625,0.1,0.01
11,0.903125,0.1,0.9
2,0.9,0.1,0.001
10,0.89375,0.01,0.9
8,0.890625,0.1,0.1
1,0.7875,0.01,0.001
4,0.715625,0.01,0.01
7,0.6375,0.01,0.1
9,0.559375,0.001,0.9
3,0.29375,0.001,0.01


How long did it take? We'll compare this value with a different training run in the next lesson.

In [18]:
stats = analysis_func.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

   -inf seconds,    -inf minutes


### Use Tune's Trainable Class API

As a subclass of `tune.Trainable`, Tune will create a Trainable object on a separate process (using the [Ray Actor API](https://docs.ray.io/en/latest/actors.html#actor-guide)).

 * setup function is invoked once training starts.
 * step is invoked multiple times. Each time, the Trainable object executes one logical iteration of training in the tuning process, which may include one or more iterations of actual training.


In [19]:
class TrainMNIST(tune.Trainable):
    def setup(self, config):
        self.config = config
        self.train_loader, self.test_loader = get_data_loaders()
        self.model = ConvNet()
        self.optimizer = optim.SGD(self.model.parameters(), lr=self.config["lr"])

    def step(self):
        train(self.model, self.optimizer, self.train_loader)
        acc = test(self.model, self.test_loader)
        return {"mean_accuracy": acc}

In [20]:
%%time
analysis = tune.run(
    TrainMNIST,
    config=config,
    stop={"training_iteration": 20},
    verbose=1
)



== Status ==
Current time: 2023-06-15 11:40:17 (running for 00:00:00.20)
Using FIFO scheduling algorithm.
Logical resource usage: 0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (12 PENDING)


== Status ==
Current time: 2023-06-15 11:40:22 (running for 00:00:05.25)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (12 PENDING)


[2m[36m(TrainMNIST pid=7040)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 8x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 7x across cluster][0m
[2m[36m(train_mnist pid=6273)[0m Extracting data/mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/mnist/MNIST/raw[32m [repeated 8x across cluster][0

[2m[36m(TrainMNIST pid=7040)[0m   0%|          | 0/9912422 [00:00<?, ?it/s][32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7040)[0m 100%|██████████| 9912422/9912422 [00:00<00:00, 139361015.93it/s]


== Status ==
Current time: 2023-06-15 11:40:27 (running for 00:00:10.32)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (10 PENDING, 2 RUNNING)


== Status ==
Current time: 2023-06-15 11:40:32 (running for 00:00:15.40)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (10 PENDING, 2 RUNNING)


== Status ==
Current time: 2023-06-15 11:40:38 (running for 00:00:20.48)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (10 PENDING, 2 TERMINATED)


[2m[36m(TrainMNIST pid=7288)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7288)[0m Downloading http://yan

[2m[36m(TrainMNIST pid=7288)[0m   0%|          | 0/9912422 [00:00<?, ?it/s]100%|██████████| 9912422/9912422 [00:00<00:00, 146962570.68it/s][32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7040)[0m 100%|██████████| 1648877/1648877 [00:00<00:00, 37193809.88it/s][32m [repeated 2x across cluster][0m


== Status ==
Current time: 2023-06-15 11:40:43 (running for 00:00:25.54)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (8 PENDING, 2 RUNNING, 2 TERMINATED)


== Status ==
Current time: 2023-06-15 11:40:48 (running for 00:00:30.63)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (8 PENDING, 2 RUNNING, 2 TERMINATED)


== Status ==
Current time: 2023-06-15 11:40:53 (running for 00:00:35.70)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (8 PENDING, 4 TERMINATED)


[2m[36m(TrainMNIST pid=7503)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7286)[

[2m[36m(TrainMNIST pid=7503)[0m   0%|          | 0/9912422 [00:00<?, ?it/s][32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7286)[0m   0%|          | 0/9912422 [00:00<?, ?it/s]
[2m[36m(TrainMNIST pid=7503)[0m 100%|██████████| 9912422/9912422 [00:00<00:00, 133979057.55it/s]


== Status ==
Current time: 2023-06-15 11:40:58 (running for 00:00:40.75)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (6 PENDING, 2 RUNNING, 4 TERMINATED)


== Status ==
Current time: 2023-06-15 11:41:03 (running for 00:00:45.84)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (6 PENDING, 2 RUNNING, 4 TERMINATED)


== Status ==
Current time: 2023-06-15 11:41:08 (running for 00:00:50.85)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (6 PENDING, 6 TERMINATED)




[2m[36m(TrainMNIST pid=7681)[0m   0%|          | 0/9912422 [00:00<?, ?it/s][32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7505)[0m 100%|██████████| 1648877/1648877 [00:00<00:00, 39088954.18it/s][32m [repeated 3x across cluster][0m


[2m[36m(TrainMNIST pid=7681)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7681)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/mnist/MNIST/raw/train-images-idx3-ubyte.gz[32m [repeated 9x across cluster][0m
[2m[36m(TrainMNIST pid=7505)[0m Extracting data/mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/mnist/MNIST/raw[32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7505)[0m [32m [repeated 8x across cluster][0m


[2m[36m(TrainMNIST pid=7714)[0m  87%|████████▋ | 8650752/9912422 [00:00<00:00, 75914217.99it/s]100%|██████████| 9912422/9912422 [00:00<00:00, 84050255.93it/s]


== Status ==
Current time: 2023-06-15 11:41:13 (running for 00:00:55.89)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (4 PENDING, 2 RUNNING, 6 TERMINATED)


== Status ==
Current time: 2023-06-15 11:41:18 (running for 00:01:00.98)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (4 PENDING, 2 RUNNING, 6 TERMINATED)


== Status ==
Current time: 2023-06-15 11:41:23 (running for 00:01:06.07)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (4 PENDING, 8 TERMINATED)


[2m[36m(TrainMNIST pid=7862)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7862)[

[2m[36m(TrainMNIST pid=7862)[0m   0%|          | 0/9912422 [00:00<?, ?it/s]100%|██████████| 9912422/9912422 [00:00<00:00, 148568518.11it/s][32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=7681)[0m 100%|██████████| 1648877/1648877 [00:00<00:00, 38624181.24it/s][32m [repeated 2x across cluster][0m


== Status ==
Current time: 2023-06-15 11:41:28 (running for 00:01:11.13)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (2 PENDING, 2 RUNNING, 8 TERMINATED)


== Status ==
Current time: 2023-06-15 11:41:33 (running for 00:01:16.19)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (2 PENDING, 2 RUNNING, 8 TERMINATED)


== Status ==
Current time: 2023-06-15 11:41:38 (running for 00:01:21.21)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (2 PENDING, 10 TERMINATED)


[2m[36m(TrainMNIST pid=8043)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=8043)

[2m[36m(TrainMNIST pid=8043)[0m   0%|          | 0/9912422 [00:00<?, ?it/s]100%|██████████| 9912422/9912422 [00:00<00:00, 136404594.68it/s][32m [repeated 8x across cluster][0m
100%|██████████| 9912422/9912422 [00:00<00:00, 136404594.68it/s]
100%|██████████| 1648877/1648877 [00:00<00:00, 37471507.97it/s]


== Status ==
Current time: 2023-06-15 11:41:43 (running for 00:01:26.22)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (2 RUNNING, 10 TERMINATED)


== Status ==
Current time: 2023-06-15 11:41:48 (running for 00:01:31.31)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (2 RUNNING, 10 TERMINATED)




2023-06-15 11:41:51,622	INFO tune.py:1111 -- Total run time: 94.10 seconds (94.07 seconds for the tuning loop).


== Status ==
Current time: 2023-06-15 11:41:51 (running for 00:01:34.07)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/2 CPUs, 0/0 GPUs
Result logdir: /root/ray_results/TrainMNIST_2023-06-15_11-40-17
Number of trials: 12/12 (12 TERMINATED)


CPU times: user 2.03 s, sys: 285 ms, total: 2.32 s
Wall time: 1min 34s


In [21]:
print("Best config: ", analysis.get_best_config(metric="mean_accuracy", mode="max"))

Best config:  {'lr': 0.1, 'momentum': 0.01}


In [22]:
# Get a dataframe for analyzing trial results.
df = analysis.dataframe()
df.head()

Unnamed: 0,mean_accuracy,done,training_iteration,trial_id,date,timestamp,time_this_iter_s,time_total_s,pid,hostname,node_ip,time_since_restore,iterations_since_restore,config/lr,config/momentum,logdir
0,0.125,True,20,66963_00000,2023-06-15_11-40-33,1686829233,0.402833,7.609252,7040,b739b75c6450,172.28.0.12,7.609252,20,0.001,0.001,/root/ray_results/TrainMNIST_2023-06-15_11-40-...
1,0.840625,True,20,66963_00001,2023-06-15_11-40-33,1686829233,0.328962,7.510359,7075,b739b75c6450,172.28.0.12,7.510359,20,0.01,0.001,/root/ray_results/TrainMNIST_2023-06-15_11-40-...
2,0.86875,True,20,66963_00002,2023-06-15_11-40-49,1686829249,0.527731,7.756325,7286,b739b75c6450,172.28.0.12,7.756325,20,0.1,0.001,/root/ray_results/TrainMNIST_2023-06-15_11-40-...
3,0.1625,True,20,66963_00003,2023-06-15_11-40-49,1686829249,0.677231,7.842635,7288,b739b75c6450,172.28.0.12,7.842635,20,0.001,0.01,/root/ray_results/TrainMNIST_2023-06-15_11-40-...
4,0.83125,True,20,66963_00004,2023-06-15_11-41-05,1686829265,0.631667,8.981776,7503,b739b75c6450,172.28.0.12,8.981776,20,0.01,0.01,/root/ray_results/TrainMNIST_2023-06-15_11-40-...


In [23]:
analysis.dataframe().sort_values('mean_accuracy', ascending=False).head()

Unnamed: 0,mean_accuracy,done,training_iteration,trial_id,date,timestamp,time_this_iter_s,time_total_s,pid,hostname,node_ip,time_since_restore,iterations_since_restore,config/lr,config/momentum,logdir
5,0.925,True,20,66963_00005,2023-06-15_11-41-05,1686829265,0.710634,8.789523,7505,b739b75c6450,172.28.0.12,8.789523,20,0.1,0.01,/root/ray_results/TrainMNIST_2023-06-15_11-40-...
11,0.925,True,20,66963_00011,2023-06-15_11-41-51,1686829311,0.23213,8.63845,8079,b739b75c6450,172.28.0.12,8.63845,20,0.1,0.9,/root/ray_results/TrainMNIST_2023-06-15_11-40-...
2,0.86875,True,20,66963_00002,2023-06-15_11-40-49,1686829249,0.527731,7.756325,7286,b739b75c6450,172.28.0.12,7.756325,20,0.1,0.001,/root/ray_results/TrainMNIST_2023-06-15_11-40-...
10,0.846875,True,20,66963_00010,2023-06-15_11-41-50,1686829310,0.388859,9.285601,8043,b739b75c6450,172.28.0.12,9.285601,20,0.01,0.9,/root/ray_results/TrainMNIST_2023-06-15_11-40-...
1,0.840625,True,20,66963_00001,2023-06-15_11-40-33,1686829233,0.328962,7.510359,7075,b739b75c6450,172.28.0.12,7.510359,20,0.01,0.001,/root/ray_results/TrainMNIST_2023-06-15_11-40-...


It's easier to see what we want if project out the interesting columns:

In [24]:
analysis.dataframe()[['mean_accuracy', 'config/lr', 'config/momentum']].sort_values('mean_accuracy', ascending=False)

Unnamed: 0,mean_accuracy,config/lr,config/momentum
5,0.925,0.1,0.01
11,0.925,0.1,0.9
2,0.86875,0.1,0.001
10,0.846875,0.01,0.9
1,0.840625,0.01,0.001
8,0.834375,0.1,0.1
4,0.83125,0.01,0.01
7,0.828125,0.01,0.1
9,0.25,0.001,0.9
6,0.21875,0.001,0.1


How long did it take? We'll compare this value with a different training run in the next lesson.

In [25]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

   -inf seconds,    -inf minutes


The next lesson will explore optimization algorithms that speed up HPO.

In [26]:
ray.shutdown()  # "Undo ray.init()".

[2m[36m(TrainMNIST pid=8079)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 7x across cluster][0m
[2m[36m(TrainMNIST pid=8079)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 7x across cluster][0m
[2m[36m(TrainMNIST pid=8079)[0m Extracting data/mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/mnist/MNIST/raw[32m [repeated 8x across cluster][0m
[2m[36m(TrainMNIST pid=8079)[0m [32m [repeated 8x across cluster][0m


[2m[36m(TrainMNIST pid=8079)[0m   0%|          | 0/4542 [00:00<?, ?it/s]100%|██████████| 4542/4542 [00:00<00:00, 23548243.22it/s][32m [repeated 7x across cluster][0m
[2m[36m(TrainMNIST pid=8079)[0m 100%|██████████| 1648877/1648877 [00:00<00:00, 36870988.95it/s][32m [repeated 2x across cluster][0m
