# Autoscaling Ray on Databricks and Apache Spark

With the release of **Ray 2.8.0**, we have enabled Ray auto-scaling with Ray on Databricks and Apache Spark. Below, we showcase the functionality by going through an example of hyper-parameter tuning for a deep learning model on the CIFAR dataset.

Ray Auto-scaling works with **DBR runtime 14+**, and the code has been tested with the following cluster configurations:

**Azure**: Driver NC6s_v3 and autoscaling with 4 worker nodes NC6s_v3.

## Install the Ray library and any other python Dependencies
Once specified you do not need to respecify the libraries during Ray initialization

In [0]:
%pip install ray['default,tune'] >=2.8.0

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
databricks-feature-store 0.14.1 requires pyspark<4,>=3.1.2, which is not installed.
tensorflow 2.11.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 4.25.0 which is incompatible.
tensorboard 2.11.0 requires protobuf<4,>=3.9.2, but you have protobuf 4.25.0 which is incompatible.
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

## Start the ray cluster 
Use the Ray on spark API's to start the cluster refer to the [here](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/spark.html?highlight=ray.util.spark#ray-on-spark-apis)  for more details on the parameters

In [0]:
from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster


num_cpu_cores_per_worker = 4 # total cpu's present in each node
num_cpus_head_node = 4
num_gpu_per_worker = 1
num_gpus_head_node = 1

ray_conf = setup_ray_cluster(
  num_worker_nodes= 4,#this should be set max number of nodes the cluster is allowed to auto-scale
  num_cpus_head_node= num_cpus_head_node, #this should be set cores used in the driver node used for jobs
  num_gpus_head_node= num_gpus_head_node, #this only should be set for GPU enabled cluster 
  num_cpus_per_node=num_cpu_cores_per_worker, #this should be set cores added from each worker node 
  num_gpus_per_node=num_gpu_per_worker,#this should be set gpus added from each worker node 
  autoscale = True)


2023-11-06 21:18:42,105	INFO cluster_init.py:528 -- Ray head hostname 10.139.64.118, port 9124


2023-11-06 21:18:44,307	INFO usage_lib.py:416 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-11-06 21:18:44,307	INFO scripts.py:744 -- [37mLocal node IP[39m: [1m10.139.64.118[22m
2023-11-06 21:18:46,380	SUCC scripts.py:781 -- [32m--------------------[39m
2023-11-06 21:18:46,380	SUCC scripts.py:782 -- [32mRay runtime started.[39m
2023-11-06 21:18:46,380	SUCC scripts.py:783 -- [32m--------------------[39m
2023-11-06 21:18:46,380	INFO scripts.py:785 -- [36mNext steps[39m
2023-11-06 21:18:46,380	INFO scripts.py:788 -- To add another node to this Ray cluster, run
2023-11-06 21:18:46,380	INFO scripts.py:791 -- [1m  ray start --address='10.139.64.118

2023-11-06 21:19:02,138	INFO cluster_init.py:640 -- Ray head node started.
2023-11-06 21:19:02,143	INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 10.139.64.118:9124...
2023-11-06 21:19:02,153	INFO worker.py:1664 -- Connected to Ray cluster. View the dashboard at [1m[32m10.139.64.118:9137 [39m[22m


To monitor and debug Ray from Databricks, view the dashboard at 
 https://dbc-dp-984752964297111.cloud.databricks.com/driver-proxy/o/984752964297111/1023-112611-gamx0lyy/9137/


In [0]:
#Incase you want to restart the cluster use `shutdown_ray_cluster` this will not restart the interpretor or REPL
# shutdown_ray_cluster()

## Import all the libraries

In [0]:
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from filelock import FileLock
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
import ray
from ray import train, tune
from ray.train import Checkpoint
from ray.tune.schedulers import ASHAScheduler
import time

In [0]:
def load_data(data_dir="./data"):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    # We add FileLock here because multiple workers will want to
    # download data, and this may cause overwrites since
    # DataLoader is not threadsafe.
    with FileLock(os.path.expanduser("~/.data.lock")):
        trainset = torchvision.datasets.CIFAR10(
            root=data_dir, train=True, download=True, transform=transform)

        testset = torchvision.datasets.CIFAR10(
            root=data_dir, train=False, download=True, transform=transform)

    return trainset, testset

In [0]:
class Net(nn.Module):
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

##The Train function
Now it gets interesting, because we introduce some changes to the example from the [PyTorch documentation](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html).

The full code example looks like this:

In [0]:
def train_cifar(config,loc):
    
    print("num_cpus:",int(train.get_context().get_trial_resources().head_cpus))
    torch.set_num_threads(int(train.get_context().get_trial_resources().head_cpus))
    net = Net(config["l1"], config["l2"])

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    # To restore a checkpoint, use `train.get_checkpoint()`.
    loaded_checkpoint = train.get_checkpoint()
    if loaded_checkpoint:
        with loaded_checkpoint.as_directory() as loaded_checkpoint_dir:
           model_state, optimizer_state = torch.load(os.path.join(loaded_checkpoint_dir, "checkpoint.pt"))
        net.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    data_dir = os.path.abspath("./data")
    trainset, testset = load_data(data_dir)

    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs])

    trainloader = torch.utils.data.DataLoader(
        train_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)
    valloader = torch.utils.data.DataLoader(
        val_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)

    for epoch in range(config['max_epoch']):  # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
                                                running_loss / epoch_steps))
                running_loss = 0.0

        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        # Here we save a checkpoint. It is automatically registered with
        # Ray Tune and can be accessed through `train.get_checkpoint()`
        # API in future iterations.
        os.makedirs(f"{loc}/mymodel", exist_ok=True)
        torch.save(
            (net.state_dict(), optimizer.state_dict()), f"{loc}/mymodel/checkpoint.pt")
        checkpoint = Checkpoint.from_directory(f"{loc}/mymodel/")
        train.report({"loss": (val_loss / val_steps),"try_gpu" : False, "accuracy": correct / total}, checkpoint=checkpoint)
    print("Finished Training")

In [0]:
def main(num_samples=10, max_num_epochs=10,
         grace_period=5,cpus_per_trial=1, 
         gpus_per_trial=0 , loc = '/dbfs/pj/ray/'):
    config = {
        "l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16]),
        "max_epoch":20
    }
    scheduler = ASHAScheduler(
        max_t=config['max_epoch'],
        grace_period=5,
        reduction_factor=2)
    
    tuner = tune.Tuner(
        tune.with_resources(
            tune.with_parameters(train_cifar,loc = loc),
            resources={"cpu": cpus_per_trial, "gpu":gpus_per_trial }
        ),
        tune_config=tune.TuneConfig(
            metric="loss",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples,
        ),
        run_config=train.RunConfig(
        storage_path=os.path.expanduser(loc),
        name="tune_checkpointing_location",
    ),
        param_space=config,
    )
    results = tuner.fit()
    
    best_result = results.get_best_result("loss", "min")

    print("Best trial config: {}".format(best_result.config))
    print("Best trial final validation loss: {}".format(
        best_result.metrics["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_result.metrics["accuracy"]))

    test_best_model(best_result)


In [0]:
def test_best_model(best_result):
    best_trained_model = Net(best_result.config["l1"], best_result.config["l2"])
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    best_trained_model.to(device)

    checkpoint_path = os.path.join(best_result.checkpoint.to_directory(), "checkpoint.pt")

    model_state, optimizer_state = torch.load(checkpoint_path)
    best_trained_model.load_state_dict(model_state)

    trainset, testset = load_data()

    testloader = torch.utils.data.DataLoader(
        testset, batch_size=4, shuffle=False, num_workers=2)

    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = best_trained_model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()


    print("Best trial test set accuracy: {}".format(correct / total))

In [0]:
# Run a CPU only Trial
main(num_samples=8, max_num_epochs=10,grace_period=5,cpus_per_trial=3, gpus_per_trial=0 , loc = '/dbfs/pj/ray/')

In [0]:
# Run a GPU only Trial
main(num_samples=8, max_num_epochs=10,grace_period=5,cpus_per_trial=1, gpus_per_trial=0.5 , loc = '/dbfs/pj/ray/')

2023-11-06 21:19:06,551	INFO worker.py:1354 -- Using address 10.139.64.118:9124 set in the environment variable RAY_ADDRESS
2023-11-06 21:19:06,551	INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 10.139.64.118:9124...
2023-11-06 21:19:06,558	INFO worker.py:1664 -- Connected to Ray cluster. View the dashboard at [1m[32m10.139.64.118:9137 [39m[22m
2023-11-06 21:19:06,816	INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2023-11-06 21:19:06,819	INFO tune.py:595 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949


+----------------------------------------------------------------+
| Configuration for experiment     tune_checkpointing_location   |
+----------------------------------------------------------------+
| Search algorithm                 BasicVariantGenerator         |
| Scheduler                        AsyncHyperBandScheduler       |
| Number of trials                 8                             |
+----------------------------------------------------------------+

View detailed results here: /dbfs/pj/ray/tune_checkpointing_location
To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/tune_checkpointing_location`

Trial status: 8 PENDING
Current time: 2023-11-06 21:19:07. Total running time: 0s
Logical resource usage: 0/4 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:V100)
+----------------------------------------------------------------+
| Trial name                status             lr     batch_size |
+--------------------------------------------------

[36m(train_cifar pid=3953)[0m   0%|          | 0/170498071 [00:00<?, ?it/s]
[36m(train_cifar pid=3953)[0m   1%|          | 884736/170498071 [00:00<00:20, 8256869.30it/s]
[36m(train_cifar pid=3953)[0m   6%|▌         | 10158080/170498071 [00:00<00:02, 56489307.72it/s]
[36m(train_cifar pid=3953)[0m  11%|█▏        | 19529728/170498071 [00:00<00:02, 73158467.09it/s]
[36m(train_cifar pid=3953)[0m  18%|█▊        | 30539776/170498071 [00:00<00:01, 87585855.10it/s]
[36m(train_cifar pid=3953)[0m  23%|██▎       | 39878656/170498071 [00:00<00:01, 89643204.70it/s]
[36m(train_cifar pid=3953)[0m  30%|██▉       | 51019776/170498071 [00:00<00:01, 96988530.14it/s]
[36m(train_cifar pid=3953)[0m  36%|███▌      | 60784640/170498071 [00:00<00:01, 95656279.61it/s]
[36m(train_cifar pid=3953)[0m  42%|████▏     | 72220672/170498071 [00:00<00:00, 101461161.60it/s]
[36m(train_cifar pid=3953)[0m  48%|████▊     | 82411520/170498071 [00:00<00:00, 98252274.26it/s] 
[36m(train_cifar pid=

[36m(train_cifar pid=3953)[0m Extracting /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/data/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/data
[36m(autoscaler +39s)[0m Resized to 8 CPUs, 2 GPUs.
[36m(train_cifar pid=3953)[0m Files already downloaded and verified
[36m(train_cifar pid=3953)[0m num_cpus: 1
[36m(autoscaler +41s)[0m Adding 1 node(s) of type ray.worker.




[36m(train_cifar pid=3952)[0m Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/data/cifar-10-python.tar.gz


[36m(train_cifar pid=3952)[0m   0%|          | 0/170498071 [00:00<?, ?it/s]
[36m(train_cifar pid=3952)[0m   1%|          | 884736/170498071 [00:00<00:20, 8172890.17it/s]
[36m(train_cifar pid=3952)[0m   6%|▌         | 9568256/170498071 [00:00<00:03, 52899563.28it/s]
[36m(train_cifar pid=3952)[0m  12%|█▏        | 20348928/170498071 [00:00<00:01, 77405448.14it/s]
[36m(train_cifar pid=3952)[0m  18%|█▊        | 29949952/170498071 [00:00<00:01, 84600610.97it/s]
[36m(train_cifar pid=3952)[0m  24%|██▍       | 40992768/170498071 [00:00<00:01, 93605270.70it/s]
[36m(train_cifar pid=3952)[0m  30%|██▉       | 50429952/170498071 [00:00<00:01, 93687562.55it/s]
[36m(train_cifar pid=3952)[0m  36%|███▌      | 61734912/170498071 [00:00<00:01, 99929381.80it/s]
[36m(train_cifar pid=3952)[0m  42%|████▏     | 71761920/170498071 [00:00<00:01, 97541930.99it/s]
[36m(train_cifar pid=3952)[0m  49%|████▊     | 83034112/170498071 [00:00<00:00, 101972135.19it/s]
[36m(train_cifar pid=39

[36m(train_cifar pid=3952)[0m Extracting /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/data/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/data

Trial train_cifar_1e5c3_00002 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_1e5c3_00002 config             |
+--------------------------------------------------+
| batch_size                                     4 |
| l1                                             4 |
| l2                                           128 |
| lr                                       0.01267 |
| max_epoch                                     20 |
+--------------------------------------------------+

Trial train_cifar_1e5c3_00003 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_1e5c3_00003 config     



[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00002_2_batch_size=4,lr=0.0127_2023-11-06_21-19-07/data/cifar-10-python.tar.gz


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m   0%|          | 0/170498071 [00:00<?, ?it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m   0%|          | 851968/170498071 [00:00<00:21, 8014472.36it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m   6%|▋         | 10813440/170498071 [00:00<00:02, 60476090.13it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  13%|█▎        | 22544384/170498071 [00:00<00:01, 85989520.05it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  20%|██        | 34275328/170498071 [00:00<00:01, 98163341.69it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  27%|██▋       | 45940736/170498071 [00:00<00:01, 104673581.07it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  34%|███▎      | 57442304/170498071 [00:00<00:01, 108095831.64it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  40%|████      | 68517888/170498071 [00:00<00:00, 108887165.64it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  47%|████▋     | 79462400/

[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Extracting /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00002_2_batch_size=4,lr=0.0127_2023-11-06_21-19-07/data/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00002_2_batch_size=4,lr=0.0127_2023-11-06_21-19-07/data
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Files already downloaded and verified
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m num_cpus: 1
[36m(train_cifar pid=3953)[0m [1,  2000] loss: 2.164




[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/data/cifar-10-python.tar.gz


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m   0%|          | 0/170498071 [00:00<?, ?it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m   1%|          | 917504/170498071 [00:00<00:19, 8540464.71it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m   6%|▋         | 10715136/170498071 [00:00<00:02, 59536090.79it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  13%|█▎        | 21954560/170498071 [00:00<00:01, 83189601.93it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  20%|█▉        | 33685504/170498071 [00:00<00:01, 96449741.25it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  27%|██▋       | 45416448/170498071 [00:00<00:01, 103897763.09it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  33%|███▎      | 57081856/170498071 [00:00<00:01, 108199350.79it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  40%|████      | 68812800/170498071 [00:00<00:00, 111142980.04it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  47%|████▋     | 80510976/

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Extracting /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/data/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/data


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000000)



Trial status: 4 RUNNING | 4 PENDING
Current time: 2023-11-06 21:19:37. Total running time: 30s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=2.004182459259033 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-------------------------------------------------------------------------------------------------------------------+
| Trial name                status             lr     batch_size     iter     total time (s)      loss     accuracy |
+-------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING    0.00119966              2                                                    |
| train_cifar_1e5c3_00001   RUNNING    0.00170084             16        1             23.004   2.00418       0.2544 |
| train_cifar_1e5c3_00002   RUNNING    0.0126673    

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000001)


[36m(train_cifar pid=3952)[0m [1,  6000] loss: 0.616[32m [repeated 3x across cluster][0m
[36m(train_cifar pid=3953)[0m [3,  2000] loss: 1.582[32m [repeated 3x across cluster][0m


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/checkpoint_000000)


Trial status: 4 RUNNING | 4 PENDING
Current time: 2023-11-06 21:20:07. Total running time: 1min 0s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.5679220623016357 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-------------------------------------------------------------------------------------------------------------------+
| Trial name                status             lr     batch_size     iter     total time (s)      loss     accuracy |
+-------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING    0.00119966              2                                                    |
| train_cifar_1e5c3_00001   RUNNING    0.00170084             16        3            54.5407   1.56792       0.4043 |
| train_cifar_1e5c3_00002   RUNNING    0.0126673

[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00002_2_batch_size=4,lr=0.0127_2023-11-06_21-19-07/checkpoint_000000)[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=3952)[0m [1, 12000] loss: 0.280[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/checkpoint_000001)[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [2,  2000] loss: 2.219
[36m(train_cifar pid=3952)[0m [1, 14000] loss: 0.232


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000004)


Trial status: 4 RUNNING | 4 PENDING
Current time: 2023-11-06 21:20:37. Total running time: 1min 30s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.450542694759369 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-------------------------------------------------------------------------------------------------------------------+
| Trial name                status             lr     batch_size     iter     total time (s)      loss     accuracy |
+-------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING    0.00119966              2                                                    |
| train_cifar_1e5c3_00001   RUNNING    0.00170084             16        5            85.975    1.45054       0.4546 |
| train_cifar_1e5c3_00002   RUNNING    0.0126673

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000005)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [2,  8000] loss: 0.553[32m [repeated 4x across cluster][0m
[36m(train_cifar pid=3953)[0m [7,  2000] loss: 1.370[32m [repeated 2x across cluster][0m
Trial status: 4 RUNNING | 4 PENDING
Current time: 2023-11-06 21:21:07. Total running time: 2min 0s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.3976775547981262 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-------------------------------------------------------------------------------------------------------------------+
| Trial name                status             lr     batch_size     iter     total time (s)      loss     accuracy |
+-------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING    0.00119966              2                

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000006)[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [4,  4000] loss: 0.724[32m [repeated 3x across cluster][0m


[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000000)[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=3953)[0m [8,  2000] loss: 1.325
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [3,  2000] loss: 2.309


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/checkpoint_000003)
[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000007)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [3,  4000] loss: 1.155[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=3953)[0m [9,  2000] loss: 1.283[32m [repeated 3x across cluster][0m
Trial status: 4 RUNNING | 4 PENDING
Current time: 2023-11-06 21:21:37. Total running time: 2min 30s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.3287887833595275 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-------------------------------------------------------------------------------------------------------------------+
| Trial name                status             lr     batch_size     iter     total time (s)      loss     accuracy |
+-------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING    0.00119966              2        1      

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000008)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [5,  4000] loss: 0.721[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [3,  8000] loss: 0.577[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/checkpoint_000004)
[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000009)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [3, 10000] loss: 0.462[32m [repeated 3x across cluster][0m
[36m(train_cifar pid=3952)[0m [2, 10000] loss: 0.307[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00002_2_batch_size=4,lr=0.0127_2023-11-06_21-19-07/checkpoint_000002)


Trial status: 4 RUNNING | 4 PENDING
Current time: 2023-11-06 21:22:07. Total running time: 3min 0s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2945023712158203 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-------------------------------------------------------------------------------------------------------------------+
| Trial name                status             lr     batch_size     iter     total time (s)      loss     accuracy |
+-------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING    0.00119966              2        1            123.353   1.54597       0.4237 |
| train_cifar_1e5c3_00001   RUNNING    0.00170084             16       10            164.148   1.2945        0.5205 |
| train_cifar_1e5c3_00002   RUNNING    0.0126673

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000010)
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/checkpoint_000005)


[36m(train_cifar pid=3952)[0m [2, 14000] loss: 0.217[32m [repeated 3x across cluster][0m


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000011)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [7,  2000] loss: 1.420[32m [repeated 3x across cluster][0m
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [7,  4000] loss: 0.719[32m [repeated 3x across cluster][0m
Trial status: 4 RUNNING | 4 PENDING
Current time: 2023-11-06 21:22:37. Total running time: 3min 30s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2525162158966066 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-------------------------------------------------------------------------------------------------------------------+
| Trial name                status             lr     batch_size     iter     total time (s)      loss     accuracy |
+-------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING    0.00119966            

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000012)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [4, 10000] loss: 0.462[32m [repeated 4x across cluster][0m
[36m(train_cifar pid=3953)[0m [14,  2000] loss: 1.215[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00002_2_batch_size=4,lr=0.0127_2023-11-06_21-19-07/checkpoint_000003)[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [8,  4000] loss: 0.709[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000001)[32m [repeated 2x across cluster][0m


Trial status: 4 RUNNING | 4 PENDING
Current time: 2023-11-06 21:23:07. Total running time: 4min 0s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.3108190949440002 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-------------------------------------------------------------------------------------------------------------------+
| Trial name                status             lr     batch_size     iter     total time (s)      loss     accuracy |
+-------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING    0.00119966              2        2            234.037   1.48939       0.4562 |
| train_cifar_1e5c3_00001   RUNNING    0.00170084             16       14            227.265   1.31082       0.5243 |
| train_cifar_1e5c3_00002   RUNNING    0.0126673

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/checkpoint_000007)
[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000014)


[36m(train_cifar pid=3952)[0m [3,  2000] loss: 1.473[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [9,  2000] loss: 1.406
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [5,  6000] loss: 0.770


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000015)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [9,  4000] loss: 0.712[32m [repeated 3x across cluster][0m
Trial status: 4 RUNNING | 4 PENDING
Current time: 2023-11-06 21:23:37. Total running time: 4min 30s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2756760026931762 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-------------------------------------------------------------------------------------------------------------------+
| Trial name                status             lr     batch_size     iter     total time (s)      loss     accuracy |
+-------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING    0.00119966              2        2            234.037   1.48939       0.4562 |
| train_cifar_1e5c3_00001   RUNNING    0.00170084     

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/checkpoint_000008)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [5, 10000] loss: 0.462[32m [repeated 3x across cluster][0m


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000016)



Trial train_cifar_1e5c3_00002 completed after 5 iterations at 2023-11-06 21:23:47. Total running time: 4min 40s
+------------------------------------------------------------+
| Trial train_cifar_1e5c3_00002 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000004 |
| time_this_iter_s                                  52.12899 |
| time_total_s                                     265.44513 |
| training_iteration                                       5 |
| accuracy                                             0.095 |
| loss                                               2.30968 |
| try_gpu                                              False |
+------------------------------------------------------------+

Trial train_cifar_1e5c3_00004 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_1e5c3_00004 config             |
+----------------------------

[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00002_2_batch_size=4,lr=0.0127_2023-11-06_21-19-07/checkpoint_000004)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00004_4_batch_size=4,lr=0.0312_2023-11-06_21-19-07/data/cifar-10-python.tar.gz


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m   0%|          | 0/170498071 [00:00<?, ?it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m   0%|          | 851968/170498071 [00:00<00:22, 7635220.07it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m   6%|▌         | 9961472/170498071 [00:00<00:02, 54470896.21it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  11%|█         | 18350080/170498071 [00:00<00:02, 67537373.73it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  17%|█▋        | 29556736/170498071 [00:00<00:01, 84732720.83it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  22%|██▏       | 38174720/170498071 [00:00<00:01, 84460721.46it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  27%|██▋       | 46727168/170498071 [00:00<00:01, 80873391.01it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  32%|███▏      | 54919168/170498071 [00:00<00:01, 66819032.92it/s]
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m  36%|███▋      | 62029824/1704

[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Extracting /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00004_4_batch_size=4,lr=0.0312_2023-11-06_21-19-07/data/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00004_4_batch_size=4,lr=0.0312_2023-11-06_21-19-07/data
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Files already downloaded and verified
[36m(train_cifar pid=3952)[0m [3, 10000] loss: 0.295[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000017)


[36m(train_cifar pid=3952)[0m [3, 12000] loss: 0.240[32m [repeated 3x across cluster][0m

Trial train_cifar_1e5c3_00003 completed after 10 iterations at 2023-11-06 21:24:05. Total running time: 4min 57s
+------------------------------------------------------------+
| Trial train_cifar_1e5c3_00003 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000009 |
| time_this_iter_s                                   27.0152 |
| time_total_s                                      283.1032 |
| training_iteration                                      10 |
| accuracy                                            0.4804 |
| loss                                               1.55756 |
| try_gpu                                              False |
+------------------------------------------------------------+

Trial train_cifar_1e5c3_00005 started with configuration:
+----------------------------------------

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00003_3_batch_size=8,lr=0.0061_2023-11-06_21-19-07/checkpoint_000009)
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m   0%|          | 0/170498071 [00:00<?, ?it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m   1%|          | 917504/170498071 [00:00<00:20, 8372875.28it/s]


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/data/cifar-10-python.tar.gz


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m   6%|▌         | 10649600/170498071 [00:00<00:02, 58637099.70it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  11%|█▏        | 19234816/170498071 [00:00<00:02, 70775024.20it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  18%|█▊        | 30670848/170498071 [00:00<00:01, 87732771.11it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  23%|██▎       | 39550976/170498071 [00:00<00:01, 87565269.06it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  30%|██▉       | 51019776/170498071 [00:00<00:01, 96663426.41it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  36%|███▌      | 60751872/170498071 [00:00<00:01, 93793243.67it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  42%|████▏     | 72220672/170498071 [00:00<00:00, 100219700.57it/s]
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  48%|████▊     | 82313216/170498071 [00:00<00:00, 96365247.41it/s] 
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m  55

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Extracting /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/data/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/data
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [1,  2000] loss: 2.288

Trial status: 4 RUNNING | 2 TERMINATED | 2 PENDING
Current time: 2023-11-06 21:24:07. Total running time: 5min 0s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2368483323097228 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      lo

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000018)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [1,  2000] loss: 2.211[32m [repeated 3x across cluster][0m
[36m(train_cifar pid=3953)[0m [20,  2000] loss: 1.171[32m [repeated 3x across cluster][0m


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00001_1_batch_size=16,lr=0.0017_2023-11-06_21-19-07/checkpoint_000019)



Trial train_cifar_1e5c3_00001 completed after 20 iterations at 2023-11-06 21:24:32. Total running time: 5min 25s
+------------------------------------------------------------+
| Trial train_cifar_1e5c3_00001 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000019 |
| time_this_iter_s                                  15.99379 |
| time_total_s                                     321.87896 |
| training_iteration                                      20 |
| accuracy                                            0.5481 |
| loss                                                1.2401 |
| try_gpu                                              False |
+------------------------------------------------------------+

Trial train_cifar_1e5c3_00006 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_1e5c3_00006 config             |
+---------------------------

[36m(train_cifar pid=3953)[0m   0%|          | 0/170498071 [00:00<?, ?it/s]
[36m(train_cifar pid=3953)[0m   1%|          | 884736/170498071 [00:00<00:21, 8044736.12it/s]
[36m(train_cifar pid=3953)[0m   6%|▋         | 10747904/170498071 [00:00<00:02, 59187025.99it/s]
[36m(train_cifar pid=3953)[0m  13%|█▎        | 22380544/170498071 [00:00<00:01, 84696253.11it/s]
[36m(train_cifar pid=3953)[0m  20%|█▉        | 33947648/170498071 [00:00<00:01, 96728829.65it/s]
[36m(train_cifar pid=3953)[0m  27%|██▋       | 45318144/170498071 [00:00<00:01, 102777836.25it/s]
[36m(train_cifar pid=3953)[0m  33%|███▎      | 56918016/170498071 [00:00<00:01, 107136279.31it/s]
[36m(train_cifar pid=3953)[0m  40%|████      | 68616192/170498071 [00:00<00:00, 109970022.52it/s]
[36m(train_cifar pid=3953)[0m  47%|████▋     | 80445440/170498071 [00:00<00:00, 112572266.77it/s]
[36m(train_cifar pid=3953)[0m  54%|█████▍    | 92143616/170498071 [00:00<00:00, 113913377.83it/s]
[36m(train_cifar p

[36m(train_cifar pid=3953)[0m Extracting /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00006_6_batch_size=16,lr=0.0628_2023-11-06_21-19-07/data/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00006_6_batch_size=16,lr=0.0628_2023-11-06_21-19-07/data
[36m(train_cifar pid=3953)[0m Files already downloaded and verified

Trial status: 4 RUNNING | 3 TERMINATED | 1 PENDING
Current time: 2023-11-06 21:24:38. Total running time: 5min 30s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00004_4_batch_size=4,lr=0.0312_2023-11-06_21-19-07/checkpoint_000000)


[36m(train_cifar pid=3953)[0m [1,  2000] loss: 2.248[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00006_6_batch_size=16,lr=0.0628_2023-11-06_21-19-07/checkpoint_000000)
[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000002)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [2,  4000] loss: 1.160[32m [repeated 3x across cluster][0m
Trial status: 4 RUNNING | 3 TERMINATED | 1 PENDING
Current time: 2023-11-06 21:25:08. Total running time: 6min 0s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        3           345.506    1.46644       0.4744 |
| train_cifar_1e5c3_00004   RUNN

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00006_6_batch_size=16,lr=0.0628_2023-11-06_21-19-07/checkpoint_000001)[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [2,  2000] loss: 1.752[32m [repeated 3x across cluster][0m
[36m(train_cifar pid=3952)[0m [4,  4000] loss: 0.708[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00006_6_batch_size=16,lr=0.0628_2023-11-06_21-19-07/checkpoint_000002)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [2, 10000] loss: 0.464[32m [repeated 4x across cluster][0m


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00004_4_batch_size=4,lr=0.0312_2023-11-06_21-19-07/checkpoint_000001)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [2,  8000] loss: 0.427[32m [repeated 3x across cluster][0m
Trial status: 4 RUNNING | 3 TERMINATED | 1 PENDING
Current time: 2023-11-06 21:25:38. Total running time: 6min 30s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        3           345.506    1.46644       0.4744 |
| train_cifar_1e5c3_00004   RUN

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00006_6_batch_size=16,lr=0.0628_2023-11-06_21-19-07/checkpoint_000003)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [3,  2000] loss: 2.321[32m [repeated 3x across cluster][0m
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [3,  4000] loss: 1.159[32m [repeated 3x across cluster][0m


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000001)


[36m(train_cifar pid=3952)[0m [4, 12000] loss: 0.238[32m [repeated 2x across cluster][0m

Trial train_cifar_1e5c3_00006 completed after 5 iterations at 2023-11-06 21:25:59. Total running time: 6min 51s
+------------------------------------------------------------+
| Trial train_cifar_1e5c3_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000004 |
| time_this_iter_s                                  17.02978 |
| time_total_s                                      86.46692 |
| training_iteration                                       5 |
| accuracy                                            0.1038 |
| loss                                               2.31322 |
| try_gpu                                              False |
+------------------------------------------------------------+

Trial train_cifar_1e5c3_00007 started with configuration:
+-----------------------------------------

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00006_6_batch_size=16,lr=0.0628_2023-11-06_21-19-07/checkpoint_000004)
[36m(train_cifar pid=3953)[0m   0%|          | 0/170498071 [00:00<?, ?it/s]
[36m(train_cifar pid=3953)[0m   1%|          | 917504/170498071 [00:00<00:20, 8204575.92it/s]
[36m(train_cifar pid=3953)[0m   6%|▋         | 10911744/170498071 [00:00<00:02, 59588343.32it/s]
[36m(train_cifar pid=3953)[0m  13%|█▎        | 22577152/170498071 [00:00<00:01, 84933196.59it/s]
[36m(train_cifar pid=3953)[0m  20%|██        | 34373632/170498071 [00:00<00:01, 97729509.33it/s]
[36m(train_cifar pid=3953)[0m  27%|██▋       | 46104576/170498071 [00:00<00:01, 104647190.23it/s]
[36m(train_cifar pid=3953)[0m  34%|███▍      | 57835520/170498071 [00:00<00:01, 108917397.86it/s]
[36m(train_cifar pid=3953)[0m  41%|████      | 69566464/170498071 [00:00<00:00, 1116024

[36m(autoscaler +7m24s)[0m Resized to 12 CPUs, 3 GPUs.
[36m(train_cifar pid=3953)[0m Extracting /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00007_7_batch_size=8,lr=0.0131_2023-11-06_21-19-07/data/cifar-10-python.tar.gz to /root/ray_results/tune_checkpointing_location/train_cifar_1e5c3_00007_7_batch_size=8,lr=0.0131_2023-11-06_21-19-07/data
[36m(autoscaler +7m25s)[0m Resized to 16 CPUs, 4 GPUs.
[36m(train_cifar pid=3953)[0m Files already downloaded and verified
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [3,  2000] loss: 1.647[32m [repeated 2x across cluster][0m

Trial status: 4 RUNNING | 4 TERMINATED
Current time: 2023-11-06 21:26:08. Total running time: 7min 0s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+-----------------------------

[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00004_4_batch_size=4,lr=0.0312_2023-11-06_21-19-07/checkpoint_000002)


[36m(train_cifar pid=3953)[0m [1,  4000] loss: 1.051[32m [repeated 4x across cluster][0m
[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [4,  2000] loss: 2.322[32m [repeated 3x across cluster][0m


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00007_7_batch_size=8,lr=0.0131_2023-11-06_21-19-07/checkpoint_000000)


Trial status: 4 RUNNING | 4 TERMINATED
Current time: 2023-11-06 21:26:38. Total running time: 7min 30s
Logical resource usage: 4.0/16 CPUs, 2.0/4 GPUs (0.0/6.0 NODE_ID_AS_RESOURCE, 0.0/4.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        3           345.506    1.46644       0.4744 |
| train_cifar_1e5c3_00004   RUNNING      0.0312312               4        3           155.782    2.32546       0.0973 |
| train_cifar_1e5c3_00005   RUNNI

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000002)


[36m(train_cifar pid=2848, ip=10.139.64.113)[0m [4,  6000] loss: 0.773[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000003)


[36m(train_cifar pid=3953)[0m [2,  4000] loss: 1.145[32m [repeated 2x across cluster][0m




[36m(autoscaler +8m25s)[0m Removing 1 nodes of type ray.worker (idle).
[36m(train_cifar pid=3952)[0m [5,  2000] loss: 1.402[32m [repeated 3x across cluster][0m




[36m(autoscaler +8m26s)[0m Removing 1 nodes of type ray.worker (idle).
[36m(autoscaler +8m27s)[0m Resized to 12 CPUs, 3 GPUs.


[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00007_7_batch_size=8,lr=0.0131_2023-11-06_21-19-07/checkpoint_000001)


[36m(autoscaler +8m28s)[0m Resized to 8 CPUs, 2 GPUs.
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [4,  6000] loss: 0.529[32m [repeated 2x across cluster][0m
Trial status: 4 RUNNING | 4 TERMINATED
Current time: 2023-11-06 21:27:08. Total running time: 8min 0s
Logical resource usage: 4.0/16 CPUs, 2.0/4 GPUs (0.0/6.0 NODE_ID_AS_RESOURCE, 0.0/4.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        4           461.557    1.51442   

[36m(train_cifar pid=2848, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00004_4_batch_size=4,lr=0.0312_2023-11-06_21-19-07/checkpoint_000003)


[36m(train_cifar pid=3953)[0m [3,  2000] loss: 2.117[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=3952)[0m [5,  6000] loss: 0.463[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=3952)[0m [5,  8000] loss: 0.354[32m [repeated 4x across cluster][0m


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000003)


Trial status: 4 RUNNING | 4 TERMINATED
Current time: 2023-11-06 21:27:38. Total running time: 8min 30s
Logical resource usage: 4.0/8 CPUs, 2.0/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        4           461.557    1.51442       0.4371 |
| train_cifar_1e5c3_00004   RUNNING      0.0312312               4        4           205.749    2.31443       0.0985 |
| train_cifar_1e5c3_00005   RUNNIN

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00007_7_batch_size=8,lr=0.0131_2023-11-06_21-19-07/checkpoint_000003)[32m [repeated 2x across cluster][0m



Trial train_cifar_1e5c3_00004 completed after 5 iterations at 2023-11-06 21:28:03. Total running time: 8min 56s
+------------------------------------------------------------+
| Trial train_cifar_1e5c3_00004 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000004 |
| time_this_iter_s                                  50.35731 |
| time_total_s                                     256.10656 |
| training_iteration                                       5 |
| accuracy                                            0.0964 |
| loss                                               2.31252 |
| try_gpu                                              False |
+------------------------------------------------------------+
[36m(train_cifar pid=3952)[0m [5, 16000] loss: 0.178[32m [repeated 3x across cluster][0m

Trial status: 3 RUNNING | 5 TERMINATED
Current time: 2023-11-06 21:28:08. Total running time: 9min 0

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000004)[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=3953)[0m [5,  4000] loss: 1.094[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [6,  2000] loss: 1.489[32m [repeated 2x across cluster][0m

Trial train_cifar_1e5c3_00007 completed after 5 iterations at 2023-11-06 21:28:30. Total running time: 9min 23s
+------------------------------------------------------------+
| Trial train_cifar_1e5c3_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000004 |
| time_this_iter_s                                  29.10241 |
| time_total_s                                     151.15273 |
| training_iteration                                       5 |
| accuracy                                            0.1636 |
| loss                                               2.16655 |
| try_gpu                                              False |
+----------------------------------------------------

[36m(train_cifar pid=3953)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00007_7_batch_size=8,lr=0.0131_2023-11-06_21-19-07/checkpoint_000004)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [6,  4000] loss: 0.731


[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000004)



Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:28:38. Total running time: 9min 30s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        5           566.968    1.45453       0.4871 |
| train_cifar_1e5c3_00005   RUNNING      0.00117903              4        5           254.897    1.5355        0.4307 |
| train_cifar_1e5c3_00001   TERMI

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000005)


Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:29:08. Total running time: 10min 0s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        5           566.968    1.45453       0.4871 |
| train_cifar_1e5c3_00005   RUNNING      0.00117903              4        6           298.101    1.49083       0.4642 |
| train_cifar_1e5c3_00001   TERMIN

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000006)


[36m(train_cifar pid=3952)[0m [6, 18000] loss: 0.156[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=3952)[0m [6, 20000] loss: 0.143[32m [repeated 2x across cluster][0m
Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:30:08. Total running time: 11min 1s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        5           56

[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000005)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [8,  6000] loss: 0.455[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [8,  8000] loss: 0.337
[36m(train_cifar pid=3952)[0m [7,  2000] loss: 1.356
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [8, 10000] loss: 0.271
[36m(train_cifar pid=3952)[0m [7,  4000] loss: 0.687


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000007)


[36m(train_cifar pid=3952)[0m [7,  6000] loss: 0.457
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [9,  2000] loss: 1.315
Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:30:38. Total running time: 11min 31s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        6           658.586    1.49881       0.4675 |
| train_cifar_1e5c3_00005

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000008)


[36m(train_cifar pid=3952)[0m [7, 18000] loss: 0.158
[36m(train_cifar pid=3952)[0m [7, 20000] loss: 0.137[32m [repeated 2x across cluster][0m


[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000006)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [10,  6000] loss: 0.443[32m [repeated 2x across cluster][0m
Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:31:38. Total running time: 12min 31s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        7           745.31     1.42207       0.491  |
| train_cifar_1e5c3_00005   RUNNING      

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000009)


[36m(train_cifar pid=3952)[0m [8,  6000] loss: 0.455
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [11,  2000] loss: 1.274
Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:32:08. Total running time: 13min 1s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        7           745.31     1.42207       0.491  |
| train_cifar_1e5c3_00005

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000010)


[36m(train_cifar pid=3952)[0m [8, 18000] loss: 0.151[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [12,  2000] loss: 1.224
[36m(train_cifar pid=3952)[0m [8, 20000] loss: 0.142
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [12,  4000] loss: 0.640


[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000007)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [12,  6000] loss: 0.426
Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:33:08. Total running time: 14min 1s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        8           830.615    1.44869       0.4724 |
| train_cifar_1e5c3_00005   RUNNING      0.00117903              4       11     

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000011)


[36m(train_cifar pid=3952)[0m [9,  8000] loss: 0.337
[36m(train_cifar pid=3952)[0m [9, 10000] loss: 0.278[32m [repeated 2x across cluster][0m
Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:33:38. Total running time: 14min 31s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        8           830.615    1.44869       0.4724 |
| tra

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000012)


Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:34:08. Total running time: 15min 1s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        8           830.615    1.44869       0.4724 |
| train_cifar_1e5c3_00005   RUNNING      0.00117903              4       13           602.575    1.415         0.5135 |
| train_cifar_1e5c3_00001   TERMIN

[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000008)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [14,  6000] loss: 0.424
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [14,  8000] loss: 0.314[32m [repeated 2x across cluster][0m
Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:34:38. Total running time: 15min 31s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        9           9

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000013)


[36m(train_cifar pid=3952)[0m [10,  8000] loss: 0.337[32m [repeated 2x across cluster][0m
[36m(train_cifar pid=3952)[0m [10, 10000] loss: 0.266[32m [repeated 2x across cluster][0m
Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:35:08. Total running time: 16min 1s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        9           

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000014)


Trial status: 2 RUNNING | 6 TERMINATED
Current time: 2023-11-06 21:35:39. Total running time: 16min 31s
Logical resource usage: 2.0/8 CPUs, 1.0/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   RUNNING      0.00119966              2        9           916.434    1.37025       0.5169 |
| train_cifar_1e5c3_00005   RUNNING      0.00117903              4       15           689.639    1.46727       0.5215 |
| train_cifar_1e5c3_00001   TERMI

[36m(train_cifar pid=3952)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00000_0_batch_size=2,lr=0.0012_2023-11-06_21-19-07/checkpoint_000009)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [16,  6000] loss: 0.409
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [16,  8000] loss: 0.314

Trial status: 7 TERMINATED | 1 RUNNING
Current time: 2023-11-06 21:36:09. Total running time: 17min 1s
Logical resource usage: 1.0/8 CPUs, 0.5/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00005   RUNNING      0.00117903              4       15           689.639    1.46727       0.5215 |
| tra

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000015)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [17,  2000] loss: 1.187
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [17,  4000] loss: 0.598
Trial status: 7 TERMINATED | 1 RUNNING
Current time: 2023-11-06 21:36:39. Total running time: 17min 31s
Logical resource usage: 1.0/8 CPUs, 0.5/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00005   RUNNING      0.00117903              4       16           733.409    1.33357       0.5445 |
| tra

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000016)


Trial status: 7 TERMINATED | 1 RUNNING
Current time: 2023-11-06 21:37:09. Total running time: 18min 1s
Logical resource usage: 1.0/8 CPUs, 0.5/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00005   RUNNING      0.00117903              4       17           777.155    1.40678       0.525  |
| train_cifar_1e5c3_00000   TERMINATED   0.00119966              2       10          1001.14     1.45024       0.4933 |
| train_cifar_1e5c3_00001   TERMIN

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000017)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [19,  2000] loss: 1.192
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [19,  4000] loss: 0.597
[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [19,  6000] loss: 0.400
Trial status: 7 TERMINATED | 1 RUNNING
Current time: 2023-11-06 21:38:09. Total running time: 19min 2s
Logical resource usage: 1.0/8 CPUs, 0.5/2 GPUs (0.0/2.0 accelerator_type:V100, 0.0/1.0 NODE_ID_AS_RESOURCE)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00005   RUNNING      0.00117903 

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000018)


[36m(train_cifar pid=2850, ip=10.139.64.113)[0m [20,  2000] loss: 1.194
Trial status: 7 TERMINATED | 1 RUNNING
Current time: 2023-11-06 21:38:39. Total running time: 19min 32s
Logical resource usage: 1.0/8 CPUs, 0.5/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00005   RUNNING      0.00117903              4       19           864.621    1.38908       0.5269 |
| train_cifar_1e5c3_00000   TERMINATED   0.00119966              2       10    

[36m(train_cifar pid=2850, ip=10.139.64.113)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/dbfs/pj/ray/tune_checkpointing_location/train_cifar_1e5c3_00005_5_batch_size=4,lr=0.0012_2023-11-06_21-19-07/checkpoint_000019)



Trial status: 8 TERMINATED
Current time: 2023-11-06 21:39:16. Total running time: 20min 8s
Logical resource usage: 1.0/8 CPUs, 0.5/2 GPUs (0.0/1.0 NODE_ID_AS_RESOURCE, 0.0/2.0 accelerator_type:V100)
Current best trial: 1e5c3_00001 with loss=1.2401041355133056 and params={'l1': 4, 'l2': 8, 'lr': 0.0017008417716143818, 'batch_size': 16, 'max_epoch': 20}
+---------------------------------------------------------------------------------------------------------------------+
| Trial name                status               lr     batch_size     iter     total time (s)      loss     accuracy |
+---------------------------------------------------------------------------------------------------------------------+
| train_cifar_1e5c3_00000   TERMINATED   0.00119966              2       10          1001.14     1.45024       0.4933 |
| train_cifar_1e5c3_00001   TERMINATED   0.00170084             16       20           321.879    1.2401        0.5481 |
| train_cifar_1e5c3_00002   TERMINATED   0.01

  0%|          | 0/170498071 [00:00<?, ?it/s]

Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Best trial test set accuracy: 0.5479
