# How to re-run failed training

## Before you start

### Install dependencies

In [2]:
! pip install -U neptune-client numpy==1.19.5 torch==1.9.0 torchvision==0.10.0

Collecting neptune-client
  Downloading neptune-client-0.12.1.tar.gz (275 kB)
[?25l[K     |█▏                              | 10 kB 30.9 MB/s eta 0:00:01[K     |██▍                             | 20 kB 28.4 MB/s eta 0:00:01[K     |███▋                            | 30 kB 21.4 MB/s eta 0:00:01[K     |████▊                           | 40 kB 17.7 MB/s eta 0:00:01[K     |██████                          | 51 kB 13.7 MB/s eta 0:00:01[K     |███████▏                        | 61 kB 12.5 MB/s eta 0:00:01[K     |████████▎                       | 71 kB 11.9 MB/s eta 0:00:01[K     |█████████▌                      | 81 kB 13.1 MB/s eta 0:00:01[K     |██████████▊                     | 92 kB 13.8 MB/s eta 0:00:01[K     |████████████                    | 102 kB 10.9 MB/s eta 0:00:01[K     |█████████████                   | 112 kB 10.9 MB/s eta 0:00:01[K     |██████████████▎                 | 122 kB 10.9 MB/s eta 0:00:01[K     |███████████████▌                | 133 kB 10.9 MB/s 

# Basic example

**Import libraries**

In [11]:
import neptune.new as neptune
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import os
from torchvision import datasets, transforms

## Step 1: Get Run ID
You will get the Run ID of the failed run **programmatically**.

In [4]:
# Fetch project
project = neptune.get_project(name='common/showroom', api_token='ANONYMOUS')

# Fetch only inactive runs
runs_table_df = project.fetch_runs_table(state="idle", tag=['showcase-run']).to_pandas()

# Sort runs by failed 
runs_table_df = runs_table_df.sort_values(by='sys/failed', ascending=True)

# Extract the last failed run's id
failed_run_id = runs_table_df[runs_table_df['sys/failed']==True]['sys/id'].values[0]

In [5]:
print('Failed_run_id = ', failed_run_id)

'SHOW-3295'

## Step 2: Resume run
Use the neptune.init() method to:
* Re-open a run using the ID you got from the previous step 
* Re-open it in the `read-only` mode

You use the `read-only` mode so the metadata previously logged to the run is not accidentally changed. Also, you can re-open a run as many times as needed.


In [6]:
failed_run = neptune.init(
    project="common/showroom",
    api_token="ANONYMOUS",
    mode="read-only",
    run=failed_run_id
)

https://app.neptune.ai/common/showroom/e/SHOW-3295
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


## Step 3: Fetching and downloading data from Neptune

Fetch metadata(dataset and hyperparameters) needed to re-run the training. Precisely, you will download the hyperparameters used in the failed run to instantiate a model with the same configuration and then you will download the dataset path to get the same dataset too.

To do that:

### Use the  [.download()](https://docs.neptune.ai/api-reference/field-types#.download-3) method to retrieve the [dataset artifact](https://docs.neptune.ai/api-reference/field-types#artifact) to your local disk:

In [None]:
data_dir = 'data'

In [21]:
failed_run['artifacts/dataset'].download(destination=data_dir)

### Use the [.fetch()](https://docs.neptune.ai/api-reference/field-types#.fetch-1) method to retrieve hyperparameters:

In [14]:
# fetching non-file values 
failed_run_params = failed_run['config/hyperparameters'].fetch()

## Step 4: Create a new run
Create a new Neptune run that will be used to log metadata in the re-run session.

In [17]:
new_run = neptune.init(
    project="common/showroom",
    tags=['re-run', 'successful training'],
    api_token="ANONYMOUS"
)

https://app.neptune.ai/common/showroom/e/SHOW-3296
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


## Step 5: Log new training metadata
Now you can continue working and logging metadata to a brand new Run.
You can log metadata using the [Neptune API Client](https://docs.neptune.ai/you-should-know/what-can-you-log-and-display).

### Log copy of dataset artifact from failed_run to new run

In [18]:
new_run["artifacts/dataset"].assign(failed_run["artifacts/dataset"].fetch())

### Log Hyperparameters from failed run to new run

In [19]:
new_run["config/hyperparameters"] = failed_run_params

### Load Dataset and Model

Dataset

In [20]:
data_tfms = {
    "train": transforms.Compose(
        [
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    ),
}

In [26]:
trainset = datasets.CIFAR10(data_dir+'/CIFAR10', transform=data_tfms["train"], download=False)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=failed_run_params["bs"], shuffle=True, num_workers=2
)

Model

In [27]:
class BaseModel(nn.Module):
    def __init__(self, input_sz, hidden_dim, n_classes):
        super(BaseModel, self).__init__()
        self.main = nn.Sequential(
            nn.Linear(input_sz, hidden_dim * 2),
            nn.ReLU(),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, n_classes),
        )

    def forward(self, input):
        x = input.view(-1, 32 * 32 * 3)
        return self.main(x)

In [28]:
model = BaseModel(
    failed_run_params["input_sz"], failed_run_params["input_sz"], failed_run_params["n_classes"]
).to(failed_run_params["device"])
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=failed_run_params["lr"])

### Log losses and metrics

In [29]:
for i, (x, y) in enumerate(trainloader, 0):
    x, y = x.to(failed_run_params["device"]), y.to(failed_run_params["device"])
    optimizer.zero_grad()
    outputs = model.forward(x)
    _, preds = torch.max(outputs, 1)
    loss = criterion(outputs, y)
    acc = (torch.sum(preds == y.data)) / len(x)

    new_run["training/batch/loss"].log(loss)

    new_run["training/batch/acc"].log(acc)

    loss.backward()
    optimizer.step()

Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/cpu. Invalid point: 2021-10-13T06:18:24.977Z
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/memory. Invalid point: 2021-10-13T06:18:24.977Z
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/gpu_memory. Invalid point: 2021-10-13T06:18:24.977Z
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stderr. Invalid point: 2021-10-13T06:18:29.754Z
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stderr. Invalid point: 2021-10-13T06:18:29.767Z
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stderr. Invalid point

# Stop run

<font color=red>**Warning:**</font><br>
Once you are done logging, you should stop tracking the run using the `stop()` method.
This is needed only while logging from a notebook environment. While logging through a script, Neptune automatically stops tracking once the script has completed execution.

In [31]:
failed_run.stop()
new_run.stop()