# How to re-run failed training

## Introduction
When you are executing a model training script that’s being tracked in Neptune and it fails in the middle, you can easily re-run it with the same metadata such as hyperparameters, data, and code version.

By the end of this guide, you will learn how to re-open a failed Neptune run to fetch the metadata needed to re-run it and log all metadata from the model training, validation, or testing to a new run, to save results you didn’t get from the failed run.

[See this example in Neptune](https://app.neptune.ai/common/showroom/e/SHOW-3695)

[![image](https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MT0sYKbymfLAAtTq4-t%2Fuploads%2Fcgj0RmMm89ejwZKSZfXK%2Fimage.png?alt=media&token=c0314a39-5e8d-465c-9752-a688bd207b91)](https://app.neptune.ai/common/showroom/e/SHOW-3695)
<center><small>Training charts from the run created with the metadata taken from the failed run.</small></center>



## Before you start

Make sure that you have:
* [Python 3.7+ installed](https://www.python.org/downloads/),
* [Basic familiarity with Neptune (create run and log metadata to it)](hhttps://docs.neptune.ai/you-should-know/what-can-you-log-and-display)

In [None]:
! pip install -U neptune-client torch==1.10.2 torchvision==0.11.3 #add other dependencies

## Step 1: Get Run ID
You will get the Run ID of the failed run **programmatically**.

**Note**: To log or retrieve metadata from Neptune, you need the `project name` and the `api_token`.

To make this example easy to follow, we have created a public project **'common/showroom'** and a shared user **'neptuner'** with the API token **'ANONYMOUS'**. As you will see in the code cell below.

**(Optional)** If you want to log to your own project you have to have or create [Neptune account](https://app.neptune.ai/register/) and [project](https://docs.neptune.ai/getting-started/installation#setting-the-project-name).
Then you can pass [project](https://docs.neptune.ai/getting-started/installation#setting-the-project-name) and [api_token](https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token) arguments to the `init()` method.

`run = neptune.init(api_token='<YOUR_API_TOKEN>', project='<YOUR_WORKSPACE/YOUR_PROJECT>')` 


In [None]:
import neptune.new as neptune

# Fetch project
project = neptune.get_project(name="common/showroom", api_token="ANONYMOUS")

# Fetch only inactive runs with tag "showcase-run"
runs_table_df = project.fetch_runs_table(state="idle", tag=["showcase-run"]).to_pandas()

# Extract the last failed run's id
failed_run_id = runs_table_df[runs_table_df["sys/failed"] == True]["sys/id"].values[0]

In [None]:
print("Failed_run_id = ", failed_run_id)

## Step 2: Resume failed run
Use the `neptune.init()` method to:
* Re-open a run using the ID you got from the previous step 
* Re-open it in the `read-only` mode

You use the `read-only` mode so the metadata previously logged to the run is not accidentally changed. Also, you can re-open a run as many times as needed.

**(Optional)** If you already have an existing [Neptune account](https://app.neptune.ai/register/) you can pass your credentials to **[project](https://docs.neptune.ai/getting-started/installation#setting-the-project-name)** and **[api_token](https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token)** arguments of neptune.init()

`run = neptune.init(api_token='<YOUR_API_TOKEN>', project='<YOUR_WORKSPACE/YOUR_PROJECT>')` 

In [None]:
failed_run = neptune.init(
    project="common/showroom",
    api_token="ANONYMOUS",
    run=failed_run_id,
    mode="read-only"
)

## Step 3: Fetch relevant metadata from Neptune

Fetch metadata (dataset and hyperparameters) needed to re-run the training. Precisely, you will download the hyperparameters used in the failed run to instantiate a model with the same configuration. Then you will get the [dataset artifact](https://docs.neptune.ai/api-reference/field-types#artifact) to download the dataset files from your cloud storage (i.e., s3 bucket) to train your model on the same dataset version used in the failed run.

To do that:

Use the [download()](https://docs.neptune.ai/api-reference/field-types#.download-3) method to retrieve the [dataset artifact](https://docs.neptune.ai/api-reference/field-types#artifact) to your local disk:

In [None]:
data_dir = "data"

In [None]:
# Download tracked dataset files from S3 bucket
failed_run["artifacts/dataset"].download(destination=data_dir)

Use the [fetch()](https://docs.neptune.ai/api-reference/field-types#.fetch-1) method to retrieve hyperparameters:

In [None]:
# Fetch hyperparameters
failed_run_params = failed_run["config/hyperparameters"].fetch()

## Step 4: Create a new run
Create a new Neptune run that will be used to log metadata in the re-run session.

In [None]:
new_run = neptune.init(
    project="common/showroom",
    api_token="ANONYMOUS",
    tags=["re-run", "successful training"],
)

Running this cell creates a Run in Neptune, and you can log model building metadata to it.

**Click on the link above to open the Run in Neptune UI.** 

For now, it is empty, but you should keep the tab open to see what happens next.

## Step 5: Log new training metadata
Now you can continue working and logging metadata to a brand new Run.
You can log metadata using the [Neptune API Client](https://docs.neptune.ai/you-should-know/what-can-you-log-and-display).

### Log copy of dataset artifact from failed_run to new run

In [None]:
new_run["artifacts/dataset"].assign(failed_run["artifacts/dataset"].fetch())

### Log Hyperparameters from failed run to new run

In [None]:
new_run["config/hyperparameters"] = failed_run_params

### Load Dataset and Model

Dataset

In [None]:
import torch
from torchvision import datasets, transforms

data_tfms = {
    "train": transforms.Compose(
        [
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    ),
}

In [None]:
trainset = datasets.CIFAR10(
    data_dir + "/CIFAR10", transform=data_tfms["train"], download=False
)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=failed_run_params["bs"], shuffle=True, num_workers=2
)

Model

In [None]:
import torch.nn as nn

class BaseModel(nn.Module):
    def __init__(self, input_sz, hidden_dim, n_classes):
        super(BaseModel, self).__init__()
        self.main = nn.Sequential(
            nn.Linear(input_sz, hidden_dim * 2),
            nn.ReLU(),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, n_classes),
        )

    def forward(self, input):
        x = input.view(-1, 32 * 32 * 3)
        return self.main(x)

In [None]:
model = BaseModel(
    failed_run_params["input_sz"],
    failed_run_params["input_sz"],
    failed_run_params["n_classes"],
).to(failed_run_params["device"])
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=failed_run_params["lr"])

### Log losses and metrics

In [None]:
for i, (x, y) in enumerate(trainloader, 0):
    x, y = x.to(failed_run_params["device"]), y.to(failed_run_params["device"])
    optimizer.zero_grad()
    outputs = model.forward(x)
    _, preds = torch.max(outputs, 1)
    loss = criterion(outputs, y)
    acc = (torch.sum(preds == y.data)) / len(x)

    new_run["training/batch/loss"].log(loss)

    new_run["training/batch/acc"].log(acc)

    loss.backward()
    optimizer.step()

# Stop run

<font color=red>**Warning:**</font><br>

Once you are done logging, you should stop tracking the run using the [stop()](https://docs.neptune.ai/api-reference/project#.stop) method.
This is needed only while logging from a notebook environment. While logging through a script, Neptune automatically stops tracking once the script has completed execution.

In [None]:
failed_run.stop()
new_run.stop()

## Explore the run in the Neptune UI

After running the code cell in **Step 4**, you will get a link on the cell output similar to https://app.neptune.ai/common/showroom/e/SHOW-3695 with: 
* **common/showroom** replaced by **your_workspace/your_project**,
* **SHOW-3695** replaced by your *Run ID*. 

**Click on the link to open the Run in Neptune UI.**

![image](https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MT0sYKbymfLAAtTq4-t%2Fuploads%2FnZUUkjpnlr7VhB24Trzv%2Fimage.png?alt=media&token=7944ed6d-cbc5-4edf-9013-04d5e9e6ef9f)
<center><small>New Neptune run created from failed run metadata</small></center>

## Conclusion
You learned how to:
* Re-open a failed run in order to fetch the metadata needed to re-run it.
* Use fetched metadata to parametrize a new run with the same training loop.

**This knowledge can be applied to any other scenario as well!**

Visit our docs for more tutorials and guides on how to use Neptune: https://docs.neptune.ai
