# How to re-run failed training

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/examples/blob/main/how-to-guides/re-run-failed-training/notebooks/re_run_failed_training.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
</a>
<a target="_blank" href="https://github.com/neptune-ai/examples/blob/main/how-to-guides/re-run-failed-training/notebooks/re_run_failed_training.ipynb">
  <img alt="Open in GitHub" src="https://img.shields.io/badge/Open_in_GitHub-blue?logo=github&labelColor=black">
</a>
<a target="_blank" href="https://app.neptune.ai/o/common/org/showroom/e/SHOW-28179/all"> 
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>
<a target="_blank" href="https://docs.neptune.ai/tutorials/re-running_failed_training/">
  <img alt="View tutorial in docs" src="https://neptune.ai/wp-content/uploads/2024/01/docs-badge-2.svg">
</a>

## Introduction
When you are executing a model training script that’s being tracked in Neptune and it fails in the middle, you can easily re-run it with the same metadata such as hyperparameters, data, and code version.

By the end of this guide, you will learn how to re-open a failed Neptune run to fetch the metadata needed to re-run it and log all metadata from the model training, validation, or testing to a new run, to save results you didn’t get from the failed run.


## Before you start

Make sure that you have:
* [Python 3.7+ installed](https://www.python.org/downloads/),
* [Basic familiarity with Neptune (create run and log metadata to it)](https://docs.neptune.ai/usage/#getting-started)

In [None]:
! pip install -U neptune torch torchvision

## Step 1: Get run ID
You will get the run ID of the failed run **programmatically**.

**Note**: To log or retrieve metadata from Neptune, you need the project name and the API token

To make this example easy to follow, we'll log the metadata to the public project **'common/showroom'** using a shared token for anonymous logging.

**(Optional)** If you want to log to your own project, you need a [Neptune account](https://app.neptune.ai/register/) and a [project](https://docs.neptune.ai/setup/creating_project).
Then you can pass [project](https://docs.neptune.ai/setup/creating_project/#next-steps) and [api_token](https://docs.neptune.ai/setup/setting_api_token/#setting-your-api-token) arguments to the `init_run()` method.

`run = neptune.init_run(api_token='YOUR_API_TOKEN', project='YOUR_WORKSPACE/YOUR_PROJECT')` 


In [None]:
import neptune

# Fetch project
project = neptune.init_project(
    project="common/showroom", api_token=neptune.ANONYMOUS_API_TOKEN, mode="read-only"
)

# Fetch only inactive runs with tag "showcase-run"
runs_table_df = project.fetch_runs_table(
    state="inactive", tag=["showcase-run"], columns=["sys/failed"]
).to_pandas()

# Extract the last failed run's id
failed_run_id = runs_table_df[runs_table_df["sys/failed"] == True]["sys/id"].values[0]

## Step 2: Resume failed run
Use the `neptune.init_run()` method to:
* Re-open a run using the ID you got from the previous step 
* Re-open it in the `read-only` mode

Use the `read-only` mode so the metadata previously logged to the run is not accidentally changed. Also, you can re-open a run as many times as needed.

**(Optional)** If you already have a [Neptune account](https://app.neptune.ai/register/) you can pass your credentials to **[project](https://docs.neptune.ai/setup/setting_project_name/)** and **[api_token](https://docs.neptune.ai/setup/setting_api_token/)** arguments of neptune.init_run()

```python
from getpass import getpass

run = neptune.init_run(
    api_token=getpass("Enter your Neptune API token: "),
    project="workspace-name/project-name",  # replace with your own
) 
```

In [None]:
failed_run = neptune.init_run(
    project="common/showroom",
    api_token=neptune.ANONYMOUS_API_TOKEN,
    with_id=failed_run_id,
    mode="read-only",
)

## Step 3: Fetch relevant metadata from Neptune

Fetch metadata (i.e., dataset and hyperparameters) needed to re-run the training. Precisely, you will download the hyperparameters and dataset path used in the failed run to instantiate a model and dataset objects with the same configuration.

To do that:

Use the [fetch()](https://docs.neptune.ai/api/universal/#fetch) method to retrieve relevant metadata

In [None]:
# Fetch hyperparameters
failed_run_params = failed_run["config/hyperparameters"].fetch()

In [None]:
# Fetch dataset path
dataset_path = failed_run["dataset/path"].fetch()

## Step 4: Create a new run
Create a new Neptune run that will be used to log metadata in the re-run session.

In [None]:
new_run = neptune.init_run(
    project="common/showroom",
    api_token=neptune.ANONYMOUS_API_TOKEN,
    tags=["re-run", "successful training"],
)

Running this cell creates a run in Neptune, and you can log model building metadata to it.

**Click on the link above to open the run in the Neptune app.** 

For now, it is empty, but you should keep the tab open to see what happens next.

## Step 5: Log Hyperparameters and Dataset details from failed run to new run
Now you can continue working and logging metadata to a brand new run.
You can log metadata using the Neptune API Client. For details, see [What you can log and display](https://docs.neptune.ai/logging/what_you_can_log).

In [None]:
new_run["config/hyperparameters"] = failed_run_params
new_run["dataset/path"] = dataset_path

### Load dataset and model

Dataset

In [None]:
import torch
from torchvision import datasets, transforms

data_tfms = {
    "train": transforms.Compose(
        [
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    ),
}

In [None]:
trainset = datasets.CIFAR10(dataset_path, transform=data_tfms["train"], download=True)

trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=failed_run_params["bs"], shuffle=True, num_workers=0
)

Model

In [None]:
import torch.nn as nn


class BaseModel(nn.Module):
    def __init__(self, input_sz, hidden_dim, n_classes):
        super(BaseModel, self).__init__()
        self.main = nn.Sequential(
            nn.Linear(input_sz, hidden_dim * 2),
            nn.ReLU(),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, n_classes),
        )

    def forward(self, input):
        x = input.view(-1, 32 * 32 * 3)
        return self.main(x)

In [None]:
model = BaseModel(
    failed_run_params["input_sz"],
    failed_run_params["input_sz"],
    failed_run_params["n_classes"],
).to(failed_run_params["device"])
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=failed_run_params["lr"])

### Log losses and metrics

In [None]:
for i, (x, y) in enumerate(trainloader, 0):
    x, y = x.to(failed_run_params["device"]), y.to(failed_run_params["device"])
    optimizer.zero_grad()
    outputs = model.forward(x)
    _, preds = torch.max(outputs, 1)
    loss = criterion(outputs, y)
    acc = (torch.sum(preds == y.data)) / len(x)

    new_run["training/batch/loss"].append(loss)

    new_run["training/batch/acc"].append(acc)

    loss.backward()
    optimizer.step()

## Stop logging

Once you are done logging, stop tracking the run.

In [None]:
failed_run.stop()
new_run.stop()

## Explore the run in the Neptune app

After running the code cell in **Step 4**, you will get a link on the cell output similar to https://app.neptune.ai/o/common/org/showroom/e/SHOW-28180 with: 
* **common/showroom** replaced by **your_workspace/your_project**,
* **SHOW-28180** replaced by your Run ID. 

**Click on the link to open the Run in Neptune UI.**

## Conclusion
You learned how to:
* Re-open a failed run in order to fetch the metadata needed to re-run it.
* Use fetched metadata to parametrize a new run with the same training loop.

**This knowledge can be applied to any other scenario as well!**

Visit our docs for more tutorials and guides on how to use Neptune: https://docs.neptune.ai
