NPT-14525: Neptune reports "step must be strictly increasing" error if lightning logs in training and validation step #1702

NiklasKappel · 2024-03-20T09:04:45Z

Describe the bug

Using the Neptune logger in lightning, I get multiple of the following errors:

[neptune] [error  ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 299.0

The number at the end is different for every error. The errors only appear when there is a log call in BOTH of lightnings trainig_step and validation_step methods.

Reproduction

Run the following MWE after setting project and api_key in main. Comment out one of the two self.log calls to see the errors go away.

import lightning as L
import torch
from pytorch_lightning.loggers import NeptuneLogger
from torch.utils.data import DataLoader, Dataset


class MyDataset(Dataset):
    def __init__(self):
        super().__init__()
        self.data = torch.rand(100, 3)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]


class MyDataModule(L.LightningDataModule):
    def __init__(self, num_workers=7):
        super().__init__()
        self.num_workers = num_workers

    def setup(self, stage=None):
        del stage  # unused
        self.train_dataset = MyDataset()
        self.val_dataset = MyDataset()

    def train_dataloader(self):
        return DataLoader(self.train_dataset, num_workers=self.num_workers)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, num_workers=self.num_workers)


class MyModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(3, 1)

    def forward(self, x):  # type: ignore
        return self.layer(x)

    def training_step(self, batch):  # type: ignore
        loss = self(batch).mean()
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch):  # type: ignore
        loss = self(batch).mean()
        self.log("val_loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())


def main():
    model = MyModel()
    data_module = MyDataModule()
    neptune_logger = NeptuneLogger(
        project="",
        api_key="",
    )
    trainer = L.Trainer(logger=neptune_logger, log_every_n_steps=1, max_epochs=5)  # type: ignore
    trainer.fit(model, data_module)


if __name__ == "__main__":
    main()

Expected behavior

No errors when logging both train and val metrics.

Environment

The output of pip list:

Package                   Version
------------------------- --------------
arrow                     1.3.0
attrs                     23.2.0
blinker                   1.7.0
boto3                     1.34.65
botocore                  1.34.66
bravado                   11.0.3
bravado-core              5.17.1
Brotli                    1.1.0
cached-property           1.5.2
certifi                   2024.2.2
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
colorama                  0.4.6
cryptography              42.0.5
filelock                  3.13.1
fqdn                      1.5.1
fsspec                    2024.3.1
future                    1.0.0
gitdb                     4.0.11
GitPython                 3.1.42
gmpy2                     2.1.2
idna                      3.6
importlib_resources       6.3.1
isoduration               20.11.0
Jinja2                    3.1.3
jmespath                  1.0.1
jsonpointer               2.4
jsonref                   1.1.0
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
lightning                 2.2.1
lightning-utilities       0.11.0
MarkupSafe                2.1.5
monotonic                 1.5
mpmath                    1.3.0
msgpack                   1.0.7
neptune                   1.9.1
networkx                  3.2.1
numpy                     1.26.4
oauthlib                  3.2.2
packaging                 24.0
pandas                    2.2.1
Pillow                    9.4.0
pip                       24.0
pkgutil_resolve_name      1.3.10
psutil                    5.9.8
pycparser                 2.21
PyJWT                     2.8.0
PySocks                   1.7.1
python-dateutil           2.9.0
pytorch-lightning         2.2.1
pytz                      2024.1
PyYAML                    6.0.1
referencing               0.34.0
requests                  2.31.0
requests-oauthlib         1.4.0
rfc3339-validator         0.1.4
rfc3987                   1.3.8
rpds-py                   0.18.0
s3transfer                0.10.1
setuptools                69.2.0
simplejson                3.19.2
six                       1.16.0
smmap                     5.0.0
swagger-spec-validator    3.0.3
sympy                     1.12
torch                     2.2.1
torchaudio                2.2.1
torchmetrics              1.3.2
torchvision               0.17.1
tqdm                      4.66.2
types-python-dateutil     2.9.0.20240316
typing_extensions         4.10.0
tzdata                    2024.1
uri-template              1.3.0
urllib3                   2.2.1
webcolors                 1.13
websocket-client          1.7.0
wheel                     0.42.0
zipp                      3.17.0

The operating system you're using: Linux (Fedora 39)
The output of python --version: Python 3.11.8

The text was updated successfully, but these errors were encountered:

SiddhantSadangi · 2024-03-20T11:36:57Z

Hey @NiklasKappel 👋
This is a known issue and is already in the engineering backlog to fix ✅

Usage-wise, this should not have any impact, as the error is basically due to Neptune trying to log the epoch number for the same step in both the training and validation loop. Since this has already been logged once, ignoring the second logging call does not impact the charts.

However, I agree that the error logs you see are a nuisance, especially on longer training jobs where they can quickly flood the console output.
I will keep this thread updated on the status of a permanent fix.

rkinas · 2024-07-11T10:29:24Z

Any fix?
I got the same error: [neptune] [error ] Error occurred during asynchronous operation processing

SiddhantSadangi · 2024-07-11T10:54:46Z

Hey @rkinas,

This issue is currently in the engineering backlog. However, as mentioned earlier, this error typically does not have any negative impact on logging, so it is not considered a priority at the moment.

Please let me know if this issue is more serious for you.

rkinas · 2024-07-11T11:14:55Z

yes, I see ... but it's a little annoying how such messages appear in the login console :)

SiddhantSadangi · 2024-07-11T11:58:06Z

If this is crowding your console output, you can silence Neptune messages using Python logger:

import logging
logging.getLogger("neptune").setLevel(logging.CRITICAL)

Note that this will silence ALL Neptune messages, including Run initialization and sync status messages.

SiddhantSadangi self-assigned this Mar 20, 2024

SiddhantSadangi added the bug label Mar 20, 2024

SiddhantSadangi changed the title ~~BUG: Neptune reports "step must be strictly increasing" error if lightning logs in training and validation step~~ NPT-14525: Neptune reports "step must be strictly increasing" error if lightning logs in training and validation step Mar 20, 2024

SiddhantSadangi mentioned this issue Apr 18, 2024

neptune.ai logger produces lots of errors when logging "training/epoch" Lightning-AI/pytorch-lightning#19679

Open

SiddhantSadangi pinned this issue Jul 11, 2024

normandy7 unpinned this issue Jul 11, 2024

normandy7 pinned this issue Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPT-14525: Neptune reports "step must be strictly increasing" error if lightning logs in training and validation step #1702

NPT-14525: Neptune reports "step must be strictly increasing" error if lightning logs in training and validation step #1702

NiklasKappel commented Mar 20, 2024

SiddhantSadangi commented Mar 20, 2024

rkinas commented Jul 11, 2024

SiddhantSadangi commented Jul 11, 2024

rkinas commented Jul 11, 2024

SiddhantSadangi commented Jul 11, 2024

NPT-14525: Neptune reports "step must be strictly increasing" error if lightning logs in training and validation step #1702

NPT-14525: Neptune reports "step must be strictly increasing" error if lightning logs in training and validation step #1702

Comments

NiklasKappel commented Mar 20, 2024

Describe the bug

Reproduction

Expected behavior

Environment

SiddhantSadangi commented Mar 20, 2024

rkinas commented Jul 11, 2024

SiddhantSadangi commented Jul 11, 2024

rkinas commented Jul 11, 2024

SiddhantSadangi commented Jul 11, 2024