Skip to content

ViT model training notebook data prep fails on Free-IPU-POD4 #54

@nmb-paperspace

Description

@nmb-paperspace

If I do

  • Start Gradient Notebook with PyTorch on IPU runtime
  • Wait for its log to say Finished running setup.sh
  • Open notebook vit-model-training/walkthrough.ipynb and do Run All
  • Look at execution of cell 3 under "Preparing the NIH Chest X-ray Dataset" that unpacks the data by running /tmp/dataset_cache/chest-xray-nihcc-3/unpack-images.sh that does tar -xf on some .tar.gz files

Then on IPU-POD4, IPU-POD16 and Bow-POD16 it's fine, with that cell running in about 10 minutes.

But on Free-IPU-POD4 the cell never finishes executing. If I restart the notebook kernel and do Run All again, it then executes immediately but fails in a later cell with OSError: image file is truncated.

The image unpacking is done on CPU (gc-monitor, top), so something is causing the data handling to depend on the machine being free or paid here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions