Warning while training model with DDP #177

AmmaraRazzaq · 2022-02-28T15:23:04Z

Hi
I am getting the following warning when training the model with ffcv dataloader + ddp.

[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.

The same code works fine with pytorch dataloader + ddp

AmmaraRazzaq · 2022-03-01T12:51:07Z

I think this error was occurring because I was not putting the tensors on gpu in the image and label pipeline, instead I was putting them on gpu in the train and val loop.
However now, only the image tensors are going on gpu, label tensors are not moving to gpu.

loaders = {}
for name in ['train', 'val']:
    label_pipeline: List[Operation] = [NDArrayDecoder(), ToDevice(ch.device('cuda:0'))] 
    image_pipeline: List[Operation] = [SimpleRGBImageDecoder(), Normalize(), ToTensor(), Convert(ch.float32), ToDevice(ch.device('cuda:0')), ToTorchImage()] 
    # Create loaders
    loaders[name] = Loader(
        paths[f'{name}_beton_path'],
        batch_size=14,
        num_workers=6,
        order=OrderOption.RANDOM if name == 'train' else OrderOption.SEQUENTIAL,
        # distributed = (name == 'train'),
        # seed= 0,
        drop_last = (name == 'train'),
        pipelines={
            'image': image_pipeline,
            'label': label_pipeline
        }
    )

AmmaraRazzaq · 2022-03-02T12:41:10Z

Resolved: Pytorch dataset class should be given array as an input and I was giving a list for labels.
Even though NDArrayField and NDArrayDecoder() were working fine. No further changes could be done to labels after decoding.

sachitkuhar · 2022-03-03T07:03:31Z

Hi @AmmaraRazzaq
I am facing the same error. I could not understand your last comment. Do you mind sharing it in a bit more detail?
Thanks!

AmmaraRazzaq · 2022-03-03T11:01:24Z

Even after successfully moving the tensors to GPU, the Warning still persists,

[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [2304, 576, 1, 1], strides() = [576, 1, 576, 576]
bucket_view.sizes() = [2304, 576, 1, 1], strides() = [576, 1, 1, 1] (function operator())

AmmaraRazzaq · 2022-03-03T14:53:20Z

Finally figured it out. This warning occurs because ToTorchImage() class returns tensor in channels_last memory format. If the input tensor to a model is in channals_last memory format then the model should support this format else it will give the warning about grad strides not matching. Model can be converted to channels last as follows
model = model.to(memory_format=torch.channels_last) as explained here in detail.
OR
channels_last parameter can be set to False in ToTorchImage(channels_last=False) and it will return the tensor in contiguous memory format and there is no need to convert the model to channels last memory format.

I have found that contiguous memory format is much faster. channels_last memory format is making the training slower than with pytorch data loader.

GuillaumeLeclerc · 2022-03-03T16:19:30Z

@AmmaraRazzaq What GPU are you using. Newer GPUs should be at least 10% faster with channel_last

AmmaraRazzaq · 2022-03-05T18:27:29Z

Hi @GuillaumeLeclerc I am using Tesla V100-SXM2-32GB

GuillaumeLeclerc · 2022-03-08T03:04:02Z

I have a V100 handy, do you mind sharing a sample of your code that is faster with channel_last=false so I can investigate ?

AmmaraRazzaq · 2022-03-08T15:47:28Z

Hi @GuillaumeLeclerc Thankyou for offering to help.
Here is the link to the code https://github.com/AmmaraRazzaq/image_classification/blob/main/sample_code.py

GuillaumeLeclerc · 2022-03-14T22:07:42Z

Sorry for the delay, can you give me exactly the parameters you are using (and which dataset). Thank you!

AmmaraRazzaq · 2022-03-15T06:36:55Z

Hi @GuillaumeLeclerc I can't share much detail with you as this is a research project which is still in development phase and has not been made opensource yet.
Please let me know if parameters, nature of the data set or model architecture can affect the speed of model training?

GuillaumeLeclerc · 2022-03-15T23:53:37Z

There are many very important factors including:

Distribution of image resolution
The amount of raw/jpeg used in the file
amount of compression of images
Shape of your labels
etc...
Can you provide a dataset where the images and labels have been replace by noise ?

AmmaraRazzaq · 2022-03-25T09:17:16Z

Hi @GuillaumeLeclerc Apologies for late reply.

I am sharing the dataset files and sample code. I am working with CheXpert dataset and beton file size is 165GB for all the images so I have created a beton file with 1000 images (~1.5GB). Images are resized to 512x512 and normalized in the range [-1,1] and are written to beton file in 'raw' format. It's a multilabel classification problem with 5 labels for each image.

Dataset files: https://github.com/AmmaraRazzaq/image_classification/tree/master/betonfiles
code: https://github.com/AmmaraRazzaq/image_classification/blob/master/pyfiles/sample_code.py

I am using resnet101 architecture with lr=2e-3, bs=24, gpus=4 (ddp training), SGD optimizers with weight_decay=0, momentum=0.9 and num_workers=6 in the dataloader.

AmmaraRazzaq closed this as completed Mar 2, 2022

AmmaraRazzaq reopened this Mar 3, 2022

GuillaumeLeclerc added the waiting_for_op label Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning while training model with DDP #177

Warning while training model with DDP #177

AmmaraRazzaq commented Feb 28, 2022

AmmaraRazzaq commented Mar 1, 2022

AmmaraRazzaq commented Mar 2, 2022 •

edited

Loading

sachitkuhar commented Mar 3, 2022

AmmaraRazzaq commented Mar 3, 2022 •

edited

Loading

AmmaraRazzaq commented Mar 3, 2022 •

edited

Loading

GuillaumeLeclerc commented Mar 3, 2022

AmmaraRazzaq commented Mar 5, 2022

GuillaumeLeclerc commented Mar 8, 2022

AmmaraRazzaq commented Mar 8, 2022

GuillaumeLeclerc commented Mar 14, 2022

AmmaraRazzaq commented Mar 15, 2022

GuillaumeLeclerc commented Mar 15, 2022

AmmaraRazzaq commented Mar 25, 2022 •

edited

Loading

Warning while training model with DDP #177

Warning while training model with DDP #177

Comments

AmmaraRazzaq commented Feb 28, 2022

AmmaraRazzaq commented Mar 1, 2022

AmmaraRazzaq commented Mar 2, 2022 • edited Loading

sachitkuhar commented Mar 3, 2022

AmmaraRazzaq commented Mar 3, 2022 • edited Loading

AmmaraRazzaq commented Mar 3, 2022 • edited Loading

GuillaumeLeclerc commented Mar 3, 2022

AmmaraRazzaq commented Mar 5, 2022

GuillaumeLeclerc commented Mar 8, 2022

AmmaraRazzaq commented Mar 8, 2022

GuillaumeLeclerc commented Mar 14, 2022

AmmaraRazzaq commented Mar 15, 2022

GuillaumeLeclerc commented Mar 15, 2022

AmmaraRazzaq commented Mar 25, 2022 • edited Loading

AmmaraRazzaq commented Mar 2, 2022 •

edited

Loading

AmmaraRazzaq commented Mar 3, 2022 •

edited

Loading

AmmaraRazzaq commented Mar 3, 2022 •

edited

Loading

AmmaraRazzaq commented Mar 25, 2022 •

edited

Loading