Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning while training model with DDP #177

Open
AmmaraRazzaq opened this issue Feb 28, 2022 · 13 comments
Open

Warning while training model with DDP #177

AmmaraRazzaq opened this issue Feb 28, 2022 · 13 comments

Comments

@AmmaraRazzaq
Copy link

Hi
I am getting the following warning when training the model with ffcv dataloader + ddp.

[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.

The same code works fine with pytorch dataloader + ddp

@AmmaraRazzaq
Copy link
Author

I think this error was occurring because I was not putting the tensors on gpu in the image and label pipeline, instead I was putting them on gpu in the train and val loop.
However now, only the image tensors are going on gpu, label tensors are not moving to gpu.

loaders = {}
for name in ['train', 'val']:
    label_pipeline: List[Operation] = [NDArrayDecoder(), ToDevice(ch.device('cuda:0'))] 
    image_pipeline: List[Operation] = [SimpleRGBImageDecoder(), Normalize(), ToTensor(), Convert(ch.float32), ToDevice(ch.device('cuda:0')), ToTorchImage()] 
    # Create loaders
    loaders[name] = Loader(
        paths[f'{name}_beton_path'],
        batch_size=14,
        num_workers=6,
        order=OrderOption.RANDOM if name == 'train' else OrderOption.SEQUENTIAL,
        # distributed = (name == 'train'),
        # seed= 0,
        drop_last = (name == 'train'),
        pipelines={
            'image': image_pipeline,
            'label': label_pipeline
        }
    )

@AmmaraRazzaq
Copy link
Author

AmmaraRazzaq commented Mar 2, 2022

Resolved: Pytorch dataset class should be given array as an input and I was giving a list for labels.
Even though NDArrayField and NDArrayDecoder() were working fine. No further changes could be done to labels after decoding.

@sachitkuhar
Copy link

Hi @AmmaraRazzaq
I am facing the same error. I could not understand your last comment. Do you mind sharing it in a bit more detail?
Thanks!

@AmmaraRazzaq AmmaraRazzaq reopened this Mar 3, 2022
@AmmaraRazzaq
Copy link
Author

AmmaraRazzaq commented Mar 3, 2022

Even after successfully moving the tensors to GPU, the Warning still persists,

[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [2304, 576, 1, 1], strides() = [576, 1, 576, 576]
bucket_view.sizes() = [2304, 576, 1, 1], strides() = [576, 1, 1, 1] (function operator())

@AmmaraRazzaq
Copy link
Author

AmmaraRazzaq commented Mar 3, 2022

Finally figured it out. This warning occurs because ToTorchImage() class returns tensor in channels_last memory format. If the input tensor to a model is in channals_last memory format then the model should support this format else it will give the warning about grad strides not matching. Model can be converted to channels last as follows
model = model.to(memory_format=torch.channels_last) as explained here in detail.
OR
channels_last parameter can be set to False in ToTorchImage(channels_last=False) and it will return the tensor in contiguous memory format and there is no need to convert the model to channels last memory format.

I have found that contiguous memory format is much faster. channels_last memory format is making the training slower than with pytorch data loader.

@GuillaumeLeclerc
Copy link
Collaborator

@AmmaraRazzaq What GPU are you using. Newer GPUs should be at least 10% faster with channel_last

@AmmaraRazzaq
Copy link
Author

Hi @GuillaumeLeclerc I am using Tesla V100-SXM2-32GB

@GuillaumeLeclerc
Copy link
Collaborator

I have a V100 handy, do you mind sharing a sample of your code that is faster with channel_last=false so I can investigate ?

@AmmaraRazzaq
Copy link
Author

Hi @GuillaumeLeclerc Thankyou for offering to help.
Here is the link to the code https://github.com/AmmaraRazzaq/image_classification/blob/main/sample_code.py

@GuillaumeLeclerc
Copy link
Collaborator

Sorry for the delay, can you give me exactly the parameters you are using (and which dataset). Thank you!

@AmmaraRazzaq
Copy link
Author

Hi @GuillaumeLeclerc I can't share much detail with you as this is a research project which is still in development phase and has not been made opensource yet.
Please let me know if parameters, nature of the data set or model architecture can affect the speed of model training?

@GuillaumeLeclerc
Copy link
Collaborator

There are many very important factors including:

  • Distribution of image resolution
  • The amount of raw/jpeg used in the file
  • amount of compression of images
  • Shape of your labels
  • etc...
    Can you provide a dataset where the images and labels have been replace by noise ?

@AmmaraRazzaq
Copy link
Author

AmmaraRazzaq commented Mar 25, 2022

Hi @GuillaumeLeclerc Apologies for late reply.

I am sharing the dataset files and sample code. I am working with CheXpert dataset and beton file size is 165GB for all the images so I have created a beton file with 1000 images (~1.5GB). Images are resized to 512x512 and normalized in the range [-1,1] and are written to beton file in 'raw' format. It's a multilabel classification problem with 5 labels for each image.

Dataset files: https://github.com/AmmaraRazzaq/image_classification/tree/master/betonfiles
code: https://github.com/AmmaraRazzaq/image_classification/blob/master/pyfiles/sample_code.py

I am using resnet101 architecture with lr=2e-3, bs=24, gpus=4 (ddp training), SGD optimizers with weight_decay=0, momentum=0.9 and num_workers=6 in the dataloader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants