Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataModules: pass kwargs directly to datasets #730

Merged
merged 9 commits into from
Oct 1, 2022
Merged

Conversation

adamjstewart
Copy link
Collaborator

@adamjstewart adamjstewart commented Aug 19, 2022

Closes #666

Pros

  • Consistency: every datamodule now has the same kwargs as its corresponding dataset
  • New features: this means that every datamodule now supports automatic downloads if its corresponding dataset does too
  • No unused kwargs: previously, a typo in an argument or non-existent argument would go unnoticed, now it will raise an error
  • Reduced code duplication: -140 lines of code

Cons

  • Backwards incompatibility, root_dir has been renamed to root for consistency
  • Can no longer pass root_dir as the first positional arg, must be a keyword arg

To elaborate, the following are no longer valid:

BigEarthNetDataModule('data/bigearthnet')
BigEarthNetDataModule(root_dir='data/bigearthnet')

Instead, users will need to use:

BigEarthNetDataModule(root='data/bigearthnet')

@adamjstewart adamjstewart added the backwards-incompatible Changes that are not backwards compatible label Aug 19, 2022
@adamjstewart adamjstewart added this to the 0.4.0 milestone Aug 19, 2022
@github-actions github-actions bot added datamodules PyTorch Lightning datamodules testing Continuous integration testing documentation Improvements or additions to documentation labels Aug 19, 2022
@@ -18,8 +18,8 @@ experiment:
num_filters: 256
ignore_index: 0
datamodule:
root_dir: "data/oscd"
batch_size: 32
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OSCD data module has never had a batch_size parameter. The correct name is train_batch_size. This parameter was being ignored since we never actually used kwargs before.

batch_size: int = 64,
num_workers: int = 0,
bands: str = "rgb",
band_set: str = "rgb",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed from bands to band_set to match SEN12MSDataModule and to remove ambiguity between So2SatDataModule and So2Sat where bands is a list of band names.

@adamjstewart adamjstewart changed the title Datamodules: pass kwargs directly to datasets DataModules: pass kwargs directly to datasets Aug 22, 2022
isaaccorley
isaaccorley previously approved these changes Aug 31, 2022
Copy link
Collaborator

@isaaccorley isaaccorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any issues. lgtm this one was much needed.

@adamjstewart
Copy link
Collaborator Author

Will give @calebrob6 a chance to review, no rush on this PR since it's going in 0.4.0.

@calebrob6
Copy link
Member

Agree that this was much needed -- I just don't like how it isn't obvious from the DataModule constructors that you need to pass a root parameter. Since every dataset will need a root parameter can't we just keep it in?

@adamjstewart
Copy link
Collaborator Author

Since every dataset will need a root parameter

In theory, we could someday have datasets without a root parameter (streaming data from Azure/AWS for example). But yes, all of our current datasets have one.

I just don't like how it isn't obvious from the DataModule constructors that you need to pass a root parameter... can't we just keep it in?

We could keep it in and forward it like we used to. Or we could properly document the available kwargs, or the fact that kwargs will be forwarded. Or both.

@calebrob6
Copy link
Member

Properly document the available kwargs seems best to me

@adamjstewart
Copy link
Collaborator Author

@calebrob6 check out the docs for the latest version and see if that's better. The alternative would be to make a formal doc section like:

Initialize a LightningDataModule for BigEarthNet based DataLoaders.           
                                                                                         
Args:                                                                            
    batch_size: The batch size to use in all created DataLoaders                 
    num_workers: The number of workers to use in all created DataLoaders         

Keyword Args:
    root: root directory where dataset can be found                              
    split: train/val/test split to load                                          
    bands: load Sentinel-1 bands, Sentinel-2, or both. one of {s1, s2, all}      
    num_classes: number of classes to load in target. one of {19, 43}            
    transforms: a function/transform that takes input sample and its target as   
        entry and returns a transformed version                                  
    download: if True, download dataset and store it in the root directory       
    checksum: if True, check the MD5 of the downloaded files (may be slow)

but that ends up with a lot of duplication and becomes harder to maintain.

@adamjstewart
Copy link
Collaborator Author

Closing/reopening to try to kick off CLA bot.

@adamjstewart
Copy link
Collaborator Author

@microsoft-github-policy-service agree

@calebrob6
Copy link
Member

This looks like a good compromise on documentation. What if we specially check for "root" in kwargs and raise a helpful error message if it isn't included? My worry is that if a user doesn't add root they'll get an error message from the Dataset object which might be confusing.

@calebrob6
Copy link
Member

(or warning)

@adamjstewart
Copy link
Collaborator Author

The stack trace will tell them the full path the code took to raise that error (not necessarily clear to a non-programmer though). root isn't required, all datasets have a default root. If anyone ever implements #660 the error message will be a bit more helpful for datasets that don't already explain what root is set to.

@calebrob6
Copy link
Member

calebrob6 commented Sep 30, 2022 via email

@adamjstewart
Copy link
Collaborator Author

At the moment this is every dataset.

@adamjstewart adamjstewart merged commit 6f9589d into main Oct 1, 2022
@adamjstewart adamjstewart deleted the datamodules/kwargs branch October 1, 2022 21:33
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* Datamodules: pass kwargs directly to datasets

* Rename root_dir -> root in config files

* Fix datamodule tests

* Fix mypy

* Fix tutorial

* Specify all kwarg keys

* Fix bands vs. band_set

* root_dir -> root

* Document **kwargs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompatible Changes that are not backwards compatible datamodules PyTorch Lightning datamodules documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass datamodule kwargs to datasets
3 participants