Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDL/NLCD/SSL4EO: allow selection of classes #1392

Merged
merged 12 commits into from
Jun 4, 2023

Conversation

adamjstewart
Copy link
Collaborator

@adamjstewart adamjstewart commented Jun 3, 2023

There are too many CDL classes. #1389 revealed that the top 3 classes cover more area than the bottom 130, and only 17 classes occupy more than 1% of land. We need a way to specify a smaller set of classes in order for benchmarking to be computationally tractable.

This PR introduces a classes parameter to all 3 datasets that lets users specify which classes they actually care about. In combination with the class_weights and ignore_index parameters of SemanticSegmentationTask, this will allow us to weight these remaining classes to fight class imbalance.

@nilsleh apologies, but this undoes a lot of your hard work and changes the cmaps to their previous values (albeit without unused classes). I only did NLCD to start, but see what you think of this implementation and we can update CDL and SSL4EO-L Benchmark to match. I also haven't tested any of this so it's very likely there are bugs. We should test both the getitem and plot methods. I also don't know what will happen if the user doesn't include 0 in classes.

All of the following plots have been validated in QGIS.

NLCD

All classes:

nlcd_all_classes
ssl4eo_nlcd_all

Few classes:

nlcd_fewer_classes
ssl4eo_nlcd_few

CDL

All classes:

cdl_all_classes
ssl4eo_cdl_all

Few classes:

cdl_few_classes
ssl4eo_cdl_few

@adamjstewart adamjstewart requested a review from nilsleh June 3, 2023 04:28
@adamjstewart adamjstewart added this to In progress in SSL4EO-L via automation Jun 3, 2023
@adamjstewart adamjstewart marked this pull request as draft June 3, 2023 04:28
@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Jun 3, 2023
@github-actions github-actions bot added the testing Continuous integration testing label Jun 3, 2023
@adamjstewart adamjstewart marked this pull request as ready for review June 3, 2023 22:53
@calebrob6
Copy link
Member

calebrob6 commented Jun 3, 2023

There are too many CDL classes. #1389 revealed that the top 3 classes cover more area than the bottom 130, and only 17 classes occupy more than 1% of land. We need a way to specify a smaller set of classes in order for benchmarking to be computationally tractable.

I don't understand this motivation:

  • the number of CDL classes is fixed, there aren't "too many". Long-tailed class distributions are something that needs to be dealt with in modeling.
  • <256 classes is definitely not computationally intractable (e.g. imagenet has 1,000)
  • Regardless, if a user doesn't want some classes, then they can remap them to 0 in a transform (see below). I don't see this as a feature that torchgeo needs for one particular dataset.
def preprocess(sample):
  mask = sample["mask"]
  mask[mask == 42] = 0
  sample["mask"] = mask
  return sample

Do you mind expanding on your thinking here?

@adamjstewart
Copy link
Collaborator Author

adamjstewart commented Jun 4, 2023

<256 classes is definitely not computationally intractable (e.g. imagenet has 1,000)

It isn't for classification, but it is for semantic segmentation. Our CDL benchmarks take 60x longer than our NLCD benchmarks. At the moment, it's basically impossible for us to finish all CDL benchmarks before the deadline. This is what we decided in the last meeting.

@calebrob6
Copy link
Member

I can't reproduce this behavior

image

@calebrob6
Copy link
Member

It is identical with 20 and 130 output classes, the last classification layer should not be a large fraction of the computation done in a UNet.

@nilsleh
Copy link
Collaborator

nilsleh commented Jun 4, 2023

I think the story changes when you also include the backward pass, that is also what we found when using the pytorch Profiler.
Screenshot from 2023-06-04 14-58-01

@calebrob6
Copy link
Member

There's some difference but not double

image

@adamjstewart
Copy link
Collaborator Author

@yichiac tried training on our cluster and each epoch took 60x longer for CDL than NLCD. @nilsleh tried on his cluster and it was 10x. @calebrob6 if you want to train on your GPUs that's fine, but it's not feasible on our systems. This is what we spent a large portion of the last meeting discussing and this is the decision that we made. We've been trying to figure this out for days and no one has solved it and we're rapidly running out of time to start benchmarking. We have 65 CDL benchmarks to fill in. If each CDL benchmark takes > 24 hrs to complete, we're going to need quite a lot of GPUs...

@calebrob6
Copy link
Member

That's fine, but I don't think it needs to be baked into the library is what I'm saying.

@calebrob6 calebrob6 merged commit 9e57f27 into microsoft:main Jun 4, 2023
21 checks passed
SSL4EO-L automation moved this from In progress to Done Jun 4, 2023
@adamjstewart adamjstewart deleted the datasets/cdl-nlcd-classes branch June 4, 2023 15:21
@adamjstewart
Copy link
Collaborator Author

Writing transforms is a little painful at the moment but we can think about moving this into a datamodule transform before the next release.

@calebrob6
Copy link
Member

Yes, it will take more memory, but the computation cost is linear in the number of classes.

See how the linear function fit on "time for inference + backprop" from 20 to 500 classes predicts the "time for inference + backprop" for 1000 classes:

image

@adamjstewart adamjstewart added this to the 0.5.0 milestone Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants