Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes VectorDataset rounding bug causing sample mask size mismatch #675

Merged

Conversation

TCherici
Copy link
Contributor

@TCherici TCherici commented Jul 14, 2022

I've had to adjust the test to also account for the extra pixel generated, which seems right to me (1.9-1.1 = 0.8 which would be 8 pixels with a res=0.1).

Fixes #674

@ghost
Copy link

ghost commented Jul 14, 2022

CLA assistant check
All CLA requirements met.

@github-actions github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels Jul 14, 2022
@adamjstewart adamjstewart added this to the 0.3.1 milestone Jul 14, 2022
adamjstewart
adamjstewart previously approved these changes Jul 14, 2022
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm that this fixes the issue, thanks for the PR!

@@ -611,12 +611,12 @@ def __getitem__(self, query: BoundingBox) -> Dict[str, Any]:
)
if shapes:
masks = rasterio.features.rasterize(
shapes, out_shape=(int(height), int(width)), transform=transform
shapes, out_shape=(int(round(height)), int(round(width))), transform=transform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is too long now. Can you run black torchgeo/datasets/geo.py to fix this?

Copy link
Collaborator

@ashnair1 ashnair1 Jul 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to cast it to int again if you're not specifying the number of decimals to round to.

@adamjstewart
Copy link
Collaborator

There are several other places where we aren't rounding before casting to an int. I'm not worried about the str -> int conversions, but some of these float -> int conversions could be problematic. Not sure if integer division includes rounding or not. I wonder if any of these should also be fixed:

$ grep -PR '\bint\((?!round)'
tests/models/test_changestar.py:        assert y[0].shape == (3, 1, int(32 * sf), int(32 * sf))
tests/samplers/test_single.py:        rows = int((100 - sampler.size[0]) // sampler.stride[0]) + 1
tests/samplers/test_single.py:        cols = int((100 - sampler.size[1]) // sampler.stride[1]) + 1
tests/data/forestdamage/data.py:        ET.SubElement(bbox, "xmin").text = str(0 + int(SIZE / 4))
tests/data/forestdamage/data.py:        ET.SubElement(bbox, "ymin").text = str(0 + int(SIZE / 4))
tests/data/forestdamage/data.py:        ET.SubElement(bbox, "xmax").text = str(SIZE - int(SIZE / 4))
tests/data/forestdamage/data.py:        ET.SubElement(bbox, "ymax").text = str(SIZE - int(SIZE / 4))
torchgeo/datasets/forestdamage.py:            int(bndbox.find("xmin").text),  # type: ignore[union-attr, arg-type]
torchgeo/datasets/forestdamage.py:            int(bndbox.find("ymin").text),  # type: ignore[union-attr, arg-type]
torchgeo/datasets/forestdamage.py:            int(bndbox.find("xmax").text),  # type: ignore[union-attr, arg-type]
torchgeo/datasets/forestdamage.py:            int(bndbox.find("ymax").text),  # type: ignore[union-attr, arg-type]
torchgeo/datasets/so2sat.py:            self.size = int(f["label"].shape[0])
torchgeo/datasets/spacenet.py:            [int(math.ceil(s / bin_size_mph)) for s in speed_arr_bin]
torchgeo/datasets/spacenet.py:                            int(feature["properties"]["inferred_speed_mph"]) - 1
torchgeo/datasets/gbif.py:        mint = datetime(int(year), 1, 1)
torchgeo/datasets/gbif.py:        maxt = datetime(int(year) + 1, 1, 1)
torchgeo/datasets/gbif.py:        mint = datetime(int(year), int(month), 1)
torchgeo/datasets/gbif.py:            maxt = datetime(int(year) + 1, 1, 1)
torchgeo/datasets/gbif.py:            maxt = datetime(int(year), int(month) + 1, 1)
torchgeo/datasets/gbif.py:        mint = datetime(int(year), int(month), int(day))
torchgeo/datasets/cowc.py:        target = int(self.targets[index])
torchgeo/datasets/idtrees.py:                [self.idx2class[int(i)] for i in sample["label"]]
torchgeo/datasets/idtrees.py:                [self.idx2class[int(i)] for i in sample["prediction_label"]]
torchgeo/datasets/openbuildings.py:                shapes, out_shape=(int(height), int(width)), transform=transform
torchgeo/datasets/openbuildings.py:            masks = torch.zeros(size=(1, int(height), int(width)))
torchgeo/datasets/cv4a_kenya_crop_type.py:                train_field_ids.append(int(row[0]))
torchgeo/datasets/cv4a_kenya_crop_type.py:                    test_field_ids.append(int(row[1]))
torchgeo/datasets/cyclone.py:        features["relative_time"] = int(features["relative_time"])
torchgeo/datasets/cyclone.py:        features["ocean"] = int(features["ocean"])
torchgeo/datasets/cyclone.py:        features["label"] = int(features["wind_speed"])
torchgeo/datamodules/etci2021.py:        size_train = int(0.8 * size_train_val)
torchgeo/datamodules/chesapeake.py:        self.original_patch_size = int(patch_size * 2.0)
torchgeo/datamodules/sen12ms.py:            scene_id = int(parts[3])
torchgeo/datamodules/utils.py:        val_length = int(len(dataset) * val_pct)
torchgeo/datamodules/utils.py:        val_length = int(len(dataset) * val_pct)
torchgeo/datamodules/utils.py:        test_length = int(len(dataset) * test_pct)
torchgeo/models/farseg.py:            num_upsample = int(math.log2(int(in_feat_os))) - int(
torchgeo/models/farseg.py:                math.log2(int(out_feature_output_stride))
torchgeo/samplers/single.py:            rows = int((bounds.maxy - bounds.miny - self.size[0]) // self.stride[0]) + 1
torchgeo/samplers/single.py:            cols = int((bounds.maxx - bounds.minx - self.size[1]) // self.stride[1]) + 1
torchgeo/samplers/single.py:            rows = int((bounds.maxy - bounds.miny - self.size[0]) // self.stride[0]) + 1
torchgeo/samplers/single.py:            cols = int((bounds.maxx - bounds.minx - self.size[1]) // self.stride[1]) + 1

)
else:
# If no features are found in this query, return an empty mask
# with the default fill value and dtype used by rasterize
masks = np.zeros((int(height), int(width)), dtype=np.uint8)
masks = np.zeros((int(round(height)), int(round(width))), dtype=np.uint8)
Copy link
Collaborator

@ashnair1 ashnair1 Jul 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Casting to int not required.

@ashnair1
Copy link
Collaborator

ashnair1 commented Jul 16, 2022

There are several other places where we aren't rounding before casting to an int. I'm not worried about the str -> int conversions, but some of these float -> int conversions could be problematic. Not sure if integer division includes rounding or not. I wonder if any of these should also be fixed:

I had encountered a related issue with dataset_split in torchgeo/datamodules/utils.py. Direct int casting basically strips out the decimal part so it's a problem if len(dataset) * val_pct < 1. I had encountered this issue while testing coverage for a datamodule. The test dataset in question had len(dataset) = 3 and val_pct = 0.3.

Fix using round:

    if test_pct is None:
-        val_length = int(len(dataset) * val_pct)
+        val_length = round(len(dataset) * val_pct)
        train_length = len(dataset) - val_length
        return random_split(dataset, [train_length, val_length])
    else:
-       val_length = int(len(dataset) * val_pct)
-       test_length = int(len(dataset) * test_pct)
+       val_length = round(len(dataset) * val_pct)
+       test_length = round(len(dataset) * test_pct)

@TCherici
Copy link
Contributor Author

TCherici commented Jul 18, 2022

I had encountered a related issue with dataset_split in torchgeo/datamodules/utils.py. Direct int casting basically strips out the decimal part so it's a problem if len(dataset) * val_pct < 0. I had encountered this issue while testing coverage for a datamodule. The test dataset in question had len(dataset) = 3 and val_pct = 0.3.

I don't understand how len(dataset) * val_pct could be less than 0 for the values you gave, and I'm not sure if this is related to the bugfix in this PR. @ashnair1 If you think it is I will add the fix you gave as a commit to this PR.

@ashnair1
Copy link
Collaborator

ashnair1 commented Jul 18, 2022

I had encountered a related issue with dataset_split in torchgeo/datamodules/utils.py. Direct int casting basically strips out the decimal part so it's a problem if len(dataset) * val_pct < 0. I had encountered this issue while testing coverage for a datamodule. The test dataset in question had len(dataset) = 3 and val_pct = 0.3.

I don't understand how len(dataset) * val_pct could be less than 0 for the values you gave, and I'm not sure if this is related to the bugfix in this PR. @ashnair1 If you think it is I will add the fix you gave as a commit to this PR.

My bad. Meant less than 1. The comment is unrelated to this PR. I was just replying to @adamjstewart's comment on other places where round vs int errors can occur.

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix other instances of this bug in a separate PR.

@adamjstewart adamjstewart merged commit 67b8b30 into microsoft:main Jul 18, 2022
@TCherici TCherici deleted the bugfix/vector_dataset_sample_rounding branch July 19, 2022 09:15
adamjstewart pushed a commit that referenced this pull request Sep 3, 2022
)

* Fixes VectorDataset rounding bug causing sample mask size mismatch

* removes unnecessary casting to int
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
…icrosoft#675)

* Fixes VectorDataset rounding bug causing sample mask size mismatch

* removes unnecessary casting to int
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VectorDataset incorrect sample shape
3 participants