Sample from a specific label #522

brunovollmer · 2021-10-20T12:42:11Z

Hey everybody,

I was wondering if there is a way to sample from a specific label. My situation is that I have a dataset where one class is heavily over represented and I was wondering if there is an option to just sample from this label and keep everything else similar.

Thanks in advance,

Bruno

zhiltsov-max · 2021-10-20T14:35:36Z

Hi, could you please describe more precisely the operations you're trying to do and the expected results? What do you mean by "sample" exactly? I can suggest you to look into the directions of filtering with custom filter expression (label == class, label != class), or NDR transform, or splitting by task-specific splitters.

brunovollmer · 2021-10-20T16:00:36Z

Hey @zhiltsov-max

so the distribution of my dataset is the following:

Label distribution:
* bench: 418366 -- 2.5%
* bicycle: 215504 -- 1.3%
* bus: 144856 -- 0.9%
* car: 1478262 -- 8.9%
* chair: 2030188 -- 12.2%
* dog: 74188 -- 0.4%
* laptop: 92986 -- 0.6%
* person: 12008280 -- 72.4%
* phone: 121542 -- 0.7%

As you can see, the person class is heavily over represented. What I would like to do is to have an operation where datumaro randomly picks a certain percentage of annotations/images from a class (in my class person) and removes the rest. The result should then contain the same amount of bboxes for each other class and the reduced amount for the picked class.

zhiltsov-max · 2021-10-20T17:22:05Z

Probably, there is no ready-to-use solution for this now. AFAIK, datasets are typically formed to have nearly equal class distribution, or they need to be split into subsets with the initial distribution preserved. The latter case is already covered by the task-oriented splitters. Probably you could use a simple script to get what you want:

from datumaro.components.dataset import Datumaro
from datumaro.components.extractor import Transform, AnnotationType

class Sampler(Transform):
    def __iter__(self):
        person_anns = 0
        required_quantity = 10000
        person_label_idx = self._extractor.categories()[AnnotationType.label].find('person')[0]

        for item in self._extractor:
            new_anns = []
            for ann in item.annotations:
                if hasattr(ann, 'label') and ann.label == person_label_idx:
                    if person_anns >= required_quantity:
                        continue
                    else:
                        person_anns += 1

                new_anns.append(ann)
            if new_anns:
                yield item.wrap(annotations=new_anns)

dataset = Dataset.import_from('path/', 'format_name')
dataset.transform(Sampler)
dataset.export('new_path/', 'format', save_images=True)

brunovollmer · 2021-10-21T12:36:53Z

Thanks for the extensive reply. I'm trying to integrate your code but I'm having some problems with the python version of datumaro. Due to some previous operations I have a datumaro project that has the data (annotations + images) I want. Unfortunately I can't load it as a dataset (Error: no data of format coco at this path) and when I use a datumaro Project I can load it Project.load(path) but then I can't convert the project to a dataset as the project does not have a variable called working_tree.

I guess I'm doing something wrong. Any suggestions?

brunovollmer · 2021-10-21T12:37:41Z

And my second question was if there is an easy way to pass variables to the Sampler class from your example?

zhiltsov-max · 2021-10-21T13:40:02Z

Thanks for the extensive reply. I'm trying to integrate your code but I'm having some problems with the python version of datumaro. Due to some previous operations I have a datumaro project that has the data (annotations + images) I want. Unfortunately I can't load it as a dataset (Error: no data of format coco at this path) and when I use a datumaro Project I can load it Project.load(path) but then I can't convert the project to a dataset as the project does not have a variable called working_tree.

I suppose, you're using an outdated version of the library. The latest is the 0.2, which version you're on? datum --version. In the previous versions it was:

from datumaro.components.project import Project

project = Project.load(path)
dataset = project.make_dataset()
...
dataset = Dataset.from_extractors(dataset.transform(...))

And my second question was if there is an easy way to pass variables to the Sampler class from your example?

class Sampler(Transform):
    def __init__(self, extractor, option1='foo', option2=42):
        super().__init__(extractor)
        self._option1 = option1
        self._option2 = option2

...

dataset.transform(Sampler, option1='bar', option2=36)

brunovollmer · 2021-10-27T08:48:44Z

class Sampler(Transform):
    def __init__(self, extractor, label=None, number=None):
        super().__init__(extractor)
        self._label = label
        self._number = number

    def __iter__(self):
        anns = 0
        label_idx = self._extractor.categories()[AnnotationType.label].find(self._label)[0]

        items = random.sample(list(self._extractor), len(list(self._extractor)))

        for item in items:
            new_anns = []
            for ann in item.annotations:
                if hasattr(ann, 'label') and ann.label == label_idx:
                    if anns >= self._number:
                        continue
                    else:
                        anns += 1

                new_anns.append(ann)
            if new_anns:
                yield item.wrap(annotations=new_anns)

def main(args):
    project = Project.load(args.input)
    dataset = project.make_dataset()
    dataset = Dataset.from_extractors(dataset.transform(Sampler, label=args.label, number=args.number))
    dataset.export(args.output, 'yolo', save_images=False)

if __name__ == '__main__':
    main(parse_args())

This is the current version of my Sampler. When I run it I receive this error:

Traceback (most recent call last):
  File "datumaro_sampler.py", line 59, in <module>
    main(parse_args())
  File "datumaro_sampler.py", line 56, in main
    dataset.export(args.output, 'yolo', save_images=False)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/util/__init__.py", line 203, in wrapped_func
    func(*args, **kwargs)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/components/dataset.py", line 774, in export
    converter.convert(self, save_dir=save_dir, **kwargs)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/components/converter.py", line 33, in convert
    return converter.apply()
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/plugins/yolo_format/converter.py", line 92, in apply
    with open(annotation_path, 'w', encoding='utf-8') as f:
OSError: [Errno 9] Bad file descriptor: '/home/azureuser/cloudfiles/code/Users/mael/datasets/Objects365/objects_365_soundmap/objects_365_soundmap-map_subsets/sampled/obj_valid_data/images/train2017/objects365_v2
_01826801.txt'

zhiltsov-max · 2021-10-27T09:11:57Z

Hi, the script looks correct. From the error message I can see that you're probably using a mounted directory to work with Azure cloud, is it correct? I think the error can be related to this - maybe, there were too much I/O requests or something similar. Can you share some details about the drive mounting options (without personal data, of course)? We haven't tested such scenario yet, so I can suggest to try to export on a local filesystem and then copy manually to the cloud.

zhiltsov-max · 2022-01-30T12:36:52Z

Hi, please check if #640 is useful for you.

tdhooghe · 2022-07-12T11:46:43Z

Hi, just wanted to shed my light on the code above. I tried this code and noticed that it stops drawing labels of a given if a certain threshold is reached. I don't think this leads to wanted behavior, as it removes labels from images that contain other labels as well. Hence, now we are left with images that are unlabeled w.r.t. a given class while the label should be there. Therefore, the model will be trained on images with missing labels and might be unfairly penalized by the loss function.

Do you agree, or am I missing something here?

zhiltsov-max · 2022-07-13T19:34:39Z

Hi, yes, it is true. The solution can produce under-annotated images, when there are more than 1 annotation per image. In the PR referenced (#640), we took a different way, which works on the image level and doesn't have this problem. It may produce different distribution of annotations than requested depending on the data available, but the resulting images will contain all the annotations.

tdhooghe · 2022-07-16T10:25:39Z

Thank you very much for letting me know! Could you maybe provide an example of how I can use this class with my dataset with the API approach?

I am using the following line, and it does not seem to work, as the number of 'person' labels stays the same:
seed = 1234
sampled_coco = coco_dataset.transform(LabelRandomSampler, label_counts={'person': 100000}, seed=seed)

Also, could you explain what the count argument is for?

zhiltsov-max · 2022-07-18T10:22:40Z

@tdhooghe, you can find the parameter descriptions here and API usage examples here.

Basically, the count parameter is applied to all classes, while counts for specific classes can be set explicitly with label_counts.

zhiltsov-max added the ENHANCE Enhancement of existing features label Nov 9, 2021

efcy mentioned this issue Jan 4, 2022

Support export balance dataset from imbalance source cvat-ai/cvat#4107

Closed

zhiltsov-max mentioned this issue Jan 30, 2022

Add label sampling transform #640

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample from a specific label #522

Sample from a specific label #522

brunovollmer commented Oct 20, 2021

zhiltsov-max commented Oct 20, 2021

brunovollmer commented Oct 20, 2021

zhiltsov-max commented Oct 20, 2021 •

edited

Loading

brunovollmer commented Oct 21, 2021

brunovollmer commented Oct 21, 2021

zhiltsov-max commented Oct 21, 2021 •

edited

Loading

brunovollmer commented Oct 27, 2021 •

edited

Loading

zhiltsov-max commented Oct 27, 2021

zhiltsov-max commented Jan 30, 2022

tdhooghe commented Jul 12, 2022

zhiltsov-max commented Jul 13, 2022

tdhooghe commented Jul 16, 2022 •

edited

Loading

zhiltsov-max commented Jul 18, 2022

Sample from a specific label #522

Sample from a specific label #522

Comments

brunovollmer commented Oct 20, 2021

zhiltsov-max commented Oct 20, 2021

brunovollmer commented Oct 20, 2021

zhiltsov-max commented Oct 20, 2021 • edited Loading

brunovollmer commented Oct 21, 2021

brunovollmer commented Oct 21, 2021

zhiltsov-max commented Oct 21, 2021 • edited Loading

brunovollmer commented Oct 27, 2021 • edited Loading

zhiltsov-max commented Oct 27, 2021

zhiltsov-max commented Jan 30, 2022

tdhooghe commented Jul 12, 2022

zhiltsov-max commented Jul 13, 2022

tdhooghe commented Jul 16, 2022 • edited Loading

zhiltsov-max commented Jul 18, 2022

zhiltsov-max commented Oct 20, 2021 •

edited

Loading

zhiltsov-max commented Oct 21, 2021 •

edited

Loading

brunovollmer commented Oct 27, 2021 •

edited

Loading

tdhooghe commented Jul 16, 2022 •

edited

Loading