Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample from a specific label #522

Open
brunovollmer opened this issue Oct 20, 2021 · 13 comments
Open

Sample from a specific label #522

brunovollmer opened this issue Oct 20, 2021 · 13 comments
Labels
ENHANCE Enhancement of existing features

Comments

@brunovollmer
Copy link

Hey everybody,

I was wondering if there is a way to sample from a specific label. My situation is that I have a dataset where one class is heavily over represented and I was wondering if there is an option to just sample from this label and keep everything else similar.

Thanks in advance,

Bruno

@zhiltsov-max
Copy link
Contributor

Hi, could you please describe more precisely the operations you're trying to do and the expected results? What do you mean by "sample" exactly? I can suggest you to look into the directions of filtering with custom filter expression (label == class, label != class), or NDR transform, or splitting by task-specific splitters.

@brunovollmer
Copy link
Author

Hey @zhiltsov-max

so the distribution of my dataset is the following:

Label distribution:
* bench: 418366 -- 2.5%
* bicycle: 215504 -- 1.3%
* bus: 144856 -- 0.9%
* car: 1478262 -- 8.9%
* chair: 2030188 -- 12.2%
* dog: 74188 -- 0.4%
* laptop: 92986 -- 0.6%
* person: 12008280 -- 72.4%
* phone: 121542 -- 0.7%

As you can see, the person class is heavily over represented. What I would like to do is to have an operation where datumaro randomly picks a certain percentage of annotations/images from a class (in my class person) and removes the rest. The result should then contain the same amount of bboxes for each other class and the reduced amount for the picked class.

@zhiltsov-max
Copy link
Contributor

zhiltsov-max commented Oct 20, 2021

Probably, there is no ready-to-use solution for this now. AFAIK, datasets are typically formed to have nearly equal class distribution, or they need to be split into subsets with the initial distribution preserved. The latter case is already covered by the task-oriented splitters. Probably you could use a simple script to get what you want:

from datumaro.components.dataset import Datumaro
from datumaro.components.extractor import Transform, AnnotationType

class Sampler(Transform):
    def __iter__(self):
        person_anns = 0
        required_quantity = 10000
        person_label_idx = self._extractor.categories()[AnnotationType.label].find('person')[0]

        for item in self._extractor:
            new_anns = []
            for ann in item.annotations:
                if hasattr(ann, 'label') and ann.label == person_label_idx:
                    if person_anns >= required_quantity:
                        continue
                    else:
                        person_anns += 1

                new_anns.append(ann)
            if new_anns:
                yield item.wrap(annotations=new_anns)

dataset = Dataset.import_from('path/', 'format_name')
dataset.transform(Sampler)
dataset.export('new_path/', 'format', save_images=True)

@brunovollmer
Copy link
Author

Thanks for the extensive reply. I'm trying to integrate your code but I'm having some problems with the python version of datumaro. Due to some previous operations I have a datumaro project that has the data (annotations + images) I want. Unfortunately I can't load it as a dataset (Error: no data of format coco at this path) and when I use a datumaro Project I can load it Project.load(path) but then I can't convert the project to a dataset as the project does not have a variable called working_tree.

I guess I'm doing something wrong. Any suggestions?

@brunovollmer
Copy link
Author

And my second question was if there is an easy way to pass variables to the Sampler class from your example?

@zhiltsov-max
Copy link
Contributor

zhiltsov-max commented Oct 21, 2021

Thanks for the extensive reply. I'm trying to integrate your code but I'm having some problems with the python version of datumaro. Due to some previous operations I have a datumaro project that has the data (annotations + images) I want. Unfortunately I can't load it as a dataset (Error: no data of format coco at this path) and when I use a datumaro Project I can load it Project.load(path) but then I can't convert the project to a dataset as the project does not have a variable called working_tree.

I suppose, you're using an outdated version of the library. The latest is the 0.2, which version you're on? datum --version. In the previous versions it was:

from datumaro.components.project import Project

project = Project.load(path)
dataset = project.make_dataset()
...
dataset = Dataset.from_extractors(dataset.transform(...))

And my second question was if there is an easy way to pass variables to the Sampler class from your example?

class Sampler(Transform):
    def __init__(self, extractor, option1='foo', option2=42):
        super().__init__(extractor)
        self._option1 = option1
        self._option2 = option2

...

dataset.transform(Sampler, option1='bar', option2=36)

@brunovollmer
Copy link
Author

brunovollmer commented Oct 27, 2021

class Sampler(Transform):
    def __init__(self, extractor, label=None, number=None):
        super().__init__(extractor)
        self._label = label
        self._number = number

    def __iter__(self):
        anns = 0
        label_idx = self._extractor.categories()[AnnotationType.label].find(self._label)[0]

        items = random.sample(list(self._extractor), len(list(self._extractor)))

        for item in items:
            new_anns = []
            for ann in item.annotations:
                if hasattr(ann, 'label') and ann.label == label_idx:
                    if anns >= self._number:
                        continue
                    else:
                        anns += 1

                new_anns.append(ann)
            if new_anns:
                yield item.wrap(annotations=new_anns)

def main(args):
    project = Project.load(args.input)
    dataset = project.make_dataset()
    dataset = Dataset.from_extractors(dataset.transform(Sampler, label=args.label, number=args.number))
    dataset.export(args.output, 'yolo', save_images=False)

if __name__ == '__main__':
    main(parse_args())

This is the current version of my Sampler. When I run it I receive this error:

Traceback (most recent call last):
  File "datumaro_sampler.py", line 59, in <module>
    main(parse_args())
  File "datumaro_sampler.py", line 56, in main
    dataset.export(args.output, 'yolo', save_images=False)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/util/__init__.py", line 203, in wrapped_func
    func(*args, **kwargs)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/components/dataset.py", line 774, in export
    converter.convert(self, save_dir=save_dir, **kwargs)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/components/converter.py", line 33, in convert
    return converter.apply()
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/plugins/yolo_format/converter.py", line 92, in apply
    with open(annotation_path, 'w', encoding='utf-8') as f:
OSError: [Errno 9] Bad file descriptor: '/home/azureuser/cloudfiles/code/Users/mael/datasets/Objects365/objects_365_soundmap/objects_365_soundmap-map_subsets/sampled/obj_valid_data/images/train2017/objects365_v2
_01826801.txt'

@zhiltsov-max
Copy link
Contributor

Hi, the script looks correct. From the error message I can see that you're probably using a mounted directory to work with Azure cloud, is it correct? I think the error can be related to this - maybe, there were too much I/O requests or something similar. Can you share some details about the drive mounting options (without personal data, of course)? We haven't tested such scenario yet, so I can suggest to try to export on a local filesystem and then copy manually to the cloud.

@zhiltsov-max
Copy link
Contributor

Hi, please check if #640 is useful for you.

@tdhooghe
Copy link

Hi, just wanted to shed my light on the code above. I tried this code and noticed that it stops drawing labels of a given if a certain threshold is reached. I don't think this leads to wanted behavior, as it removes labels from images that contain other labels as well. Hence, now we are left with images that are unlabeled w.r.t. a given class while the label should be there. Therefore, the model will be trained on images with missing labels and might be unfairly penalized by the loss function.

Do you agree, or am I missing something here?

@zhiltsov-max
Copy link
Contributor

Hi, yes, it is true. The solution can produce under-annotated images, when there are more than 1 annotation per image. In the PR referenced (#640), we took a different way, which works on the image level and doesn't have this problem. It may produce different distribution of annotations than requested depending on the data available, but the resulting images will contain all the annotations.

@tdhooghe
Copy link

tdhooghe commented Jul 16, 2022

Thank you very much for letting me know! Could you maybe provide an example of how I can use this class with my dataset with the API approach?

I am using the following line, and it does not seem to work, as the number of 'person' labels stays the same:
seed = 1234
sampled_coco = coco_dataset.transform(LabelRandomSampler, label_counts={'person': 100000}, seed=seed)

Also, could you explain what the count argument is for?

@zhiltsov-max
Copy link
Contributor

@tdhooghe, you can find the parameter descriptions here and API usage examples here.

Basically, the count parameter is applied to all classes, while counts for specific classes can be set explicitly with label_counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ENHANCE Enhancement of existing features
Projects
None yet
Development

No branches or pull requests

3 participants