-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample from a specific label #522
Comments
Hi, could you please describe more precisely the operations you're trying to do and the expected results? What do you mean by "sample" exactly? I can suggest you to look into the directions of filtering with custom filter expression (label == class, label != class), or NDR transform, or splitting by task-specific splitters. |
Hey @zhiltsov-max so the distribution of my dataset is the following:
As you can see, the |
Probably, there is no ready-to-use solution for this now. AFAIK, datasets are typically formed to have nearly equal class distribution, or they need to be split into subsets with the initial distribution preserved. The latter case is already covered by the task-oriented splitters. Probably you could use a simple script to get what you want: from datumaro.components.dataset import Datumaro
from datumaro.components.extractor import Transform, AnnotationType
class Sampler(Transform):
def __iter__(self):
person_anns = 0
required_quantity = 10000
person_label_idx = self._extractor.categories()[AnnotationType.label].find('person')[0]
for item in self._extractor:
new_anns = []
for ann in item.annotations:
if hasattr(ann, 'label') and ann.label == person_label_idx:
if person_anns >= required_quantity:
continue
else:
person_anns += 1
new_anns.append(ann)
if new_anns:
yield item.wrap(annotations=new_anns)
dataset = Dataset.import_from('path/', 'format_name')
dataset.transform(Sampler)
dataset.export('new_path/', 'format', save_images=True) |
Thanks for the extensive reply. I'm trying to integrate your code but I'm having some problems with the python version of datumaro. Due to some previous operations I have a datumaro project that has the data (annotations + images) I want. Unfortunately I can't load it as a dataset (Error: no data of format coco at this path) and when I use a datumaro I guess I'm doing something wrong. Any suggestions? |
And my second question was if there is an easy way to pass variables to the |
I suppose, you're using an outdated version of the library. The latest is the 0.2, which version you're on? from datumaro.components.project import Project
project = Project.load(path)
dataset = project.make_dataset()
...
dataset = Dataset.from_extractors(dataset.transform(...))
class Sampler(Transform):
def __init__(self, extractor, option1='foo', option2=42):
super().__init__(extractor)
self._option1 = option1
self._option2 = option2
...
dataset.transform(Sampler, option1='bar', option2=36) |
class Sampler(Transform):
def __init__(self, extractor, label=None, number=None):
super().__init__(extractor)
self._label = label
self._number = number
def __iter__(self):
anns = 0
label_idx = self._extractor.categories()[AnnotationType.label].find(self._label)[0]
items = random.sample(list(self._extractor), len(list(self._extractor)))
for item in items:
new_anns = []
for ann in item.annotations:
if hasattr(ann, 'label') and ann.label == label_idx:
if anns >= self._number:
continue
else:
anns += 1
new_anns.append(ann)
if new_anns:
yield item.wrap(annotations=new_anns)
def main(args):
project = Project.load(args.input)
dataset = project.make_dataset()
dataset = Dataset.from_extractors(dataset.transform(Sampler, label=args.label, number=args.number))
dataset.export(args.output, 'yolo', save_images=False)
if __name__ == '__main__':
main(parse_args()) This is the current version of my
|
Hi, the script looks correct. From the error message I can see that you're probably using a mounted directory to work with Azure cloud, is it correct? I think the error can be related to this - maybe, there were too much I/O requests or something similar. Can you share some details about the drive mounting options (without personal data, of course)? We haven't tested such scenario yet, so I can suggest to try to export on a local filesystem and then copy manually to the cloud. |
Hi, please check if #640 is useful for you. |
Hi, just wanted to shed my light on the code above. I tried this code and noticed that it stops drawing labels of a given if a certain threshold is reached. I don't think this leads to wanted behavior, as it removes labels from images that contain other labels as well. Hence, now we are left with images that are unlabeled w.r.t. a given class while the label should be there. Therefore, the model will be trained on images with missing labels and might be unfairly penalized by the loss function. Do you agree, or am I missing something here? |
Hi, yes, it is true. The solution can produce under-annotated images, when there are more than 1 annotation per image. In the PR referenced (#640), we took a different way, which works on the image level and doesn't have this problem. It may produce different distribution of annotations than requested depending on the data available, but the resulting images will contain all the annotations. |
Thank you very much for letting me know! Could you maybe provide an example of how I can use this class with my dataset with the API approach? I am using the following line, and it does not seem to work, as the number of 'person' labels stays the same: Also, could you explain what the |
Hey everybody,
I was wondering if there is a way to sample from a specific label. My situation is that I have a dataset where one class is heavily over represented and I was wondering if there is an option to just sample from this label and keep everything else similar.
Thanks in advance,
Bruno
The text was updated successfully, but these errors were encountered: