Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class Balances #34

Open
wawamanhunt opened this issue Jul 13, 2017 · 8 comments
Open

Class Balances #34

wawamanhunt opened this issue Jul 13, 2017 · 8 comments

Comments

@wawamanhunt
Copy link

Hi,

Big fan of google research datasets, have been hoping to use this dataset to train a model.

For my model, I am looking for 2000 to 3000 balanced classes with 1000 or more observations each. I examined the aggregate training and validation annotations, to test for class balance, taking up to 3000 observations from each class. The original annotations had 83 million labels, which this operation trimmed to 1.7 million rows. I expected my distribution to be a bit flatter at the upper bound otherwise be the same as the graphs provided in the repository, show around 2500 classes with 1000 occurrences each.

download

The graphs provided in the repository show around 2500 classes with 1000 occurrences each. I re-downloaded the annotations and checked and rewrote my scripts a bunch of times, but still keep getting this result. I was wondering if anyone else is having this issue, or if I am missing something.

@rkrasin
Copy link
Contributor

rkrasin commented Jul 13, 2017

@andreasveit Andreas, can you please take a quick look?

@atqamar
Copy link

atqamar commented Jul 19, 2017

@rkrasin @andreasveit I'm getting similar results too. The histogram of labels does not match what is presented in the README.

@tiborr
Copy link

tiborr commented Jul 21, 2017

@rkrasin @andreasveit Same here.

@theostrauss
Copy link

@rkrasin @andreasveit

@atqamar @tiborr i am getting the same results. this is a little weird, don't you think? this is definitely slowing down my workflow and would love some confirmation asap.

@rkrasin
Copy link
Contributor

rkrasin commented Jul 21, 2017

According to this query in BigQuery, there's 2532 categories with at least 1000 images.

@rkrasin
Copy link
Contributor

rkrasin commented Jul 21, 2017

@theostrauss @tiborr @atqamar @wawamanhunt may I see your code that builds the histogram? I am pretty sure the histogram in README is close to truth based on the confirmation from BigQuery, but I also have no doubt that your results are also based on something. So, I want to reproduce your results to understand.

@rkrasin
Copy link
Contributor

rkrasin commented Jul 22, 2017

Actually, given the new update posted (https://research.googleblog.com/2017/07/an-update-to-open-images-now-with.html), all histograms need to be reevaluated. :)
TL,DR: more ground truth, and also with bounding boxes.

@nalldrin
Copy link
Member

Sorry for not seeing this thread sooner. I wonder if the distribution from Andreas was for human annotations whereas you are looking at the machine annotations? 83M is from the machine annotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants