Class Balances #34

wawamanhunt · 2017-07-13T17:59:29Z

Hi,

Big fan of google research datasets, have been hoping to use this dataset to train a model.

For my model, I am looking for 2000 to 3000 balanced classes with 1000 or more observations each. I examined the aggregate training and validation annotations, to test for class balance, taking up to 3000 observations from each class. The original annotations had 83 million labels, which this operation trimmed to 1.7 million rows. I expected my distribution to be a bit flatter at the upper bound otherwise be the same as the graphs provided in the repository, show around 2500 classes with 1000 occurrences each.

The graphs provided in the repository show around 2500 classes with 1000 occurrences each. I re-downloaded the annotations and checked and rewrote my scripts a bunch of times, but still keep getting this result. I was wondering if anyone else is having this issue, or if I am missing something.

rkrasin · 2017-07-13T18:05:27Z

@andreasveit Andreas, can you please take a quick look?

atqamar · 2017-07-19T19:36:48Z

@rkrasin @andreasveit I'm getting similar results too. The histogram of labels does not match what is presented in the README.

tiborr · 2017-07-21T16:23:34Z

@rkrasin @andreasveit Same here.

theostrauss · 2017-07-21T16:24:55Z

@rkrasin @andreasveit

@atqamar @tiborr i am getting the same results. this is a little weird, don't you think? this is definitely slowing down my workflow and would love some confirmation asap.

rkrasin · 2017-07-21T16:44:56Z

According to this query in BigQuery, there's 2532 categories with at least 1000 images.

rkrasin · 2017-07-21T19:01:20Z

@theostrauss @tiborr @atqamar @wawamanhunt may I see your code that builds the histogram? I am pretty sure the histogram in README is close to truth based on the confirmation from BigQuery, but I also have no doubt that your results are also based on something. So, I want to reproduce your results to understand.

rkrasin · 2017-07-22T05:26:39Z

Actually, given the new update posted (https://research.googleblog.com/2017/07/an-update-to-open-images-now-with.html), all histograms need to be reevaluated. :)
TL,DR: more ground truth, and also with bounding boxes.

nalldrin · 2017-11-28T01:07:49Z

Sorry for not seeing this thread sooner. I wonder if the distribution from Andreas was for human annotations whereas you are looking at the machine annotations? 83M is from the machine annotations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Class Balances #34

Class Balances #34

wawamanhunt commented Jul 13, 2017

rkrasin commented Jul 13, 2017

atqamar commented Jul 19, 2017

tiborr commented Jul 21, 2017

theostrauss commented Jul 21, 2017

rkrasin commented Jul 21, 2017

rkrasin commented Jul 21, 2017

rkrasin commented Jul 22, 2017

nalldrin commented Nov 28, 2017

Class Balances #34

Class Balances #34

Comments

wawamanhunt commented Jul 13, 2017

rkrasin commented Jul 13, 2017

atqamar commented Jul 19, 2017

tiborr commented Jul 21, 2017

theostrauss commented Jul 21, 2017

rkrasin commented Jul 21, 2017

rkrasin commented Jul 21, 2017

rkrasin commented Jul 22, 2017

nalldrin commented Nov 28, 2017