Wrong "adult" dataset in OpenML100 / CC18 #813

amueller · 2018-10-08T17:42:46Z

This one is tagged:
https://www.openml.org/d/1590

It should be this:
https://www.openml.org/d/1119

It should exclude the "fnlwgt" column (not sure it's marked, can't see that in the web interface).
Also the sklearn fetcher fails on this dataset.
CC @janvanrijn

amueller · 2018-10-08T17:45:23Z

It doesn't look like fnlwgt is dropped when loading from the python interface.
(also that interface doesn't crash like the scikit-learn one does).

janvanrijn · 2018-10-08T18:27:45Z

Regarding scikit-learn, there was some flawed if logic in a for loop.

I created a PR to improve the code-quality and fix the problem:
scikit-learn/scikit-learn#12330

amueller · 2018-10-09T15:26:46Z

Cool, I merged this one. But what about the wrong dataset being in the collections? cc @joaquinvanschoren @berndbischl (and this dataset not ignoring one of the columns it should be ignoring).

janvanrijn · 2018-10-09T15:43:46Z

We should deactivate the current dataset(s) and upload a new version. Based on what information do you think that the fnlwgt column should be ignored?

amueller · 2018-10-09T15:46:46Z

Based on the description of the dataset. Original paper maybe?? It's a reweighting of the individual so that the overall population is represented better. Sent from phone. Please excuse spelling and brevity.

…

On Tue, Oct 9, 2018, 11:43 janvanrijn ***@***.***> wrote: We should deactivate the current dataset(s) and upload a new version. Based on what information do you think that the fnlwgt column should be ignored? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#813 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbcFmszsxxoqXSLD3QAtroFJYKTd0wBks5ujMQ2gaJpZM4XNYqY> .

amueller · 2018-10-18T16:45:07Z

Ok looks like the original paper included the fnlwgt feature.
The description says

| Description of fnlwgt (final weight)
|
| The weights on the CPS files are controlled to independent estimates of the
| civilian noninstitutional population of the US.  These are prepared monthly
| for us by Population Division here at the Census Bureau.  We use 3 sets of
| controls.
|  These are:
|          1.  A single cell estimate of the population 16+ for each state.
|          2.  Controls for Hispanic Origin by age and sex.
|          3.  Controls by Race, age and sex.
|
| We use all three sets of controls in our weighting program and "rake" through
| them 6 times so that by the end we come back to all the controls we used.
|
| The term estimate refers to population totals derived from CPS by creating
| "weighted tallies" of any specified socio-economic characteristics of the
| population.
|
| People with similar demographic characteristics should have
| similar weights.  There is one important caveat to remember
| about this statement.  That is that since the CPS sample is
| actually a collection of 51 state samples, each with its own
| probability of selection, the statement only applies within
| state.

I don't entirely understand this feature but I guess it should be included?

amueller · 2018-10-18T16:45:34Z

also shouldn't adult and adult-census be different versions of the dataset?

janvanrijn · 2018-11-14T16:15:58Z

Moved this to openml/benchmark-suites#37

janvanrijn mentioned this issue Oct 8, 2018

fetch openml fails when ignore_attribute is not categorical scikit-learn/scikit-learn#12329

Closed

janvanrijn mentioned this issue Nov 14, 2018

adult dataset openml/benchmark-suites#37

Open

janvanrijn closed this as completed Nov 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong "adult" dataset in OpenML100 / CC18 #813

Wrong "adult" dataset in OpenML100 / CC18 #813

amueller commented Oct 8, 2018 •

edited

Loading

amueller commented Oct 8, 2018

janvanrijn commented Oct 8, 2018

amueller commented Oct 9, 2018

janvanrijn commented Oct 9, 2018

amueller commented Oct 9, 2018 via email

amueller commented Oct 18, 2018

amueller commented Oct 18, 2018

janvanrijn commented Nov 14, 2018

Wrong "adult" dataset in OpenML100 / CC18 #813

Wrong "adult" dataset in OpenML100 / CC18 #813

Comments

amueller commented Oct 8, 2018 • edited Loading

amueller commented Oct 8, 2018

janvanrijn commented Oct 8, 2018

amueller commented Oct 9, 2018

janvanrijn commented Oct 9, 2018

amueller commented Oct 9, 2018 via email

amueller commented Oct 18, 2018

amueller commented Oct 18, 2018

janvanrijn commented Nov 14, 2018

amueller commented Oct 8, 2018 •

edited

Loading