Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong "adult" dataset in OpenML100 / CC18 #813

Closed
amueller opened this issue Oct 8, 2018 · 8 comments
Closed

Wrong "adult" dataset in OpenML100 / CC18 #813

amueller opened this issue Oct 8, 2018 · 8 comments

Comments

@amueller
Copy link

amueller commented Oct 8, 2018

This one is tagged:
https://www.openml.org/d/1590

It should be this:
https://www.openml.org/d/1119

It should exclude the "fnlwgt" column (not sure it's marked, can't see that in the web interface).
Also the sklearn fetcher fails on this dataset.
CC @janvanrijn

@amueller
Copy link
Author

amueller commented Oct 8, 2018

It doesn't look like fnlwgt is dropped when loading from the python interface.
(also that interface doesn't crash like the scikit-learn one does).

@janvanrijn
Copy link
Member

Regarding scikit-learn, there was some flawed if logic in a for loop.

I created a PR to improve the code-quality and fix the problem:
scikit-learn/scikit-learn#12330

@amueller
Copy link
Author

amueller commented Oct 9, 2018

Cool, I merged this one. But what about the wrong dataset being in the collections? cc @joaquinvanschoren @berndbischl (and this dataset not ignoring one of the columns it should be ignoring).

@janvanrijn
Copy link
Member

We should deactivate the current dataset(s) and upload a new version. Based on what information do you think that the fnlwgt column should be ignored?

@amueller
Copy link
Author

amueller commented Oct 9, 2018 via email

@amueller
Copy link
Author

Ok looks like the original paper included the fnlwgt feature.
The description says

| Description of fnlwgt (final weight)
|
| The weights on the CPS files are controlled to independent estimates of the
| civilian noninstitutional population of the US.  These are prepared monthly
| for us by Population Division here at the Census Bureau.  We use 3 sets of
| controls.
|  These are:
|          1.  A single cell estimate of the population 16+ for each state.
|          2.  Controls for Hispanic Origin by age and sex.
|          3.  Controls by Race, age and sex.
|
| We use all three sets of controls in our weighting program and "rake" through
| them 6 times so that by the end we come back to all the controls we used.
|
| The term estimate refers to population totals derived from CPS by creating
| "weighted tallies" of any specified socio-economic characteristics of the
| population.
|
| People with similar demographic characteristics should have
| similar weights.  There is one important caveat to remember
| about this statement.  That is that since the CPS sample is
| actually a collection of 51 state samples, each with its own
| probability of selection, the statement only applies within
| state.

I don't entirely understand this feature but I guess it should be included?

@amueller
Copy link
Author

also shouldn't adult and adult-census be different versions of the dataset?

@janvanrijn
Copy link
Member

Moved this to openml/benchmark-suites#37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants