Skip to content

Fixing fetching of categorical sparse data#823

Merged
mfeurer merged 3 commits intodevelopfrom
fix_758
Oct 15, 2019
Merged

Fixing fetching of categorical sparse data#823
mfeurer merged 3 commits intodevelopfrom
fix_758

Conversation

@Neeratyoy
Copy link
Copy Markdown
Contributor

Reference Issue

Fixes #758.

What does this PR implement/fix? Explain your changes.

There was a try block checking for a numpy based conversion. Converted that to pandas categorical encoding.

How should this PR be tested?

import openml
openml.datasets.get_dataset(395)

@Neeratyoy Neeratyoy requested a review from mfeurer October 14, 2019 16:04
self.assertEqual(dataset.name, 're1.wc')
self.assertEqual(feature.name, 'CLASS_LABEL')
self.assertEqual(feature.data_type, 'nominal')
self.assertEqual(len(feature.nominal_values), 25)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a check for the type of the output value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being clear enough here. Could you please load X and y and check their type, dtype and shape?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I'm refraining from changing this test now. Have created an issue to take care of such checks independently.

Comment thread tests/test_datasets/test_dataset.py
np.array(type_, dtype=np.float32)
# checks if the strings which should be the class labels
# can be encoded into integers
pd.factorize(type_)[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you mention in person that you need to assign the value of this function call?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but with further testing realized that assignment doesn't make sense here since this loop iterates over the attributes. Whereas, if anything needs to be checked, we should check the data. Which is not seemingly throwing any issue.

I checked this chunk of code with both a Sparse_Arff and Arff data formats, the type_ receives exactly the same type and structure of the output. I don't know why the attribute list is being checked for type whereas the arff.ArffDecoder.decode() seems to return the target feature as a list of the classes. Don't know why a sparse format requires numeric encoding of that attribute list.

Hence, I replaced the numpy check with the pandas categorical encoding.

@codecov-io
Copy link
Copy Markdown

codecov-io commented Oct 14, 2019

Codecov Report

Merging #823 into develop will increase coverage by 1.23%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #823      +/-   ##
===========================================
+ Coverage    88.05%   89.28%   +1.23%     
===========================================
  Files           36       36              
  Lines         4243     4768     +525     
===========================================
+ Hits          3736     4257     +521     
- Misses         507      511       +4
Impacted Files Coverage Δ
openml/datasets/dataset.py 90.15% <100%> (+2.68%) ⬆️
openml/flows/flow.py 94.24% <0%> (+0.37%) ⬆️
openml/extensions/sklearn/extension.py 94.01% <0%> (+2.73%) ⬆️
openml/exceptions.py 97.67% <0%> (+4.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4853d7c...6b0b036. Read the comment docs.

@Neeratyoy Neeratyoy requested a review from mfeurer October 14, 2019 22:24
@mfeurer mfeurer merged commit 23d4e6f into develop Oct 15, 2019
@mfeurer mfeurer deleted the fix_758 branch October 15, 2019 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot load sparse datasets with categorical variables any more

3 participants