Fixing fetching of categorical sparse data by Neeratyoy · Pull Request #823 · openml/openml-python

Neeratyoy · 2019-10-14T16:04:23Z

Reference Issue

Fixes #758.

What does this PR implement/fix? Explain your changes.

There was a try block checking for a numpy based conversion. Converted that to pandas categorical encoding.

How should this PR be tested?

import openml
openml.datasets.get_dataset(395)

mfeurer · 2019-10-14T18:52:04Z

+        self.assertEqual(dataset.name, 're1.wc')
+        self.assertEqual(feature.name, 'CLASS_LABEL')
+        self.assertEqual(feature.data_type, 'nominal')
+        self.assertEqual(len(feature.nominal_values), 25)


Could you please add a check for the type of the output value?

Sorry for not being clear enough here. Could you please load X and y and check their type, dtype and shape?

As discussed, I'm refraining from changing this test now. Have created an issue to take care of such checks independently.

mfeurer · 2019-10-14T18:55:54Z

-                    np.array(type_, dtype=np.float32)
+                    # checks if the strings which should be the class labels
+                    # can be encoded into integers
+                    pd.factorize(type_)[0]


Didn't you mention in person that you need to assign the value of this function call?

Yes, but with further testing realized that assignment doesn't make sense here since this loop iterates over the attributes. Whereas, if anything needs to be checked, we should check the data. Which is not seemingly throwing any issue.

I checked this chunk of code with both a Sparse_Arff and Arff data formats, the type_ receives exactly the same type and structure of the output. I don't know why the attribute list is being checked for type whereas the arff.ArffDecoder.decode() seems to return the target feature as a list of the classes. Don't know why a sparse format requires numeric encoding of that attribute list.

Hence, I replaced the numpy check with the pandas categorical encoding.

codecov-io · 2019-10-14T22:23:07Z

Codecov Report

Merging #823 into develop will increase coverage by 1.23%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop     #823      +/-   ##
===========================================
+ Coverage    88.05%   89.28%   +1.23%     
===========================================
  Files           36       36              
  Lines         4243     4768     +525     
===========================================
+ Hits          3736     4257     +521     
- Misses         507      511       +4

Impacted Files	Coverage Δ
openml/datasets/dataset.py	`90.15% <100%> (+2.68%)`	⬆️
openml/flows/flow.py	`94.24% <0%> (+0.37%)`	⬆️
openml/extensions/sklearn/extension.py	`94.01% <0%> (+2.73%)`	⬆️
openml/exceptions.py	`97.67% <0%> (+4.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4853d7c...6b0b036. Read the comment docs.

Replacing numpy conversion with pandas categorical encoding

d4d764e

Neeratyoy requested a review from mfeurer October 14, 2019 16:04

mfeurer requested changes Oct 14, 2019

View reviewed changes

Adding more unit tests check

b06d348

Neeratyoy requested a review from mfeurer October 14, 2019 22:24

Changing unit test data fetch parameter

6b0b036

mfeurer approved these changes Oct 15, 2019

View reviewed changes

mfeurer merged commit 23d4e6f into develop Oct 15, 2019

mfeurer deleted the fix_758 branch October 15, 2019 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing fetching of categorical sparse data#823

Fixing fetching of categorical sparse data#823
mfeurer merged 3 commits intodevelopfrom
fix_758

Neeratyoy commented Oct 14, 2019

Uh oh!

mfeurer Oct 14, 2019

Uh oh!

Neeratyoy Oct 14, 2019

Uh oh!

mfeurer Oct 15, 2019

Uh oh!

Neeratyoy Oct 15, 2019

Uh oh!

Uh oh!

mfeurer Oct 14, 2019

Uh oh!

Neeratyoy Oct 14, 2019

Uh oh!

codecov-io commented Oct 14, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Neeratyoy commented Oct 14, 2019

Reference Issue

What does this PR implement/fix? Explain your changes.

How should this PR be tested?

Uh oh!

mfeurer Oct 14, 2019

Choose a reason for hiding this comment

Uh oh!

Neeratyoy Oct 14, 2019

Choose a reason for hiding this comment

Uh oh!

mfeurer Oct 15, 2019

Choose a reason for hiding this comment

Uh oh!

Neeratyoy Oct 15, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mfeurer Oct 14, 2019

Choose a reason for hiding this comment

Uh oh!

Neeratyoy Oct 14, 2019

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Oct 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-io commented Oct 14, 2019 •

edited

Loading