Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions openml/datasets/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@ def _get_arff(self, format: str) -> Dict:
when converted to lower case.



Returns
-------
dict
Expand Down Expand Up @@ -319,13 +320,15 @@ def _parse_data_from_arff(
attribute_names = []
categories_names = {}
categorical = []
for name, type_ in data['attributes']:
for i, (name, type_) in enumerate(data['attributes']):
# if the feature is nominal and the a sparse matrix is
# requested, the categories need to be numeric
if (isinstance(type_, list)
and self.format.lower() == 'sparse_arff'):
try:
np.array(type_, dtype=np.float32)
# checks if the strings which should be the class labels
# can be encoded into integers
pd.factorize(type_)[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you mention in person that you need to assign the value of this function call?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but with further testing realized that assignment doesn't make sense here since this loop iterates over the attributes. Whereas, if anything needs to be checked, we should check the data. Which is not seemingly throwing any issue.

I checked this chunk of code with both a Sparse_Arff and Arff data formats, the type_ receives exactly the same type and structure of the output. I don't know why the attribute list is being checked for type whereas the arff.ArffDecoder.decode() seems to return the target feature as a list of the classes. Don't know why a sparse format requires numeric encoding of that attribute list.

Hence, I replaced the numpy check with the pandas categorical encoding.

except ValueError:
raise ValueError(
"Categorical data needs to be numeric when "
Expand Down
11 changes: 11 additions & 0 deletions tests/test_datasets/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import openml
from openml.testing import TestBase
from openml.exceptions import PyOpenMLError
from openml.datasets import OpenMLDataset, OpenMLDataFeature


class OpenMLDatasetTest(TestBase):
Expand Down Expand Up @@ -320,6 +321,16 @@ def test_get_sparse_dataset_rowid_and_ignore_and_target(self):
self.assertListEqual(categorical, [False] * 19998)
self.assertEqual(y.shape, (600, ))

def test_get_sparse_categorical_data_id_395(self):
Comment thread
mfeurer marked this conversation as resolved.
dataset = openml.datasets.get_dataset(395, download_data=True)
feature = dataset.features[3758]
self.assertTrue(isinstance(dataset, OpenMLDataset))
self.assertTrue(isinstance(feature, OpenMLDataFeature))
self.assertEqual(dataset.name, 're1.wc')
self.assertEqual(feature.name, 'CLASS_LABEL')
self.assertEqual(feature.data_type, 'nominal')
self.assertEqual(len(feature.nominal_values), 25)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a check for the type of the output value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being clear enough here. Could you please load X and y and check their type, dtype and shape?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I'm refraining from changing this test now. Have created an issue to take care of such checks independently.



class OpenMLDatasetQualityTest(TestBase):
def test__check_qualities(self):
Expand Down