Skip to content

Fix612 lazy download dataset#644

Merged
mfeurer merged 19 commits intodevelopfrom
fix612
Mar 18, 2019
Merged

Fix612 lazy download dataset#644
mfeurer merged 19 commits intodevelopfrom
fix612

Conversation

@PGijsbers
Copy link
Copy Markdown
Collaborator

@PGijsbers PGijsbers commented Mar 17, 2019

Datasets can now be downloaded without downloading the arff file.
Function signature of get_dataset changed from
def get_dataset(dataset_id: Union[int, str]) -> OpenMLDataset:
to
def get_dataset(dataset_id: Union[int, str], download_data: bool = True) -> OpenMLDataset:
I chose to default to True so there are no breaking changes to existing code.
If download_data=False, only metadata will be downloaded (i.e. all data except the arff file).

Whenever a user invokes retrieve_class_labels or get_data, both of which require the arff-file, the arff-file is retrieved and processed as if it were downloaded from the server on initialization (e.g. pickled). This happens without warning or error/argument.
As long as only functionality is used which does not require the arff file, no additional data is downloaded.

I took the liberty to refactor retrieve_class_labels s.t. it uses the already downloaded feature metadata instead of reading the arff file. This makes it so that retrieve_class_labels can 50used without downloading the underlying data, and overall should really speed up the method in cases where a huge arff file was loaded just to check the header 👍
As part of this I also addressed open issue #507 that was being worked on in #508 (I only realized afterwards). Though it looks like that code was from before @janvanrijn changed the xml so that the nominal values are a list. My solution is able to use this update and with only minor changes address the issue.

Fixes:
#643
#612
#507
#446
#346 in part

@PGijsbers PGijsbers requested review from mfeurer and removed request for mfeurer March 17, 2019 21:01
@codecov-io
Copy link
Copy Markdown

codecov-io commented Mar 18, 2019

Codecov Report

Merging #644 into develop will increase coverage by 0.54%.
The diff coverage is 92.62%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #644      +/-   ##
===========================================
+ Coverage    90.13%   90.67%   +0.54%     
===========================================
  Files           32       32              
  Lines         3366     3390      +24     
===========================================
+ Hits          3034     3074      +40     
+ Misses         332      316      -16
Impacted Files Coverage Δ
openml/datasets/dataset.py 88.41% <90.76%> (+6%) ⬆️
openml/datasets/functions.py 92.04% <94.59%> (-0.7%) ⬇️
openml/utils.py 92.06% <95%> (+0.55%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94102f3...b812904. Read the comment docs.

This was referenced Mar 18, 2019
@PGijsbers PGijsbers changed the title [WIP] Fix612 lazy download dataset Fix612 lazy download dataset Mar 18, 2019
@PGijsbers
Copy link
Copy Markdown
Collaborator Author

It looks like continuous-integration/appveyor/pr failed because the conda update stalled.
I can't figure out how to rerun it, without pushing a new commit to this PR. I tried to re-build the commit, but that only seems to trigger continuous-integration/appveyor/branch (which worked but is now being rerun).

@PGijsbers PGijsbers self-assigned this Mar 18, 2019
@PGijsbers
Copy link
Copy Markdown
Collaborator Author

I can't seem to restart the continuous-integration/appveyor/pr build tests without submitting a new commit. If requested, I will add one, but I feel confident that it is not needed to have said build rerun and the code quality is fine (as all three other builds passed).

@joaquinvanschoren
Copy link
Copy Markdown
Contributor

Awesome! ::hooray::

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants