Fix612 lazy download dataset by PGijsbers · Pull Request #644 · openml/openml-python

PGijsbers · 2019-03-17T18:44:47Z

Datasets can now be downloaded without downloading the arff file.
Function signature of get_dataset changed from
def get_dataset(dataset_id: Union[int, str]) -> OpenMLDataset:
to
def get_dataset(dataset_id: Union[int, str], download_data: bool = True) -> OpenMLDataset:
I chose to default to True so there are no breaking changes to existing code.
If download_data=False, only metadata will be downloaded (i.e. all data except the arff file).

Whenever a user invokes ~~retrieve_class_labels or~~ get_data, both of which require the arff-file, the arff-file is retrieved and processed as if it were downloaded from the server on initialization (e.g. pickled). This happens without warning or error/argument.
As long as only functionality is used which does not require the arff file, no additional data is downloaded.

I took the liberty to refactor retrieve_class_labels s.t. it uses the already downloaded feature metadata instead of reading the arff file. This makes it so that retrieve_class_labels can 50used without downloading the underlying data, and overall should really speed up the method in cases where a huge arff file was loaded just to check the header 👍
As part of this I also addressed open issue #507 that was being worked on in #508 (I only realized afterwards). Though it looks like that code was from before @janvanrijn changed the xml so that the nominal values are a list. My solution is able to use this update and with only minor changes address the issue.

Fixes:
#643
#612
#507
#446
#346 in part

…laces that might use the arff file internally.

…ing it out of __init__.

…ed when downloading the arff file.

…ded.

codecov-io · 2019-03-18T08:38:19Z

Codecov Report

Merging #644 into develop will increase coverage by 0.54%.
The diff coverage is 92.62%.

@@             Coverage Diff             @@
##           develop     #644      +/-   ##
===========================================
+ Coverage    90.13%   90.67%   +0.54%     
===========================================
  Files           32       32              
  Lines         3366     3390      +24     
===========================================
+ Hits          3034     3074      +40     
+ Misses         332      316      -16

Impacted Files	Coverage Δ
openml/datasets/dataset.py	`88.41% <90.76%> (+6%)`	⬆️
openml/datasets/functions.py	`92.04% <94.59%> (-0.7%)`	⬇️
openml/utils.py	`92.06% <95%> (+0.55%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94102f3...b812904. Read the comment docs.

…re metadata.

… differently from one with multiple (e.g. feat 5 of d/2).

…zy loading.

PGijsbers · 2019-03-18T15:56:47Z

It looks like continuous-integration/appveyor/pr failed because the conda update stalled.
I can't figure out how to rerun it, without pushing a new commit to this PR. I tried to re-build the commit, but that only seems to trigger continuous-integration/appveyor/branch (which worked but is now being rerun).

PGijsbers · 2019-03-18T16:21:43Z

I can't seem to restart the continuous-integration/appveyor/pr build tests without submitting a new commit. If requested, I will add one, but I feel confident that it is not needed to have said build rerun and the code quality is fine (as all three other builds passed).

joaquinvanschoren · 2019-03-18T21:53:26Z

Awesome! ::hooray::

PGijsbers added 10 commits March 17, 2019 15:49

First iteration of lazy loading. Does not yet take into account all p…

4c21ad8

…laces that might use the arff file internally.

Factor functionality of loading ARFF to correct data format and pickl…

312650f

…ing it out of __init__.

Extracted a more general 'download_text_file' function that is now us…

a1b8c93

…ed when downloading the arff file.

Download data when get_data is called and it had not yet been downloa…

1b14078

…ded.

Update unit tests.

4090c05

Also check if download is required for retrieve class labels.

a01a029

add test to ensure all functionality works without retrieving data.

50a9c3f

update doc/hint.

d13f0c4

Flake8, unused imports, spacing around =

dd6a064

Always return path to pickle file.

9cd8176

PGijsbers requested review from mfeurer and removed request for mfeurer March 17, 2019 21:01

Add notice of lazy loading to dataset tutorial.

18eda4d

PGijsbers added 3 commits March 18, 2019 10:38

Simplified retrieve_class_labels using the already downloaded featu…

3d8deda

…re metadata.

Fix a bug where nominal feature with a single unique value is treated…

6ca05be

… differently from one with multiple (e.g. feat 5 of d/2).

Apply AppVeyor fix.

5f2919f

This was referenced Mar 18, 2019

[WIP] Fix643 #645

Closed

make tests run quicker #330

Open

PGijsbers added 5 commits March 18, 2019 12:49

Update feature xml to most recent.

062e2e9

Update test to reflect retrieve_class_labels is now available with la…

391f30a

…zy loading.

Unify loading of features between cached and downloaded.

8603224

Flake8.

76e5bb9

Add random element to tag to avoid race conditions in parallel tests.

b812904

PGijsbers requested review from ArlindKadra and mfeurer March 18, 2019 13:41

PGijsbers changed the title ~~[WIP] Fix612 lazy download dataset~~ Fix612 lazy download dataset Mar 18, 2019

PGijsbers mentioned this pull request Mar 18, 2019

Current dev branch fails on appveyor #643

Closed

PGijsbers self-assigned this Mar 18, 2019

mfeurer approved these changes Mar 18, 2019

View reviewed changes

mfeurer merged commit aecb6ac into develop Mar 18, 2019

mfeurer deleted the fix612 branch March 18, 2019 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix612 lazy download dataset#644

Fix612 lazy download dataset#644
mfeurer merged 19 commits intodevelopfrom
fix612

PGijsbers commented Mar 17, 2019 •

edited

Loading

Uh oh!

codecov-io commented Mar 18, 2019 •

edited

Loading

Uh oh!

PGijsbers commented Mar 18, 2019

Uh oh!

PGijsbers commented Mar 18, 2019

Uh oh!

joaquinvanschoren commented Mar 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

PGijsbers commented Mar 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PGijsbers commented Mar 18, 2019

Uh oh!

PGijsbers commented Mar 18, 2019

Uh oh!

joaquinvanschoren commented Mar 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PGijsbers commented Mar 17, 2019 •

edited

Loading

codecov-io commented Mar 18, 2019 •

edited

Loading