Skip to content

Lazy download of data splits #659

Merged
mfeurer merged 11 commits intodevelopfrom
fix346
Apr 9, 2019
Merged

Lazy download of data splits #659
mfeurer merged 11 commits intodevelopfrom
fix346

Conversation

@Neeratyoy
Copy link
Copy Markdown
Contributor

What does this PR implement/fix? Explain your changes.

Addresses the pending part from #346 . Wherein, the splits and the data are not downloaded when get_task is called. Only when a run is executed, the data and the respective splits are downloaded.
A unit test has been added for the lazy fetch of a task.

How should this PR be tested?

Calling get_task or get_tasks with the additional parameter of download_data=False downloads only the task.xml file. Which can be verified by monitoring the cache directory. When a task is run, the corresponding datasplits.arff file is downloaded and available in the cache directory.

Any other comments?

  • Earlier: task.class_labels assignment was happening in a nested manner after task.download_split
  • Now: task.class_labelsis being assigned even if splits are not being downloaded, by fetching it from dataset's description

Copy link
Copy Markdown
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! This looks really good, I only have a few minor comments.

Comment thread tests/test_tasks/test_task_functions.py Outdated
Comment thread tests/test_tasks/test_task_functions.py
Comment thread openml/tasks/functions.py Outdated
def get_task(task_id):
"""Download the OpenML task for a given task ID.
def get_task(task_id, download_data=True):
"""Download the OpenML task representation for a given task ID, optionally
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please try to keep the first line of the docstring to be a single line and add information about optional arguments afterwards?

@codecov-io
Copy link
Copy Markdown

codecov-io commented Apr 4, 2019

Codecov Report

Merging #659 into develop will decrease coverage by 0.11%.
The diff coverage is 76.92%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #659      +/-   ##
===========================================
- Coverage       91%   90.89%   -0.12%     
===========================================
  Files           36       32       -4     
  Lines         3526     3395     -131     
===========================================
- Hits          3209     3086     -123     
+ Misses         317      309       -8
Impacted Files Coverage Δ
openml/tasks/functions.py 85.53% <76.92%> (-0.92%) ⬇️
openml/testing.py 93.82% <0%> (-1.5%) ⬇️
openml/flows/functions.py 87.4% <0%> (-0.79%) ⬇️
openml/config.py 89.28% <0%> (-0.19%) ⬇️
openml/runs/trace.py 90.84% <0%> (-0.18%) ⬇️
openml/flows/flow.py 94.08% <0%> (-0.1%) ⬇️
openml/__init__.py 100% <0%> (ø) ⬆️
openml/datasets/functions.py 95.47% <0%> (ø) ⬆️
openml/flows/__init__.py 100% <0%> (ø) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e8e904...23209af. Read the comment docs.

@Neeratyoy Neeratyoy requested a review from mfeurer April 4, 2019 17:14
Comment thread openml/tasks/functions.py Outdated
Parameters
----------
task_id : int
The task representation is downloaded while the download of data
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that's not how I meant it. Could you please check for example the doc strings of the scikit-learn random forest? There's a single line, then some additional information (where I think the sentence about the optional download of the split and data should go) and then the Parameters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about that, should have looked up an example, will make the changes

@Neeratyoy Neeratyoy requested a review from mfeurer April 5, 2019 10:42
@mfeurer mfeurer merged commit 6b5dfe6 into develop Apr 9, 2019
@mfeurer mfeurer deleted the fix346 branch April 9, 2019 15:27
PGijsbers pushed a commit that referenced this pull request Apr 10, 2019
* Added comments in examples for dataset 68 belonging to only test server

* Added comment in flow and run example for dataset 68 belonging to only test server

* Making download of datasplits optional and adding a relevant unit test

* Adding error handling for task ID type

* Changes suggested by Matthias on PR #659

* Removing inappropriate dataset check from test case

* Fixing docstring

* Fixing whitespace issue for PEP8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants