Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINT, MRG] Enable pooch to perform fetching of datasets #9742

Merged
merged 52 commits into from Sep 19, 2021

Conversation

adam2392
Copy link
Member

@adam2392 adam2392 commented Sep 15, 2021

Reference issue

First part of #8679 #9736

closes #9756

What does this implement/fix?

Builds on top of @drammock work in #8679 to:

  • move all definition of a dataset to a config.py + dataset_checksums.txt file. These define urls, archive names, configuration keys, folder names, and md5 hashes
  • use pooch to do the downloading, unzip/untar and QA checks
  • replace all of utils/fetching.py with pooch usage in all datasets that used _fetch_file().

I'm not sure how to handle bst, eegbci, hf_sef, limo and other complicated files yet. I think maybe I can copy @drammock's work? I'll have to investigate.

  • handle bst files
  • handle eegbci files
  • handle hf_sef files
  • handle limo files
  • handle fsaverage files
  • delete _fetch_file and related files

Note that the checksums file for eegbci is 3058 lines long, so in reality this code diff is:

  • ~800 green lines
  • 784 red lines
    as of 09/15/21

Additional information

The next PR should handle:

  1. passing in an optional authorization token to _data_path which would then address Adding an optional token to the dataset fetcher code to allow optional fetching from private repositories #9736
  2. generalize _data_path to a public function, perhaps... mne_dataset_path(), which then can be used by 3rd parties who don't want to store data in the MNE datasets section.

Copy link
Member

@drammock drammock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few prelim comments

environment.yml Outdated Show resolved Hide resolved
mne/datasets/config.py Outdated Show resolved Hide resolved
mne/datasets/config.py Show resolved Hide resolved
@adam2392
Copy link
Member Author

adam2392 commented Sep 15, 2021

So majority of the work was done by @drammock originally in his PR. Really great push! I've added some additional changes to basically remove as much redundant code as possible.

Here, I have basically removed almost the entire utils/fetching.py file because we frankly don't need it if we allow pooch to be an optional requirement for fetching datasets.

This PR does not change the user interface, and so far works pretty great for me to pull datasets. If this PR is approved, then I can easily add in a public interface (turning _data_path to a public function) that assumes some structure on top of pooch to basically enable any downstream MNE package to define their own data_paths, leveraging the _data_path() function.

@adam2392
Copy link
Member Author

adam2392 commented Sep 16, 2021

Fixing the docs error.

Everything else actually LGTM. There's many LOC reduced (~800). The only other error is:

________________________________ test_downloads ________________________________
3404
mne/datasets/tests/test_datasets.py:79: in test_downloads
3405
    new_path = datasets._fake.data_path(update_path=True, **kwargs)
3406
E   Failed: DID NOT WARN. No warnings of type (<class 'RuntimeWarning'>,) was emitted. The list of emitted warnings is: [].

However, I'm not entirely sure why that line is producing an error... Any ideas?

@adam2392
Copy link
Member Author

The error has to do w/ capsys, or calling data_path twice? I think?

This test succeeds:

def test_download(tmp_path, monkeypatch, capsys):
    kwargs = dict(path=str(tmp_path), verbose=True)
    # XXX we shouldn't need to disable capsys here, but there's a pytest bug
    # that we're hitting (https://github.com/pytest-dev/pytest/issues/5997)
    # now that we use pooch
    # with capsys.disabled(): 
    # path = datasets._fake.data_path(update_path=False, **kwargs)

    monkeypatch.setenv('_MNE_FAKE_HOME_DIR', str(tmp_path))
    with pytest.warns(RuntimeWarning, match='non-standard config'):
        new_path = datasets._fake.data_path(update_path=True, **kwargs)

    # assert path == new_path
    out, _ = capsys.readouterr()
    assert 'Downloading' not in out

But this one fails

def test_download(tmp_path, monkeypatch, capsys):
    kwargs = dict(path=str(tmp_path), verbose=True)
    # XXX we shouldn't need to disable capsys here, but there's a pytest bug
    # that we're hitting (https://github.com/pytest-dev/pytest/issues/5997)
    # now that we use pooch
    with capsys.disabled(): 
        path = datasets._fake.data_path(update_path=False, **kwargs)

    monkeypatch.setenv('_MNE_FAKE_HOME_DIR', str(tmp_path))
    with pytest.warns(RuntimeWarning, match='non-standard config'):
        new_path = datasets._fake.data_path(update_path=True, **kwargs)

    assert path == new_path
    out, _ = capsys.readouterr()
    assert 'Downloading' not in out

@adam2392 adam2392 changed the title [WIP] Enable pooch to perform fetching of datasets [MAINT, WIP] Enable pooch to perform fetching of datasets Sep 16, 2021
mne/datasets/utils.py Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
Copy link
Member

@drammock drammock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one last bit of cruft in requirements_testing.txt. I think this is good to go after that.

requirements_testing.txt Outdated Show resolved Hide resolved
@adam2392
Copy link
Member Author

Now there is an error with GH actions though:

    pooch = _soft_import('pooch', 'dataset downloading', strict=True)
61
  File "/usr/share/miniconda/envs/test/lib/python3.7/site-packages/mne-0.24.dev0-py3.7.egg/mne/utils/check.py", line 273, in _soft_import
62
    raise RuntimeError(f'For {purpose} to work, the {name} module is '
63
RuntimeError: For dataset downloading to work, the pooch module is needed, but it could not be imported.

Do we need pooch somewhere besides where we already added?

@drammock
Copy link
Member

Now there is an error with GH actions though:

dang, OK, I'll fix it

@drammock
Copy link
Member

ci failure is unrelated (Windows 3.7 pip timeout)

@@ -0,0 +1,46 @@
# include here the name of the zipped file and the md5 hash
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

later PR: Put into python code and create temp file for internal

@drammock
Copy link
Member

drammock commented Sep 17, 2021

TODO list (from live chat with @agramfort and @adam2392):

  1. make sure pip install mne and then import mne will still work without pooch (i.e., make sure pooch imports are properly nested)
  2. add pooch to test_import_nesting
  3. restructure dataset info as nested dict of dicts; incorporate MD5s into python and change pooch to temporarily write the registry file (separate PR) and whatever else is necessary to make it easier to add new built-in datasets
  4. add generic dataset downloader function, that allows e.g., API keys (separate PR). This should probably look like a data_path() function that takes in URL, archive name, hash, etc, and does everything that one of our built-in data_path() functions does (except for checking/setting config keys)

@adam2392
Copy link
Member Author

Unrelated CI failure.

@drammock
Copy link
Member

Unrelated CI failure.

indeed they look unrelated to this PR, though did you see the ResourceWarning: unclosed file related to EDF export?
https://dev.azure.com/mne-tools/mne-python/_build/results?buildId=15442&view=logs&j=fa98c7b3-575e-55b6-6946-22dea6f6daf8&t=c0866e65-eed5-5461-eadc-f0ff150eefd8&l=3120

@adam2392
Copy link
Member Author

Ah fixed the resource warning. It was an error with not closing the files when a runtime error is raised.

@drammock

@drammock
Copy link
Member

all green! pooch PR is ready for final review

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsoner merge if happy

@@ -261,6 +266,7 @@ def _export_raw(fname, raw, physical_range, add_ch_type):
onset = onset * 10000
duration = duration * 10000
if hdl.writeAnnotation(onset, duration, desc) != 0:
hdl.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution is less clean / DRY than a solution that uses a context manager like with hdl: or, if that's not an option:

@contextmanager
def _close_file(fid):
    try:
        yield
    finally:
        fid.close()

But we can clean this up in a follow-up PR

@@ -17,6 +17,7 @@ elif [ "${TEST_MODE}" == "pip-pre" ]; then
python -m pip install --progress-bar off --upgrade --only-binary ":all" "vtk<=9.0.1"
python -m pip install --progress-bar off https://github.com/pyvista/pyvista/zipball/main
python -m pip install --progress-bar off https://github.com/pyvista/pyvistaqt/zipball/main
python -m pip install --progress-bar off pooch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed because the last line includes pip install -r requirements_testing.txt and it's in there

Comment on lines +34 to +35
echo "pooch"
pip install --progress-bar off pooch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@larsoner larsoner merged commit c42a7b7 into mne-tools:main Sep 19, 2021
@larsoner
Copy link
Member

Thanks @adam2392 @drammock ! I'll open a quick PR for my tiny comments

Comment on lines +297 to +302
hf_sef_raw=pooch.Untar(
extract_dir=path, members=[f'hf_sef/{subdir}' for subdir in
('MEG', 'SSS', 'subjects')]),
hf_sef_evoked=pooch.Untar(
extract_dir=path, members=[f'hf_sef/{subdir}' for subdir in
('MEG', 'SSS', 'subjects')]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drammock do you know why we needed hf_Sef to be like this? These LOC I copied seems to break circle CI in : #9774

It seems to work perfectly fine w/ processor = 'untar' untaring to the path. I confirm that the error was not raised when I changed the processor to just regular pooch.Untar(extract_dir=path

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I copied these from your original PR aimed to tackle #8674

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI, MAINT: Unclosed file ResourceWarning in EDF export
4 participants