Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support extracting DICOMs from ZIP files (and possibly other archives) by switching to use shutil.unpack_archive instead of tarfile module functionality #471

Merged
merged 14 commits into from Apr 6, 2023

Conversation

HippocampusGirl
Copy link
Contributor

Dear heudiconv team,

I was recently working on converting a DICOM dataset to BIDS with heudiconv. Our imaging facility sends us the DICOM files as ZIP archives.

While I could extract the ZIP files manually before running heudiconv, I thought it may be useful to have on-the-fly extraction just like for tarballs. I implemented this via the built-in Python module "zipfile".

@codecov
Copy link

codecov bot commented Oct 27, 2020

Codecov Report

Patch coverage: 93.46% and project coverage change: +5.39 🎉

Comparison is base (d855f64) 76.13% compared to head (99b0bf6) 81.52%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #471      +/-   ##
==========================================
+ Coverage   76.13%   81.52%   +5.39%     
==========================================
  Files          37       43       +6     
  Lines        2962     3930     +968     
==========================================
+ Hits         2255     3204     +949     
- Misses        707      726      +19     
Impacted Files Coverage Δ
heudiconv/cli/monitor.py 31.52% <ø> (ø)
heudiconv/cli/run.py 91.48% <ø> (ø)
heudiconv/external/dlad.py 97.67% <ø> (ø)
heudiconv/heuristics/multires_7Tbold.py 15.90% <ø> (ø)
heudiconv/heuristics/test_reproin.py 100.00% <ø> (ø)
heudiconv/tests/test_queue.py 100.00% <ø> (ø)
heudiconv/tests/anonymize_script.py 54.54% <33.33%> (ø)
heudiconv/due.py 48.00% <48.00%> (ø)
heudiconv/tests/test_dicoms.py 85.88% <75.00%> (-14.12%) ⬇️
heudiconv/heuristics/reproin.py 84.29% <77.50%> (+3.18%) ⬆️
... and 21 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@HippocampusGirl
Copy link
Contributor Author

I just discovered that this works even without the change I uploaded. Sorry, my bad :-)

@yarikoptic
Copy link
Member

interesting -- stock heudiconv doesn't work for me:

$> heudiconv -f reproin --bids -o /tmp/h --files bids_anat-T1w.zip 
INFO: Running heudiconv version 0.9.0 latest 0.9.0
INFO: Analyzing 1 dicoms
INFO: Filtering out 0 dicoms based on their filename
WARNING: Ignoring bids_anat-T1w.zip since not quite a "normal" DICOM: 'FileDataset' object has no attribute 'SeriesNumber'
INFO: Generated sequence info for 0 studies with 0 entries total
INFO: Need to process 0 study sessions

so may be there is still a need in this/such PR. Do you remember with which version it has worked?

Haven't reviewed changes but tried with this PR and above worked... so we still need it seems to me.

@yarikoptic yarikoptic reopened this Feb 6, 2021
@yarikoptic
Copy link
Member

From cursory review looks ok but needs a unittest to ensure that it works. Test could reuse existing files, zip them, and do conversion.

@HippocampusGirl
Copy link
Contributor Author

Interesting, maybe some zip files work with tarfile and some do not. Or maybe I made a mistake and got things mixed up, who knows ;-)

I agree that a unit test would be important. I am currently quite busy, but I will have time to write one next week, so please stay tuned.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet how it could have worked on zip, most likely some fluke (e.g. .tar.gz was named with .zip extension or alike). So I think this enhancement has value so I will keep it live.

It will also need a unittest (produce .zip in the test code, give to function, see that does the right thing).

Meanwhile I will convert to draft. please advise if you are going to work on it further @HippocampusGirl . Cheers

heudiconv/parser.py Outdated Show resolved Hide resolved
@yarikoptic yarikoptic marked this pull request as draft February 17, 2023 15:31
@HippocampusGirl
Copy link
Contributor Author

Hi @yarikoptic, I think I got confused when I closed the pull request. I also agree that this is a valuable feature and have been using my changes for multiple projects.

I want to work on this some more, but cannot prioritize this right now. If you or anyone else is willing to step in, then things may be faster.

@psadil
Copy link
Contributor

psadil commented Mar 23, 2023

I want to work on this some more, but cannot prioritize this right now. If you or anyone else is willing to step in, then things may be faster.

Hello @HippocampusGirl, is it still the case that this feature is on hold? I would be interested in the ability to work with zip files and would be happy to bring this pull request over the finish line.

@HippocampusGirl
Copy link
Contributor Author

Hi @psadil, That would be wonderful!

@psadil
Copy link
Contributor

psadil commented Mar 23, 2023

Hi @psadil, That would be wonderful!

Great!

I wonder what's going to be a good way to contribute code. I have this potential solution that builds on the pull request's initial commit, d076a07. Do I submit a pull request into HippocampusGirl:add-zipfile-support so that it shows up in this #471 request? Can I directly add on to #471? Should I open a new pull request for the main nipy/heudiconv repo?

@yarikoptic perhaps you have guidance?

Thank you

@HippocampusGirl
Copy link
Contributor Author

Hi @psadil, I have given you write access to my repository. You should be able to push into my branch now, so that it shows up as part of the pull request :-)

@psadil
Copy link
Contributor

psadil commented Mar 29, 2023

@yarikoptic (tagging you as the original reviewer) - I think this may be ready to go. As you requested in the original review, the unpacking is now done with the shutil module, and several new tests of get_extracted_dicoms have been added.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • major concern -- seems to be a change in behavior in case of a mixed list of files/archives. I would prefer to keep original behavior unless there is really a strong need
  • some minor comments on styling etc.

heudiconv/parser.py Outdated Show resolved Hide resolved
import tarfile
from tempfile import mkdtemp
import shutil
import typing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think frequent typing. makes code too cluttered. I am ok with imports of the constructs we use as we already do elsewhere

heudiconv/dicoms.py:from typing import List, Optional

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note -- ATM we do not check typing information at all. Filed #653

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Looking forward to newer versions that allow the use of builtins like list or | :)

heudiconv/parser.py Outdated Show resolved Hide resolved
heudiconv/tests/test_archives.py Outdated Show resolved Hide resolved
heudiconv/tests/test_archives.py Outdated Show resolved Hide resolved

Returns
-------
ItemsView[int | None, list[str]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately on this aspect there even were not discussion so I initiated one numpy/numpydoc#196 (comment) . I guess we are doomed to keep it as this for now here.

heudiconv/parser.py Outdated Show resolved Hide resolved
input_list_of_unpacked or
all(t.endswith(_UNPACK_FORMATS) for t in fl)
):
raise ValueError("Some but not all input files are archives.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think prior code was handling that just fine -- and add non-archive to None session. Is there a reason to introduce this breaking behavior? I think it might have really bad manifestations (didn't check) e.g. by pointing to a folder with some tarballs and some files to just be ignored which we ignore if not dicom later (?) in the code, etc.

typing.List[str]
] = defaultdict(list)

if not isinstance(fl, list):
fl = list(fl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we even need this "casting"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check. There was a similar cast before

if not isinstance(fl, (list, tuple)):
fl = list(fl)

but perhaps that can be dropped altogether

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah -- that is what I wondered on why it was needed before. I wish smarter earlier us had left a comment ;)

# contained within each archive. So, files are temporarily
# extracted into unique tempdirs
# cannot use TempDirs since will trigger cleanup with __del__
tmpdir = tempdirs(prefix="heudiconvDCM")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make that tmpdir 0o700 right away here so no files extracted would have any moment in time when they are accessible to non-authorized user (i.e. until chmoding below)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! This seems especially relevant considering that the files are unpacked into a generic tmp directory, which could be a surprising risk to privacy.

@psadil
Copy link
Contributor

psadil commented Mar 29, 2023

  • major concern -- seems to be a change in behavior in case of a mixed list of files/archives. I would prefer to keep original behavior unless there is really a strong need

Ah, sorry. I thought that I was making something explicit that was previously only implicit. I didn't have a strong need. I'll update it to match the current behavior

  • some minor comments on styling etc.

Great! I'm looking through these and will revise accordingly

psadil and others added 7 commits March 29, 2023 18:16
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
@psadil
Copy link
Contributor

psadil commented Mar 30, 2023

Okay, now revised

@psadil
Copy link
Contributor

psadil commented Apr 6, 2023

Checking in on this -- the code is ready for another round of review

@yarikoptic yarikoptic changed the title ENH: Support extracting DICOMs from ZIP files ENH: Support extracting DICOMs from ZIP files (and possibly other archives) by switching to use shutil.unpack_archive instead of tarfile module functionality Apr 6, 2023
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code styling is a bit odd to me, but without black I cannot insist really ;)

Otherwise I think it looks great and let's give it a try! Thank you @HippocampusGirl and @psadil !!!

@yarikoptic yarikoptic added the minor Increment the minor version when merged label Apr 6, 2023
@yarikoptic yarikoptic marked this pull request as ready for review April 6, 2023 02:39
@yarikoptic yarikoptic merged commit 3940813 into nipy:master Apr 6, 2023
8 checks passed
@github-actions
Copy link

github-actions bot commented May 8, 2023

🚀 PR was released in v0.13.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor Increment the minor version when merged released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants