Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve file type detection of NetCDF and HDF5 #9117

Closed
pdurbin opened this issue Nov 2, 2022 · 3 comments · Fixed by #9152
Closed

Improve file type detection of NetCDF and HDF5 #9117

pdurbin opened this issue Nov 2, 2022 · 3 comments · Fixed by #9152
Labels
pm.netcdf-hdf5.d All 3 aims are currently under this deliverable
Milestone

Comments

@pdurbin
Copy link
Member

pdurbin commented Nov 2, 2022

According to Wikipedia NetCDF and HDF5 have magic numbers that should let us detect these file types more easily and reliably than guessing based on file extensions.

NetCDF magic number

CDF\001
\211HDF\r\n\032\n

HDF5 magic number

\211HDF\r\n\032\n

I brought this up at standup today and here are some notes from the discussion:

  • We should see if JHOVE can detect them.
  • Normally, detecting file types by seeking into files is part of detecting tabular files, specifically.
  • Given that NetCDF can be big, this might be a case where switching to a ranged request to find the signature might be important.

We should add some NetCDF and HDF5 files to https://github.com/IQSS/dataverse-sample-data to test with, at some point.

Related:

@mreekie mreekie added this to This Sprint 🏃‍♀️ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Nov 3, 2022
@mreekie mreekie added the pm.netcdf-hdf5.d All 3 aims are currently under this deliverable label Nov 3, 2022
@pdurbin pdurbin moved this from This Sprint 🏃‍♀️ to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Nov 8, 2022
@pdurbin pdurbin self-assigned this Nov 8, 2022
@mreekie
Copy link

mreekie commented Nov 8, 2022

Updated Project status's

@mreekie
Copy link

mreekie commented Nov 22, 2022

@pdurbin dumb question.

  • This issue was linked to PR 2 weeks ago.
    • per the process it was then removed from the dataverse sprint board.
  • So when the PR makes it through QA this issue will be automatically closed right?

@pdurbin
Copy link
Member Author

pdurbin commented Nov 22, 2022

Yes, the issue will be closed when it is merged.

In our PRs we use the magic "closes" syntax.

We'll write something like this:

"Closes #12345"

This makes an association between the PR and the issue (or multiple issues). When you merge the PR, the issue (or multiple issues) is closed.

Please see https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue

pdurbin added a commit that referenced this issue Nov 22, 2022
Also fix test so it doesn't rely on the file extension ".nc".
kcondon added a commit that referenced this issue Nov 22, 2022
detect NetCDF and HDF5 files based on content #9117
@pdurbin pdurbin added this to the 5.13 milestone Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm.netcdf-hdf5.d All 3 aims are currently under this deliverable
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants