Support video and frames files associated with sample data #171

niksirbi · 2024-05-02T14:19:38Z

Description

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?
Having video files, and/or frames extracted from those videos, associated with existing sample pose files will greatly facilitate the development and debugging of GUIs, because it would allow us to plot trajectories over a meaningful background, define ROIs etc. See all the issues linked in References.

What does this PR do?

It overhauls the sample_data.py module to allow for the fetching of videos and/or frames alongside the fetching of pose files.
All the changes were done in conjunction with changes to the data repository on GIN and should be interpreted together.

Changes to the data repository:

added folders for "videos" and "frames", in addition to the existing "poses" folder. Populated these folders with files for which the researcher(s) have given permission to share. Some video/frame file are shared across multiple pose files (because e.g. the same video was analysed with DeepLabCut and SLEAP).
the metadata file is now named metadata.yaml. I added new metadata fields, to express the association between pose datasets and videos/frames. Here's an example entry:

- file_name: "SLEAP_three-mice_Aeon_proofread.analysis.h5"
  sha256sum: "82ebd281c406a61536092863bc51d1a5c7c10316275119f7daf01c1ff33eac2a"
  source_software: "SLEAP"
  fps: 50
  species: "mouse"
  number_of_individuals: 3
  shared_by:
    name: "Chang Huan Lo"
    affiliation: "Sainsbury Wellcome Centre, UCL"
  frame:
    file_name: "three-mice_Aeon_frame-5sec.png"
    sha256sum: "889e1bbee6cb23eb6d52820748123579acbd0b2a7265cf72a903dabb7fcc3d1a"
  video:
    file_name: "three-mice_Aeon_video.avi"
    sha256sum: "bc7406442c90467f11a982fd6efd85258ec5ec7748228b245caf0358934f0e7d"
  note: "All labels were proofread (user-defined) and can be considered ground truth. It was exported from the .slp file with the same prefix."

added a new convenience script get_sha256_hashes.py which will iterate over all files in poses, videos, and frames and write the results to txt files (poses_hashes.txt, videos_hashes.txt, frames_hashes.txt). It doesn't go all the way to fully automate the generation of metadata.yaml entries, but it is an improvement on previous practices.

Changes to the code repository (this PR):

The sample_data.py module now exposes 3 public functions:
- list_datasets(): returns the filenames of the sample pose files (the one in the poses folder)
- fetch_dataset_paths(filename): given a filename of a valid pose dataset (one that the above function returns), return a dict of 3 local paths, with keys "poses", "video", "frame". If video or frame is missing, their value is None.
```
 from movement.sample_data import fetch_dataset_paths
 paths = fetch_dataset_paths("DLC_single-mouse_EPM.predictions.h5")
 poses_path = paths["poses"]
 frame_path = paths["frame"]
 video_path = paths["video"]
```
- fetch_dataset(filename) : given a filename of a valid pose dataset (one that list_datasets() returns), calls fetch_dataset_paths(filename) and proceed to load the "poses" into a movement dataset. The "video" and "frame" paths do not get loaded (for now), they are simply stored as dataset attributes.
```
 from movement.sample_data import fetch_dataset
 ds = fetch_dataset("DLC_single-mouse_EPM.predictions.h5")
 frame_path = ds.video_path
 video_path = ds.frame_path
```
This is the function we expect to be most used, and the updated docs reflect that.
Tests, docs, and contributing guide have been updated accordingly. The availability of video frames means that we can also add images to the plots in our examples, but I haven't done this here, as it will be part of the big docs re-organisation Restructure docs informed by the diataxis framework #70.

References

Closes #38.
Closes #121 because the syntax is much less awkward now (with fewer redundancies), and I think there is no longer a clear need for rewriting the sample_data.py module into a class.

Facilitates #105, #49, #50, #48, #164.

How has this PR been tested?

Updated existing tests in test_sample_data.py.

Is this a breaking change?

Yes, the API for fetching sample datasets has changed. This PR need to be merged ahead of any others, because the changes to the GIN data repository have broken CI, and it will remain broken until this is merged.

Does this PR require an update to the documentation?

Yes, I've updated the relevant sections of the docs.

Checklist:

The code has been tested locally
Tests have been added to cover all new functionality
The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

EDIT 2024-05-07

Following @lochhh 's suggestion, the metadata.yaml has been reformatted as a dict of dicts, using the pose file names as top-level dict keys:

"SLEAP_three-mice_Aeon_proofread.analysis.h5":
  sha256sum: "82ebd281c406a61536092863bc51d1a5c7c10316275119f7daf01c1ff33eac2a"
  source_software: "SLEAP"
  fps: 50
  species: "mouse"
  number_of_individuals: 3
  shared_by:
    name: "Chang Huan Lo"
    affiliation: "Sainsbury Wellcome Centre, UCL"
  frame:
    file_name: "three-mice_Aeon_frame-5sec.png"
    sha256sum: "889e1bbee6cb23eb6d52820748123579acbd0b2a7265cf72a903dabb7fcc3d1a"
  video:
    file_name: "three-mice_Aeon_video.avi"
    sha256sum: "bc7406442c90467f11a982fd6efd85258ec5ec7748228b245caf0358934f0e7d"
  note: "All labels were proofread (user-defined) and can be considered ground truth. It was exported from the .slp file with the same prefix."

This simplifies the logic inside sample_data.py quite a bit!

codecov · 2024-05-02T14:20:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.67%. Comparing base (a30e796) to head (2551b30).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #171   +/-   ##
=======================================
  Coverage   99.66%   99.67%           
=======================================
  Files          10       10           
  Lines         605      619   +14     
=======================================
+ Hits          603      617   +14     
  Misses          2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lochhh

Thanks @niksirbi! The new sample_data functions are great and make a lot of sense. Most of my comments are related to restructuring the metadata.yml to have the "pose" file as the key, which will simplify a few functions. I'll let you decide on this.

CONTRIBUTING.md

movement/sample_data.py

lochhh · 2024-05-07T15:24:29Z

tests/test_unit/test_filtering.py

@@ -74,7 +74,7 @@ def test_filter_by_confidence(sample_dataset, caplog):
            ).values[:, 0]
        )
    )
-    assert n_nans == 3213
+    assert n_nans == 2555


Asking out of curiosity, how has this changed?

Because I actually changed the underlying dataset! Before we had "DLC_single-mouse_EPM" and "SLEAP_single-mouse_EPM", but they did not actually correspond to the same video. I updated the DLC one so it matches the SLEAP one (enabling us to use the same video and frame.

niksirbi · 2024-05-07T17:45:02Z

Thanks a lot @lochhh! I like your suggestion to use the pose file names as keys, it indeed simplifies things a lot. I will give it a try.

Co-authored-by: Chang Huan Lo <changhuan.lo@ucl.ac.uk>

sonarcloud · 2024-05-07T18:58:46Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

niksirbi · 2024-05-07T19:04:13Z

Thanks a lot @lochhh! I like your suggestion to use the pose file names as keys, it indeed simplifies things a lot. I will give it a try.

I've implemented this and it works! I've added this as an "EDIT" to the PR's description.

niksirbi added 8 commits May 1, 2024 19:30

updated sample data tests to reflect GIN repo changes

5edab8e

renamed poses_files_metadata.yaml to metadata.yaml

d753ae7

generate joint registry for poses, videos, frames

a0c1096

renamed "file" metadata field to "file_name" for consistency

9b34519

re-organised sample_data module and its tests

83f9b83

updated contributing guide

b75260f

updated getting started guide and the loading example

f4c8acb

log ValueError before raising

8a4909d

niksirbi marked this pull request as ready for review May 2, 2024 14:56

niksirbi requested a review from lochhh May 2, 2024 14:56

lochhh approved these changes May 7, 2024

View reviewed changes

niksirbi and others added 4 commits May 7, 2024 18:46

fix typo in CONTRIBUTING.md

824c6cd

Co-authored-by: Chang Huan Lo <changhuan.lo@ucl.ac.uk>

adpated code to use dataset names as keys for the metadata dict

36260b6

updated metadata entry example in contributing guide

7a47584

define METADATA_FILE at the top

2551b30

niksirbi added this pull request to the merge queue May 7, 2024

Merged via the queue into main with commit 065755d May 7, 2024
27 checks passed

niksirbi deleted the sample-videos-frames branch May 7, 2024 19:24

niksirbi mentioned this pull request May 13, 2024

Define the "movement dataset" and reorganise the Getting Started section #177

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support video and frames files associated with sample data #171

Support video and frames files associated with sample data #171

niksirbi commented May 2, 2024 •

edited

Loading

codecov bot commented May 2, 2024 •

edited

Loading

lochhh left a comment

lochhh May 7, 2024

niksirbi May 7, 2024

niksirbi commented May 7, 2024

sonarcloud bot commented May 7, 2024

niksirbi commented May 7, 2024

Support video and frames files associated with sample data #171

Support video and frames files associated with sample data #171

Conversation

niksirbi commented May 2, 2024 • edited Loading

Description

References

How has this PR been tested?

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist:

EDIT 2024-05-07

codecov bot commented May 2, 2024 • edited Loading

Codecov Report

lochhh left a comment

Choose a reason for hiding this comment

lochhh May 7, 2024

Choose a reason for hiding this comment

niksirbi May 7, 2024

Choose a reason for hiding this comment

niksirbi commented May 7, 2024

sonarcloud bot commented May 7, 2024

Quality Gate passed

niksirbi commented May 7, 2024

niksirbi commented May 2, 2024 •

edited

Loading

codecov bot commented May 2, 2024 •

edited

Loading