Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GOAL: Large-data-handling #32

Closed
4 of 40 tasks
droumis opened this issue Jun 6, 2023 · 1 comment
Closed
4 of 40 tasks

GOAL: Large-data-handling #32

droumis opened this issue Jun 6, 2023 · 1 comment
Labels
ca-imaging calcium imaging eeg eeg-viewer eeg-viewer workflow ephys ephys-viewer ephys-viewer workflow video-viewer

Comments

@droumis
Copy link
Collaborator

droumis commented Jun 6, 2023

UPDATE: This initiative has been superseded by other, more targeted efforts

Summary and Links

  • large-data-handling (lead: ): Develop a first pass solution for the various workflow types
  • Large data handling meeting notes
  • Below, the phases are prioritized into domain sections, going ephys > imaging > eeg. This is because:
    • Ephys data has a very high sampling rate (30KHz) and (in the last few years) many channels (>100), so large datasets are ubiquitous.
    • Imaging data also gets large pretty quickly and dealing with larger datasets has been communicated as the primary pain point for miniscope users in the Minian pipeline.
    • There are already feature-rich browser-based EEG viewers, but they cannot handle large datasets very well, so although EEG datasets are typically not as large as in ephys or imaging, it's still relevant.

Note.. each domain section below starts with some important 'Context'.

Task Planning:

Electrophysiology (Ephys)

**Context:**

While the continuous, raw data is streamed and viewed during data acquisition, it's not that critical to look at the full-band 30KHz version during processing/analysis. Instead, the raw-ish displays are of the low-pass filtered (<1000 Hz) continuous data (like a filtered version of the ephys viewer workflow), stacked 'spike raster' of action potential events (see spike raster workflow), and a view of the spike waveforms (see waveform workflow). These three workflows represent different challenges to large data handling and may require specific approaches.

Additionally, although there is a lot of heterogeneity in technique and equipment in electrophysiology, below we are focusing on the Allen Institute data is advantageous because they have a well-funded group maintaining their sdk, they utilize Neuropixel probes which are relatively high channel-count (and therefore represent a more difficult use case), and their data are available via the NWB 2.0 file format (fancy HDF5) which is becoming increasingly common in neuroscience. Demetris has some contacts with the Allen institute but we haven't yet engaged with them for feedback/collaboration; but this will happen once we have something to show them that is demonstrably better than their current approach. Additionally, we are collaborating with one of Jim's former colleagues, who works primarily with relatively smaller spike-time datasets (some real, some synthetic) and is primarily interested in spike-raster-type workflows, so the work below will benefit his group as well even though we will focus on Allen Institute data.

Ephys Phase 1: Understanding the ecosystem, problems, and foundations for the solution

  • Ecosystem Review: Read about the existing ecosystem on ephys data/viz, challenges, and commonly used tools. (Living notes here)
  • Research Common Workflows: Identify and establish the most common workflow(s) for accessing and visualizing Neuropixels data. This will be crucial for later benchmarking. (Relevant materials here)
  • Allen Institute Data Familiarization: Download and understand the structure of an NWB formatted Neuropixels dataset from the Allen Institute.
  • Review Pangeo Community Tools: Familiarize with the Pangeo community approach, emphasizing scalable and modular workflows.
  • Review Pandata SOSA Whitepaper: Read Jim and Martin's whitepaper to potentially align the project with their principles and suggestions when applicable.

Ephys Phase 2: Building an MVP

  • Data Conversion to Zarr: Convert a subset of the Allen Institute data from NWB (HDF5) format to Zarr for efficient chunking.
  • Integration with Dask and Xarray: Create a notebook and/or script demonstrating optimized workflow for data access, emphasizing the use of Dask for scalable computations and Xarray for labeled data structures, aligning with other the Pangeo/Pandata stack principles when applicable.
  • Basic Visualization: Incorporate basic visualization workflows into this pipeline using HoloViz with Bokeh backend (see spike raster and ephys viewer workflows - Demetris can continue to lead on this task).
  • User Feedback on MVP: Present the MVPs (spike raster, continuous traces) to a subset of potential users and gather feedback for refinement.
  • Refinement based on Feedback: Make necessary adjustments to the MVPs based on user feedback.

Ephys Phase 3: Benchmarking the MVP

  • Define Benchmark Metrics: Establish concrete metrics to evaluate performance, usability, and user experience. (See benchmarking notes)
  • Benchmark Against Common Workflows: Compare the MVPs against the previously identified common workflows for accessing and visualizing Neuropixels data.
  • Document Benchmark Results: Document the benchmark results, emphasizing areas where the MVPs clearly differ from existing solutions.

Ephys Phase 4: Advanced Visualization Techniques

  • Integrate Datashader: Add variant using Datashader for efficient rendering of large datasets.
  • Explore Decimation: Add variant using data decimation techniques like HoloView's LTTB.
  • Bokeh with WebGL: Add variant using Bokeh's WebGL backend.

Ephys Phase 5: Minimap/multi-scale

  • Evaluate xarray DataTree: Explore the xarray datatree for multi-scale rendering. (some exploratory notes here)
  • Leverage Minimap/RangeTool: Leverage our work on the minimap feature (Maybe the minimap could use a precomputed lower-res data rendering stored in the datatree to speed up initial display, or the minimap could be used to help ensure that up to a certain optimized chunk/area/channels/time is displayed at first).

PROBABLY SKIP THIS: Ephys Phase 6: Exploring Direct HDF5 Access with Kerchunk

  • Kerchunk Integration: Integrate Kerchunk to provide direct, chunked access to original HDF5 datasets.
  • Benchmark Kerchunk Performance: Compare the performance of accessing data via Kerchunk versus the Zarr data copy approach.

Ephys Phase 7: Adapt Progress to Waveform Workflow

  • Waveform Workflow: Although the initial focus should be on optimizing the ephys viewer and spike raster, the waveform workflow is also an important component and should be updated accordingly. As it displays thousands of overlapping and grouped lines, it may require an adapted approach compared to the other workflows.

1-Photon Calcium Imaging (1P-Imaging)

Primarily regarding the Miniscope device and associated Minian software

**Context:** The Minian work so far uses many SOSA tools (zarr, dask, xarray, holoviews, panel, bokeh, etc) which is great and we want to help improve their pipeline, especially since there are parts (like the CNMF app) that are reportedly unusable with large data. If we could make their pipeline streamlined, that would be a massive win for everyone. However, Demetris is trying to engage with the primary developer of Minian to see if they would consider accepting PR's (the project hasn't been updated since June 2022, old versions of most packages are pinned, and it doesn't have a build for osx_arm64), or else we'd need to find a solution that has visibility in the community, which gets more complicated. The developer is now working with a company called Metacell which is facilitating imaging analysis platforms, so this could either be an opportunity for accelerated adoption or something less good if we can't improve things and show that a bokeh-based workflow is the best approach. There is also some potentially competing/complementary solutions in the works from the fastplotlib folks, and they already have a collab going with the popular 2-Photon analysis suite 'CaImAn', which could potentially absorb 1-Photon workflows in the future (unless our solution and community support is demonstrably better).

1P-Imaging Phase 1: Understanding the ecosystem, problems, and foundations for the solution

1P-Imaging Phase 2: Building from the existing MVP

  • Explanation: Minian, as it currently exists, is the MVP that we want to benchmark and build from. Demetris has been building a more generalized version of a video viewer reminiscent of Minian's use case, for easier benchmarking and development but eventually improvement should feed back into the Minian ecosystem.
  • User Feedback on MVP: We have already met with Cai lab members about pain points on the existing Minian pipeline, and have the identified initial targets for improvement. These are the pipeline steps of the "CNMF" app and the "Temporal update" app in the Minian pipeline. While planning for addressing these pipeline steps, and if useful, experiment with the generalized video viewer workflow, which should be independent of Minian-specific machinery.
    • Here is a note about CNMF struggles from the meeting notes: "[user] doesn't use the CNMF viewer because it takes a long time to load, although it would be extremely helpful if it did work. It's the least used but could potentially be the most helpful. Right now they are setting parameters based on the first video chunk, then apply those parameters to the rest of the videos and inspect the max projection to see if those parameters worked. If they need to adjust the parameters, they would need to run the whole pipeline again."
    • Here is a note about the Temporal update step: "Temporal update plot in the Minian pipeline would also be really useful to improve. It's just a timeseries plot, but after some duration (~1 hr) it becomes unusable."
  • Refinement: Fork the Minian repo and make preliminary targeted adjustments to address the initial targets for improvement.

1P-Imaging Phase 3: Benchmarking the improvements

  • Define Benchmark Metrics: Establish concrete metrics to evaluate performance, usability, and user experience. (See benchmarking notes)
  • Benchmark original Minian Approach vs Adjusted Approach: Develop benchmark tests and compare the adjustments with the original.
  • Document Benchmark Results: Document the benchmark results, emphasizing areas where the adjustments clearly differ from the origin solution.
  • Benchmark and/or Document Adjusted Approach vs Competitors: Ideally, find a way to quantitatively compare the Adjusted Approach with something comparable being done with fastplotlib and napari. These are all very different tools and it will be difficult to directly compare apples to apples, but we want some indicator (while documenting caveats) of user experience between the tools/approaches.

EEG

Primarily regarding the MNE software

**Context:** The MNE software is well-maintained, documented, and widespread. We have established a friendly collaboration with one of their developers, and a successful end result is a HoloViz/Bokeh approach to EEG visualization that they advertise to their users. The extent of actual integration into the MNE software is yet to be determined, but one ('best') possible situation is that the HoloViz/Bokeh backend is shipped with their package so users can easily switch to it with an argument. The next best possible situation is that they advertise the HoloViz/Bokeh approach in some way, but it remains outside of their package. Either way, we want to fashion our solution such that it would be possible to integrate and complement their tooling. This has implications for the data-access approach, as we want to try to utilize their existing data readers and formats as much as possible. In the future, a possible grant extension could work with MNE developers to adopt a data-access approach that uses zarr, dask, xarray, etc if there was some hints that this approach would be more promising.

EEG Phase 1: Understanding the ecosystem, problems, and foundations for the solution

  • Ecosystem Review: Read about the existing ecosystem on ephys data/viz, challenges, and commonly used tools. (Living notes here)
  • Research Common Workflows: Identify and establish the most common workflow(s) for accessing and visualizing EEG data. This will be crucial for later benchmarking. (Relevant materials here)

EEG Phase 2: Benchmark the eeg viewer workflow version that uses MNE I/O

  • Define Benchmark Metrics: Establish concrete metrics to evaluate performance, usability, and user experience. (See benchmarking notes)
  • Benchmark Against Common Workflows: Compare the MVP against the previously identified common workflows for accessing and visualizing EEG data. This would involve benchmarking something about one or both of their current backends (although I doubt we'll be able to benchmark the qt-based backend).
  • Document Benchmark Results: Document the benchmark results, emphasizing areas where the MVP clearly outperforms or underperforms existing solutions.

EEG Phase 3: Advanced Visualization Techniques (common to Ephys)

  • Integrate Datashader: Add variant using Datashader for efficient rendering of large datasets
  • Explore Decimation: Add variant using data decimation techniques like HoloView's LTTB.
  • Bokeh with WebGL: Add variant using Bokeh's WebGL backend.

EEG Phase 4: Minimap/multi-scale (common to Ephys)

  • Evaluate xarray DataTree: Explore the xarray datatree for multi-scale rendering. (some exploratory notes here)
  • Leverage Minimap/RangeTool: Leverage our work on the minimap feature (Maybe the minimap could use a precomputed lower-res data rendering stored in the datatree to speed up initial display, or the minimap could be used to help ensure that up to a certain optimized chunk/area/channels/time is displayed at first).
@droumis
Copy link
Collaborator Author

droumis commented Jul 26, 2023

I'm creating a separate benchmarking goal issue.

@droumis droumis changed the title GOAL: Benchmark and solve large-data-handling GOAL: Solve large-data-handling Jul 26, 2023
@droumis droumis changed the title GOAL: Solve large-data-handling GOAL: Large-data-handling Aug 15, 2023
@droumis droumis closed this as completed Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ca-imaging calcium imaging eeg eeg-viewer eeg-viewer workflow ephys ephys-viewer ephys-viewer workflow video-viewer
Projects
Status: dropped
Development

No branches or pull requests

2 participants