Skip to content

Latest commit

 

History

History
656 lines (349 loc) · 12.1 KB

slides.md

File metadata and controls

656 lines (349 loc) · 12.1 KB
title author origin email date
high-bandwidth 3D image compression to boost predictive life sciences
Peter Steinbach, Jeffrey Kelling (presenter)
Scionics Computer Innovation GmbH, Helmholz-Zentrum Dresden Rossendorf
steinbach@scionics.de
May 11, 2017

Before I start

Jeffrey != Peter

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

presenter

Jeffrey Kelling (HZDR){ width=50% }

[/column]

[column,class="col-xs-6"]

author

Peter Steinbach (Scionics){ width=50% }

[/column]

[/columns]

Scionics Who?

[columns,class="row vertical-align"]

[column,class="col-xs-6"]


Scionics Computer Innovation GmbH

[/column]

[column,class="col-xs-6"]

[/column]

[/columns]

[notes]

  • presentation of our institute [/notes]

This Talk is

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

![](img/opensource-550x475.png)

github.com/psteinb/gtc2017

[/column]

. . .

[column,class="col-xs-4"]

[/column]

[/columns]

Outline

1. Scientific Motivation
  1. Sqeazy library

  2. Results

Big Data Deluge in Systems Biology

Selective Plane Illumination Microscopy

Biologists love this!

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

[/column]

[column,class="col-xs-4"]

3D rendering of Drosophila embryogenesis time-lapse data reconstructed from 5 angles SPIM recording

credits to Pavel Tomancak (MPI CBG)

[/column]

[/columns]

But ...

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

Design Draft of a modern SPIM microscope, credits Nicola Maghelli (MPI CBG, Myers lab){width=100%}

[/column]

[column,class="col-xs-6"]

  • today:

    • each CMOS camera can record 850 MB/s of 16bit grayscale pixels
    • 2 cameras per scope, 1.7 GB/s
  • scientists would like to capture long timelapses 1-2 days (or more)

  • total data volume per 1-2 day capture:

150-300 TiB raw volume

= 57 - 114 kEUR in SSDs

[/column]

[/columns]

IT to the rescue

{ width=85% }

Does that scale? {data-background="img/ieee_data_deluge.jpg"}

![](img/sqeazy-on-github.png){ width=90% }

Yet another compression library?

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

wikimedia commons

[/column]

[column,class="col-xs-6"]

  • heart of sqeazy: pipeline mechanism

    • transform data so that it can be compressed best
    • use very good and fast encoders as end of the pipeline, e.g. zstd, lz4, blosc, ...
      use them, don't reinvent them!
  • do it fast! (multi-core, SIMD)

  • written in C++11 (soon C++14)

[/column]

[/columns]

Can we do better?

3D in space = 2D in space + time!

. . .

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

wikimedia commons{ width=80% }

[/column]

[column,class="col-xs-4"]

  • multimedia industry and video codec research has worked in high-bandwidth/low-latency regime for years
  • reuse their expertise through free available codec libraries
  • currently looking into h264/MPEG-4 AVC and h265/hevc, others are possible

[/column]

[/columns]

Challenge: SPIM data

{width=90%}

[columns,class="row"]

[column,class="col-xs-6"]

  • raw data is encoded as grey16

[/column]

[column,class="col-xs-6"]

  • pixel intensities occupy more than 8-bits
    mean +/- std = 11 +/- 3

[/column]

[/columns]

Solution: Quantize data

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{width=100%}

[/column]

[column,class="col-xs-4"]

  • lossy bucket based quantisation
    (16 -> 8 bits per pixel transformation)
  • quality loss minimal
  • 8-bit per channel encoding is the standard input for video codecs
  • bandwidth enough to take 8 cameras

[/column]

[/columns]

ffmpeg

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

  • using ffmpeg framework to interface sqeazy to

    • support CPU and GPU based encoding/decoding

    • enable future directions to non-x86 platforms

    • Linux, macOS, Windows supported

  • steep learning curve for using libavcodec API

  • for this talk: ffmpeg 3.0.7

[/column]

[column,class="col-xs-4"]

{ width=80% }

[/column]

[/columns]

hardware accelerated codecs

  • our production environment: Windows (microscope) and Linux (HPC) based

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

from ffmpeg wiki{width=100%}

[/column]

[column,class="col-xs-4"]

  • rarely any single library supports hardware accelerated video encoding uniformly across platforms

  • ffmpeg+nvenc meets our production requirements

  • encapsulates external dependencies (easier comparison)

[/column]

[/columns]

Results

benchmark platform

[columns,class="row"]

[column,class="col-xs-6"]

*hardware*

[/column]

[column,class="col-xs-6"]

*software*

[/column]

[/columns]

what I measured

  • simple workflow based on ffmpeg performed on all:

    1. quantize .tif images to YUV 4:2:0 with sqeazy (produce input.y4m)
    2. encode input.y4m video with ffmpeg (take time, input/output files in ramdisk)
    3. decode encoded.raw to obtain roundtrip.y4m
    4. compare quality of input.y4m and roundtrip.y4m

  • all timings based on /usr/bin/time if not stated otherwise
  • orchestration on our HPC infrastructure with snakemake

CPU only

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=90% }

[/column]

[column,class="col-xs-4"]

  • x264 is fast, but doesn't provide high compression

  • x265 is slow, but does provide high compression

  • codec preset study ongoing with downstream analysis/processing

GPUs to the rescue?

[/column]

[/columns]

compare timings

$ time ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 ...
$ nvprof --print-api-trace ffmpeg -i input.y4m -c:v nvenc_h264   ...

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=90% }

[/column]

[column,class="col-xs-4"]

  • nvprof api trace: time delta from cuCtxCreate/cuCtxDestroy

  • nvenc codec consumes 30-50% of the ffmpeg process time only

  • ffmpeg induces quite some overhead on top of nvenc!

[/column]

GPU enhanced encoding

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=90% }

[/column]

[column,class="col-xs-4"]

  • here:
    cuCtxCreate/Destroy based timing
  • nvenc offers improved compression ratios in comparison to libx26{4,5} (preset definitions differ)
  • nvenc bandwidths are surprisingly low

[/column]

Profiling details

$ nvprof ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 -gpu 1 -y output.h264

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=100% }

[/column]

[column,class="col-xs-4"]

  • to no surpise: nvenc encoding is bound by host-device transfers (90%)
**Can it still be that slow?**

[/column]

GPU enhanced encoding (cont.)

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=90% }

[/column]

[column,class="col-xs-4"]

  • here:
    timing from Nvidia Video SDK NvEncodeLowLatency

  • nvenc superior to libx26{4,5}

  • NvEncodeLowLatency timings:

    • exclude driver initialisation
    • exclude memory initialisation

[/column]

Summary

high-bandwidth 3D image compression

  • tough business given modern CMOS cameras (around 1GB/s at 16bit greyscale)

  • multi-core implementations very competitive
    (either in compression ratio or speed)

    • many codecs available

    • manu configuration parameters

    • many bit ranges coming about (8,10,12 bits)

GPUs for 3D image compression?

  • nvenc through ffmpeg difficult to use/measure
    (memory traffic, implementation quality poor?)

  • raw nvenc API suitable for high-bandwidth compression

    • NvEncodeLowLatency timings ignores driver and memory initialisation
      (represents scenario of constant streaming/encoding)

    • nvenc API useful on the microscope only, i.e. in streaming mode
      (at best if compression pipeline is on the device as well)

    • PCIe bus apparently a bottleneck

Thank you!

[columns,class="row vertical-align"]

[column,class="col-xs-4"]

For questions, concerns or suggestions:

Open an issue, please!

[/column]

[column,class="col-xs-8"]

![](img/opensource-550x475.png)

github.com/psteinb/gtc2017

[/column]

[/columns]