title	author	origin	email	date
high-bandwidth 3D image compression to boost predictive life sciences	Peter Steinbach, Jeffrey Kelling (presenter)	Scionics Computer Innovation GmbH, Helmholz-Zentrum Dresden Rossendorf	steinbach@scionics.de	May 11, 2017

Before I start

Jeffrey != Peter

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

presenter

{ width=50% }

[/column]

[column,class="col-xs-6"]

author

{ width=50% }

[/column]

[/columns]

Scionics Who?

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

Scionics Computer Innovation GmbH

[/column]

[column,class="col-xs-6"]

founded in 2000, Dresden (Germany)
service provider to the Max Planck Institute of Molecular Cell Biology and Genetics
- scientific computing facility
- IT infrastructure
- public relations
member of the GPU Center of Excellence (community of industrial and academic developers/scientists using GPUs)

[/column]

[/columns]

[notes]

presentation of our institute [/notes]

This Talk is

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

![](img/opensource-550x475.png)

github.com/psteinb/gtc2017

[/column]

. . .

[column,class="col-xs-4"]

code snippets
presentation links
open an issue for questions

[/column]

[/columns]

Outline

1. Scientific Motivation

Sqeazy library
Results

Big Data Deluge in Systems Biology

SPIM

Biologists love this!

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

[/column]

[column,class="col-xs-4"]

3D rendering of Drosophila embryogenesis time-lapse data reconstructed from 5 angles SPIM recording

credits to Pavel Tomancak (MPI CBG)

[/column]

[/columns]

But ...

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

{width=100%}

[/column]

[column,class="col-xs-6"]

today:
- each CMOS camera can record 850 MB/s of 16bit grayscale pixels
- 2 cameras per scope, 1.7 GB/s
scientists would like to capture long timelapses 1-2 days (or more)
total data volume per 1-2 day capture:

150-300 TiB raw volume

= 57 - 114 kEUR in SSDs

[/column]

[/columns]

IT to the rescue

{ width=85% }

Does that scale? {data-background="img/ieee_data_deluge.jpg"}

Sqeazy

Open-source Compression Library{ target="_blank" }

![](img/sqeazy-on-github.png){ width=90% }

Yet another compression library?

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

[/column]

[column,class="col-xs-6"]

heart of sqeazy: pipeline mechanism
- transform data so that it can be compressed best
- use very good and fast encoders as end of the pipeline, e.g. zstd, lz4, blosc, ...
  use them, don't reinvent them!
do it fast! (multi-core, SIMD)
written in C++11 (soon C++14)

[/column]

[/columns]

Can we do better?

3D in space = 2D in space + time!

. . .

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=80% }

[/column]

[column,class="col-xs-4"]

multimedia industry and video codec research has worked in high-bandwidth/low-latency regime for years
reuse their expertise through free available codec libraries
currently looking into h264/MPEG-4 AVC and h265/hevc, others are possible

[/column]

[/columns]

Challenge: SPIM data

{width=90%}

[columns,class="row"]

[column,class="col-xs-6"]

raw data is encoded as grey16

[/column]

[column,class="col-xs-6"]

pixel intensities occupy more than 8-bits
mean +/- std = 11 +/- 3

[/column]

[/columns]

Solution: Quantize data

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{width=100%}

[/column]

[column,class="col-xs-4"]

lossy bucket based quantisation
(16 -> 8 bits per pixel transformation)
quality loss minimal
8-bit per channel encoding is the standard input for video codecs
bandwidth enough to take 8 cameras

[/column]

[/columns]

ffmpeg

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

using ffmpeg framework to interface sqeazy to
- support CPU and GPU based encoding/decoding
- enable future directions to non-x86 platforms
- Linux, macOS, Windows supported
steep learning curve for using libavcodec API
for this talk: ffmpeg 3.0.7

[/column]

[column,class="col-xs-4"]

{ width=80% }

[/column]

[/columns]

hardware accelerated codecs

our production environment: Windows (microscope) and Linux (HPC) based

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{width=100%}

[/column]

[column,class="col-xs-4"]

rarely any single library supports hardware accelerated video encoding uniformly across platforms
ffmpeg+nvenc meets our production requirements
encapsulates external dependencies (easier comparison)

[/column]

[/columns]

Results

benchmark platform

[columns,class="row"]

[column,class="col-xs-6"]

*hardware*

dual socket Intel Xeon E5-2680v3 (2x12c)
128GB DDR4 RAM
2x Nvidia GeForce GTX1080
CentOS 7.1
host CPU limited to 8 threads (production environment)

[/column]

[column,class="col-xs-6"]

*software*

ffmpeg 3.0.7 (build instructions)
x264 (commit 90a61ec764)
x265 2.4
GNU gcc 6.3 (5.4 when CUDA is required)
Nvidia Media SDK v7.1.9
Nvidia driver 375.26
CUDA 8.0.61
snakemake 3.11.2 to orchestrate benchmarks

[/column]

[/columns]

what I measured

simple workflow based on ffmpeg performed on all:
1. quantize .tif images to YUV 4:2:0 with sqeazy (produce input.y4m)
2. encode input.y4m video with ffmpeg (take time, input/output files in ramdisk)
3. decode encoded.raw to obtain roundtrip.y4m
4. compare quality of input.y4m and roundtrip.y4m

all timings based on /usr/bin/time if not stated otherwise
orchestration on our HPC infrastructure with snakemake

CPU only

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=90% }

[/column]

[column,class="col-xs-4"]

x264 is fast, but doesn't provide high compression
x265 is slow, but does provide high compression
codec preset study ongoing with downstream analysis/processing

GPUs to the rescue?

[/column]

[/columns]

compare timings

$ time ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 ...
$ nvprof --print-api-trace ffmpeg -i input.y4m -c:v nvenc_h264   ...

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=90% }

[/column]

[column,class="col-xs-4"]

nvprof api trace: time delta from cuCtxCreate/cuCtxDestroy
nvenc codec consumes 30-50% of the ffmpeg process time only
ffmpeg induces quite some overhead on top of nvenc!

[/column]

GPU enhanced encoding

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=90% }

[/column]

[column,class="col-xs-4"]

here:
cuCtxCreate/Destroy based timing
nvenc offers improved compression ratios in comparison to libx26{4,5} (preset definitions differ)
nvenc bandwidths are surprisingly low

[/column]

Profiling details

$ nvprof ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 -gpu 1 -y output.h264

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=100% }

[/column]

[column,class="col-xs-4"]

to no surpise: nvenc encoding is bound by host-device transfers (90%)

**Can it still be that slow?**

[/column]

GPU enhanced encoding (cont.)

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

{ width=90% }

[/column]

[column,class="col-xs-4"]

here:
timing from Nvidia Video SDK NvEncodeLowLatency
nvenc superior to libx26{4,5}
NvEncodeLowLatency timings:
- exclude driver initialisation
- exclude memory initialisation

[/column]

Summary

high-bandwidth 3D image compression

tough business given modern CMOS cameras (around 1GB/s at 16bit greyscale)
multi-core implementations very competitive
(either in compression ratio or speed)
- many codecs available
- manu configuration parameters
- many bit ranges coming about (8,10,12 bits)

GPUs for 3D image compression?

nvenc through ffmpeg difficult to use/measure
(memory traffic, implementation quality poor?)
raw nvenc API suitable for high-bandwidth compression
- NvEncodeLowLatency timings ignores driver and memory initialisation
  (represents scenario of constant streaming/encoding)
- nvenc API useful on the microscope only, i.e. in streaming mode
  (at best if compression pipeline is on the device as well)
- PCIe bus apparently a bottleneck

Thank you!

[columns,class="row vertical-align"]

[column,class="col-xs-4"]

For questions, concerns or suggestions:

Open an issue, please!

[/column]

[column,class="col-xs-8"]

![](img/opensource-550x475.png)

github.com/psteinb/gtc2017

[/column]

[/columns]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slides.md

slides.md

Before I start

Jeffrey != Peter

Scionics Who?

This Talk is

Outline

Big Data Deluge in Systems Biology

SPIM

Biologists love this!

But ...

IT to the rescue

Does that scale? {data-background="img/ieee_data_deluge.jpg"}

Sqeazy

Open-source Compression Library{ target="_blank" }

Yet another compression library?

Can we do better?

Challenge: SPIM data

Solution: Quantize data

ffmpeg

hardware accelerated codecs

Results

benchmark platform

what I measured

CPU only

compare timings

GPU enhanced encoding

Profiling details

GPU enhanced encoding (cont.)

Summary

high-bandwidth 3D image compression

GPUs for 3D image compression?

Thank you!

Files

slides.md

Latest commit

History

slides.md

File metadata and controls

Before I start

Jeffrey != Peter

Scionics Who?

This Talk is

Outline

Big Data Deluge in Systems Biology

SPIM

Biologists love this!

But ...

IT to the rescue

Does that scale? {data-background="img/ieee_data_deluge.jpg"}

Sqeazy

Open-source Compression Library{ target="_blank" }

Yet another compression library?

Can we do better?

Challenge: SPIM data

Solution: Quantize data

ffmpeg

hardware accelerated codecs

Results

benchmark platform

what I measured

CPU only

compare timings

GPU enhanced encoding

Profiling details

GPU enhanced encoding (cont.)

Summary

high-bandwidth 3D image compression

GPUs for 3D image compression?

Thank you!