title | author | origin | date | |
---|---|---|---|---|
high-bandwidth 3D image compression to boost predictive life sciences |
Peter Steinbach, Jeffrey Kelling (presenter) |
Scionics Computer Innovation GmbH, Helmholz-Zentrum Dresden Rossendorf |
steinbach@scionics.de |
May 11, 2017 |
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
presenter
[/column]
[column,class="col-xs-6"]
author
[/column]
[/columns]
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
Scionics Computer Innovation GmbH
[/column]
[column,class="col-xs-6"]
-
founded in 2000, Dresden (Germany)
-
service provider to the Max Planck Institute of Molecular Cell Biology and Genetics
- scientific computing facility
- IT infrastructure
- public relations
-
member of the GPU Center of Excellence (community of industrial and academic developers/scientists using GPUs)
[/column]
[/columns]
[notes]
- presentation of our institute [/notes]
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
![](img/opensource-550x475.png)[/column]
. . .
[column,class="col-xs-4"]
-
code snippets
-
presentation links
-
open an issue for questions
[/column]
[/columns]
-
Sqeazy library
-
Results
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
[/column]
[column,class="col-xs-4"]
3D rendering of Drosophila embryogenesis time-lapse data reconstructed from 5 angles SPIM recording
credits to Pavel Tomancak (MPI CBG)
[/column]
[/columns]
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
[/column]
[column,class="col-xs-6"]
-
today:
- each CMOS camera can record 850 MB/s of 16bit grayscale pixels
- 2 cameras per scope, 1.7 GB/s
-
scientists would like to capture long timelapses 1-2 days (or more)
-
total data volume per 1-2 day capture:
150-300 TiB raw volume
= 57 - 114 kEUR in SSDs
[/column]
[/columns]
Open-source Compression Library{ target="_blank" }
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
[/column]
[column,class="col-xs-6"]
-
heart of sqeazy: pipeline mechanism
-
do it fast! (multi-core, SIMD)
-
written in C++11 (soon C++14)
[/column]
[/columns]
3D in space = 2D in space + time!
. . .
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
[/column]
[column,class="col-xs-4"]
- multimedia industry and video codec research has worked in high-bandwidth/low-latency regime for years
- reuse their expertise through free available codec libraries
- currently looking into h264/MPEG-4 AVC and h265/hevc, others are possible
[/column]
[/columns]
[columns,class="row"]
[column,class="col-xs-6"]
- raw data is encoded as grey16
[/column]
[column,class="col-xs-6"]
- pixel intensities occupy more than 8-bits
mean +/- std = 11 +/- 3
[/column]
[/columns]
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
[/column]
[column,class="col-xs-4"]
- lossy bucket based quantisation
(16 -> 8 bits per pixel transformation) - quality loss minimal
- 8-bit per channel encoding is the standard input for video codecs
- bandwidth enough to take 8 cameras
[/column]
[/columns]
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
-
using ffmpeg framework to interface sqeazy to
-
support CPU and GPU based encoding/decoding
-
enable future directions to non-x86 platforms
-
Linux, macOS, Windows supported
-
-
steep learning curve for using libavcodec API
-
for this talk: ffmpeg 3.0.7
[/column]
[column,class="col-xs-4"]
[/column]
[/columns]
- our production environment: Windows (microscope) and Linux (HPC) based
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
[/column]
[column,class="col-xs-4"]
-
rarely any single library supports hardware accelerated video encoding uniformly across platforms
-
ffmpeg+nvenc meets our production requirements
-
encapsulates external dependencies (easier comparison)
[/column]
[/columns]
[columns,class="row"]
[column,class="col-xs-6"]
- dual socket Intel Xeon E5-2680v3 (2x12c)
- 128GB DDR4 RAM
- 2x Nvidia GeForce GTX1080
- CentOS 7.1
- host CPU limited to 8 threads (production environment)
[/column]
[column,class="col-xs-6"]
- ffmpeg 3.0.7 (build instructions)
- x264 (commit 90a61ec764)
- x265 2.4
- GNU gcc 6.3 (5.4 when CUDA is required)
- Nvidia Media SDK v7.1.9
- Nvidia driver 375.26
- CUDA 8.0.61
- snakemake 3.11.2 to orchestrate benchmarks
[/column]
[/columns]
-
simple workflow based on ffmpeg performed on all:
- quantize .tif images to YUV 4:2:0 with sqeazy (produce input.y4m)
- encode input.y4m video with ffmpeg (take time, input/output files in ramdisk)
- decode encoded.raw to obtain roundtrip.y4m
- compare quality of input.y4m and roundtrip.y4m
- all timings based on /usr/bin/time if not stated otherwise
- orchestration on our HPC infrastructure with snakemake
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
[/column]
[column,class="col-xs-4"]
-
x264 is fast, but doesn't provide high compression
-
x265 is slow, but does provide high compression
-
codec preset study ongoing with downstream analysis/processing
GPUs to the rescue?
[/column]
[/columns]
$ time ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 ...
$ nvprof --print-api-trace ffmpeg -i input.y4m -c:v nvenc_h264 ...
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
[/column]
[column,class="col-xs-4"]
-
nvprof api trace: time delta from cuCtxCreate/cuCtxDestroy
-
nvenc codec consumes 30-50% of the ffmpeg process time only
-
ffmpeg induces quite some overhead on top of nvenc!
[/column]
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
[/column]
[column,class="col-xs-4"]
- here:
cuCtxCreate/Destroy based timing - nvenc offers improved compression ratios in comparison to libx26{4,5} (preset definitions differ)
- nvenc bandwidths are surprisingly low
[/column]
$ nvprof ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 -gpu 1 -y output.h264
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
[/column]
[column,class="col-xs-4"]
- to no surpise: nvenc encoding is bound by host-device transfers (90%)
[/column]
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
[/column]
[column,class="col-xs-4"]
-
here:
timing from Nvidia Video SDK NvEncodeLowLatency -
nvenc superior to libx26{4,5}
-
NvEncodeLowLatency timings:
- exclude driver initialisation
- exclude memory initialisation
[/column]
-
tough business given modern CMOS cameras (around 1GB/s at 16bit greyscale)
-
multi-core implementations very competitive
(either in compression ratio or speed)-
many codecs available
-
manu configuration parameters
-
many bit ranges coming about (8,10,12 bits)
-
-
nvenc through ffmpeg difficult to use/measure
(memory traffic, implementation quality poor?) -
raw nvenc API suitable for high-bandwidth compression
-
NvEncodeLowLatency timings ignores driver and memory initialisation
(represents scenario of constant streaming/encoding) -
nvenc API useful on the microscope only, i.e. in streaming mode
(at best if compression pipeline is on the device as well) -
PCIe bus apparently a bottleneck
-
[columns,class="row vertical-align"]
[column,class="col-xs-4"]
For questions, concerns or suggestions:[/column]
[column,class="col-xs-8"]
![](img/opensource-550x475.png)[/column]
[/columns]