Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed optimisations in downloader #448

Closed
hugbug opened this issue Sep 22, 2017 · 21 comments

Comments

Projects
None yet
4 participants
@hugbug
Copy link
Member

commented Sep 22, 2017

In this release we did a bunch of speed optimisations in different program areas:

  • web-interface;
  • download queue management;
  • loading of very large queue.

The next area to optimise is the downloader code:

  • provide 64 bit build for Windows (#435);
  • enable more aggressive optimisation options in Visual Studio for Windows builds (#447);
  • provide a way to disable article writing and article decoding for benchmarking purposes;
  • faster checksum computations (#446);
  • decrease speed meter update rate (#449);
  • optimisations in yEnc-Decoder (#450);
  • explore SSE/SIMD versions of CRC32 and yEnc-decoder routines (#454);
  • use glibc instead of uClibc in universal installer builds for Linux (#459).

UPDATE: The following benchmark table represents results after implementing of all optimisations.


Benchmarks

Test devices

  1. MacBook 2010 with Intel Core i5-520 (2 cores @ 2.4 GHz, 4 threads), macOS 64 bit;
  2. Dell 2015 Windows with Intel Core i7-5600U (2 cores @ 2.6 GHz, 4 threads), Windows 64 bit;
  3. Dell 2015 Linux with Intel Core i7-5600U (2 cores @ 2.6 GHz, 4 threads), Linux 64 bit. The same machine as above booted with Linux;
  4. PVR with ARMv7 Cortex A-15 (2 cores @ 1.7GHz), Broadcom BCM 7251s, Linux 32 bit;
  5. NanoPi NEO2 with ARMv8 Cortex A-53 (4 cores @ 1.5GHz), Allwinner H5, Linux 64 bit. NZBGet runs in 32 bit mode.

Test conditions

  • all tests were performed at least 4 times, the worst result was discarded and the average of the remaining results were taken;
  • downloading from NServ running on the same machine via unencrypted connection, memory cache activated in NServ;
  • on Mac and Dell a 30 GB nzb file was downloaded, on ARM-devices - a 10 GB file. Test nzb-files consisted of many rar-files (each 100 MB) whereas the content of the same rar-file was referenced all over again (300 or 100 times) in the nzb. This allowed NServ to keep all served articles in memory (only 100 MB RAM needed);
  • writing of downloaded data was disabled in NZBGet (via option SkipWrite) to reduce the influence of disk subsystem.

Results

All numbers are in MB/s (megabytes per second).

Improvement MacBook
macOS
i5-520
Dell
Windows
i7‑5600U
Dell
Linux
i7‑5600U
PVR
Linux
ARMv7
NEO2
Linux
ARMv8
Before all optimisations 162 207 266 54 62
64 Bit Build (#435) 162 [1] 279 266 [1] 54 [2] 62 [3]
(+) improved CRC32 (#446) 172 301 267 60 70
(+) option RateBuffer (#449) 217 354 293 80 84
(+) improved decoder (#450) 305 389 480 89 102
(+) raw decoder (#448) 369 414 636 93 107
(+) SIMD decoder (#454) 467 493 836 99 121
(+) SIMD CRC32 (#454) 520 541 1011 99 [4] 136
(+) one-pass decoding (#448) 520 570 1140 106 157
(+) glibc (#459) 520 [5] 570 [5] 1623 155 222
Speed bump after all optimisations +221% +175% [6] +510% +187% +258%

Footnotes:

  1. Mac and Linux builds for x86 were already 64 Bit before optimisations.
  2. ARMv7 is 32 Bit only.
  3. ARMv8 is 64 Bit capable but all tests were performed with 32 Bit version of NZBGet because we don't provide 64 Bit builds for ARM yet.
  4. ARMv7 doesn't have hardware support for CRC32.
  5. Mac and Windows have their own system C libraries. Switch from "uClibc" to "glibc" isn't applicable here.
  6. On Windows PC on late optimisation steps NServ was using more than 30% of CPU time, leaving less CPU time to NZBGet and therefore significantly reducing download speed.

TLS/SSL

Benchmark results above were obtained using unencrypted connection between NZBGet and NServ. To determine how TLS/SSL affects download speed additional tests were performed. We were testing two ciphers:

  • AES - that's default cipher;
  • RC4-MD5 - cipher which we recommended for faster download speeds, especially for devices with weak CPUs.

For TLS/SSL tests on PVR ARMv7 device NServ was running on a separate machine connected via cable. For tests on other devices NServ was running on tested device.

  Dell
Linux
i7‑5600U
PVR
Linux
ARMv7
NEO2
Linux
ARMv8
AES
Before all optimisations 246 31 54
After all optimisations 967 49 151
Speed bump after all optimisations +293% +58% +180%
RC4-MD5
Before all optimisations 174 36 33
After all optimisations 375 65 55
Speed bump after all optimisations +115% +81% +67%

Observations and conclusion:

  1. TLS/SSL drastically increases CPU load on NServ side. As a result NZBGet (which is running on the same machine) gets much less CPU time compared to what it gets on unencrypted connection. Therefore these speed numbers are not directly comparable to numbers from the first table above.
  2. Modern Intel and ARM CPUs have hardware instructions for AES encryption, which OpenSSL can use. As a result AES cipher shows much better speeds than RC4 on Intel and ARMv8. It is worth noting that not every ARMv8 CPU supports AES instructions as they are optional feature. In particular RPi3 has ARMv8 without AES support.
  3. If we would take into account high CPU usage of NServ and adjust speed numbers accordingly the resulting numbers for AES on CPUs with hardware AES support would be almost on par with numbers for unencrypted connections. That allows us to say that encrypted connections do not decrease performance on systems with modern Intel and ARMv8 CPUs.
  4. The recommendation to use "RC4-MD5" for faster speeds should be revised as it doesn't apply to modern CPUs.

@hugbug hugbug added the feature label Sep 22, 2017

@hugbug hugbug added this to the v20 milestone Sep 22, 2017

hugbug added a commit that referenced this issue Sep 22, 2017

#448: disable article writing and decoding
Disabling is now possible for test purposes via defines
SKIP_ARTICLE_WRITING and SKIP_ARTICLE_DECODING (nzbget.h)

hugbug added a commit that referenced this issue Sep 22, 2017

#448, 7150534: allow CRC calculation even if
decoding is disabled via SKIP_ARTICLE_DECODING
@hugbug

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2017

Benchmarks for CRC, Decoding and RateBuffer

Test systems

Tests were performed on two machines:

  • Apple MacBook 2010 with Intel Core i5-520 (2 cores @ 2.4 GHz, 4 threads), macOS 64 bit;
  • Dell Notebook 2015 with Intel Core i7-5600U (2 cores @ 2.6 GHz, 4 threads), Windows 64 bit.

Test conditions

  • all tests were performed 4 times, the worst result was discarded and the average of the remaining three results were taken;
  • downloading from NServ running on the same machine;
  • writing of downloaded data was disabled in NZBGet using compiler define SKIP_ARTICLE_WRITING to reduce the influence of disk subsystem.

Results

Improvements MacBook 2010 Dell 2015 Windows
Before all optimizations 147 MB/s 199 MB/s
64 Bit Build (#435) 147 MB/s 245 MB/s
(+) new CRC32 routine (#446) 153 MB/s 265 MB/s
(+) option RateBuffer=10000 (#449) 183 MB/s 306 MB/s
(+) new decoder (#450) 247 MB/s 353 MB/s
Speed bump after all optimizations +68% +77%
@hugbug

This comment has been minimized.

Copy link
Member Author

commented Sep 26, 2017

Detailed results

Config MacBook 2010 Dell 2015 Windows
before-all-opt: crc off, ratebuf 0 165 MB/s 234 MB/s
before-all-opt: crc on, ratebuf 0 147 MB/s 199 MB/s
before-all-opt: crc off, ratebuf 10k 206 MB/s 277 MB/s
before-all-opt: crc on, ratebuf 10k 179 MB/s 234 MB/s
64-bit: crc off, ratebuf 0 165 MB/s 286 MB/s
64-bit: crc on, ratebuf 0 147 MB/s 245 MB/s
64-bit: crc off, ratebuf 10k 206 MB/s 349 MB/s
64-bit: crc on, ratebuf 10k 179 MB/s 284 MB/s
new-crc: crc off, ratebuf 0 163 MB/s 286 MB/s
new-crc: crc on, ratebuf 0 153 MB/s 265 MB/s
new-crc: crc off, ratebuf 10k 206 MB/s 338 MB/s
new-crc: crc on, ratebuf 10k 183 MB/s 306 MB/s
new-decoder: crc off, ratebuf 0 208 MB/s 324 MB/s
new-decoder: crc on, ratebuf 0 191 MB/s 292 MB/s
new-decoder: crc off, ratebuf 10k 278 MB/s 365 MB/s
new-decoder: crc on, ratebuf 10k 247 MB/s 353 MB/s
sse-decoder: crc off, ratebuf 0 221 MB/s 345 MB/s
sse-decoder: crc on, ratebuf 0 202 MB/s 308 MB/s
sse-decoder: crc off, ratebuf 10k 300 MB/s 365 MB/s
sse-decoder: crc on, ratebuf 10k 273 MB/s 357 MB/s
off-decoder: crc off, ratebuf 0 229 MB/s 349 MB/s
off-decoder: crc on, ratebuf 0 216 MB/s 334 MB/s
off-decoder: crc off, ratebuf 10k 339 MB/s 391 MB/s
off-decoder: crc on, ratebuf 10k 291 MB/s 369 MB/s

Notes:

  • sse-decoder is an experimental decoder not committed to develop branch.
  • off-decoder means decoder was disabled using compiler define SKIP_ARTICLE_DECODING. This is for comparison only as disabled decoder has no practical usage.

hugbug added a commit that referenced this issue Sep 28, 2017

#448: new option "SkipWrite"
replaces compiler define “SKIP_ARTICLE_WRITING”. 2) renamed option
“Decode” to “RawArticle”. 3) option “CrcCheck” moved from section
“Download Queue “ into section “Check and Repair”

hugbug added a commit that referenced this issue Sep 28, 2017

#448: memory cache in NServ
: new command line switch “-m”

hugbug added a commit that referenced this issue Sep 29, 2017

hugbug added a commit that referenced this issue Oct 4, 2017

#448: don't try deleting files that don't exist
- a small optimisation to reduce disk activity

hugbug added a commit that referenced this issue Oct 8, 2017

hugbug added a commit that referenced this issue Oct 9, 2017

#448: disable article writing and decoding
Disabling is now possible for test purposes via defines
SKIP_ARTICLE_WRITING and SKIP_ARTICLE_DECODING (nzbget.h)

hugbug added a commit that referenced this issue Oct 9, 2017

#448, 7150534: allow CRC calculation even if
decoding is disabled via SKIP_ARTICLE_DECODING

hugbug added a commit that referenced this issue Oct 9, 2017

#448: new option "SkipWrite"
replaces compiler define “SKIP_ARTICLE_WRITING”. 2) renamed option
“Decode” to “RawArticle”. 3) option “CrcCheck” moved from section
“Download Queue “ into section “Check and Repair”

hugbug added a commit that referenced this issue Oct 9, 2017

#448: memory cache in NServ
: new command line switch “-m”

hugbug added a commit that referenced this issue Oct 9, 2017

hugbug added a commit that referenced this issue Oct 9, 2017

#448: don't try deleting files that don't exist
- a small optimisation to reduce disk activity
@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2017

Raw processing

The technique of data processing was reworked to pass more job to decoder, a so called raw-mode. The data received from socket is pushed to decoder directly instead of line by line processing, which has also needed line bufferization. This allows decoder to process data in larger blocks (currently using 4KB blocks) compared to 128 bytes per-line processing.

The raw-decoder is developed by @animetosho for node-yencode project. The very advanced decoder can use SIMD CPU instructions for high efficiency. The decoder exists in multiple variants for following CPUs/instruction sets: Intel SSE2, Intel SSSE3 and ARM Neon. On other CPUs a cross-plaform version (scalar decoder) is used.

For comparison the previous versions of decoders (per-line) were tested again in same conditions. All results are presented below.

Decoder CrcCheck Off CrcCheck On
new-decoder (per-line) 282 MB/s 245 MB/s
sse-decoder (per-line) 304 MB/s 268 MB/s
sse-decoder (raw mode) 404 MB/s 350 MB/s
scalar-decoder (raw mode) 329 MB/s 292 MB/s
no-decoder (raw mode) 449 MB/s 399 MB/s
  • new-decoder (per-line) refers to improved classic decoder tested in previous tests;
  • sse-decoder (per-line) is SIMD decoder running in clean mode (as opposite to raw mode). The line processing is performed by other program parts, just like in "new-decoder (per-line)";
  • sse-decoder (raw mode) is SIMD decoder running in raw mode with other program parts reworked to support it;
  • scalar-decoder (raw mode) similar to previous but without using SIMD CPU instructions;
  • no-decoder - decoding was disabled using compiler define SKIP_ARTICLE_DECODING.

Test conditions are similar to previous tests:

  • Apple MacBook 2010 with Intel Core i5-520 (2 cores @ 2.4 GHz, 4 threads), macOS 64 bit;
  • all tests were performed 4 times, the worst result was discarded and the average of the remaining three results were taken;
  • downloading from NServ running on the same machine using memory cache option;
  • writing of downloaded data was disabled in NZBGet using option SkipWrite to reduce influence of disk subsystem.

Note: Current version of raw decoder doesn't support end-of-stream detection. Therefore an extra scan of incoming data before feeding decoder is needed to properly process data. The author of decoder works on an improved version with built-in end-of-stream detection which presumably will improve overall performance further.

hugbug added a commit that referenced this issue Oct 10, 2017

#448: speed optimisation in NServ
when using unlimited memory cache (command line switch “-m 0”)
@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 10, 2017

When making tests on two ARM devices discovered and fixed a performance bottleneck in NServ. The CPU usage of NServ has been greatly reduced, which in turn gives more CPU time to NZBGet and increases speed. All tests were rerun with improved NServ (including tests on Mac).

Test devices

  1. MacBook 2010 with Intel Core i5-520 (2 cores @ 2.4 GHz, 4 threads), macOS 64 bit;
  2. PVR with ARMv7 Cortex A-15 (2 cores @ 1.7GHz), Broadcom BCM 7251s, Linux 32 bit;
  3. NanoPi NEO2 with ARMv8 Cortex A-53 (4 cores @ 1.5GHz), Allwinner H5, Linux 64 bit. NZBGet runs in 32 bit.

Test results

All numbers are in MB/s. For each decoder two test cases were measured - with and without CRC calculation; the latter is shown in parentheses. The overhead of CRC calculation shows how much improvement potential is still there - the CRC routine isn't optimised for SIMD yet.

Once again a reminder that the speeds below represent overall download speed in NZBGet, not just decoding speed.

Decoder MacBook 2010 PVR ARMv7 NEO2 ARMv8
new-decoder per-line mode 307 (354) 89 (99) 102 (118)
simd-decoder per-line mode 336 (404) 89 (100) 101 (119)
simd-decoder raw mode 467 (570) 99 (109) 121 (141)
scalar-decoder raw mode 369 (421) 93 (105) 107 (124)
no-decoder raw mode 544 (693) 111 (128) 145 (168)

Observations

  • significant improvement when going from scalar to SIMD on Intel: 369 -> 467 MB/s: +25%;
  • good speed increase on ARMv8: 107 -> 121 MB/s: +13%;
  • small difference on ARMv7: +6%;
  • raw mode increases speed (compared to line mode) even without SIMD decoder; and SIMD decoder can show full potential in raw mode (thanks to larger blocks);
  • CRC calculation eats significant amount of speed and is worth improving. We can get potential speed increase up to 22% on Intel and 17% on ARMv8 (theoretical limit).

hugbug added a commit that referenced this issue Oct 11, 2017

#448: don't try deleting files that don't exist
- a small optimisation to reduce disk activity

hugbug added a commit that referenced this issue Oct 11, 2017

#448: speed optimisation in NServ
when using unlimited memory cache (command line switch “-m 0”)
@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 12, 2017

Tests for PVR ARMv7 box when downloading from another PC running NServ (as opposed to NServ running on the same device as in previous tests)

Decoder PVR ARMv7 via Ethernet
new-decoder per-line mode 67 (74)
simd-decoder per-line mode 68 (74)
simd-decoder raw mode 73 (81)
scalar-decoder raw mode 71 (78)
no-decoder raw mode 82 (91)

Discovered anomaly: process ksoftirqd takes about 40% (from 200%) of CPU time. That's more than NServ uses on that device (about 20%). As a result nzbget gets less CPU time and overall performance is reduced.

My guess is ksoftirqd activity has something to do with ethernet chip. I don't know if that can be fixed via some system setting.

@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 12, 2017

SIMD CRC

Integrated SIMD CRC routines for Intel and ARMv8 into NZBGet.

  • scalar-crc - CRC routine used in previous test (slice by 4);
  • simd-crc - SIMD routine: on Intel crc_fold, a sophisticated routine which uses CPU instruction PCLMUL. On ARMv8 it's a simple routine whose job is to execute dedicated CPU instruction crc32.

All numbers are in MB/s. For each decoder two test cases were measured - with and without CRC calculation; the latter is shown in parentheses. For convenience the table also includes all previous measurements.

Decoder MacBook 2010 PVR ARMv7 NEO2 ARMv8
new-decoder per-line mode, scalar-crc 307 (354) 89 (99) 102 (118)
simd-decoder per-line mode, scalar-crc 336 (404) 89 (100) 101 (119)
simd-decoder raw mode, scalar-crc 467 (570) 99 (109) 121 (141)
scalar-decoder raw mode, scalar-crc 369 (421) 93 (105) 107 (124)
no-decoder raw mode, scalar-crc 544 (693) 111 (128) 145 (168)
simd-decoder raw mode, simd-crc 520 (563) n/a 136 (141)
scalar-decoder raw mode, simd-crc 397 (420) n/a 118 (125)
no-decoder raw mode, simd-crc 621 (684) n/a 165 (169)

Conclusion

  • Intel: with SIMD CRC routine the speed has improved from 467 MB/s -> 520 MB/s, significantly reducing the gap between CRC-ON (520 MB/s) and CRC-OFF (563 MB/s);
  • ARMv8: with dedicated crc-instruction we are getting CRC calculation almost for free: 136 MB/s CRC-ON vs. 141 MB/s CRC-OFF. Speed improvement: 121 MB/s -> 136 MB/s.
@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 13, 2017

UPDATE: Benchmark table moved to the very first post.

@sanderjo

This comment has been minimized.

Copy link
Contributor

commented Oct 14, 2017

Impressive improvements.

Strange that the Macbook (2010, i5) with all optimizations performs better than the Dell (2015, i7). The Macbooks seems to take over the lead with SIMD ... I would expect no OS nor compiler influence there as SIMD is pure CPU stuff.

BTW: For the Mac and Dell, I would mention the CPU in the table header, just like you did for the ARM devices ... easier for the reader.

@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 14, 2017

See footnote 5. NServ on Windows machine taking a lot of resources prevents download speed in NZBGet going further up.

@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 15, 2017

I've done two more test sets on Dell Notebook:

  • installed fresh Windows 10 on an external drive and rerun all tests there.
  • booted from a Linux DVD and run all the tests.

The table above updated with new numbers. Previous numbers for Windows replaced with new ones. A new column for "Dell Linux" added.

I've also added row "raw decoder" which is a base line to measure SIMD performance.

@animetosho

This comment has been minimized.

Copy link

commented Oct 15, 2017

Thanks for the benchmarks!

Are you perhaps using different compilers on Linux and Windows? The Linux numbers make more sense, as the i7 5600U is a significantly more powerful CPU than the i5 520M. Or perhaps do you have any clue as to why NServ is using 30% CPU in Windows?

@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 16, 2017

Compilers are different (VS 2015 on Windows, GCC 5.x on Linux, Clang on macOS), but I don't think it's caused by compilers. The CPU usage per process is quite different on Windows from Linux and Mac.

Process Linux/Mac CPU usage Windows CPU usage 1
nzbget 330% 180%
nserv 30% 120%
System-process 0% 60%
idle 0-40% 0%
  1. Numbers reported by Task Manager on Windows scaled to 400% to compare with numbers from top.

The numbers are from my memory, therefore not that exact, for nzbget binaries with highest optimisations (SIMD decoder and CRC).

On Windows nzbget process gets only about a half of CPU time compared to Linux/Mac and the speed is accordingly lower.

Profiling of NServ reports that it spends 96% of time in Windows recv which seems to be OK since it uses blocking sockets and most of time waits for new requests from nzbget. Could recv on Windows be that bad to occupy much of CPU time even in waiting state?

@animetosho

This comment has been minimized.

Copy link

commented Oct 16, 2017

I wouldn't really know what the cause is. Is this a sampling profiler? One would think that a blocking call wouldn't use any CPU, but yeah, a sampling profiler will be stuck with a blocking function.
The system process consuming 60% is interesting too. Perhaps something related to TCP/networking settings on the system or similar. Maybe try setting up some other server and doing a large transfer, and watch CPU usage (something like a simple Python HTTP server that sends endless data, and just try downloading from it on the same machine).

@Safihre

This comment has been minimized.

Copy link

commented Oct 18, 2017

I have the same when testing SABnzbd, the System process takes up to 40-50% of CPU time at 100MB/s+ speeds.
Google seems to suggest faulty drivers to be the cause, but it requires lots of debugging so I never tried to figure it out what causes it on my system. But since you also see it with NServe locally, that's strange.. What is this mysterious process doing..

@animetosho

This comment has been minimized.

Copy link

commented Oct 19, 2017

Network activity all goes through the kernel, which does stuff like packet routing, TCP stream handling etc. The System process is a pseudo-process which represents the Windows NT kernel. A high CPU usage for this process just means that the kernel is doing a lot of work, most likely with regards to networking in your case.

I'm not sure what may cause high kernel usage, but a network driver problem or misconfigured NIC wouldn't be out of the realms of feasibility, I'd think.

One suspects that the Windows TCP implementation isn't as optimal as Linux's, but perhaps there's tuning that can be done.

I did a quick benchmark using a much simpler example: a HTTP server which just sends null data. Here's my node.js script (should be easy to port to some other scripting language, but if you want to try and run it, just download node.exe, save this to a .js file, and run node.exe saved_file.js):

const http = require('http');
const url = require('url');
const srv = http.createServer((req, res) => {
  res.writeHead(200, { 'Content-Type': 'text/plain' });
  const bs = url.parse(req.url).pathname.replace(/^\//, '') | 0;
  const buf = new Buffer(bs);
  buf.fill(0);
  
  console.log('Got connection for bs='+bs);
  
  var sent = bs;
  var start = Date.now();
  res.on('drain', () => {
    sent += bs;
    while(res.write(buf))
      sent += bs;
  });
  req.on('close', () => {
    res.end();
    srv.close();
    var t = Date.now() - start;
    console.log('Request ended, sent ' + sent + ' bytes @ ' + (sent / t /1048576*1000) + ' MB/s');
  });
  while(res.write(buf))
    sent += bs;
});

srv.listen(8888, '127.0.0.1');

Testing using 2 instances of wget -O/dev/null 127.0.0.1:8888/1048576 (via MSYS bash shell), the System process consumes a whole CPU core. Actual total transfer rate is around 4.3Gbps on my machine. Testing on a slightly more powerful Linux server, I get around 36Gbps, and ran via the time command, about 80% of the CPU time is spent in the kernel ("sys").
So this basic test seems to show Linux being more efficient, although it's not really a fair test (Windows TCP settings not changed, but I've tuned the TCP parameters in this Linux server a bit).

Doesn't really answer everything, but just demonstrating the effect that networking has on the kernel.

Edit: sorry if I went too far off topic.

@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2017

One-pass decoding

A new version of SIMD decoder with support for detection of stream end (which @animetosho has released a few days ago) makes it possible to process data in one pass instead of two. The new decoder has been integrated into NZBGet and it improves performance further.

The benchmark table in the first post updated with new row "one-pass decoding".

@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 26, 2017

More tests were run after migrating Linux installer builds to "glibc". The table in the first post extended with new row "glibc".

Also tests with active TLS/SSL were made and the new section "TLS/SSL" added to the first post.

@animetosho

This comment has been minimized.

Copy link

commented Oct 27, 2017

Interesting that the libc has such an impact - do you use a lot of libc functions when handling the data?

Also, with the TLS benchmarks, what does 'AES' correspond to? AES-256-GCM perhaps?
AES-GCM should beat pretty much anything else with AES-NI acceleration (particularly on your Broadwell chip). I'd imagine that ARMv8's crypto is similar. As such, AES-128-GCM should be the fastest cipher on chips with AES acceleration, otherwise RC4-MD5 is hard to beat*.

* If memory serves me right, RC4 and MD5 are removed in TLS 1.3 (as they're both rather weak), but that also brings in ChaCha20-Poly1305 cipher which should be comparable in speed. However, at the rate news providers change, it'll probably be a while until TLS 1.3 becomes common place.

@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 27, 2017

do you use a lot of libc functions when handling the data?

With article size 500K it's 2000 articles per second on Intel (1000 MB/s) or 200-400 articles per second on ARM. Quite a lot of management overhead not related to actual downloading and decoding.

Plus NZBGet creates a separate thread for each article download. That's 2000 (Intel) or 200-400 (ARM) threads per second created and deleted. Nice, hah? ;)

I don't know which parts of libc have impact here. Thread management shouldn't be as it should transparently translate to kernel calls. Most likely candidates are heap manager or string functions (that parts are implemented in libc).

It was a surprise for me that uClibc and glibc perform so differently in NZBGet.

One possible way to measure management overhead would be to make tests with articles of say 10MB size.

what does 'AES' correspond to?

It's what passed to SSL_CTX_set_cipher_list. The passed string is a filter. It can be a full name of a cipher or only part. In latter case OpenSSL chooses one suitable cipher. Command openssl ciphers AES prints all suitable ciphers (output depends on OpenSSL options during compiling).

I also tried AES256-SHA256 and it was slower than just AES.

@animetosho

This comment has been minimized.

Copy link

commented Oct 27, 2017

Indeed, I'd expect other overhead, it's just that I'm surprised (as you seem to be) at the effect a libc library has.

I think OpenSSL tends to prefer ECDHE-RSA-AES256-GCM-SHA384 if you just give it AES. I'm guessing AES256-SHA256 runs AES in CBC mode (plus needs a SHA256 HMAC), so AES in GCM mode (default option) should definitely be faster where GCM is accelerated (and GCM doesn't need a HMAC). Perhaps you can try using AES128 instead of AES (AES128 should be slightly faster than AES256), which should also select a GCM mode by default.

@hugbug

This comment has been minimized.

Copy link
Member Author

commented Oct 28, 2017

Speeds on Neo2 ARMv8 device:

Cipher string Speed
(empty) 151
AES 151
AES128 156
AES256 150
AES256-SHA256 113
ECDHE-RSA-AES256-GCM-SHA384 150

Speeds fluctuate on test runs. 150 and 151 can be considered the same speed.

@hugbug hugbug closed this Oct 29, 2017

hugbug added a commit that referenced this issue Oct 29, 2017

#448, 186da63: NServ memory cache switch
no longer has memory limit parameter. The parameter wasn’t respected
anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.