Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE2 patches for encoding and decoding functions #302

Merged
merged 4 commits into from Apr 25, 2021

Conversation

ethercrow
Copy link
Contributor

@ethercrow ethercrow commented Oct 16, 2020

This PR is just a collection of SSE2 patches from those PRs:

@phadej please have a look.

@ethercrow ethercrow changed the title Sse2 combined SSE2 patches for encoding and decoding functions Oct 16, 2020
@ethercrow
Copy link
Contributor Author

@phadej If you're too busy, are there other maintainers we could ping and ask for review?

@phadej
Copy link
Contributor

phadej commented Oct 22, 2020

I'm only merging simple patches, everything untrivial would need input from @hvr

@ethercrow
Copy link
Contributor Author

@hvr ping

@Lysxia Lysxia linked an issue Mar 8, 2021 that may be closed by this pull request
@ethercrow
Copy link
Contributor Author

ethercrow commented Mar 28, 2021

quarterly bump

@Bodigrim
Copy link
Contributor

@ethercrow could you please rebase and rerun relevant benchmarks? First on master with --csv master.csv, then on your branch with --baseline master.csv --csv branch.csv.

@ethercrow
Copy link
Contributor Author

ethercrow commented Mar 28, 2021

> cabal bench --benchmark-options="--baseline master.csv --csv sse2.csv -p code" 2>&1 | tee bench.log

Build profile: -w ghc-9.0.1 -O1
In order, the following will be built (use -v for more details):
 - text-1.2.4.2 (bench:text-benchmarks) (first run)
Preprocessing benchmark 'text-benchmarks' for text-1.2.4.2..
Building benchmark 'text-benchmarks' for text-1.2.4.2..
Running 1 benchmarks...
Benchmark text-benchmarks: RUNNING...
All
  DecodeUtf8
    Strict+html:               OK (1.98s)
      235 μs ±  14 μs, 11% faster than baseline
    Stream+html:               OK (1.83s)
      219 μs ±  13 μs, 25% faster than baseline
    IConv+html:                OK (1.36s)
      1.3 ms ± 112 μs
    StrictLength+html:         OK (1.80s)
      865 μs ±  59 μs,  9% faster than baseline
    StrictInitLength+html:     OK (2.28s)
      2.1 ms ± 139 μs
    Lazy+html:                 OK (1.82s)
      219 μs ±  14 μs, 25% faster than baseline
    LazyLength+html:           OK (1.87s)
      450 μs ±  33 μs, 14% faster than baseline
    LazyInitLength+html:       OK (2.33s)
      2.2 ms ± 122 μs
  DecodeUtf8
    Strict+xml:                OK (1.67s)
      105 ms ± 8.8 ms
    Stream+xml:                OK (1.63s)
      107 ms ± 8.1 ms
    IConv+xml:                 OK (1.42s)
      235 ms ±  14 ms
    StrictLength+xml:          OK (1.16s)
      164 ms ±  16 ms,  9% faster than baseline
    StrictInitLength+xml:      OK (2.46s)
      350 ms ±  25 ms
    Lazy+xml:                  OK (1.61s)
      107 ms ± 7.2 ms
    LazyLength+xml:            OK (1.92s)
      127 ms ± 7.0 ms,  7% faster than baseline
    LazyInitLength+xml:        OK (1.22s)
      406 ms ±  32 ms
  DecodeUtf8
    Strict+ascii:              OK (1.51s)
       11 ms ± 1.0 ms, 42% faster than baseline
    Stream+ascii:              OK (1.53s)
       24 ms ± 1.7 ms, 28% faster than baseline
    IConv+ascii:               OK (2.19s)
      304 ms ±  13 ms
    StrictLength+ascii:        OK (1.15s)
      163 ms ±  14 ms, 14% faster than baseline
    StrictInitLength+ascii:    OK (3.33s)
      473 ms ±  16 ms
    Lazy+ascii:                OK (1.52s)
       24 ms ± 1.7 ms, 27% faster than baseline
    LazyLength+ascii:          OK (2.05s)
       65 ms ± 3.6 ms, 11% faster than baseline
    LazyInitLength+ascii:      OK (3.53s)
      499 ms ±  20 ms, 10% faster than baseline
  DecodeUtf8
    Strict+russian:            OK (1.86s)
       14 ms ± 1.2 ms
    Stream+russian:            OK (1.61s)
       12 ms ± 858 μs, 16% faster than baseline
    IConv+russian:             OK (1.26s)
       20 ms ± 1.8 ms
    StrictLength+russian:      OK (2.05s)
       16 ms ± 1.1 ms, 10% faster than baseline
    StrictInitLength+russian:  OK (1.88s)
       30 ms ± 2.2 ms
    Lazy+russian:              OK (1.61s)
       12 ms ± 863 μs, 16% faster than baseline
    LazyLength+russian:        OK (1.93s)
       15 ms ± 916 μs, 14% faster than baseline
    LazyInitLength+russian:    OK (1.12s)
       36 ms ± 3.5 ms
  DecodeUtf8
    Strict+japanese:           OK (1.53s)
       23 μs ± 1.8 μs
    Stream+japanese:           OK (1.38s)
       21 μs ± 1.7 μs, 27% faster than baseline
    IConv+japanese:            OK (2.11s)
       32 μs ± 1.8 μs
    StrictLength+japanese:     OK (1.15s)
       35 μs ± 3.3 μs, 12% faster than baseline
    StrictInitLength+japanese: OK (3.63s)
       55 μs ± 2.4 μs,  8% faster than baseline
    Lazy+japanese:             OK (1.39s)
       21 μs ± 1.7 μs, 26% faster than baseline
    LazyLength+japanese:       OK (1.64s)
       25 μs ± 1.7 μs, 22% faster than baseline
    LazyInitLength+japanese:   OK (1.94s)
       58 μs ± 5.6 μs, 10% faster than baseline
  DecodeASCII
    strict decodeUtf8:         OK (2.84s)
       11 ms ± 574 μs
    strict decodeLatin1:       OK (2.83s)
       11 ms ± 610 μs
    strict decodeASCII:        OK (2.86s)
       11 ms ± 476 μs
    lazy decodeUtf8:           OK (1.60s)
       25 ms ± 1.9 ms
    lazy decodeLatin1:         OK (1.58s)
       25 ms ± 1.9 ms
    lazy decodeASCII:          OK (1.57s)
       25 ms ± 1.8 ms
  EncodeUtf8
    Text (non-ASCII):          OK (2.04s)
      2.0 ms ± 162 μs
    LazyText (non-ASCII):      OK (1.51s)
      2.8 ms ± 239 μs
  EncodeUtf8
    Text (ASCII):              OK (1.94s)
      116 μs ± 9.1 μs
    LazyText (ASCII):          OK (3.70s)
      1.8 ms ± 171 μs
  Pure
    decode
      Text+tiny:               OK (2.11s)
         31 ns ± 1.7 ns, 36% slower than baseline
      LazyText+tiny:           OK (2.24s)
         66 ns ± 3.9 ns,  9% slower than baseline
    decode'
      Text+tiny:               OK (1.28s)
         38 ns ± 3.7 ns, 25% slower than baseline
      LazyText+tiny:           OK (1.34s)
         78 ns ± 7.2 ns
    encode
      Text+tiny:               OK (1.78s)
         26 ns ± 1.9 ns, 10% slower than baseline
      LazyText+tiny:           OK (1.90s)
         56 ns ± 3.4 ns
    length
      decode
        Text+tiny:             OK (1.38s)
           20 ns ± 2.0 ns
        LazyText+tiny:         OK (1.21s)
           71 ns ± 6.6 ns
  Pure
    decode
      Text+ascii-small:        OK (9.82s)
        9.3 μs ± 199 ns, 53% faster than baseline
      LazyText+ascii-small:    OK (2.54s)
        9.4 μs ± 419 ns, 56% faster than baseline
    decode'
      Text+ascii-small:        OK (2.44s)
        9.1 μs ± 705 ns, 54% faster than baseline
      LazyText+ascii-small:    OK (2.57s)
        9.6 μs ± 796 ns, 55% faster than baseline
    encode
      Text+ascii-small:        OK (1.96s)
        7.3 μs ± 642 ns, 61% faster than baseline
      LazyText+ascii-small:    OK (1.27s)
         38 μs ± 3.4 μs, 45% faster than baseline
    length
      decode
        Text+ascii-small:      OK (1.86s)
          223 μs ±  21 μs
        LazyText+ascii-small:  OK (1.40s)
           42 μs ± 3.7 μs, 51% faster than baseline
  Pure
    decode
      Text+ascii:              OK (3.40s)
         11 ms ± 433 μs, 42% faster than baseline
      LazyText+ascii:          OK (5.73s)
         87 ms ± 1.9 ms
    decode'
      Text+ascii:              OK (3.48s)
         11 ms ± 488 μs, 42% faster than baseline
      LazyText+ascii:          OK (5.73s)
         88 ms ± 1.8 ms
    encode
      Text+ascii:              OK (1.97s)
         11 ms ± 974 μs, 44% faster than baseline
      LazyText+ascii:          OK (4.24s)
         63 ms ± 1.8 ms, 20% faster than baseline
    length
      decode
        Text+ascii:            OK (3.20s)
          193 ms ±  11 ms
        LazyText+ascii:        OK (1.57s)
           38 ms ± 3.4 ms, 48% faster than baseline
  Pure
    decode
      Text+english:            OK (20.14s)
        593 μs ±  42 μs, 51% faster than baseline
      LazyText+english:        OK (1.36s)
        613 μs ±  57 μs, 50% faster than baseline
    decode'
      Text+english:            OK (1.30s)
        578 μs ±  53 μs, 51% faster than baseline
      LazyText+english:        OK (10.65s)
        637 μs ±  22 μs, 50% faster than baseline
    encode
      Text+english:            OK (2.55s)
        582 μs ±  41 μs, 52% faster than baseline
      LazyText+english:        OK (2.13s)
        2.0 ms ± 133 μs, 47% faster than baseline
    length
      decode
        Text+english:          OK (1.69s)
           13 ms ± 1.0 ms
        LazyText+english:      OK (2.58s)
          2.4 ms ± 117 μs, 50% faster than baseline
  Pure
    decode
      Text+russian:            OK (2.24s)
         34 μs ± 1.8 μs
      LazyText+russian:        OK (1.95s)
         29 μs ± 1.7 μs, 16% faster than baseline
    decode'
      Text+russian:            OK (2.22s)
         34 μs ± 1.7 μs
      LazyText+russian:        OK (1.96s)
         30 μs ± 1.9 μs, 16% faster than baseline
    encode
      Text+russian:            OK (1.50s)
         11 μs ± 840 ns, 14% faster than baseline
      LazyText+russian:        OK (1.50s)
         11 μs ± 1.0 μs, 21% faster than baseline
    length
      decode
        Text+russian:          OK (2.50s)
           38 μs ± 1.8 μs,  8% faster than baseline
        LazyText+russian:      OK (2.17s)
           32 μs ± 1.7 μs, 21% faster than baseline
  Pure
    decode
      Text+japanese:           OK (1.55s)
         23 μs ± 1.8 μs
      LazyText+japanese:       OK (1.36s)
         21 μs ± 1.8 μs, 26% faster than baseline
    decode'
      Text+japanese:           OK (1.53s)
         23 μs ± 1.7 μs
      LazyText+japanese:       OK (1.36s)
         21 μs ± 1.8 μs, 26% faster than baseline
    encode
      Text+japanese:           OK (2.17s)
        8.2 μs ± 468 ns, 19% faster than baseline
      LazyText+japanese:       OK (1.48s)
         11 μs ± 973 ns
    length
      decode
        Text+japanese:         OK (2.31s)
           35 μs ± 2.0 μs
        LazyText+japanese:     OK (1.52s)
           23 μs ± 1.8 μs, 29% faster than baseline

All 98 tests passed (239.41s)
Benchmark text-benchmarks: FINISH

@Bodigrim
Copy link
Contributor

Speed ups look great! I think what we need here is a multiplatform CI to ensure that there are no gcc / clang issues on various OSs. Something like a separate workflow running

jobs:
  build:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-16.04, ubuntu-18.04, ubuntu-20.04, windows-2019, macos-10.15, macos-11.0]

and FreeBSD, all with the latest available (or default) GHC.

@phadej
Copy link
Contributor

phadej commented Mar 29, 2021

@Bodigrim I played with the relevant part in godbolt.org, clang and GCC are both happy with that code. MSVC doesn't define __i386__ and __x86_64__ macros, but I you cannot use that with GHC anyway.

The problem with testing is rather that non-i386 (and non-x86_64) architectures are not tested. One option is to have a way to toggle the ifdefs so no architecture specific code is used, e.g. something like

package text
  cpp-options: -DNO_PLATFORM_SPECIFICS
  -- i don't remember whether cpp-options works here, they should?

I'd avoid adding new Cabal flags. Perfectly users won't know about development or testing concerns (i.e. I'd try to figure out whether developer flag is really required).

What will be relevant sooner or later is ARM64. But there isn't ARM specific code, nor there are easy (and free or even cheap) way to have ARM Github action runners.


I played with following snippet:

We need to remove static inline for assembly to be generated in godbolt.

#include <stdint.h>
#include <stddef.h>

#if defined(__x86_64__)
#include <emmintrin.h>
#include <xmmintrin.h>
#endif

#define UTF8_ACCEPT 0
#define UTF8_REJECT 12


static const uint8_t utf8d[] = {
  /*
   * The first part of the table maps bytes to character classes that
   * to reduce the size of the transition table and create bitmasks.
   */
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
   7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
   8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
  10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8,

  /*
   * The second part is a transition table that maps a combination of
   * a state of the automaton and a character class to a state.
   */
   0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12,
  12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12,
  12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12,
  12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12,
  12,36,12,12,12,12,12,12,12,12,12,12,
};

uint32_t
decode(uint32_t *state, uint32_t* codep, uint32_t byte) {
  uint32_t type = utf8d[byte];

  *codep = (*state != UTF8_ACCEPT) ?
    (byte & 0x3fu) | (*codep << 6) :
    (0xff >> type) & (byte);

  return *state = utf8d[256 + *state + type];
}

uint8_t const *
_hs_text_decode_utf8_int(uint16_t *const dest, size_t *destoff,
			 const uint8_t **src, const uint8_t *srcend,
			 uint32_t *codepoint0, uint32_t *state0)
{
  uint16_t *d = dest + *destoff;
  const uint8_t *s = *src, *last = *src;
  uint32_t state = *state0;
  uint32_t codepoint = *codepoint0;

  while (s < srcend) {
#if defined(__i386__) || defined(__x86_64__)
    /*
     * This code will only work on a little-endian system that
     * supports unaligned loads.
     *
     * It gives a substantial speed win on data that is purely or
     * partly ASCII (e.g. HTML), at only a slight cost on purely
     * non-ASCII text.
     */

    if (state == UTF8_ACCEPT) {
#if defined(__x86_64__)
      const __m128i zeros = _mm_set1_epi32(0);
      while (s < srcend - 8) {
        const uint64_t hopefully_eight_ascii_chars = *((uint64_t *) s);
        if ((hopefully_eight_ascii_chars & 0x8080808080808080LL) != 0LL)
          break;
        s += 8;

        /* Load 8 bytes of ASCII data */
        const __m128i eight_ascii_chars = _mm_cvtsi64_si128(hopefully_eight_ascii_chars);
        /* Interleave with zeros */
        const __m128i eight_utf16_chars = _mm_unpacklo_epi8(eight_ascii_chars, zeros);
        /* Store the resulting 8 bytes into destination */
        _mm_storeu_si128((__m128i *)d, eight_utf16_chars);
        d += 8;
      }
#else  
      while (s < srcend - 4) {
        codepoint = *((uint32_t *) s);
        if ((codepoint & 0x80808080) != 0)
          break;
        s += 4;
        /*
         * Tried 32-bit stores here, but the extra bit-twiddling
         * slowed the code down.
         */
        *d++ = (uint16_t) (codepoint & 0xff);
        *d++ = (uint16_t) ((codepoint >> 8) & 0xff);
        *d++ = (uint16_t) ((codepoint >> 16) & 0xff);
        *d++ = (uint16_t) ((codepoint >> 24) & 0xff);
      }
#endif
      last = s;
    } /* end if (state == UTF8_ACCEPT) */
#endif

    if (decode(&state, &codepoint, *s++) != UTF8_ACCEPT) {
      if (state != UTF8_REJECT)
	continue;
      break;
    }

    if (codepoint <= 0xffff)
      *d++ = (uint16_t) codepoint;
    else {
      *d++ = (uint16_t) (0xD7C0 + (codepoint >> 10));
      *d++ = (uint16_t) (0xDC00 + (codepoint & 0x3FF));
    }
    last = s;
  }

  *destoff = d - dest;
  *codepoint0 = codepoint;
  *state0 = state;
  *src = last;

  return s;
}

For lines around /* Load 8 bytes of ASCII data */ clang 11.0.1 generates: (-O1)

        addq    $8, %r12                        # s += 8
        movq    %rax, %xmm0                     # load 8 bytes ...
        punpcklbw       %xmm1, %xmm0            # interleave with zeros
        movdqu  %xmm0, (%r13)                   # store the resulting bytes
        addq    $16, %r13                       # d += 8

and GCC 10.2 virtually the same

        addq    $8, %rdx
        movq    %rax, %xmm0
        pxor    %xmm1, %xmm1
        punpcklbw       %xmm1, %xmm0
        movups  %xmm0, (%rbx)
        addq    $16, %rbx

so they generate the same code. (I don't know whether GHC calls C-compiler with -O1 or -O2 or the same -O it was given, the -O2 for either of C-compilers doesn't change code much, obfuscates a bit by reordering instructions).


TL;DR i don't think there is GCC & Clang concerns. It's not C++17 code :)

@phadej
Copy link
Contributor

phadej commented Mar 29, 2021

I do wonder whether making loop work so it would do aligned loads would make it even faster, but eh. Benchmarking that would be devastating.

In general I wonder whether the code for this exists in some C library. Decoding UTF8 isn't something we should reimplement. (Especially would help if text internal encoding is changed, cause then this code will just go away, decoding UTF8 to UTF16 won't matter - it won't be a hotpath).

@ethercrow
Copy link
Contributor Author

I do wonder whether making loop work so it would do aligned loads would make it even faster, but eh. Benchmarking that would be devastating.

This came up when discussing the SIMD implementation of intersperse for bytestring, very similar code to this PR. On my machine (some i7 from 2017) there was no noticable difference.

In general I wonder whether the code for this exists in some C library. Decoding UTF8 isn't something we should reimplement. (Especially would help if text internal encoding is changed, cause then this code will just go away, decoding UTF8 to UTF16 won't matter - it won't be a hotpath).

UTF8-based text is a path z-haskell is pursuing https://hackage.haskell.org/package/Z-Data

@ethercrow
Copy link
Contributor Author

Regarding CI on other OSes, I see that text uses haskell-ci for generating the GitHub Actions manifest, but it looks like that tool is Linux-only. I'm not sure, I don't see relevant documentation or mentions of Windows or macOS in the source code.

@Bodigrim
Copy link
Contributor

haskell-ci-based workflows do not support other OS. My suggestion was to create two new additional workflows. One for crossplatform testing for the latest GHC, and one for FreeBSD run. Since haskell-ci already tests project structure, it would be enough just to run cabal test / cabal bench --benchmark-option=-l. One can borrow some boilerplate from https://github.com/haskell/bytestring/blob/master/.github/workflows/ci.yml

Separate workflows are easier to manage, because GitHub allows to rerun a workflow, but not an individual job.

@ethercrow
Copy link
Contributor Author

Ah, that makes sense, I'll probably get to that over the weekend.

@ethercrow
Copy link
Contributor Author

Rebased onto master, tests passed on all newly introduced OSes.

@phadej
Copy link
Contributor

phadej commented Apr 11, 2021

2021-04-11T16:20:05.3951363Z           t_toTitle_1stNotLower:   FAIL
2021-04-11T16:20:05.3952004Z             *** Failed! Falsified (after 62 tests and 6 shrinks):
2021-04-11T16:20:05.3952528Z             "\4317"

on GHC-8.8.4, https://www.fileformat.info/info/unicode/char/10dd/index.htm

On GHCs I tested, incl. GHC-9.0.1

ghci> toUpper '\4317'
'\7325'
ghci> toTitle '\4317'
'\4317'
ghci> isUpper '\4317'
False
ghci> isLower '\4317'
True

It looks like that geortian text doesn't have concept of title case: https://en.wikipedia.org/wiki/Georgian_Extended and thus that test is broken.

@ethercrow
Copy link
Contributor Author

Could we merge this? I don't think that fixing t_toTitle_1stNotLower belongs in this PR.

@chessai
Copy link
Member

chessai commented Apr 15, 2021 via email

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrgh, I'm deeply sorry, @ethercrow.
I have written a comment below, but forgot to press "Submit", and was awaiting for your response.

FWIW it's not a blocker, I'm just being curious.

cbits/cbits.c Outdated
*/
const __m128i zeros = _mm_set1_epi32(0);
while (p < srcend - 3) {
/* Load 4 bytes of ASCII data */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to load 8 bytes here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, yes, this loop can be same as in decodeUtf8, but without a check.

I think I implemented this one first with 4-wide, then realized it can be 8-wide when doing decodeUtf8, but forgot to port it back to Latin1.


const uint64_t w = eight_chars.halves[0];
if (w & 0xFF80FF80FF80FF80ULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just break here actually? Does it affect performance? I imagine we could compare a whole 128-byte register with a broadcasted 0xFF80, and either break or _mm_packus_epi16 + _mm_storel_epi64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I experimented with where exactly and after how many checks the break can be and chose one of the best performing combinations. Unfortunately I didn't record those combinations and the results might be different on different CPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it looks asymmetric to exercise each pair of bytes from the first uint64, but not in the second one. I understand that doing the same for the second uint64 make code more hairy, so maybe we can stop doing it for the first one as well? It would also allow us to avoid union stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both symmetric variants and found both to be slower than this one.

My understanding is that

  1. if you do too many checks by looking at individual bytes, the time cost of the simd loop iteration becomes higher.
  2. if you do too few checks (by not even looking at bytes in the first half) then the probability of the simd loop iteration doing useful work decreases and it also hurts the overall performance.

This probability is different for different input data, extreme cases would be pure ASCII (simd routine always works) and pure chinese text (simd routine never works). On our test data I observed that inspecting bytes in the first half was a sweet spot.

Bodigrim
Bodigrim previously approved these changes Apr 19, 2021
cbits/cbits.c Outdated Show resolved Hide resolved
src += 4;

if (eight_chars.halves[1] & 0xFF80FF80FF80FF80ULL) {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you can still pack and store the whole 0'th half using the pext instruction
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_pext_u64&expand=4330

Copy link
Contributor Author

@ethercrow ethercrow Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR intentionally uses only SSE2 to be compatible with any 64bit x86 CPU. Quite a few models don't support pext or their implementation is not fast. From wikipedia:

AMD processors before Zen 3[11] that implement PDEP and PEXT do so in microcode, with a latency of 18 cycles[12] rather than a single cycle. As a result, if the mask is known, it is often faster to use other instructions on AMD.

Co-authored-by: Kubo Kováč <733205+kuk0@users.noreply.github.com>
cbits/cbits.c Show resolved Hide resolved
@Bodigrim
Copy link
Contributor

Unless there are more comments/suggestions, I'll merge this by the end of the week.

@Bodigrim Bodigrim merged commit e08b793 into haskell:master Apr 25, 2021
@Bodigrim
Copy link
Contributor

Thanks @ethercrow, performance improvements are much appreciated. And sorry that it took so long.

@ethercrow ethercrow deleted the sse2-combined branch April 25, 2021 13:50
sjshuck added a commit to sjshuck/text that referenced this pull request Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants