SSE2 patches for encoding and decoding functions #302

ethercrow · 2020-10-16T21:19:48Z

This PR is just a collection of SSE2 patches from those PRs:

decodeLatin1 Use SSE2 in the x86_64 C version of decodeLatin1 #297
decodeUtf8 Use SSE2 in the ASCII fast path of decodeUtf8 #298
encodeUtf8 Use SSE2 in the x86_64 C version of encodeUtf8 #300

@phadej please have a look.

ethercrow · 2020-10-21T22:49:00Z

@phadej If you're too busy, are there other maintainers we could ping and ask for review?

phadej · 2020-10-22T09:28:40Z

I'm only merging simple patches, everything untrivial would need input from @hvr

ethercrow · 2020-10-29T12:24:25Z

@hvr ping

ethercrow · 2021-03-28T14:08:17Z

quarterly bump

Bodigrim · 2021-03-28T17:10:36Z

@ethercrow could you please rebase and rerun relevant benchmarks? First on master with --csv master.csv, then on your branch with --baseline master.csv --csv branch.csv.

ethercrow · 2021-03-28T20:15:20Z

> cabal bench --benchmark-options="--baseline master.csv --csv sse2.csv -p code" 2>&1 | tee bench.log

Build profile: -w ghc-9.0.1 -O1
In order, the following will be built (use -v for more details):
 - text-1.2.4.2 (bench:text-benchmarks) (first run)
Preprocessing benchmark 'text-benchmarks' for text-1.2.4.2..
Building benchmark 'text-benchmarks' for text-1.2.4.2..
Running 1 benchmarks...
Benchmark text-benchmarks: RUNNING...
All
  DecodeUtf8
    Strict+html:               OK (1.98s)
      235 μs ±  14 μs, 11% faster than baseline
    Stream+html:               OK (1.83s)
      219 μs ±  13 μs, 25% faster than baseline
    IConv+html:                OK (1.36s)
      1.3 ms ± 112 μs
    StrictLength+html:         OK (1.80s)
      865 μs ±  59 μs,  9% faster than baseline
    StrictInitLength+html:     OK (2.28s)
      2.1 ms ± 139 μs
    Lazy+html:                 OK (1.82s)
      219 μs ±  14 μs, 25% faster than baseline
    LazyLength+html:           OK (1.87s)
      450 μs ±  33 μs, 14% faster than baseline
    LazyInitLength+html:       OK (2.33s)
      2.2 ms ± 122 μs
  DecodeUtf8
    Strict+xml:                OK (1.67s)
      105 ms ± 8.8 ms
    Stream+xml:                OK (1.63s)
      107 ms ± 8.1 ms
    IConv+xml:                 OK (1.42s)
      235 ms ±  14 ms
    StrictLength+xml:          OK (1.16s)
      164 ms ±  16 ms,  9% faster than baseline
    StrictInitLength+xml:      OK (2.46s)
      350 ms ±  25 ms
    Lazy+xml:                  OK (1.61s)
      107 ms ± 7.2 ms
    LazyLength+xml:            OK (1.92s)
      127 ms ± 7.0 ms,  7% faster than baseline
    LazyInitLength+xml:        OK (1.22s)
      406 ms ±  32 ms
  DecodeUtf8
    Strict+ascii:              OK (1.51s)
       11 ms ± 1.0 ms, 42% faster than baseline
    Stream+ascii:              OK (1.53s)
       24 ms ± 1.7 ms, 28% faster than baseline
    IConv+ascii:               OK (2.19s)
      304 ms ±  13 ms
    StrictLength+ascii:        OK (1.15s)
      163 ms ±  14 ms, 14% faster than baseline
    StrictInitLength+ascii:    OK (3.33s)
      473 ms ±  16 ms
    Lazy+ascii:                OK (1.52s)
       24 ms ± 1.7 ms, 27% faster than baseline
    LazyLength+ascii:          OK (2.05s)
       65 ms ± 3.6 ms, 11% faster than baseline
    LazyInitLength+ascii:      OK (3.53s)
      499 ms ±  20 ms, 10% faster than baseline
  DecodeUtf8
    Strict+russian:            OK (1.86s)
       14 ms ± 1.2 ms
    Stream+russian:            OK (1.61s)
       12 ms ± 858 μs, 16% faster than baseline
    IConv+russian:             OK (1.26s)
       20 ms ± 1.8 ms
    StrictLength+russian:      OK (2.05s)
       16 ms ± 1.1 ms, 10% faster than baseline
    StrictInitLength+russian:  OK (1.88s)
       30 ms ± 2.2 ms
    Lazy+russian:              OK (1.61s)
       12 ms ± 863 μs, 16% faster than baseline
    LazyLength+russian:        OK (1.93s)
       15 ms ± 916 μs, 14% faster than baseline
    LazyInitLength+russian:    OK (1.12s)
       36 ms ± 3.5 ms
  DecodeUtf8
    Strict+japanese:           OK (1.53s)
       23 μs ± 1.8 μs
    Stream+japanese:           OK (1.38s)
       21 μs ± 1.7 μs, 27% faster than baseline
    IConv+japanese:            OK (2.11s)
       32 μs ± 1.8 μs
    StrictLength+japanese:     OK (1.15s)
       35 μs ± 3.3 μs, 12% faster than baseline
    StrictInitLength+japanese: OK (3.63s)
       55 μs ± 2.4 μs,  8% faster than baseline
    Lazy+japanese:             OK (1.39s)
       21 μs ± 1.7 μs, 26% faster than baseline
    LazyLength+japanese:       OK (1.64s)
       25 μs ± 1.7 μs, 22% faster than baseline
    LazyInitLength+japanese:   OK (1.94s)
       58 μs ± 5.6 μs, 10% faster than baseline
  DecodeASCII
    strict decodeUtf8:         OK (2.84s)
       11 ms ± 574 μs
    strict decodeLatin1:       OK (2.83s)
       11 ms ± 610 μs
    strict decodeASCII:        OK (2.86s)
       11 ms ± 476 μs
    lazy decodeUtf8:           OK (1.60s)
       25 ms ± 1.9 ms
    lazy decodeLatin1:         OK (1.58s)
       25 ms ± 1.9 ms
    lazy decodeASCII:          OK (1.57s)
       25 ms ± 1.8 ms
  EncodeUtf8
    Text (non-ASCII):          OK (2.04s)
      2.0 ms ± 162 μs
    LazyText (non-ASCII):      OK (1.51s)
      2.8 ms ± 239 μs
  EncodeUtf8
    Text (ASCII):              OK (1.94s)
      116 μs ± 9.1 μs
    LazyText (ASCII):          OK (3.70s)
      1.8 ms ± 171 μs
  Pure
    decode
      Text+tiny:               OK (2.11s)
         31 ns ± 1.7 ns, 36% slower than baseline
      LazyText+tiny:           OK (2.24s)
         66 ns ± 3.9 ns,  9% slower than baseline
    decode'
      Text+tiny:               OK (1.28s)
         38 ns ± 3.7 ns, 25% slower than baseline
      LazyText+tiny:           OK (1.34s)
         78 ns ± 7.2 ns
    encode
      Text+tiny:               OK (1.78s)
         26 ns ± 1.9 ns, 10% slower than baseline
      LazyText+tiny:           OK (1.90s)
         56 ns ± 3.4 ns
    length
      decode
        Text+tiny:             OK (1.38s)
           20 ns ± 2.0 ns
        LazyText+tiny:         OK (1.21s)
           71 ns ± 6.6 ns
  Pure
    decode
      Text+ascii-small:        OK (9.82s)
        9.3 μs ± 199 ns, 53% faster than baseline
      LazyText+ascii-small:    OK (2.54s)
        9.4 μs ± 419 ns, 56% faster than baseline
    decode'
      Text+ascii-small:        OK (2.44s)
        9.1 μs ± 705 ns, 54% faster than baseline
      LazyText+ascii-small:    OK (2.57s)
        9.6 μs ± 796 ns, 55% faster than baseline
    encode
      Text+ascii-small:        OK (1.96s)
        7.3 μs ± 642 ns, 61% faster than baseline
      LazyText+ascii-small:    OK (1.27s)
         38 μs ± 3.4 μs, 45% faster than baseline
    length
      decode
        Text+ascii-small:      OK (1.86s)
          223 μs ±  21 μs
        LazyText+ascii-small:  OK (1.40s)
           42 μs ± 3.7 μs, 51% faster than baseline
  Pure
    decode
      Text+ascii:              OK (3.40s)
         11 ms ± 433 μs, 42% faster than baseline
      LazyText+ascii:          OK (5.73s)
         87 ms ± 1.9 ms
    decode'
      Text+ascii:              OK (3.48s)
         11 ms ± 488 μs, 42% faster than baseline
      LazyText+ascii:          OK (5.73s)
         88 ms ± 1.8 ms
    encode
      Text+ascii:              OK (1.97s)
         11 ms ± 974 μs, 44% faster than baseline
      LazyText+ascii:          OK (4.24s)
         63 ms ± 1.8 ms, 20% faster than baseline
    length
      decode
        Text+ascii:            OK (3.20s)
          193 ms ±  11 ms
        LazyText+ascii:        OK (1.57s)
           38 ms ± 3.4 ms, 48% faster than baseline
  Pure
    decode
      Text+english:            OK (20.14s)
        593 μs ±  42 μs, 51% faster than baseline
      LazyText+english:        OK (1.36s)
        613 μs ±  57 μs, 50% faster than baseline
    decode'
      Text+english:            OK (1.30s)
        578 μs ±  53 μs, 51% faster than baseline
      LazyText+english:        OK (10.65s)
        637 μs ±  22 μs, 50% faster than baseline
    encode
      Text+english:            OK (2.55s)
        582 μs ±  41 μs, 52% faster than baseline
      LazyText+english:        OK (2.13s)
        2.0 ms ± 133 μs, 47% faster than baseline
    length
      decode
        Text+english:          OK (1.69s)
           13 ms ± 1.0 ms
        LazyText+english:      OK (2.58s)
          2.4 ms ± 117 μs, 50% faster than baseline
  Pure
    decode
      Text+russian:            OK (2.24s)
         34 μs ± 1.8 μs
      LazyText+russian:        OK (1.95s)
         29 μs ± 1.7 μs, 16% faster than baseline
    decode'
      Text+russian:            OK (2.22s)
         34 μs ± 1.7 μs
      LazyText+russian:        OK (1.96s)
         30 μs ± 1.9 μs, 16% faster than baseline
    encode
      Text+russian:            OK (1.50s)
         11 μs ± 840 ns, 14% faster than baseline
      LazyText+russian:        OK (1.50s)
         11 μs ± 1.0 μs, 21% faster than baseline
    length
      decode
        Text+russian:          OK (2.50s)
           38 μs ± 1.8 μs,  8% faster than baseline
        LazyText+russian:      OK (2.17s)
           32 μs ± 1.7 μs, 21% faster than baseline
  Pure
    decode
      Text+japanese:           OK (1.55s)
         23 μs ± 1.8 μs
      LazyText+japanese:       OK (1.36s)
         21 μs ± 1.8 μs, 26% faster than baseline
    decode'
      Text+japanese:           OK (1.53s)
         23 μs ± 1.7 μs
      LazyText+japanese:       OK (1.36s)
         21 μs ± 1.8 μs, 26% faster than baseline
    encode
      Text+japanese:           OK (2.17s)
        8.2 μs ± 468 ns, 19% faster than baseline
      LazyText+japanese:       OK (1.48s)
         11 μs ± 973 ns
    length
      decode
        Text+japanese:         OK (2.31s)
           35 μs ± 2.0 μs
        LazyText+japanese:     OK (1.52s)
           23 μs ± 1.8 μs, 29% faster than baseline

All 98 tests passed (239.41s)
Benchmark text-benchmarks: FINISH

Bodigrim · 2021-03-28T23:17:34Z

Speed ups look great! I think what we need here is a multiplatform CI to ensure that there are no gcc / clang issues on various OSs. Something like a separate workflow running

jobs:
  build:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-16.04, ubuntu-18.04, ubuntu-20.04, windows-2019, macos-10.15, macos-11.0]

and FreeBSD, all with the latest available (or default) GHC.

phadej · 2021-03-29T00:19:06Z

@Bodigrim I played with the relevant part in godbolt.org, clang and GCC are both happy with that code. MSVC doesn't define __i386__ and __x86_64__ macros, but I you cannot use that with GHC anyway.

The problem with testing is rather that non-i386 (and non-x86_64) architectures are not tested. One option is to have a way to toggle the ifdefs so no architecture specific code is used, e.g. something like

package text
  cpp-options: -DNO_PLATFORM_SPECIFICS
  -- i don't remember whether cpp-options works here, they should?

I'd avoid adding new Cabal flags. Perfectly users won't know about development or testing concerns (i.e. I'd try to figure out whether developer flag is really required).

What will be relevant sooner or later is ARM64. But there isn't ARM specific code, nor there are easy (and free or even cheap) way to have ARM Github action runners.

I played with following snippet:

We need to remove static inline for assembly to be generated in godbolt.

#include <stdint.h>
#include <stddef.h>

#if defined(__x86_64__)
#include <emmintrin.h>
#include <xmmintrin.h>
#endif

#define UTF8_ACCEPT 0
#define UTF8_REJECT 12


static const uint8_t utf8d[] = {
  /*
   * The first part of the table maps bytes to character classes that
   * to reduce the size of the transition table and create bitmasks.
   */
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
   7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
   8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
  10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8,

  /*
   * The second part is a transition table that maps a combination of
   * a state of the automaton and a character class to a state.
   */
   0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12,
  12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12,
  12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12,
  12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12,
  12,36,12,12,12,12,12,12,12,12,12,12,
};

uint32_t
decode(uint32_t *state, uint32_t* codep, uint32_t byte) {
  uint32_t type = utf8d[byte];

  *codep = (*state != UTF8_ACCEPT) ?
    (byte & 0x3fu) | (*codep << 6) :
    (0xff >> type) & (byte);

  return *state = utf8d[256 + *state + type];
}

uint8_t const *
_hs_text_decode_utf8_int(uint16_t *const dest, size_t *destoff,
			 const uint8_t **src, const uint8_t *srcend,
			 uint32_t *codepoint0, uint32_t *state0)
{
  uint16_t *d = dest + *destoff;
  const uint8_t *s = *src, *last = *src;
  uint32_t state = *state0;
  uint32_t codepoint = *codepoint0;

  while (s < srcend) {
#if defined(__i386__) || defined(__x86_64__)
    /*
     * This code will only work on a little-endian system that
     * supports unaligned loads.
     *
     * It gives a substantial speed win on data that is purely or
     * partly ASCII (e.g. HTML), at only a slight cost on purely
     * non-ASCII text.
     */

    if (state == UTF8_ACCEPT) {
#if defined(__x86_64__)
      const __m128i zeros = _mm_set1_epi32(0);
      while (s < srcend - 8) {
        const uint64_t hopefully_eight_ascii_chars = *((uint64_t *) s);
        if ((hopefully_eight_ascii_chars & 0x8080808080808080LL) != 0LL)
          break;
        s += 8;

        /* Load 8 bytes of ASCII data */
        const __m128i eight_ascii_chars = _mm_cvtsi64_si128(hopefully_eight_ascii_chars);
        /* Interleave with zeros */
        const __m128i eight_utf16_chars = _mm_unpacklo_epi8(eight_ascii_chars, zeros);
        /* Store the resulting 8 bytes into destination */
        _mm_storeu_si128((__m128i *)d, eight_utf16_chars);
        d += 8;
      }
#else  
      while (s < srcend - 4) {
        codepoint = *((uint32_t *) s);
        if ((codepoint & 0x80808080) != 0)
          break;
        s += 4;
        /*
         * Tried 32-bit stores here, but the extra bit-twiddling
         * slowed the code down.
         */
        *d++ = (uint16_t) (codepoint & 0xff);
        *d++ = (uint16_t) ((codepoint >> 8) & 0xff);
        *d++ = (uint16_t) ((codepoint >> 16) & 0xff);
        *d++ = (uint16_t) ((codepoint >> 24) & 0xff);
      }
#endif
      last = s;
    } /* end if (state == UTF8_ACCEPT) */
#endif

    if (decode(&state, &codepoint, *s++) != UTF8_ACCEPT) {
      if (state != UTF8_REJECT)
	continue;
      break;
    }

    if (codepoint <= 0xffff)
      *d++ = (uint16_t) codepoint;
    else {
      *d++ = (uint16_t) (0xD7C0 + (codepoint >> 10));
      *d++ = (uint16_t) (0xDC00 + (codepoint & 0x3FF));
    }
    last = s;
  }

  *destoff = d - dest;
  *codepoint0 = codepoint;
  *state0 = state;
  *src = last;

  return s;
}

For lines around /* Load 8 bytes of ASCII data */ clang 11.0.1 generates: (-O1)

        addq    $8, %r12                        # s += 8
        movq    %rax, %xmm0                     # load 8 bytes ...
        punpcklbw       %xmm1, %xmm0            # interleave with zeros
        movdqu  %xmm0, (%r13)                   # store the resulting bytes
        addq    $16, %r13                       # d += 8

and GCC 10.2 virtually the same

        addq    $8, %rdx
        movq    %rax, %xmm0
        pxor    %xmm1, %xmm1
        punpcklbw       %xmm1, %xmm0
        movups  %xmm0, (%rbx)
        addq    $16, %rbx

so they generate the same code. (I don't know whether GHC calls C-compiler with -O1 or -O2 or the same -O it was given, the -O2 for either of C-compilers doesn't change code much, obfuscates a bit by reordering instructions).

TL;DR i don't think there is GCC & Clang concerns. It's not C++17 code :)

phadej · 2021-03-29T00:24:03Z

I do wonder whether making loop work so it would do aligned loads would make it even faster, but eh. Benchmarking that would be devastating.

In general I wonder whether the code for this exists in some C library. Decoding UTF8 isn't something we should reimplement. (Especially would help if text internal encoding is changed, cause then this code will just go away, decoding UTF8 to UTF16 won't matter - it won't be a hotpath).

ethercrow · 2021-03-29T10:06:57Z

I do wonder whether making loop work so it would do aligned loads would make it even faster, but eh. Benchmarking that would be devastating.

This came up when discussing the SIMD implementation of intersperse for bytestring, very similar code to this PR. On my machine (some i7 from 2017) there was no noticable difference.

In general I wonder whether the code for this exists in some C library. Decoding UTF8 isn't something we should reimplement. (Especially would help if text internal encoding is changed, cause then this code will just go away, decoding UTF8 to UTF16 won't matter - it won't be a hotpath).

UTF8-based text is a path z-haskell is pursuing https://hackage.haskell.org/package/Z-Data

ethercrow · 2021-03-29T18:41:41Z

Regarding CI on other OSes, I see that text uses haskell-ci for generating the GitHub Actions manifest, but it looks like that tool is Linux-only. I'm not sure, I don't see relevant documentation or mentions of Windows or macOS in the source code.

Bodigrim · 2021-03-31T22:26:01Z

haskell-ci-based workflows do not support other OS. My suggestion was to create two new additional workflows. One for crossplatform testing for the latest GHC, and one for FreeBSD run. Since haskell-ci already tests project structure, it would be enough just to run cabal test / cabal bench --benchmark-option=-l. One can borrow some boilerplate from https://github.com/haskell/bytestring/blob/master/.github/workflows/ci.yml

Separate workflows are easier to manage, because GitHub allows to rerun a workflow, but not an individual job.

ethercrow · 2021-04-01T06:38:22Z

Ah, that makes sense, I'll probably get to that over the weekend.

ethercrow · 2021-04-11T16:53:09Z

Rebased onto master, tests passed on all newly introduced OSes.

phadej · 2021-04-11T18:22:56Z

2021-04-11T16:20:05.3951363Z           t_toTitle_1stNotLower:   FAIL
2021-04-11T16:20:05.3952004Z             *** Failed! Falsified (after 62 tests and 6 shrinks):
2021-04-11T16:20:05.3952528Z             "\4317"

on GHC-8.8.4, https://www.fileformat.info/info/unicode/char/10dd/index.htm

On GHCs I tested, incl. GHC-9.0.1

ghci> toUpper '\4317'
'\7325'
ghci> toTitle '\4317'
'\4317'
ghci> isUpper '\4317'
False
ghci> isLower '\4317'
True

It looks like that geortian text doesn't have concept of title case: https://en.wikipedia.org/wiki/Georgian_Extended and thus that test is broken.

ethercrow · 2021-04-15T10:06:13Z

Could we merge this? I don't think that fixing t_toTitle_1stNotLower belongs in this PR.

chessai · 2021-04-15T14:10:13Z

Small note, doesn't affect much, but Data.Char.isLower's behaviour changed in https://gitlab.haskell.org/ghc/ghc/-/commit/14d88380ecb909e7032598aaad4efebb72561784. I specifically ran into issues with results changing when supporting new ghcs for duckling in facebook/duckling#541

…

On Thu, Apr 15, 2021, 05:06 Dmitry Ivanov ***@***.***> wrote: Could we merge this? I don't think that fixing t_toTitle_1stNotLower belongs in this PR. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#302 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEOIX26CCXHCYLIQ3ZNBJ4LTI23CTANCNFSM4ST2YPIQ> .

Bodigrim

Arrgh, I'm deeply sorry, @ethercrow.
I have written a comment below, but forgot to press "Submit", and was awaiting for your response.

FWIW it's not a blocker, I'm just being curious.

Bodigrim · 2021-04-11T16:15:38Z

cbits/cbits.c

+   */
+  const __m128i zeros = _mm_set1_epi32(0);
+  while (p < srcend - 3) {
+    /* Load 4 bytes of ASCII data */


Is it possible to load 8 bytes here?

Good catch, yes, this loop can be same as in decodeUtf8, but without a check.

I think I implemented this one first with 4-wide, then realized it can be 8-wide when doing decodeUtf8, but forgot to port it back to Latin1.

Bodigrim · 2021-04-19T22:01:46Z

cbits/cbits.c


+    const uint64_t w = eight_chars.halves[0];
    if (w & 0xFF80FF80FF80FF80ULL) {


Could we just break here actually? Does it affect performance? I imagine we could compare a whole 128-byte register with a broadcasted 0xFF80, and either break or _mm_packus_epi16 + _mm_storel_epi64.

Yeah, I experimented with where exactly and after how many checks the break can be and chose one of the best performing combinations. Unfortunately I didn't record those combinations and the results might be different on different CPUs.

I mean, it looks asymmetric to exercise each pair of bytes from the first uint64, but not in the second one. I understand that doing the same for the second uint64 make code more hairy, so maybe we can stop doing it for the first one as well? It would also allow us to avoid union stuff.

I tried both symmetric variants and found both to be slower than this one.

My understanding is that

if you do too many checks by looking at individual bytes, the time cost of the simd loop iteration becomes higher.

if you do too few checks (by not even looking at bytes in the first half) then the probability of the simd loop iteration doing useful work decreases and it also hurts the overall performance.

This probability is different for different input data, extreme cases would be pure ASCII (simd routine always works) and pure chinese text (simd routine never works). On our test data I observed that inspecting bytes in the first half was a sweet spot.

cbits/cbits.c

kuk0 · 2021-04-19T23:13:29Z

cbits/cbits.c

-    src += 4;
+
+    if (eight_chars.halves[1] & 0xFF80FF80FF80FF80ULL) {
+      break;


here you can still pack and store the whole 0'th half using the pext instruction
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_pext_u64&expand=4330

This PR intentionally uses only SSE2 to be compatible with any 64bit x86 CPU. Quite a few models don't support pext or their implementation is not fast. From wikipedia:

AMD processors before Zen 3[11] that implement PDEP and PEXT do so in microcode, with a latency of 18 cycles[12] rather than a single cycle. As a result, if the mask is known, it is often faster to use other instructions on AMD.

Co-authored-by: Kubo Kováč <733205+kuk0@users.noreply.github.com>

cbits/cbits.c

Bodigrim · 2021-04-22T21:08:32Z

Unless there are more comments/suggestions, I'll merge this by the end of the week.

Bodigrim · 2021-04-25T13:00:32Z

Thanks @ethercrow, performance improvements are much appreciated. And sorry that it took so long.

Fixes haskell#302.

ethercrow changed the title ~~Sse2 combined~~ SSE2 patches for encoding and decoding functions Oct 16, 2020

ethercrow force-pushed the sse2-combined branch from 56ba209 to 22433a0 Compare October 21, 2020 21:38

Lysxia added the fix:performance label Mar 7, 2021

Lysxia linked an issue Mar 8, 2021 that may be closed by this pull request

Use vector operations in text decoding #272

Closed

Lysxia removed a link to an issue Mar 8, 2021

Use vector operations in text decoding #272

Closed

ethercrow force-pushed the sse2-combined branch from 22433a0 to 0625ea0 Compare March 28, 2021 20:16

ethercrow force-pushed the sse2-combined branch from 0625ea0 to a6ab43d Compare April 11, 2021 16:15

Bodigrim reviewed Apr 15, 2021

View reviewed changes

Use SSE2 in the x86_64 C version of decodeUtf8

77775c5

ethercrow force-pushed the sse2-combined branch from a6ab43d to 6e1e8f7 Compare April 16, 2021 07:42

ethercrow added 2 commits April 16, 2021 09:46

Use SSE2 in the x86_64 C version of decodeLatin1

eaab373

Use SSE2 in the x86_64 C version of encodeUtf8

0ad0820

ethercrow force-pushed the sse2-combined branch from 6e1e8f7 to 0ad0820 Compare April 16, 2021 07:46

Bodigrim requested review from tathougies, Bodigrim, Boarders, Lysxia and parsonsmatt April 16, 2021 21:20

Bodigrim reviewed Apr 19, 2021

View reviewed changes

Bodigrim previously approved these changes Apr 19, 2021

View reviewed changes

kuk0 reviewed Apr 19, 2021

View reviewed changes

cbits/cbits.c Outdated Show resolved Hide resolved

kuk0 reviewed Apr 19, 2021

View reviewed changes

Update cbits/cbits.c

d94d2ef

Co-authored-by: Kubo Kováč <733205+kuk0@users.noreply.github.com>

ethercrow dismissed Bodigrim’s stale review via d94d2ef April 20, 2021 14:21

Lysxia reviewed Apr 20, 2021

View reviewed changes

cbits/cbits.c Show resolved Hide resolved

Bodigrim approved these changes Apr 21, 2021

View reviewed changes

Bodigrim merged commit e08b793 into haskell:master Apr 25, 2021

ethercrow deleted the sse2-combined branch April 25, 2021 13:50

Bodigrim mentioned this pull request May 6, 2021

Failing test toTitle_1stNotLower #332

Closed

Bodigrim mentioned this pull request Aug 23, 2021

Switch internal representation to UTF8 #365

Merged

sjshuck added a commit to sjshuck/text that referenced this pull request Apr 15, 2022

Add fromPtr0

c5f8182

Fixes haskell#302.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSE2 patches for encoding and decoding functions #302

SSE2 patches for encoding and decoding functions #302

ethercrow commented Oct 16, 2020 •

edited

Loading

ethercrow commented Oct 21, 2020

phadej commented Oct 22, 2020

ethercrow commented Oct 29, 2020

ethercrow commented Mar 28, 2021 •

edited

Loading

Bodigrim commented Mar 28, 2021

ethercrow commented Mar 28, 2021 •

edited

Loading

Bodigrim commented Mar 28, 2021

phadej commented Mar 29, 2021

phadej commented Mar 29, 2021

ethercrow commented Mar 29, 2021

ethercrow commented Mar 29, 2021

Bodigrim commented Mar 31, 2021

ethercrow commented Apr 1, 2021

ethercrow commented Apr 11, 2021

phadej commented Apr 11, 2021

ethercrow commented Apr 15, 2021

chessai commented Apr 15, 2021 via email

Bodigrim left a comment

Bodigrim Apr 11, 2021

ethercrow Apr 16, 2021

Bodigrim Apr 19, 2021

ethercrow Apr 20, 2021

Bodigrim Apr 20, 2021

ethercrow Apr 21, 2021

kuk0 Apr 19, 2021

ethercrow Apr 20, 2021 •

edited

Loading

Bodigrim commented Apr 22, 2021

Bodigrim commented Apr 25, 2021


		const uint64_t w = eight_chars.halves[0];
		if (w & 0xFF80FF80FF80FF80ULL) {

SSE2 patches for encoding and decoding functions #302

SSE2 patches for encoding and decoding functions #302

Conversation

ethercrow commented Oct 16, 2020 • edited Loading

ethercrow commented Oct 21, 2020

phadej commented Oct 22, 2020

ethercrow commented Oct 29, 2020

ethercrow commented Mar 28, 2021 • edited Loading

Bodigrim commented Mar 28, 2021

ethercrow commented Mar 28, 2021 • edited Loading

Bodigrim commented Mar 28, 2021

phadej commented Mar 29, 2021

phadej commented Mar 29, 2021

ethercrow commented Mar 29, 2021

ethercrow commented Mar 29, 2021

Bodigrim commented Mar 31, 2021

ethercrow commented Apr 1, 2021

ethercrow commented Apr 11, 2021

phadej commented Apr 11, 2021

ethercrow commented Apr 15, 2021

chessai commented Apr 15, 2021 via email

Bodigrim left a comment

Choose a reason for hiding this comment

Bodigrim Apr 11, 2021

Choose a reason for hiding this comment

ethercrow Apr 16, 2021

Choose a reason for hiding this comment

Bodigrim Apr 19, 2021

Choose a reason for hiding this comment

ethercrow Apr 20, 2021

Choose a reason for hiding this comment

Bodigrim Apr 20, 2021

Choose a reason for hiding this comment

ethercrow Apr 21, 2021

Choose a reason for hiding this comment

kuk0 Apr 19, 2021

Choose a reason for hiding this comment

ethercrow Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

Bodigrim commented Apr 22, 2021

Bodigrim commented Apr 25, 2021

ethercrow commented Oct 16, 2020 •

edited

Loading

ethercrow commented Mar 28, 2021 •

edited

Loading

ethercrow commented Mar 28, 2021 •

edited

Loading

ethercrow Apr 20, 2021 •

edited

Loading