Skip to content

4.1 Speed up UTF8 string processing with Intel's "intrinsics" instructions (en)

Claude Roux edited this page Oct 19, 2022 · 1 revision

Speed up UTF8 string processing with Intel's "intrinsics" instructions

Version française

I discovered the power of "intrinsics" instructions when I read Daniel Lemire's blog (https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/).

The basic idea is to use a 16-byte sliding window across a character string and project these 16-character blocks onto a 128-bit (16×8) integer. You can detect in one instruction whether one of the bytes has a value greater than 127. Remember that any byte containing a value greater than 127 has its leftmost bit at 1, so that we can detect whether or not a string contains potentially UTF8 characters.

First character UTF8

After experimenting with these instructions, I finally integrated a variation of this method into Tamgu.

static const __m128i checkutf8byte = _mm_set1_epi8(0x80);

bool checkAscii(unsigned char* src, long len, long& firstbloc) {
    __m128i current_bytes = _mm_setzero_si128();
    
    firstbloc = 0;
    for (; (firstbloc + 15) < len; firstbloc += 16) {
       current_bytes = _mm_loadu_si128((const __m128i *)(src + firstbloc));
        if (!_mm_testz_si128(checkutf8byte, current_bytes))
            return false;
    }

...

The string is cut into 16 character blocks and each block is checked for the presence of at least one bit on the far left. However, and this is a difference with Daniel Lemire's example, we keep each time the position of the first character of the last block analyzed. This allows the UTF8 section to be quickly and accurately isolated. In most cases, for languages such as English or French, this detection is often successful in terms of conversion time or access to substrings. For other languages, the checkAscii method fails on the first test, which may slightly slow down the normal functioning of the system.

Translated with www.DeepL.com/Translator

Search for a substring

We have also extended this principle to the search for a substring. Here, the idea is to detect in the main string the location where the first character of the subchain to be searched is located.

static const __m128i checkifzero = _mm_set1_epi8(0xFF);

long find_intel_byte(unsigned char* src, unsigned char* search, long lensrc, long lensearch, long i) {
   uchar c = search[0];
    __m128i firstchar = _mm_set1_epi8(c);
    __m128i current_bytes = _mm_setzero_si128();
    
    uchar* s=src;
    long j;
    
   for (; (i + 15) < lensrc; i += 16) {
        s=src+i;
        current_bytes = _mm_loadu_si128((const __m128i *)s);

       current_bytes = _mm_cmpeq_epi8(firstchar, current_bytes);
       q = _mm256_movemask_epi8(current_bytes);
       if (q) {
	  q = _bit_scan_forward(q);
          return (i+q);
       }
...

The principle is similar to that of "checkAscii". We construct a "firstchar" mask comprising 16 times the first character of the sub-chain "search" and we check if we have a non-zero intersection with a sliding window of 16 characters on the main string. It is therefore sufficient to find the position in the block of this first character and compare with the substring. It should be noted that this character may be present more than once in the block, hence the loop.

Empirical tests show that this method is often two to three times faster than a "std::string::find" which, it must be admitted, is not an efficiency model. In particular, the detection that a string is not present is remarkably effective.

Clone this wiki locally