New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epee: speed up parse_hexstr_to_binbuff a little #4850

Merged
merged 3 commits into from Dec 4, 2018

Conversation

7 participants
@moneromooo-monero
Contributor

moneromooo-monero commented Nov 14, 2018

No description provided.

return c;
}
}

This comment has been minimized.

@ndorf

ndorf Nov 14, 2018

Contributor

Just curious, why not just use std::tolower()?

This comment has been minimized.

@moneromooo-monero

moneromooo-monero Nov 14, 2018

Contributor

It was slower when I tried IIRC. Memory's hazy. If you time it to be faster, maybe it can change.

This comment has been minimized.

@ndorf

ndorf Nov 15, 2018

Contributor

OK, this piqued my curiosity a bit so I ended up testing it. The result was somewhat interesting, at least to me: on Ubuntu 16.04 with g++ 5.4.0 and glibc 2.23, std::tolower() (which simply checks bounds and then indexes a static lookup array) is slower than your version by around 36% when compiling with -O2. (Without any -O flag, it's ~30% faster, but that doesn't really matter.)

Thanks for inspiring me to try it out!

This comment has been minimized.

@vtnerd

vtnerd Nov 16, 2018

Contributor

std::tolower has to lookup the current locale setting, which involves the usage of a thread-safe global variable. This will mess with optimizations in a tight loop quite a bit - the code cannot be inlined AND involves an indirect jump every time.

@hyc

This comment has been minimized.

Contributor

hyc commented Nov 14, 2018

Is this actually perf-sensitive?

I'd use something more like

for (size_t i=0; i<s.size(); i+=2) {
  int tmp = s[i] - '0';
  if (tmp < 0) return false;
  if (tmp > 9) tmp = (tmp & 0x17) - 7;
  if (tmp > 15) return false;
  tmp <<= 4;
  if (s[i+1]) {
    int t2 = s[i+1] - '0';
    if (t2 < 0) return false;
    if (t2 > 9) t2 = (t2 & 0x17) - 7;
    if (t2 > 15) return false;
    tmp |= t2;
  }
  res[i/2] = tmp;
}
@moneromooo-monero

This comment has been minimized.

Contributor

moneromooo-monero commented Nov 14, 2018

It's at the top of the profile when reading large amounts of json (in this case, get_bulk_payments with lots of payment ids).

@moneromooo-monero

This comment has been minimized.

Contributor

moneromooo-monero commented Nov 14, 2018

Your version seems to have the same rough performance. Fair amount of variance though,

@hyc

This comment has been minimized.

Contributor

hyc commented Nov 14, 2018

In that case I recommend using my version, it's about 2x faster than the one in this PR.

@hyc

This comment has been minimized.

Contributor

hyc commented Nov 14, 2018

#include "ver1.h"
#include "ver2.h"

#include <stdlib.h>
#include <sys/time.h>

#define NUM 1048576
long numbers[NUM];
char out[NUM*sizeof(long)*2];

main() {
	int i;
	char *ptr = out;
	for (i=0; i<NUM; i++) {
		numbers[i] = random();
		ptr += sprintf(ptr, "%016lx", numbers[i]);
	}

	struct timeval beg, end;
	gettimeofday(&beg, NULL);

	ptr = out;
	for (i=0; i<NUM; i++) {
		std::string out;
		std::string in(ptr, 16);
		ptr += 16;
		parse_hexstr_to_binbuff1(in, out);
	}
	gettimeofday(&end, NULL);
	end.tv_usec -= beg.tv_usec;
	if (end.tv_usec < 0) {
		end.tv_usec += 1000000;
		end.tv_sec--;
	}
	end.tv_sec -= beg.tv_sec;
	printf("%d numbers in %d.%06d sec\n", NUM, (int)end.tv_sec, (int)end.tv_usec);

	gettimeofday(&beg, NULL);

	ptr = out;
	for (i=0; i<NUM; i++) {
		std::string out;
		std::string in(ptr, 16);
		ptr += 16;
		parse_hexstr_to_binbuff2(in, out);
	}
	gettimeofday(&end, NULL);
	end.tv_usec -= beg.tv_usec;
	if (end.tv_usec < 0) {
		end.tv_usec += 1000000;
		end.tv_sec--;
	}
	end.tv_sec -= beg.tv_sec;
	printf("%d numbers in %d.%06d sec\n", NUM, (int)end.tv_sec, (int)end.tv_usec);
}
violino:/tmp/4850> ./test1
1048576 numbers in 0.195079 sec
1048576 numbers in 0.117733 sec
@hyc

This comment has been minimized.

Contributor

hyc commented Nov 15, 2018

If I also include the original code and increase the test:

./test1
10485760 numbers in 3.677967 sec
10485760 numbers in 1.947384 sec
10485760 numbers in 1.169062 sec
@hyc

This comment has been minimized.

Contributor

hyc commented Nov 15, 2018

If you want to try this yourself
test1.cpp.txt

ver0.h.txt
ver1.h.txt
ver2.h.txt

@moneromooo-monero moneromooo-monero force-pushed the moneromooo-monero:phex branch from 754a959 to 4443c4a Nov 15, 2018

@moneromooo-monero

This comment has been minimized.

Contributor

moneromooo-monero commented Nov 15, 2018

I get less of a difference but that one's faster, I've switched to it, thanks.

if (tmp > 9) tmp = (tmp & 0x17) - 7;
if (tmp > 15) return false;
tmp <<= 4;
if (s[i+1])

This comment has been minimized.

@vtnerd

vtnerd Nov 15, 2018

Contributor

This could be an invalid memory access here (and below) if allow_partial_byte = true. Might be able to drop that parameter though.

This comment has been minimized.

@hyc

hyc Nov 15, 2018

Contributor

git grep doesn't turn up any invocations where that param is set. We should probably just remove it and check for an odd size at the top, and immediately fail.

This comment has been minimized.

@hyc

hyc Nov 15, 2018

Contributor
 template<class CharT>
  bool parse_hexstr_to_binbuff(const std::basic_string<CharT>& s, std::basic_string<CharT>& res)
  {
    res.clear();
    if (s.size() & 1)
      return false;
    try
    {
      res.resize(s.size() / 2);
      unsigned char *dst = (unsigned char *)res.data();
      unsigned char *src = (unsigned char *)s.data();
      for(size_t i = 0; i < s.size(); i += 2)
      {
  int tmp = *src++ - '0';
  if (tmp < 0) return false;
  if (tmp > 9) tmp = (tmp & 0x17) - 7;
  if (tmp > 15) return false;
  tmp <<= 4;
    int t2 = *src++ - '0';
    if (t2 < 0) return false;
    if (t2 > 9) t2 = (t2 & 0x17) - 7;
    if (t2 > 15) return false;
    *dst++ = tmp | t2;
      }

      return true;
    }catch(...)
    {
      return false;
    }
}

A couple percent faster here.

@moneromooo-monero moneromooo-monero force-pushed the moneromooo-monero:phex branch from 4443c4a to 189fcf8 Nov 15, 2018

@xiphon

This comment has been minimized.

Contributor

xiphon commented Nov 15, 2018

@hyc
this version will return true parsing non-hex characters as well, but it shoudln't.

We have to add more test to cover the following cases:

  1. Tests for all the possible input characters
  2. Invalid input length test
TEST(StringTools, ParseHex)
{
  std::string res;
  static const char data[] = "a10b68c2";
  for (size_t i = 0; i < sizeof(data); i += 2)
  {
    ASSERT_TRUE(epee::string_tools::parse_hexstr_to_binbuff(std::string(data, i), res));
    std::string hex = epee::string_tools::buff_to_hex_nodelimer(res);
    ASSERT_EQ(hex.size(), i);
    ASSERT_EQ(memcmp(data, hex.data(), i), 0);
  }

  for (size_t i = 0; i < 256; ++i) {
    std::string inputHexString = std::string(2, static_cast<char>(i));
    if ((i >= '0' && i <= '9') || (i >= 'A' && i <= 'F') || (i >= 'a' && i <= 'f')) {
      ASSERT_TRUE(epee::string_tools::parse_hexstr_to_binbuff(inputHexString, res)) << "Failed to parse: '" << inputHexString << "'";
    } else {
      ASSERT_FALSE(epee::string_tools::parse_hexstr_to_binbuff(inputHexString, res)) << "Should fail to parse: '" << inputHexString << "'";
    }
  }

  ASSERT_FALSE(epee::string_tools::parse_hexstr_to_binbuff(std::string("a"), res));
}
@moneromooo-monero

This comment has been minimized.

Contributor

moneromooo-monero commented Nov 15, 2018

Hah, I found the 0x17 thing odd, but assumed it was good :D I'll fix.

@moneromooo-monero moneromooo-monero force-pushed the moneromooo-monero:phex branch from 189fcf8 to 56b55c5 Nov 15, 2018

@moneromooo-monero

This comment has been minimized.

Contributor

moneromooo-monero commented Nov 15, 2018

Still faster, though less so. Do you want to make a PR for your unit test, or should I commit it as you ?

@hyc

This comment has been minimized.

Contributor

hyc commented Nov 15, 2018

@xiphon OK, then we just brute force it. This does what isxdigit() used to do in older versions of <ctype.h>

#include <string.h>
#include <cstdlib>
#include <string>

static const char isx[256] = {
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0, 0, 0, 0, 0,
 0, 11, 12, 13, 14, 15, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 11, 12, 13, 14, 15, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0 };

 template<class CharT>
  bool parse_hexstr_to_binbuff4(const std::basic_string<CharT>& s, std::basic_string<CharT>& res)
  {
    res.clear();
    if (s.size() & 1)
      return false;
    try
    {
      res.resize(s.size() / 2);
          unsigned char *dst = (unsigned char *)res.data();
          unsigned char *src = (unsigned char *)s.data();
      for(size_t i = 0; i < s.size(); i += 2)
      {
  int tmp = *src++;
  tmp = isx[tmp];
  if (!tmp) return false;
  tmp--;
    int t2 = *src++;
    t2 = isx[t2];
    if (!t2) return false;
    t2--;
        *dst++ = (tmp << 4) | t2;
      }

      return true;
    }catch(...)
    {
      return false;
    }
}

Could make it a little faster by filling with -1 instead of zero, which would eliminate the extra subtraction.

@moneromooo-monero moneromooo-monero force-pushed the moneromooo-monero:phex branch from 56b55c5 to 581d125 Nov 15, 2018

@@ -59,6 +59,26 @@
#pragma comment (lib, "Rpcrt4.lib")
#endif
static const constexpr char isx[256] =

This comment has been minimized.

@hyc

hyc Nov 15, 2018

Contributor

Do we need to be more explicit and use signed char here?

This comment has been minimized.

@hyc

hyc Nov 17, 2018

Contributor

Confirmed yes - char defaults to unsigned on ARM, we need to explicitly use signed char here.

res.push_back(static_cast<unsigned char>(v));
CharT tmp = *src++;
tmp = isx[(unsigned int)tmp];
if (tmp < 0) return false;

This comment has been minimized.

@hyc

hyc Nov 15, 2018

Contributor

Do we know that CharT is signed? If not we can use if (tmp & 0x80) here.

@xiphon

This comment has been minimized.

Contributor

xiphon commented Nov 15, 2018

@moneromooo-monero

Do you want to make a PR for your unit test, or should I commit it as you ?

Ah, feel free to include it into this PR

@hyc

Could make it a little faster by filling with -1 instead of zero, which would eliminate the extra subtraction.

Yeah, i like this XLAT approach. ~5% gain with -1 version (parse_hexstr_to_binbuff5) compared to 0 - biased one.

parse_hexstr_to_binbuff0: 41943040 numbers in 6.302370 sec
parse_hexstr_to_binbuff1: 41943040 numbers in 2.954025 sec
parse_hexstr_to_binbuff2: 41943040 numbers in 2.358419 sec
parse_hexstr_to_binbuff3: 41943040 numbers in 2.273179 sec
parse_hexstr_to_binbuff4: 41943040 numbers in 1.150841 sec
parse_hexstr_to_binbuff5: 41943040 numbers in 1.083029 sec

hyc and others added some commits Nov 14, 2018

@moneromooo-monero moneromooo-monero force-pushed the moneromooo-monero:phex branch 2 times, most recently from b36353e to 5c4f836 Nov 15, 2018

@moneromooo-monero

This comment has been minimized.

Contributor

moneromooo-monero commented Nov 19, 2018

10485760 numbers in 3.850427 sec
10485760 numbers in 1.657483 sec
10485760 numbers in 1.272807 sec
10485760 numbers in 1.285490 sec
10485760 numbers in 0.892490 sec
10485760 numbers in 0.826842 sec
10485760 numbers in 0.803382 sec

:P

@hyc

This comment has been minimized.

Contributor

hyc commented Nov 19, 2018

Saving a test/branch is good ;)

Not much impact here.
Intel(R) Core(TM)2 Extreme CPU Q9300 @ 2.53GHz
10485760 numbers in 3.538083 sec
10485760 numbers in 1.932067 sec
10485760 numbers in 1.163310 sec
10485760 numbers in 1.200943 sec
10485760 numbers in 0.813748 sec
10485760 numbers in 0.816778 sec
10485760 numbers in 0.818206 sec

Small win here.
Rockchip rk3368 AArch64
10485760 numbers in 28.091529 sec
10485760 numbers in 14.327599 sec
10485760 numbers in 8.334919 sec
10485760 numbers in 8.269132 sec
10485760 numbers in 7.990712 sec
10485760 numbers in 8.040629 sec
10485760 numbers in 7.842864 sec

I think these Cortex-A53 SoCs have little to no on-chip caches.

Slight loss here
AMD A10-4600M APU with Radeon(tm) HD Graphics
10485760 numbers in 4.456159 sec
10485760 numbers in 2.426657 sec
10485760 numbers in 1.442234 sec
10485760 numbers in 1.587544 sec
10485760 numbers in 0.918178 sec
10485760 numbers in 0.895542 sec
10485760 numbers in 0.932247 sec

@moneromooo-monero

This comment has been minimized.

Contributor

moneromooo-monero commented Nov 20, 2018

Maybe because the table went from 256 bytes to... possibly 2048. Try that one, it's slower for me but might be better on the ARM (I've not tested it with unit tests yet).

static const constexpr unsigned char isx7[256] =
{
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
    0,    1,    2,    3,    4,    5,    6,    7,    8,    9, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff,   10,   11,   12,   13,   14,   15, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff,   10,   11,   12,   13,   14,   15, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
};


  template
  bool parse_hexstr_to_binbuff7(const std::basic_string& s, std::basic_string& res)
  {
    res.clear();
    if (s.size() & 1)
      return false;
    try
    {
      res.resize(s.size() / 2);
      unsigned char *dst = (unsigned char *)res.data();
      const unsigned char *src = (const unsigned char *)s.data();
      for(size_t i = 0; i < s.size(); i += 2)
      {
        uint_fast16_t tmp = ((uint_fast16_t)isx7[*src++]) << 5;
        tmp |= ((uint_fast16_t)isx7[*src++]) << 1;
        if (tmp & 0xff00) return false;
        *dst++ = tmp >> 1;
      }
      return true;
    }
    catch(...)
    {
      return false;
    }
}

@moneromooo-monero

This comment has been minimized.

Contributor

moneromooo-monero commented Nov 20, 2018

Nevermind, that last one is actually broken :P Guess we'll use yours then.

@moneromooo-monero moneromooo-monero force-pushed the moneromooo-monero:phex branch from 5c4f836 to b36353e Nov 20, 2018

@dginovker

This comment has been minimized.

dginovker commented Nov 20, 2018

Why are you optimizing a parser? Is this not incredibly dangerous?

@moneromooo-monero

This comment has been minimized.

Contributor

moneromooo-monero commented Nov 20, 2018

Because it was the bottleneck when calling get_bulk_transfers with a lot of payment ids, and because it's fun :)

@hyc

This comment has been minimized.

Contributor

hyc commented Nov 21, 2018

About as dangerous as every other time we type git commit... This code is perfectly safe. The try/catch is only there in case the res.resize() fails.

@fluffypony

Reviewed

@fluffypony fluffypony merged commit b36353e into monero-project:master Dec 4, 2018

7 of 10 checks passed

buildbot/monero-static-osx-10.11 Build done.
Details
buildbot/monero-static-osx-10.12 Build done.
Details
buildbot/monero-static-ubuntu-i686 Build done.
Details
buildbot/monero-linux-armv7 Build done.
Details
buildbot/monero-linux-armv8 Build done.
Details
buildbot/monero-static-osx-10.13 Build done.
Details
buildbot/monero-static-ubuntu-amd64 Build done.
Details
buildbot/monero-static-win32 Build done.
Details
buildbot/monero-static-win64 Build done.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

fluffypony added a commit that referenced this pull request Dec 4, 2018

Merge pull request #4850
b36353e unit_tests: add some hex parsing test for non hex input (xiphon)
6671110 unit_tests: add a test for parse_hexstr_to_binbuff (moneromooo-monero)
f6187cd epee: speed up parse_hexstr_to_binbuff a little (Howard Chu)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment