Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread pool for checking verifications #1614

Closed
wants to merge 12 commits into from

Conversation

termoose
Copy link
Contributor

Added a thread_pool class for submitting work into one of the hardware_concurrency number of threads (default value). Class is somewhat general and can be used for other threading use-cases.
Added a test case for doing 100k verifications multi-threaded.
Batching 256 verifications in one thread, hard cut-off of 1000 for running the single threaded version.

@wezrule wezrule added the functionality quality improvements This item indicates the need for or supplies a better way to implement existing functionality label Jan 22, 2019
@cryptocode
Copy link
Contributor

Could boost::asio::thread_pool work here? Since we're already using boost, it's good to use that one if possible.

@termoose
Copy link
Contributor Author

I actually didn't know there was a thread pool in boost, always heard that they wouldn't add one. Guess they did! Although it looks pretty basic, there's no support for futures so I'd have to roll my own way of getting results from the async operations in the thread pool (the true/false from all those async std::all_of).

I'll see if I can get away with a smaller code footprint by using it.

@wezrule
Copy link
Contributor

wezrule commented Jan 25, 2019

I will take a look at it today @termoose as it is something of high priority and is needed asap, thanks!

@termoose
Copy link
Contributor Author

Thanks, seems like it works as well as the custom thread pool. I changed the cut-off to 10k based on the feedback from @guilhermelawless. I was considering adding this to node_config as a setting, but decided not to since it was a bit awkward getting a hold of this object from nano::signature_checker.

@guilhermelawless
Copy link
Contributor

Getting the same results with the new thread pool as well. Also a good thing is that results don't change much with higher batch sizes, so that's one less thing to parametrize.

Not sure I agree with the cut-off change. Setting to 512 (2x batch_size) already gets me a 1.7x speedup, so why not enjoy that?

@wezrule
Copy link
Contributor

wezrule commented Jan 25, 2019

Have you checked on live whether it goes into your verify_threaded function? For me the max signature size I got in 1 chunk was 2048. 256 was common, but so are very low numbers (many are 1). I think a different approach may be needed to get good results on the live network. I think our benchmark tests should use the nano::signature_checker::add () function many times with different sizes to reach a total of something like 100k and then compare how long that takes to complete vs the master branch. I'm going to start working on this, feel free to do the same.

@guilhermelawless
Copy link
Contributor

@wezrule @termoose
Did not catch this in boost documentation (didn't go through the .cpp, only headers), but this sanity check is failing:

TEST (sanity, thread_pool)
{
	boost::asio::thread_pool thread_pool (4);
	std::vector<bool> v(4, false);
	for (int i=0; i<=1; ++i)
		boost::asio::post (thread_pool, [i, &v] { v[i] = true; });

	thread_pool.join ();

	for (int i=2; i<=3; ++i)
		boost::asio::post (thread_pool, [i, &v] { v[i] = true; });

	thread_pool.join ();

	ASSERT_EQ (4, v.size ());
	ASSERT_EQ (true, v[0]);
	ASSERT_EQ (true, v[1]);
	ASSERT_EQ (true, v[2]);
	ASSERT_EQ (true, v[3]);
}

This test fails:

Value of: v[2]
  Actual: 0
Expected: true

It seems to me that once a thread_pool has been joined, it cannot be reused. But to solve we only need to instance it inside verify_threaded, not class wide.

@guilhermelawless
Copy link
Contributor

guilhermelawless commented Jan 25, 2019

See guilhermelawless#1 for a more realistic case with the following adds (repeated until 100k): 1, 256, 257, 1, 1, 512, 1, 2048, 1024, 1, 64, 1, 1, 1, 1, 1, 1, 1, 1. The relevant commit is guilhermelawless@9676296

It gets me a 2x speedup in total. Also fixed the thread_pool re-use, and the number of batches sometimes being 1 extra due to overflow = 0. Cut-off lowered to 512.

@termoose feel free to take code from there and modify this PR, i won't be PRing myself.

@termoose
Copy link
Contributor Author

Good catch on not being able to reuse the thread pool! I assumed it would be reused as long as stop() was not called prior to join() based on this part of the documentation, but I guess I misread it. It's not very clear without looking at the code.

If stop() is not called prior to join(), the join() call will wait
until the pool has no more outstanding work.

Won't there be an overhead of initialising the thread pool every time? I could try to implement something with a std::promise being set_value'd inside of each work function, and then get()ing all the corresponding std::futures later. I experimented a bit with this but couldn't make it work correctly, and it might not be needed if the overhead of initialising the thread pool is not really there. I cherry-picked your changes though, thanks!

@wezrule I agree it would be cool to accumulate verifications in add() but testing it and implementing that sounds like a new issue as it's bit out of scope from this one. If this threading change is also needed asap you could get this out and we can work on improving that add() later. I started something that I might be able to extend a week back termoose#1 where each verify() is able to run in it's own signature_checker_thread and doesn't have to run sequentially.

@termoose
Copy link
Contributor Author

Made it so that the thread pool does not have to be instantiated inside verify_threaded every time, rather we're get()ing the futures as explained above. Not sure why that syntax checker is failing though, it looks correct to me on that line.

@guilhermelawless
Copy link
Contributor

guilhermelawless commented Jan 26, 2019

About 3% improvement over instantiating every time 👍 Ran test 50 times and no failures.

@wezrule
Copy link
Contributor

wezrule commented Jan 26, 2019

Hi @termoose @guilhermelawless thanks for your contributions so far! Now I have a few comments:

  • There was mention of using std::vector for concurrent modification in different threads on different elements. Normally this would be ok, however std::vector is an anomaly, this is because std::vector has a special specialization in which some implementations squeeze bool values into bits.

Notwithstanding (17.6.5.9), implementations are required to avoid data races when the contents of the contained object in different elements in the same container, excepting vector, are modified concurrently.

  • validate_message_batch () had a bug where it doesn't check the correct range after the first batch size.
  • I wasn't entirely sure why we had another thread for signature_checking because the calling thread just blocks until the future was finished, so I've got rid of it now.
  • If threads is set to 1 I just use the calling thread to do the work.
  • I have added the config option so that the user can specify the number of cores (we do similar things with work, io and network threads)
  • The newly added test is now run over multiple threads and promises are checked after all signature checks are added to the thread pool rather than waiting for each one to finish. Using this test and changing the cores I got the following results:
    1 core - 5.9s
    2 cores - 5.4s
    4 cores - 3.4s
    8 cores - 2.5s

When more messages of a greater size are used it greatly favours the higher cores, however we want tests to run fairly quickly irrespective of the number of cores so this test is not meant to be representative of a benchmark, it just checks functionality is working correctly for a wide range of sizes.

  • There were some other things, but I'm far too sleepy to remember :)

Can you guys have a look here and give any comments: https://github.com/wezrule/raiblocks/tree/multithread_signature_checker @termoose feel free to modify your PR with any of these changes, otherwise I will mine as a PR some point tomorrow.

@termoose
Copy link
Contributor Author

  • Aware of this specialisation, we don't use std::vector<bool> anywhere but instead have futures and promises in our vectors.
  • Thanks for finding that off by 1 on batch sizes!
  • Modified my PR to include the ability to set number of threads from the config

There are a number of changes here though that I find confusing but I don't have time to figure out all the details until you merge it in tomorrow.

  • You use a shared_ptr to the vector as well as on your Task which causes a heap allocation every time verify_threaded is called. Is this an overhead that's needed?
  • You have the tasks_remaining and task->pending counters which looks like spin-your-own std::futures for checking when the threads are finished. Myself would prefer a set of std::promise<bool>s that communicate these results as return values from future.get().
  • Casting to and from char, is this to keep the std::vector thread safe? This overhead would also not be needed with our original std::vector<std::future<bool>>. It also makes things look strange, for example when initialising a vector of chars to false.
  • If pending_mutex is only used for the counter then it could be made std::atomic_int and you could rather do if(--task->pending == 0) where it decrements and loads atomically. But again I don't see these counters are needed if threading primitives are being used.

Waiting for each signature check to finish before posting new ones should not incur any great overhead as the cores are all fully exhausted anyway. Would be interesting to see some benchmarks on the different implementations if someone has time to run them.

@wezrule
Copy link
Contributor

wezrule commented Jan 27, 2019

std::vector was mentioned in an earlier version, std::vector<bool> results (batches + 1, false); so I was just informing you guys in case you were unaware.

I've changed how it works quite a bit which is why you may be finding it confusing. I got rid of the queue (there's no add function anymore), the tasks are posted directly to the thread pool. I don't want to block the calling thread so the task is shared so that it remains in scope for the thread pool after the function finishes. Although the calling thread currently just waits on a future, it could be proactive and respond after each thread in the thread pool is finished, but I am leaving this for a later task.

The std::vector was for thread safety yes, there is no overhead in converting char -> bool. I don't know about benchmarks between this and using a vector of promises, but I no longer use this anymore now, and just assert the result after each thread.

Good point about the if (--task->pending == 0) thanks! I am now using std::atomic instead of a mutex.

I am going to run TSAN on these changes on the live network for a while to see if it picks up anything.

Ok we appreciate all the time and effort you have put into this and look forward to any future contributions. I will create a PR after some more testing. Thanks again.

@guilhermelawless
Copy link
Contributor

guilhermelawless commented Jan 27, 2019

It looks good, add was confusing me as well since there weren't any background tasks happening, every verify was blocking.

All I would add is a release_assert (pending == 0); on a Task destructor.

@wezrule
Copy link
Contributor

wezrule commented Jan 27, 2019

@guilhermelawless done. I will close this off and we can use: #1651

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functionality quality improvements This item indicates the need for or supplies a better way to implement existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants