New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about possible race condition among many threads #198
Comments
I tried again, the initial two lines (emplace and incrementing the value if the key exists). This time, I captured the output to a file for comparison. The input files are random generated key-value pairs delimited by a tab character. For testing, I set the sixth argument for parallel hash map to 8 (2**8 submaps) as 12 is more difficult to reproduce. The output file
I'm able to reproduce the race condition, but takes many tries.
Thank you, for I will release my code tomorrow and post a link here. It involves populating a hash map table in parallel from one or many input files, sorting by value descending, key ascending. Finally, output. Every aspect of the demonstration (excluding output) is parallel including processing file(s) via chunking. |
For clarity, the key names mentioned above are duplicate keys in an input file. A race condition may occur between emplace and the subsequent line. const auto [it, success] = map.emplace(key, count);
if ( !success ) it->second += count; Calling |
Hi Mario, sorry for the late response. It is correct that your example using the iterator returned by
When using The change you made, using Congrats on finding the right solution, I'm well aware that the doc is not as nice as it should be. |
Hi Greg, thank you for the clarity. That all makes perfect sense. I spoke too soon about Increasing the number of input files is one way to elevate regression needle(s) from the haystack, if any. I'm running tests on
Like the prior regression, the failed keys below are unique within a file as well, but exists in other input files. Each file is processed one at a time. The regression is difficult to reproduce. So, I run multiple times. For example, 12 runs (same code, same input files), test runs 9 (3 keys), 11 (1 key), and 12 (1 key) failed. It just happened that 3 runs failed within 12 runs. Normally, I have to run 50+ times before seeing a regression. The failure is random.
It's mind-boggling to see the few keys twice in the output file. The output is simply a dump of a vector populated from the hash map (200 million+ key-value pairs). // Store the properties into a vector.
vec_str_int_type propvec;
propvec.reserve( 8 + map.size() );
for ( auto const& x : map )
propvec.emplace_back(x.first, x.second);
map.clear(); // Thank you, for clear being fast.
// Sort the vector in parallel by (count) in reverse order, (key) in lexical order.
boost::sort::block_indirect_sort(
propvec.begin(), propvec.end(),
[](const str_int_type& left, const str_int_type& right) {
return left.second != right.second
? left.second > right.second
: left.first < right.first;
},
nthds_sort
);
// Output the sorted vector.
for ( auto const& x : propvec )
fast_io::io::println(x.first, "\t", x.second); I factored out the chunking logic by running another variant where threads process a list of input files in parallel (non-chunking; involves merging handled serially and thread-safe). This too succeeds 99% of the time. The non-chunking variant constructs the parallel hash map without a mutex and calls emplace and increments the count if the key exists. Basically, all threads populate a local hash map. I'm not sure if the rare regression is coming from std::hash, behind the scene. I will tidy the two variants and post the links to them. |
I have been at this for some time, on and off. Before, processing 20 million unique keys went well (non-chunking variant). Next, I will try an alternative std::hash solution. But first, I have an older cloned parallel hash map repository somewhere and will try that. Moreover, I will try another Linux distribution / compiler. |
That's very odd. Can you share more of your code, ideally a full working program, I'd be curious to have a look. |
Also, what is your |
I factored out clang++ on Clear Linux and Fedora 28, all OS updates applied. Also, I tested using an older cloned parallel hash map repo (55725db, Mar 12). Same thing.
I created a gist containing llil4map.cc.
It depends on whether #ifdef MAX_STR_LEN_L
struct str_type : std::array<char, MAX_STR_LEN_L> {
bool operator==( const str_type& o ) const {
return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) == 0;
}
bool operator<( const str_type& o ) const {
return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) < 0;
}
};
// inject specialization of std::hash for str_type into namespace std
namespace std {
template<> struct hash<str_type> {
std::size_t operator()( str_type const& v ) const noexcept {
std::basic_string_view<char> bv {
reinterpret_cast<const char*>(v.data()), v.size() * sizeof(char) };
return std::hash<std::basic_string_view<char>>()(bv);
}
};
}
#else
using str_type = std::basic_string<char>;
#endif |
I created 92 input files The for e in $(perl -le 'print for "aa".."dn"'); do
echo "big$e"
perl gen-llil.pl "big$e" 200 3 1
perl shuffle.pl "big$e" >tmp && mv tmp "big$e"
done To run, one can pass one or more files as input. I ran with up to 2,208 files passing big* 24 times. Start with a smaller list. NUM_THREADS=8 ./llil4map /data/biga* | cksum # 26 files
NUM_THREADS=16 ./llil4map /data/big* /data/big* | cksum # 184 files The chunking variant consumes lesser memory, okay to run on a 16 GB box -- 2,208 files. The non-chunking variant was my first map demonstration. That consumes a lot more, but manageable if running with fewer workers. The more workers, more memory. Well, that's the gist of it. Basically, my contribution at solving the Chuma challenge at PerlMonks. A monk eyepopslikeamosquito introduced me to C++ and so I tried. Eventually, I reached the point on chunking in C++. The standard C++ map library runs slow. So, I searched the web for an alternative map implementation. And the joy on finding your repository. Just 2 days ago, I attempted a chunking variant to have one copy shared among threads. It works well for the most part. Sadly, a regression pops up randomly -- this issue. Off-topicMy C++ chunking demonstration was first attempted here, an Easter Egg that lives on PerlMonks, named |
Thanks, I'll have a look over the weekend, maybe even this evening (sounds like a fun project :-). |
Blessings and grace. This is largely due to eyepopslikeamosquito for getting me started. He started a PerlMonks thread "Rosetta Code: Long List is Long", late last year. I'm able to reproduce the regression using 26 input files versus 92. I pass /path/to/biga* (note the letter a) on the command line 24 times to process total 624 files. The output file is 755 MB. This is more manageable.
Out of 22 runs, 1 run failed similarly. A key is seen twice in the output which is not expected.
Here is another. Out of 16 runs, 1 failed. Since testing over the last couple of days, I have seen a blank key with a very high count twice. The regression is mostly like the prior one. This one is weird.
The following is my loop script to run many times. First time, /tmp/out1 not exist, or new input set, run llil4map one time and move or copy run_loop.sh#!/bin/bash
if [ ! -f "/tmp/out1" ]; then
echo "Oops! '/tmp/out1' does not exists."
echo "Run llil4map manually the first time."
echo "Copy or move '/tmp/out2' to '/tmp/out1'."
exit 1
fi
for i in $(seq 1 80); do
echo "## $i"
NUM_THREADS=22 ./llil4map \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
/data1/input/biga* /data1/input/biga* \
> /tmp/out2
cksum /tmp/out2
echo && wc -l /tmp/out?
echo && diff /tmp/out1 /tmp/out2 || exit 2
echo
done | tee /tmp/report |
Maybe the issue is the hardware I'm running on :( Well, I'm hoping that you're unable to reproduce the regression. That would indicate that all is well with the demonstration, populating a hash map in parallel. I ran again consuming lesser number of CPU cores and completed 80 runs in entirety, no regressions. |
Hum, I think at line 211 of |
Also, because the source files are sorted alphabetically, you could probably implement this faster and using much less memory. |
Thanks, Greg. I see what you mean. In addition to line 211 change, the while loop on line 214 from
Mine are shuffled, via the subsequent |
ResolvedThe regression turned out to be hardware related. Several years ago, I made BIOS adjustments for no other reason than to minimize power consumption. Well, one notch too much for the "Load-Line Calibration or LLC". I changed it from I ran again on all physical and logical CPU cores, completing 160 runs without issues. I ran the loop script twice in a row. No code changes were made to isolate whether hardware related due to the randomness of the regression. SummaryGreg, you are right about find_char. There is possibly a one-off error on line 211. That requires changing line 214 to Thank you dearly for the parallel hash map C++ library. What I like about it is the involvement of SSE2 CPU instructions for checking 16 slots in parallel. Also Being able to consume all physical and logical CPU cores is a testament to your parallel hash map C++ library. Typically, I run on physical CPU cores only. But, the |
Great news, glad it worked out and you solved the issue. You would run faster if you were able to reserve ( A data structure which would reduce the memory requirements for storing all these strings is a |
you could clean up the code a little bit like this:
|
I will check per file, reserving an extra 200 MB if below a threshold. Chuma mentioned that an input file may be up to 200 MB in size. Eventually, after processing many input files, the hash map size may settle due to mostly incrementing the value.
Ah... I now wish that parallel radix tree existed, similar to parallel hash map :)
Yes, will do. Thank you! I will clean up |
Closing the issue, good luck with this fun project, and thank you for using phmap! |
Thank you for your help. I removed the original gist. The new llil4map.cc is final. It turns out that resize does not happen often due to mostly updating values, eventually. Populating a shared hash map in parallel runs so fast, possible and thread safe calling |
Results for computing Chuma's "Long list is long" reside in a gist. The |
Cool, thanks for posting @marioroy . I've been very busy and I have not been able to look at your code, but I have time off next week and should be able to have a look. Would you mind if I added some of your code (possibly with changes from me) as an example in my phmap and gtl repos? Maybe even as a benchmark? |
Sure thing. I do not mind. See also issue 199. For the C++ parallel demonstrations (mt_word_counter and chunking variants) C++ parallel demonstration involving orderly output by chunk_id |
Thank you, for parallel-hashmap. I'm processing an input file in parallel that contains 1% to 2% duplicate keys. Thus, I experienced a race condition using
emplace
and subsequently incrementing the count. It is difficult to reproduce but the output had incorrect counts twice among 30+ runs. Is the following thread-safe due to involving two separate statements?I replaced the two lines with the following snippet. Is this the correct way for dealing with input having a small percentage of duplicate keys? Thus far, this succeeds 100% and have yet to see it fail after 30+ runs.
The parallel map is constructed as follow:
The text was updated successfully, but these errors were encountered: