Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about possible race condition among many threads #198

Closed
marioroy opened this issue Jun 15, 2023 · 24 comments
Closed

Question about possible race condition among many threads #198

marioroy opened this issue Jun 15, 2023 · 24 comments

Comments

@marioroy
Copy link

marioroy commented Jun 15, 2023

Thank you, for parallel-hashmap. I'm processing an input file in parallel that contains 1% to 2% duplicate keys. Thus, I experienced a race condition using emplace and subsequently incrementing the count. It is difficult to reproduce but the output had incorrect counts twice among 30+ runs. Is the following thread-safe due to involving two separate statements?

const auto [it, success] = map.emplace(key, count);
if ( !success ) it->second += count;

I replaced the two lines with the following snippet. Is this the correct way for dealing with input having a small percentage of duplicate keys? Thus far, this succeeds 100% and have yet to see it fail after 30+ runs.

map.lazy_emplace_l(
   key,
   [&](map_str_int_type::value_type& p) {
      // called only when key was already present
      p.second += count;
   },
   [&](const map_str_int_type::constructor& ctor) {
      // construct value_type in place when key not present
      ctor(key, count);
   }
);

The parallel map is constructed as follow:

// create the parallel_flat_hash_map with internal mutexes
using map_str_int_type = phmap::parallel_flat_hash_map<
   str_type, int_type,
   phmap::priv::hash_default_hash<str_type>,
   phmap::priv::hash_default_eq<str_type>,
   phmap::priv::Allocator<phmap::priv::Pair<const str_type, int_type>>,
   12, std::mutex
>;

map_str_int_type map;
@marioroy
Copy link
Author

marioroy commented Jun 16, 2023

I tried again, the initial two lines (emplace and incrementing the value if the key exists). This time, I captured the output to a file for comparison. The input files are random generated key-value pairs delimited by a tab character. For testing, I set the sixth argument for parallel hash map to 8 (2**8 submaps) as 12 is more difficult to reproduce.

The output file out1.txt is correct.

$ diff out1.txt out2.txt
425764d425763
< pxmbpi        10
687626a687626
> pxmbpi        9

I'm able to reproduce the race condition, but takes many tries.

$ diff out1.txt out2.txt
113625d113624
< duyjcn        10
687626a687626
> duyjcn        9
$ diff out1.txt out2.txt
634971d634970
< xzbwic        10
687626a687626
> xzbwic        9

Thank you, for lazy_emplace_l. I cannot reproduce the race condition running 100 times.

I will release my code tomorrow and post a link here. It involves populating a hash map table in parallel from one or many input files, sorting by value descending, key ascending. Finally, output. Every aspect of the demonstration (excluding output) is parallel including processing file(s) via chunking.

@marioroy
Copy link
Author

marioroy commented Jun 16, 2023

For clarity, the key names mentioned above are duplicate keys in an input file. A race condition may occur between emplace and the subsequent line.

const auto [it, success] = map.emplace(key, count);
if ( !success ) it->second += count;

Calling lazy_emplace_l appears to be thread-safe. I'm unable to reproduce the race condition.

@greg7mdp
Copy link
Owner

greg7mdp commented Jun 16, 2023

Hi Mario, sorry for the late response.

It is correct that your example using the iterator returned by emplace is not thread safe. There is this sentence in the README file:

However, please be aware that iterators or references returned by standard APIs are not protected by the mutex, so they cannot be used reliably on a hash map which can be changed by another thread.

When using emplace, what happens is that the mutex specified as the last template parameter of your map (std::mutex) is locked when the key is inserted in the map (not already present), but unlocked before emplace() returns. So if the key was already present, and you increment the value using the returned iterator, this increment is done without the mutex being locked.

The change you made, using lazy_emplace_l, is exactly correct, and will allow both operations (the insertion if not present, or the increment if already present) to occur under the mutex protection. This code will be fully thread-safe, and very efficient.

Congrats on finding the right solution, I'm well aware that the doc is not as nice as it should be.

@marioroy
Copy link
Author

marioroy commented Jun 16, 2023

Hi Greg, thank you for the clarity. That all makes perfect sense.

I spoke too soon about lazy_emplace_l or could be std::hash. I'm not sure.

Increasing the number of input files is one way to elevate regression needle(s) from the haystack, if any. I'm running tests on lazy_emplace_l and that mostly works 99.?% of the time. I settled on 10 2**10 (1,024 submaps) and processing 2,208 input files many times. Each file is processed individually in parallel via chunking. Validation fails randomly (not often) calling lazy_emplace_l. Below, the random generated nryjun key, with a value of 1, is unique within a file, but exists in 48 separate files. The tally 48 is correct in out1.txt. Interestingly, the key nryjun appears to be added twice as seen in out2.txt.

$ ls -lh /tmp/out?.txt
-rw-r--r-- 1 mario mario 1.9G Jun 16 03:01 /tmp/out1.txt
-rw-r--r-- 1 mario mario 1.9G Jun 16 03:05 /tmp/out2.txt

$ wc -l /tmp/out?.txt
 200483043 /tmp/out1.txt
 200483044 /tmp/out2.txt
 400966087 total

$ diff out1.txt out2.txt
58813955d58813954
< nryjun        48
86951724a86951724
> nryjun        47
200483043a200483044
> nryjun        1

Like the prior regression, the failed keys below are unique within a file as well, but exists in other input files. Each file is processed one at a time. The regression is difficult to reproduce. So, I run multiple times. For example, 12 runs (same code, same input files), test runs 9 (3 keys), 11 (1 key), and 12 (1 key) failed. It just happened that 3 runs failed within 12 runs. Normally, I have to run 50+ times before seeing a regression. The failure is random.

$ wc -l /tmp/out?.txt
 200483043 /tmp/o1
 200483046 /tmp/o2
 400966089 total

$ diff out1.txt out2.txt
49203927d49203926
< jmoohh        48
86951724a86951724
> jmoohh        47
122419594d122419593
< idfpxa        24
187959857d187959855
< xdkxfm        24
200483043a200483042,200483046
> idfpxa        23
> xdkxfm        23
> idfpxa        1
> jmoohh        1
> xdkxfm        1

It's mind-boggling to see the few keys twice in the output file. The output is simply a dump of a vector populated from the hash map (200 million+ key-value pairs).

// Store the properties into a vector.
vec_str_int_type propvec;
propvec.reserve( 8 + map.size() );
for ( auto const& x : map )
   propvec.emplace_back(x.first, x.second);

map.clear();  // Thank you, for clear being fast.

// Sort the vector in parallel by (count) in reverse order, (key) in lexical order.
boost::sort::block_indirect_sort(
   propvec.begin(), propvec.end(),
   [](const str_int_type& left, const str_int_type& right) {
      return left.second != right.second
         ? left.second > right.second
         : left.first  < right.first;
   },
   nthds_sort
);

// Output the sorted vector.
for ( auto const& x : propvec )
   fast_io::io::println(x.first, "\t", x.second);

I factored out the chunking logic by running another variant where threads process a list of input files in parallel (non-chunking; involves merging handled serially and thread-safe). This too succeeds 99% of the time. The non-chunking variant constructs the parallel hash map without a mutex and calls emplace and increments the count if the key exists. Basically, all threads populate a local hash map.

I'm not sure if the rare regression is coming from std::hash, behind the scene. I will tidy the two variants and post the links to them.

@marioroy
Copy link
Author

marioroy commented Jun 16, 2023

I have been at this for some time, on and off. Before, processing 20 million unique keys went well (non-chunking variant). Next, I will try an alternative std::hash solution. But first, I have an older cloned parallel hash map repository somewhere and will try that. Moreover, I will try another Linux distribution / compiler.

@greg7mdp
Copy link
Owner

That's very odd. Can you share more of your code, ideally a full working program, I'd be curious to have a look.

@greg7mdp
Copy link
Owner

Also, what is your str_type?

@marioroy
Copy link
Author

marioroy commented Jun 16, 2023

I factored out clang++ on Clear Linux and Fedora 28, all OS updates applied. Also, I tested using an older cloned parallel hash map repo (55725db, Mar 12). Same thing.

Can you share more of your code, ideally a full working program, ...

I created a gist containing llil4map.cc.

Also, what is your str_type?

It depends on whether MAX_STR_LEN_L is defined around line 99. I tested both paths (defined and the MAX_STR_LEN_L define commented out). Same regression, although difficult to reproduce.

#ifdef MAX_STR_LEN_L
struct str_type : std::array<char, MAX_STR_LEN_L> {
   bool operator==( const str_type& o ) const {
      return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) == 0;
   }
   bool operator<( const str_type& o ) const {
      return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) < 0;
   }
};
// inject specialization of std::hash for str_type into namespace std
namespace std {
   template<> struct hash<str_type> {
      std::size_t operator()( str_type const& v ) const noexcept {
         std::basic_string_view<char> bv {
            reinterpret_cast<const char*>(v.data()), v.size() * sizeof(char) };
         return std::hash<std::basic_string_view<char>>()(bv);
      }
   };
}
#else
using str_type         = std::basic_string<char>;
#endif

@marioroy
Copy link
Author

marioroy commented Jun 16, 2023

I created 92 input files bigaa ... bigdn in a folder somewhere; just under 3 GB all together. The shuffle script makes the content random order.

The gen-llil.pl script can be found at https://perlmonks.org/?node_id=11148681. Scroll down the page a bit. The mini shuffle.pl script can be found at https://perlmonks.org/?node_id=11148681, top of page.

for e in $(perl -le 'print for "aa".."dn"'); do
    echo "big$e"
    perl gen-llil.pl "big$e" 200 3 1
    perl shuffle.pl  "big$e" >tmp && mv tmp "big$e"
done

To run, one can pass one or more files as input. I ran with up to 2,208 files passing big* 24 times. Start with a smaller list.

NUM_THREADS=8 ./llil4map /data/biga* | cksum  # 26 files
NUM_THREADS=16 ./llil4map /data/big* /data/big* | cksum  # 184 files

The chunking variant consumes lesser memory, okay to run on a 16 GB box -- 2,208 files. The non-chunking variant was my first map demonstration. That consumes a lot more, but manageable if running with fewer workers. The more workers, more memory.

Well, that's the gist of it. Basically, my contribution at solving the Chuma challenge at PerlMonks. A monk eyepopslikeamosquito introduced me to C++ and so I tried. Eventually, I reached the point on chunking in C++.

The standard C++ map library runs slow. So, I searched the web for an alternative map implementation. And the joy on finding your repository. Just 2 days ago, I attempted a chunking variant to have one copy shared among threads. It works well for the most part. Sadly, a regression pops up randomly -- this issue.

Off-topic

My C++ chunking demonstration was first attempted here, an Easter Egg that lives on PerlMonks, named grep-count-chunk.cc. That includes logic for orderly output, if needed. A subsequent chunking attempt was posted here for Risque Romantic Rosetta Roman Race. This one does orderly output.

@greg7mdp
Copy link
Owner

Thanks, I'll have a look over the weekend, maybe even this evening (sounds like a fun project :-).
Impressive what you did for someone new to C++!

@marioroy
Copy link
Author

marioroy commented Jun 17, 2023

Impressive what you did for someone new to C++!

Blessings and grace. This is largely due to eyepopslikeamosquito for getting me started. He started a PerlMonks thread "Rosetta Code: Long List is Long", late last year.

I'm able to reproduce the regression using 26 input files versus 92. I pass /path/to/biga* (note the letter a) on the command line 24 times to process total 624 files. The output file is 755 MB. This is more manageable.

Input  91,395,200 total keys
Output 79,120,065 unique keys

Out of 22 runs, 1 run failed similarly. A key is seen twice in the output which is not expected.

$ diff /tmp/out1 /tmp/out2
3143688d3143687
< fjjfvx        48
11122871a11122871
> fjjfvx        47
79120065a79120066
> fjjfvx        1

Here is another. Out of 16 runs, 1 failed. Since testing over the last couple of days, I have seen a blank key with a very high count twice. The regression is mostly like the prior one. This one is weird.

$ diff /tmp/out1 /tmp/out2
0a1
>       122528746797432
72597798d72597798
< xndybc        24
77549573d77549572
< zkkbyj        24
79120065a79120065,79120067
> xndybc        23
> zkkbyj        23
> zkkbyj        1

The following is my loop script to run many times. First time, /tmp/out1 not exist, or new input set, run llil4map one time and move or copy /tmp/out2 to /tmp/out1. Afterwards, the loop script can be used. The requirement is for /tmp/out1 to exists. Set NUM_THREADS and the path to input files accordingly.

run_loop.sh

#!/bin/bash

if [ ! -f "/tmp/out1" ]; then
  echo "Oops! '/tmp/out1' does not exists."
  echo "Run llil4map manually the first time."
  echo "Copy or move '/tmp/out2' to '/tmp/out1'."
  exit 1
fi

for i in $(seq 1 80); do
  echo "## $i"

  NUM_THREADS=22 ./llil4map \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
    /data1/input/biga* /data1/input/biga* \
  > /tmp/out2
  cksum /tmp/out2

  echo && wc -l /tmp/out?
  echo && diff /tmp/out1 /tmp/out2 || exit 2

  echo
done | tee /tmp/report

@marioroy
Copy link
Author

marioroy commented Jun 17, 2023

Maybe the issue is the hardware I'm running on :(

Well, I'm hoping that you're unable to reproduce the regression. That would indicate that all is well with the demonstration, populating a hash map in parallel.

I ran again consuming lesser number of CPU cores and completed 80 runs in entirety, no regressions.

@greg7mdp
Copy link
Owner

Hum, I think at line 211 of llilmacpc.cc, it should be: last = &buf[len]; instead of last = &buf[len - 1];, because last in find_char is one past the last character looked at.

@greg7mdp
Copy link
Owner

Also, because the source files are sorted alphabetically, you could probably implement this faster and using much less memory.

@marioroy
Copy link
Author

marioroy commented Jun 17, 2023

Hum, I think at line 211 of llilmacpc.cc, it should be: last = &buf[len]; instead of last = &buf[len - 1];, because last in find_char is one past the last character looked at.

Thanks, Greg. I see what you mean. In addition to line 211 change, the while loop on line 214 from first <= last to first < last completes the suggested change. I will validate the change against an input file containing all blank lines, a file containing 1 char per line, and so forth.

Also, because the source files are sorted alphabetically, ...

Mine are shuffled, via the subsequent shuffled.pl step mentioned above. The reason is that in Chuma's use case, the files are sorted by the count field in reverse order, then by key. So, the keys are likely unordered.

@marioroy
Copy link
Author

Resolved

The regression turned out to be hardware related. Several years ago, I made BIOS adjustments for no other reason than to minimize power consumption. Well, one notch too much for the "Load-Line Calibration or LLC". I changed it from -1 to 1.

I ran again on all physical and logical CPU cores, completing 160 runs without issues. I ran the loop script twice in a row. No code changes were made to isolate whether hardware related due to the randomness of the regression.

Summary

Greg, you are right about find_char. There is possibly a one-off error on line 211. That requires changing line 214 to (first < last) as well. I will re-test the chunking logic ensuring no infinite loop; e.g. chunking a file containing just blank lines.

Thank you dearly for the parallel hash map C++ library. What I like about it is the involvement of SSE2 CPU instructions for checking 16 slots in parallel. Also map.clear() being instant and not to forget lazy_emplace_l, which works well. A parallel delight... Chunking and populating a shared hashmap in parallel and SSE2 parallel on top of that, no wonder my system faltered due to LLC set too low.

Being able to consume all physical and logical CPU cores is a testament to your parallel hash map C++ library. Typically, I run on physical CPU cores only. But, the llilmapc variant continued running faster by increasing NUM_THREADS up to NCPUs-1.

@greg7mdp
Copy link
Owner

Great news, glad it worked out and you solved the issue. You would run faster if you were able to reserve (map.reserve()) a size close to the final size of the hash map. Otherwise it keeps resizing as you insert values into it.

A data structure which would reduce the memory requirements for storing all these strings is a prefix trie, but it I don't know of an implementation which would allow multithreaded access with low contention.

@greg7mdp
Copy link
Owner

you could clean up the code a little bit like this:

#ifdef MAX_STR_LEN_L
            str_type s {};  // {} initializes all elements of fixword to '\0'
            ::memcpy( s.data(), start_ptr, std::min(MAX_STR_LEN_L, klen));
#else
            str_type s(start_ptr, klen);
#endif                       
            map_ret.lazy_emplace_l(
               s,
               [&](map_str_int_type::value_type& p) {
                  // called only when key was already present
                  p.second += count;
               },
               [&](const map_str_int_type::constructor& ctor) {
                  // construct value_type in place when key not present
                   ctor(std::move(s), count);
               }
            );

@marioroy
Copy link
Author

marioroy commented Jun 17, 2023

You would run faster if you were able to reserve (map.reserve()) a size close to the final size of the hash map.

I will check per file, reserving an extra 200 MB if below a threshold. Chuma mentioned that an input file may be up to 200 MB in size. Eventually, after processing many input files, the hash map size may settle due to mostly incrementing the value.

A data structure which would reduce the memory requirements for storing all these strings is a prefix trie.

Ah... I now wish that parallel radix tree existed, similar to parallel hash map :)

you could clean up the code a little bit like this

Yes, will do. Thank you! I will clean up llil4map.

@greg7mdp
Copy link
Owner

Closing the issue, good luck with this fun project, and thank you for using phmap!

@marioroy
Copy link
Author

marioroy commented Jun 17, 2023

Thank you for your help. I removed the original gist. The new llil4map.cc is final.

It turns out that resize does not happen often due to mostly updating values, eventually. Populating a shared hash map in parallel runs so fast, possible and thread safe calling lazy_emplace_l.

@marioroy
Copy link
Author

marioroy commented Jun 22, 2023

Results for computing Chuma's "Long list is long" reside in a gist. The llil4map solution is the fastest, including faster than mcesort. Scroll down past the summary section for actual preparation and running.

llil_results.txt

@greg7mdp
Copy link
Owner

Cool, thanks for posting @marioroy . I've been very busy and I have not been able to look at your code, but I have time off next week and should be able to have a look. Would you mind if I added some of your code (possibly with changes from me) as an example in my phmap and gtl repos? Maybe even as a benchmark?

@marioroy
Copy link
Author

marioroy commented Jun 22, 2023

Would you mind if I added some of your code (possibly with changes from me) as an example in my phmap and gtl repos? Maybe even as a benchmark?

Sure thing. I do not mind.

See also issue 199. For the mt_word_counter example, I opted for the individual threads to update a local hash map. This approach works well when the hash map is small, allowing better scaling on a big machine.

C++ parallel demonstrations (mt_word_counter and chunking variants)
https://gist.github.com/marioroy/643cddb782c3363d9d828b5eba292524

C++ parallel demonstration involving orderly output by chunk_id
https://www.perlmonks.org/?node_id=11152467

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants