Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to reset the inner sub map #238

Closed
marioroy opened this issue Mar 31, 2024 · 64 comments
Closed

Ability to reset the inner sub map #238

marioroy opened this issue Mar 31, 2024 · 64 comments

Comments

@marioroy
Copy link

marioroy commented Mar 31, 2024

Hi, @greg7mdp

I tested variable length keys. It turns out the llil4map.cc demonstration consumes the most memory consumption. The reason is unable to clear or reset the inner sub map. Without this ability, memory consumption peaks 18.4 GB. Otherwise, 15.1 GB with reset_inner capability.

Here, we get ready to insert the key,value pairs into a vector.

         propvec.resize(map.size());
         std::vector<vec_str_int_type::iterator> I; I.resize(map.subcnt());
         I[0] = propvec.begin();
         size_t prev_num_keys;

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
               if (i == 0) {
                  prev_num_keys = set.size();
               } else {
                  I[i] = I[i-1] + prev_num_keys;
                  prev_num_keys = set.size();
               }
            });
         }  

         #pragma omp parallel for schedule(static, 1) num_threads(5)
         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(x.first), x.second);
            });
            map.reset_inner(i);             // <--- need ability to reset the inner sub map here
         }

I added the reset_inner method to phmap.h, locally on my machine.

diff --git a/parallel_hashmap/phmap.h b/parallel_hashmap/phmap.h
index d9a5b7b..c8a3d62 100644
--- a/parallel_hashmap/phmap.h
+++ b/parallel_hashmap/phmap.h
@@ -3444,6 +3444,10 @@ public:
         return  sets_[idx];
     }
 
+    void reset_inner(size_t idx) {
+        sets_[idx].set_ = EmbeddedSet();
+    }
+
     // Extension API: support for heterogeneous keys.
     //
     //   std::unordered_set<std::string> s;

Great news!

This change enables llil4map to reach llil4hmap performance, processing variable length keys. Previously, the llil4map demonstration ran slower. Being able to clear/reset a sub map early is beneficial when the sub map is no longer needed. In the C++ snippet above, parallel threads process each sub map individually, "map to vector" below. I prefer for the map memory consumption to decrease while the vector memory increases and not wait until the end to clear the map. No issues for fixed-length keys.

llil4map.cc

The sub maps are managed by the phmap library.

$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort       Before 18.4GB     After 15.1GB
get properties         9.032 secs        8.969 secs
map to vector          3.054 secs        2.019 secs    <--- improvement
vector stable sort     3.099 secs        2.944 secs
write stdout           0.587 secs        0.553 secs
total time            15.774 secs       14.486 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

llil4hmap.cc

The sub maps are managed at the application level. Notice, similar times for "get properties" and "map to vector". The key hash_value is computed once only and stored with the key. I was curious at the time and left it this way.

$ ./llil4hmap in/big* in/big* in/big* | cksum
llil4hmap start
use OpenMP
use boost sort
get properties         8.896 secs
map to vector          2.075 secs
vector stable sort     2.957 secs
write stdout           0.586 secs
total time            14.516 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

Now, I understand why the two phmap demonstrations did not perform similarly before. You may like to know how far behind emhash. The llil4map is 0.9 seconds apart with reset_inner capability. Previously, 2.2 seconds apart total time. Please, no worries about phmap vs. emhash. Lacking the ability to clear/reset the sub map was the reason greater than 2 seconds apart (for variable length keys).

llil4emh.cc

Here too, the sub maps are managed at the application level. Including, key hash_value stored with the key.

$ ./llil4emh in/big* in/big* in/big* | cksum
llil4emh start
use OpenMP
use boost sort
get properties         7.767 secs
map to vector          2.192 secs
vector stable sort     3.075 secs
write stdout           0.543 secs
total time            13.578 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689
@greg7mdp
Copy link
Owner

greg7mdp commented Mar 31, 2024

Thanks for the great explanation @marioroy .

Can you do the reset inside the with_submap_n, as in:

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(x.first), x.second);
               set = map_str_int_type::EmbeddedSet();    // <------------------------------
            });
         }

@marioroy
Copy link
Author

marioroy commented Mar 31, 2024

Thanks. Per your suggestion (removing const) and inserting line.

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(x.first), x.second);
               set = map_str_int_type::EmbeddedSet();
            });
         }

The clang++ compiler fails.

In file included from llil4map.cc:51:
./parallel-hashmap/parallel_hashmap/phmap.h:3431:9: error: no matching function for call to object of type '(lambda at llil4map.cc:415:32)'
 3431 |         fCallback(set);
      |         ^~~~~~~~~
llil4map.cc:415:17: note: in instantiation of function template specialization 'phmap::priv::parallel_hash_set<12, phmap::priv::raw_hash_set, spinlock_mutex, phmap::priv::FlatHashMapPolicy<std::basic_string<char>, unsigned int>, phmap::priv::StringHashEqT<char>::Hash, phmap::priv::StringHashEqT<char>::Eq, std::allocator<std::pair<const std::basic_string<char>, unsigned int>>>::with_submap<(lambda at llil4map.cc:415:32)>' requested here
  415 |             map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
      |                 ^
llil4map.cc:415:32: note: candidate function not viable: 1st argument ('const EmbeddedSet' (aka 'const raw_hash_set<phmap::priv::FlatHashMapPolicy<std::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned int>, phmap::priv::StringHashEqT<char>::Hash, phmap::priv::StringHashEqT<char>::Eq, std::allocator<std::pair<const std::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned int>>>')) would lose const qualifier
  415 |             map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
      |                                ^   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

@greg7mdp
Copy link
Owner

Sorry, my bad.

After the loop, and instead if map.reset_inner(i);, use the following which is equivalent to your reset_inner:

map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
   set = map_str_int_type::EmbeddedSet();
});

@marioroy
Copy link
Author

marioroy commented Mar 31, 2024

Thank you, @greg7mdp. Your suggestion works.

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(x.first), x.second);
            });
            // The sub map is no longer needed. Reset the set to reclaim memory.
            map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
               set = map_str_int_type::EmbeddedSet();
            });
         }

Notice "map to vector" completing in 2 seconds. This was taking 3 seconds, previously.

$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort
get properties         8.946 secs
map to vector          2.018 secs    <-- Yeah! Thank you!
vector stable sort     2.960 secs
write stdout           0.589 secs
total time            14.514 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

@greg7mdp
Copy link
Owner

Hey @marioroy , I just tried your lill4map.cc myself. Would you mind if I added it to phmap as an example (with your attribution comment at the top of lill4map.cc of course).

Which parameters did you use for gen-llil.pl. I used the ones from the comment and I got:

❯ NUM_THREADS=3 ../ex_lill4map big1.txt big2.txt big3.txt >out.txt
llil4map (fixed string length=12) start
use OpenMP
don't use boost sort
get properties         0.695 secs
map to vector          0.093 secs
vector stable sort     1.316 secs
write stdout           0.061 secs
total time             2.166 secs
    count lines     10545600
    count unique    10368458

@marioroy
Copy link
Author

marioroy commented Mar 31, 2024

Wow! Do not mind. llil4map.cc is final.

Thank you for your help, @greg7mdp; (1) some time ago, found one-off error in chunking; (2) at the time, helped with map to vector allowing parallel; (3) and now, helped ensure minimum memory consumption. Blessings and grace.

That's it, just as written in the comments for generating the input files.

perl gen-llil.pl big1.txt 200 3 1
perl gen-llil.pl big1.txt 200 3 1
perl gen-llil.pl big1.txt 200 3 1

I found my script for making 92 input files + shuffled (requires 2.8 GB). On my machine (where llil4map.cc resides), in is a symbolic link to /data/input. Change the path accordingly.

#!/bin/bash
# Script for generating input files for llil4map.cc.

mkdir -p /data/input
cp gen-llil.pl shuffle.pl /data/input
cd /data/input

# create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
   perl gen-llil.pl big$n 200 3 1
   perl shuffle.pl big$n >1; mv 1 big$n
done &

# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
   perl gen-llil.pl big$n 200 3 1
   perl shuffle.pl big$n >2; mv 2 big$n
done &

# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
   perl gen-llil.pl big$n 200 3 1
   perl shuffle.pl big$n >3; mv 3 big$n
done &

# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
   perl gen-llil.pl big$n 200 3 1
   perl shuffle.pl big$n >4; mv 4 big$n
done &

wait

The default mode in llil4map.cc is fixed string length = 12. Commenting out line 119 becomes variable length keys (using std::basic_string). This requires 16 GB to process the 92 input files when commented out. Otherwise, much less for fixed-length keys.

// #define MAX_STR_LEN_L (size_t) 12

My PerlMonks friend eyepopslikeamosquito helped me learn C++.

I enjoy parallel processing. So naturally, I tried: (1) chunk IO in get properties using spinlock_mutex; (2) map to vector, this too is parallel; (3) parallel sort; (4) finally, parallel output but orderly.

@greg7mdp
Copy link
Owner

Wow! Do not mind. llil4map.cc is final.

Thanks, I'll add it as an example then, it is a great example!

@greg7mdp
Copy link
Owner

greg7mdp commented Mar 31, 2024

Added! Looks like you did an amazing job learning C++!

I saw a small perf. issue in lill4map.cc, where using the non-mutable with_submap does not allow to std::move the string out of the map. Could be an issue if long strings are used (and if not using the fixed string).

Here is my updated version.

@marioroy
Copy link
Author

marioroy commented Apr 1, 2024

Thanks. I learned also, reading phmap_fwd_decl.h and your examples.

I tried your updated version and added line to reset the set, especially beneficial for long strings. Without it, the "map to vector" takes 3 seconds for long strings and peaks near 18.4 GB.

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(const_cast<str_type&>(x.first)), x.second);
               // reset the set (no longer needed) to reclaim memory early
               set = map_str_int_type::EmbeddedSet();
            });
         }

I respect your decision if you prefer leaving the line commented out, long strings mode (line 119).

// #define MAX_STR_LEN_L (size_t) 12

Or default to fixed-length, uncomment line.

#define MAX_STR_LEN_L (size_t) 12

Here too, I respect your decision disabled or enabled (line 46). I enabled it locally to validate parallel sort. Disabled takes 58.0 seconds for "vector stable sort" (long strings mode), or 21.5 seconds (fixed-length mode).

#define USE_BOOST_PARALLEL_SORT 1

Testing was done on a 32 core (64 logical threads), AMD Ryzen Threadripper 3970X machine, DDR4 memory 3600 MHz.

$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort       without reset      with reset
get properties         9.035 secs        9.004 secs
map to vector          3.068 secs        2.010 secs  <--- reset, minimum peak memory consumption
vector stable sort     2.972 secs        2.988 secs
write stdout           0.639 secs        0.563 secs
total time            15.716 secs       14.567 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 1, 2024

Thanks @marioroy , I reverted the defaults to the way you had them, and also added the reset the set.
On my system AMD® Ryzen 9 7950x 16-core processor × 32 with DDR5 memory, I get the following, which is not the same as you:

~/gi/g/parallel-hashmap/build_gcc12_rel/llil_utils master *1 +1 !2 ?4 ❯ ./run_llil4map                                                                    19s
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties        12.296 secs
map to vector          1.300 secs
vector stable sort     2.133 secs
write stdout           3.247 secs
total time            18.978 secs
    count lines     970195200
    count unique    200485780
3341100189 1811165988
~/gi/g/parallel-hashmap/build_gcc12_rel/llil_utils master *1 +1 !2 ?4 ❯ cat ./run_llil4map                                                                19s
#!/bin/bash

cd tmp
../../ex_llil4map big* big* big* | cksum

@marioroy
Copy link
Author

marioroy commented Apr 1, 2024

That is great performance for the AMD 7950x CPU. The "write stdout" taking 3.2 seconds is likely the L3 size difference between the two CPUs. I am providing the following output, because the 7950x CPU is amazing. In other words, my system is unable to reach 2x faster total time for being 64 logical cores. Building with clang++ runs faster for me.

It is impressive that "get properties" for the 7950x CPU completes in 12.3 seconds. Wow!

g++

$ g++ -o llil4map -std=c++20 -fopenmp -Wall -O3 llil4map.cc -I./parallel-hashmap
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties         7.496 secs
map to vector          0.965 secs
vector stable sort     1.740 secs
write stdout           0.657 secs
total time            10.859 secs
    count lines     970195200
    count unique    200483043

clang++

$ clang++ -o llil4map -std=c++20 -fopenmp -Wall -O3 llil4map.cc -I./parallel-hashmap
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties         7.496 secs
map to vector          0.791 secs
vector stable sort     1.137 secs
write stdout           0.591 secs
total time            10.017 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

Last year, I tried custom memory allocation. I was trying to reduce peak memory consumption to no avail, because it greatly impacted "write stdout". Thanks for the reset suggestion. Memory consumption no longer peaks high.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 1, 2024

It is weird that I don't have the same result for count unique as you do. Maybe a difference in gen_llil.pl?

@marioroy
Copy link
Author

marioroy commented Apr 1, 2024

No, not weird at all. Because of the random generator script. This is normal for unique count to differ per machine or whenever refreshing the input files. If you run gen_files again, it is likely for unique count to differ.

Do you want to add to .gitignore, the tmp dir?

/examples/llil_utils/tmp

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 1, 2024

Sure, will do, even though in my case I generate the files in my build directory.

I tried with clang-16 and indeed it is also faster on my system, esp the write stdout.

lil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties        12.873 secs
map to vector          1.370 secs
vector stable sort     1.862 secs
write stdout           1.177 secs
total time            17.284 secs
    count lines     970195200
    count unique    200478308

@marioroy
Copy link
Author

marioroy commented Apr 1, 2024

Sure, will do, even though in my case I generate the files in my build directory.

Oh, I assumed wrongly the location.

I tried with clang-16 and indeed it is also faster on my system, esp the write stdout.

That is amazing.

I checked and you have the same defaults. Mindfulness for users and why default fixed-length is important? So the application can run with lesser memory consumption, no surprises on machines with 8 ~ 16 GB. The reset was the missing piece.

#define USE_BOOST_PARALLEL_SORT 1
#define MAX_STR_LEN_L (size_t) 12

Folks may run with a partial list, not all 92 files. For example just a i.e. biga* biga* biga*, or [ab] i.e. big[ab]* big[ab]* big[ab]*.

@marioroy marioroy closed this as completed Apr 1, 2024
@greg7mdp
Copy link
Owner

greg7mdp commented Apr 1, 2024

I have a question about the data files. They contain words followed by a number. Am I right to assume this number is always positive?
I'll see if I can make get_properties a little bit faster, this is fun :-)

Also I added a couple minor simplifications in the code, see this

@marioroy
Copy link
Author

marioroy commented Apr 1, 2024

Am I right to assume this number is always positive?

Yes, always positive is certain.

this is fun :-)

It's mind boggling. We saw that clang++ is faster than g++. Also, -std=c++20 is faster than -std=c++17. Why did I stop at 92 input files? Running big* big* big* handles approximately 1 billion lines, and 200 million unique keys. Testing big* once is mostly insertions into the map. Testing big* big* big* is combination insertion/updates. Testing smaller input is possible simply with suffix biga* biga* biga*. Then, there is fixed-length mode and long strings. I'm grateful for reset.

The application runs parallel at each level (1) input, (2) map to vector, (3) sorting, and (4) output. Crazy fast. The coolness factor is the phmap library managing the many sub maps internally (1 << 12 = 4096). Thank you for allowing a custom mutex class. The spinlock_mutex class improves parallel insertions/updates.

Randomness is beautiful. With the many sub maps, it's less likely for threads to be using the same map/mutex in a given time window or fraction of a second. If they do, no problems. The spinlock_mutex class has low overhead.

Also I added a couple minor simplifications in the code

I had no idea about set.size(). I appreciate learning from a master. Ah, the clear/swap lines are not needed. Because, the map construction is inside a block + the reset works so well. Oh, the sorting simplifications. That's amazing.

Thank you, @greg7mdp.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 1, 2024

I'll play with it some more later. It is an interesting piece of code for sure!

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 3, 2024

Hey @marioroy , just wondering, if your solution the fastest one that exists for this problem?

@marioroy
Copy link
Author

marioroy commented Apr 3, 2024

if your solution the fastest one that exists for this problem?

Hi, @greg7mdp I am not aware of anything faster.

The llil4emh emhash demonstration is the fastest for fixed-length keys. For long strings or variable length keys, llil4map and llil4emh perform similarly. The map demonstrations scale to many CPUs cores with ease. These are the fastest beyond 12 CPU threads.

There is the llil4vec implementation. This appends to a vector, no maps. So, may consume a lot of memory, particularly repeating keys. It may run faster up to some number of cores, but does not scale well beyond that. If daring to run llil4vec, the max you can do for 32 GB is big* big* big* and only fixed-length=12 max. It peaks near 28 GB. So, be careful. It may cause swapping.

llil4map peaks less than 8 GB for 200 million keys, fixed-length=12.

I created a backup plan, if the number of unique keys were to exceed memory capacity :-) Run with NUM_MAPS=128 maximum, default 32.

  1. llil4tkh tkrzw::ShardDBM demonstration. The hash dbms are managed by the C++ library.
  2. llil4tkh2 tkrzw::HashDBM demonstration. The hash dbms are managed by the application.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 3, 2024

Wow, thanks for all the information! You seem to have spent a lot of time on this. Is this something you are doing for work, or just for fun?

@marioroy
Copy link
Author

marioroy commented Apr 3, 2024

Is this something you are doing for work, or just for fun?

For fun actually. I wondered if possible to solve the challenge. A Perl Monk "eyepopslikeamosquito" tried the vector approach, stating memory is cheap. My solutions took some time (new to C++). I chose the path that preferred low memory consumption.

Your phmap C++ library made this possible (emhash as well). OpenMP made parallel possible. Chunk IO and spinlock mutex enables scaling to many CPU cores. The reset capability ensures low memory consumption during map to vector. I had no idea there was a way to do so. Thank you, for your tip. Boost's block_indirect_sort allows many threads and fast. OpenMP's ordered directive works, allowing parallel output, and orderly.

@marioroy
Copy link
Author

marioroy commented Apr 4, 2024

There is one more problem to resolve. That is for long strings or variable length keys.

The std::string (std::basic_string) container is not memory efficient, unfortunately. In long strings mode, llil4map peaks 17.8 GB for the same input files (previously, 18.4 GB without the reset). That's horrendous and more than doubles fixed-length mode. It is much worst for the llil4vec vector demonstration. That one peaks 58 GB, requiring a VM or machine with 64 GB.

For comparison to llil4map, the llil4emh emhash implementation peaks 20.0 GB for long strings mode. It computes the hash_value once only, and stores it with the key.

@marioroy
Copy link
Author

marioroy commented Apr 4, 2024

I made a memory efficient version llil4map2, using a vector of pointers, to the phmap key-value pairs. That met having to construct the phmap outside the block, and no reset. This results in sorting taking longer, possibly due to CPU cache misses. However, memory consumption is significantly less.

200mil unique keys   llil4map   llil4map2
------------------   --------   --------
fixed-length=12      ~ 8.0 GB     6.0 GB
long-strings          17.8 GB    12.3 GB

I removed the limit for the number of threads for better performance, sorting and output. This is helpful if RAM data is scattered, not cache friendly.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 4, 2024

There is one more problem to resolve. That is for long strings or variable length keys.

Give me a couple days, I'll provide a solution for that (fixed size string which also supports long strings and is as memory efficient as possible)

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 5, 2024

Hi @marioroy ,

I added a new type string_cnt which store a short string if its length is < 11 bytes, otherwise use a pointer, so the program works for any string length, but uses little memore for short string, and even for long strings it uses less space than std::string.

I also used a set instead of a map, storing the string_cnt directly, so that we know that it will use 16 bytes per entry for short strings.

Some other small changes and cleanup, I hope you don't mind.

see this

@marioroy
Copy link
Author

marioroy commented Apr 5, 2024

Hi @greg7mdp,

Wow! I gave it a spin on this machine. This peaks at 6.9 GB. The lowest memory consumption thus far for supporting long strings.

$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties         8.152 secs
map to vector          0.885 secs
vector stable sort     2.764 secs
write stdout           0.696 secs
total time            12.499 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

At your convenience, please make one tiny change to allow parallel sort to run on all CPUs threads, i.e. no limit. Replace std::min(nthds, 32) with nthds.

$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties         8.104 secs
map to vector          0.894 secs
vector stable sort     1.859 secs   <--- improvement, okay to utilize all CPU threads
write stdout           0.723 secs
total time            11.581 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

I will study string_cnt. The awesome part is not having to specify fixed-length or long strings. It's automatic.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 5, 2024

Hi @marioroy , I just made that update to remove the 32 thread limit.

@marioroy
Copy link
Author

marioroy commented Apr 5, 2024

I reached out to eyepopslikeamosquito, to let him know. He triggered my curiosity in another thread, to making an attempt at the original Long List is Long challenge. This became a great opportunity for learning OpenMP using C++.

Thank you @greg7mdp, for the string_cnt solution.

@marioroy
Copy link
Author

marioroy commented Apr 5, 2024

Blessings and grace, @greg7mdp. Thank you for the extra clarity. The string_cnt struct is clever.

Pointers have to be aligned on a 8 byte boundary.

Just earlier, I tried (for fun) putting cnt first and wondered about the assertions complaining. Now I understand.

What if cnt contained both flag (1 bit, indicating a pointer is used) and count (31 bits) using masking? Will that help remove the unused gap for long strings. Doing so may improve sorting performance.

Note: For long strings, I really mean variable length strings. Thanks, boost::pool_allocator will not work.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 6, 2024

If you want the same struct to work for both long and short strings, as we do I think, it has to contain at least a pointer (8 bytes) + a count.

If the count was never greater than 8, we could store it in the 3 lower bits of the pointer (which are always 0), so the pointer + count would fit in 8 bytes, so we could have either a 7 byte immediate string + count, or a pointer + count.

If the count can be greater than 8, then we need more than 8 bytes, so 16 bytes is the minimum our struct will occupy (for alignment), unless we are even more tricky by moving the pointers to an aligned address before dereferencing them.

But honestly, I don't see how you can save more than a few bytes, and then also you reduce the size of immediate strings that are supported.

@marioroy
Copy link
Author

marioroy commented Apr 6, 2024

Please, disregard. So... "Instead of extra[3], store the flag (1 bit) indicating whether using a pointer in the cnt variable. 31 bits is plenty big for count" will result in no savings?

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 6, 2024

No problem, it is not obvious to understand if C++ is not your main language for sure.

@marioroy
Copy link
Author

marioroy commented Apr 6, 2024

string_cnt is clever. My apologies for my ramblings.

I see it clearly, right here :)

   string_cnt(string_cnt&& o) noexcept {
      if (o.extra[3]) {
         str = o.str;
         o.str = nullptr;
         extra[3] = 1;
      } else {
         std::strcpy((char *)(&str), o.get());
         extra[3] = 0;
      }
      cnt = o.cnt;
   }

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 6, 2024

This is a move constructor, so this is stealing the data from o.
If o has a pointer to a long string, we can just copy the pointer, note that we have a pointer in extra[3], and set o.str to null so o's destructor will not free the memory.

If o contains a small string, we just copy it and mark that we have a small string as well.

I could add some comments.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 6, 2024

I think I have an idea to make get_properties faster. I'll try to check something tomorrow. This is a fun problem!

@marioroy
Copy link
Author

marioroy commented Apr 6, 2024

I understand your string_cnt fully. Thank you. The magic lies in the set method. The strcpy may span the two variables str and extra up to 3 chars (4th char is a flag).

         auto len = std::strlen(s);
         if (len >= buffsz) {
            str = new char[len+1];
            std::strcpy(str, s);
            extra[3] = 1;
         } else {
            std::strcpy((char *)(&str), s);
            extra[3] = 0;
         }

My favorite line std::strcpy((char *)(&str), s), union-like behavior str and extra[3 chars max]. Thus, 11 chars max. There's more cleverness :-) The extra[3] is dual-purpose. A flag (indicating a pointer) or null-terminating for 11-char small string.

And the importance of o.str = nullptr in the move constructor, after stealing the data.

@marioroy
Copy link
Author

marioroy commented Apr 7, 2024

Your beautification of the include directives put me to work :-) For consistency reason, I thought to beautify the include directives as well. I created a thread at Perl Monks to let folks know: https://perlmonks.org/?node_id=11158711

I brought back llil4umap (unordered_map variant). Your tip for releasing memory immediately resolved the long cleanup delay in that one. So, the umap version is nice to have for comparing the standard container vs the powerful phmap library.

@marioroy
Copy link
Author

marioroy commented Apr 7, 2024

@greg7mdp Your solution is superb. Thanks many, I learned a lot. The string_cnt struct boggles my mind on its clever design. You helped bring back sanity in my life. :-) Now, I no longer think about the Long List is Long challenge.

Blessings and grace,

@marioroy
Copy link
Author

marioroy commented Apr 7, 2024

@greg7mdp There is extra global cleanup time, exiting the llil4map + substring_cnt program.

It dawned on me to do a testing involving mixed length. I created long?? input files containing 6 chars or 15 chars. Basically, alternating 200 12 1 or 200 3 1 arguments to gen-llil.pl.

# create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
   perl gen-llil.pl long$n 200 12 1
   perl shuffle.pl long$n >1; mv 1 long$n
done &

# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
   perl gen-llil.pl long$n 200 3 1
   perl shuffle.pl long$n >2; mv 2 long$n
done &

# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
   perl gen-llil.pl long$n 200 12 1
   perl shuffle.pl long$n >3; mv 3 long$n
done &

# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
   perl gen-llil.pl long$n 200 3 1
   perl shuffle.pl long$n >4; mv 4 long$n
done &

wait

For this round of testing, the UNIX time is captured for detecting any extra global/gc cleanup time. That is the case for substring_cnt.

1. llil4map_greg   - Greg's llil4map version with string_cnt
2. llil4map_long   - llil4map with the MAX_STR_LEN_L line commented out
3. llil4map_fix16  - llil4map with MAX_STR_LEN_L 16

llil4map_greg

$ time ./llil4map_greg in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        12.370 secs
map to vector          1.670 secs
vector stable sort     9.336 secs
write stdout           3.929 secs
total time            27.307 secs
    count lines     970195200
    count unique    295748977
3333632426 4307239210

real    0m43.433s     <--- 16.126 seconds extra
user    23m22.291s
sys     0m42.527s

llil4map_long

WARNING: Do not run llil4map_long (peaks 32 GB), or process longa* longa* longa* (a suffix) instead.

$ time ./llil4map_long in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        10.232 secs
map to vector          3.249 secs
vector stable sort     4.737 secs
write stdout           1.340 secs
total time            19.560 secs
    count lines     970195200
    count unique    295748977
3333632426 4307239210

real    0m20.269s
user    13m21.889s
sys     0m24.138s

llil4map_fix16

$ time ./llil4map_fix16 in/long* in/long* in/long* | cksum
llil4map (fixed string length=16) start
use OpenMP
use boost sort
get properties         8.672 secs
map to vector          1.553 secs
vector stable sort     1.931 secs
write stdout           1.476 secs
total time            13.633 secs
    count lines     970195200
    count unique    295748977
3333632426 4307239210

real    0m13.785s
user    10m23.834s
sys     0m13.176s

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 7, 2024

Interesting!

What happens is when you use std::string, the gnu standard library uses the same trick I did and does not allocate memory for strings up to 16 characters long (it is called the short string optimization or sso).

So in your test, with the mix of 6 character strings and 15 character string, you didn't allocate memory for strings whether you commented MAX_STR_LEN_L or not.

However, using string_cnt, that same optimization is limited to 11 bytes.

Probably if you tried with a mix 6/11 characters, or 6/17 characters, the difference would be less.

The aim with string_cnt was to reduce memory usage. indeed for string sizes between from 12 to 15 characters it will be slower.

@marioroy
Copy link
Author

marioroy commented Apr 7, 2024

@greg7mdp Thank you for the clarity about std::string sso. I wanted testing to be apples-to-apples comparison.

What happens is when you use std::string ... (it is called the short string optimization or sso).

I will re-create 19 length (to exeed sso limit), by passing alternating 200 16 1 or 200 3 1 arguments to gen-llil.pl. Well, to see if std::string does the same thing i.e. long GC or global cleanup when the program exits. That was the idea of testing, check overall behavior for dynamic memory allocation.

std::string does the same thing; long exit time due to global cleanup.

llil4map_greg

$ time ./llil4map_greg in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        12.440 secs
map to vector          1.656 secs
vector stable sort     9.348 secs
write stdout           4.056 secs
total time            27.502 secs
    count lines     970195200
    count unique    295755152
29612263 5038456270

real    0m43.777s   <--- 16.275s of this time is from GC/global cleanup
user    23m27.969s
sys     0m43.282s

llil4map_long

$ time ./llil4map_long in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        13.111 secs
map to vector          3.153 secs
vector stable sort    13.529 secs
write stdout           3.055 secs
total time            32.849 secs
    count lines     970195200
    count unique    295755152
29612263 5038456270

real    0m49.815s   <--- 16.966s of this time is from GC/global cleanup
user    20m59.538s
sys     0m59.482s

string_cnt shines, consuming lesser memory consumption.

llil4map_greg   peak 17.2 GB
llil4map_long   peak > 36 GB

@marioroy
Copy link
Author

marioroy commented Apr 7, 2024

std::string and string_cnt, once past sso limit, result in significantly longer exit time due to global cleanup. If only string_cnt allowed one to specify the fix-length limit, for MAX_STR_LEN_L-like capability. A key size of 11 chars is quite small. Chuma, the OP in the LLiL PerlMonks thread, did not provide us the max key length.

@greg7mdp Akin to { sha1sum, sha224sum, sha256sum, sha384sum, sha512sum } naming, one can build llil4map supporting fixed-length mode { llil4map_12, llil4map_16, llil4map_20, llil4map_24, llil4map_30 }, and llil4map_dyn (using static_cnt).

The Chuma LLiL challenge is solved. :-)

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 8, 2024

If only string_cnt allowed one to specify the fix-length limit, for MAX_STR_LEN_L-like capability.

It is indeed possible. I have previously modified the version of string_cnt to use std::string_view, but here is a version which is templatized to support various sso sizes:

// ---------------------------------------------------------------------------------------------
// Stores a string + a count
// For strings up to N-1 bytes, total space used is N + 4 bytes
// For larger strings, uses N+4 bytes + strlen(string) + 1
//
// invariants
//    if  extra[mark_idx], str is a valid string pointer
//    if !extra[mark_idx], the N bytes starting at (const char *)(&str) store a null_terminated string
// ---------------------------------------------------------------------------------------------
template<size_t N>
struct string_cnt_t {
   using uint_t = uint32_t;

   static_assert(N >= 12);
   static constexpr size_t extra_sz = N - sizeof(char*);
   static constexpr size_t mark_idx = extra_sz - 1;

   char *   str;
   char     extra[extra_sz];
   uint_t   cnt;

   static constexpr size_t buffsz = sizeof(str) + sizeof(extra);

   string_cnt_t() : str{nullptr}, extra{0}, cnt{0} {}

   string_cnt_t(std::string_view s, uint_t c = 0) : str(nullptr), cnt(c) { set(s); }

   ~string_cnt_t() { free(); }

   string_cnt_t(const string_cnt_t& o) {
      set(o.get());
   }

   string_cnt_t(string_cnt_t&& o) noexcept {
      if (o.extra[mark_idx]) {
         str = o.str;
         o.str = nullptr;
         extra[mark_idx] = 1;
      } else {
         std::strcpy((char *)(&str), o.get());
         extra[mark_idx] = 0;
      }
      cnt = o.cnt;
   }

   string_cnt_t& operator=(const string_cnt_t& o) {
      free();
      set(o.get());
      cnt = o.cnt;
      return *this;
   }

   string_cnt_t& operator=(string_cnt_t&& o) noexcept {
      free();
      new (this) string_cnt_t(std::move(o));
      return *this;
   }

   std::strong_ordering operator<=>(const string_cnt_t& o) const { return std::strcmp(get(), o.get()) <=> 0; }
   bool operator==(const string_cnt_t& o) const { return std::strcmp(get(), o.get()) == 0; }

   std::size_t hash() const {
      auto s = get();
      std::string_view sv {s};
      return std::hash<std::string_view>()(sv);
   }

   const char *get() const { return extra[mark_idx] ? str : (const char *)(&str); }

private:
   void free() { if (extra[mark_idx]) { delete [] str; str = nullptr; } }

   void set(std::string_view s) {
      static_assert(buffsz == 12);
      static_assert(offsetof(string_cnt_t, cnt) == (intptr_t)buffsz);
      static_assert(sizeof(string_cnt_t) == 16);

      assert(!extra[mark_idx] || !str);
      if (s.empty())
         std::memset(&str, 0, buffsz);
      else {
         auto len = s.size();
         if (len >= buffsz) {
            str = new char[len+1];
            std::memcpy(str, s.data(), len);
            str[len] = 0;
            extra[mark_idx] = 1;
         } else {
            std::memcpy((char *)(&str), s.data(), len);
            ((char *)&str)[len] = 0;
            extra[mark_idx] = 0;
         }
      }
   }

   void set(const char *s) {
      set(std::string_view{s});
   }
};

using string_cnt = string_cnt_t<12>;

namespace std {
   template<> struct hash<string_cnt> {
      std::size_t operator()(const string_cnt& v) const noexcept { return v.hash(); };
   };
}

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 8, 2024

PS: I got get_properties to be faster... takes 8.8s on my machine with the 6 character strings.

@marioroy
Copy link
Author

marioroy commented Apr 8, 2024

The new update is wonderful.

I like the fact that the private method set no longer calls strlen. Because, the length of the key is already known in get_properties. Previously (version 1), lazy_emplace_l accepted beg_str. So now, passing a string_view variable instead. It compiled without errors or warnings.

            int_type count = fast_atoll64(found + 1);
            std::string_view sv(beg_ptr, found - beg_ptr);

            // Use lazy_emplace to modify the set while the mutex is locked.
            set_ret.lazy_emplace_l(
               sv,
               [&](string_cnt_set_t::value_type& p) {
                  p.cnt += count; // called only when key was already present
               },
               [&](const string_cnt_set_t::constructor& ctor) {
                  ctor(std::move(sv), count); // construct value_type in place when key not present
               }
            );

I got get_properties to be faster...

Nice work, get_properties does run faster (set strlen(s), strcpy vs. set s.size(), memcpy()). Before and after results.

$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties         8.104 secs     7.763 secs
map to vector          0.894 secs     0.937 secs
vector stable sort     1.859 secs     1.867 secs
write stdout           0.723 secs     0.674 secs
total time            11.581 secs    11.243 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

I re-compiled llil4map_greg.cc using string_cnt_t<20>. Though, two assertions failed. I changed the first and third assert lines.

   void set(std::string_view s) {
      static_assert(buffsz == N);
      static_assert(offsetof(string_cnt_t, cnt) == (intptr_t)buffsz);
   // static_assert(sizeof(string_cnt_t) == N + extra_sz);  // complained for string_cnt_t<20>

The momemt has come :-) Let's try fixed-length 20 i.e. string_cnt_t<20>. Before and after results.

$ time ./llil4map_greg long* long* long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        12.440 secs     9.270 secs   improved
map to vector          1.656 secs     2.118 secs   humm, we can bump thds_move
vector stable sort     9.348 secs     3.294 secs   improved
write stdout           4.056 secs     1.668 secs   improved
total time            27.502 secs    16.352 secs
    count lines     970195200
    count unique    295755152
29612263 5038456270

            real    0m43.777s      0m16.823s   improved, no GC side effects
            user   23m27.969s     13m35.370s
            sys     0m43.282s      0m15.187s

I bumped nthds_move to 12. No need to go higher. Running again :-).

   int nthds_move = std::min(nthds, 12);
$ time ./llil4map_greg long* long* long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        12.440 secs     9.277 secs   improved
map to vector          1.656 secs     1.853 secs   ah yes, :-)
vector stable sort     9.348 secs     3.293 secs   improved
write stdout           4.056 secs     1.626 secs   improved
total time            27.502 secs    16.051 secs
    count lines     970195200
    count unique    295755152
29612263 5038456270

            real    0m43.777s      0m16.517s   improved, no GC side effects
            user   23m27.969s     13m35.582s
            sys     0m43.282s      0m15.290s

Wow! The improved string_cnt provides flexibility, allowing the developer to specify the sso fixed-length size. Also, no problem if the input has long strings sparingly, exceeding sso. It will still work, with minimum impact to GC cleanup.

At last, the LLiL challenge is solved. Thank you, @greg7mdp. The string_cnt update is amazing.

P.S. Not fair :-) I too want to use string_cnt in my phmap demonstrations #include <parallel_hashmap/string_cnt.h> or #include <examples/string_cnt.h>.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 8, 2024

That's great, thanks for all your work. And for fixing the static_assert in string_cnt. I still have some work to do on my new version but I'm not sure when I'll get it done.

I too want to use string_cnt in my phmap demonstrations

You are welcome to use string_cnt for whatever you want.

@marioroy
Copy link
Author

marioroy commented Apr 8, 2024

So much learning. I sped up get_properties by storing the length of the key with the count i.e. (count << 8) + (klen & 0xff).

  1. I replaced two set lines from set(o.get()); to this set(o.get(), o.cnt & 0xff); and changed the private set method.
   void set(const char *s, uint8_t len) {
      set(std::string_view{s, len});
   }
  1. The hash method is called hundreds of million times. The 1 second improvement is from here.
   std::size_t hash() const {
      auto s = get();
      std::string_view sv {s, cnt & 0xff};
      return std::hash<std::string_view>()(sv);
   }

I ran with 16 CPU cores. Before and after.

$ NUM_THREADS=16 taskset -c 0-15 ./llil4map_greg in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort
get properties        23.146 secs     22.106 secs  <-- improved
map to vector          0.818 secs      0.846 secs
vector stable sort     4.547 secs      4.496 secs
write stdout           0.538 secs      0.535 secs
total time            29.049 secs     27.985 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

get_properties is now as fast as my llil4map version (without string_cnt).

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 8, 2024

Very nice speedup! congrats.

@marioroy
Copy link
Author

marioroy commented Apr 8, 2024

I created a gist (two files) containing llil4map_next.cc and string_cnt.h. Modification: cnt contains count and key length i.e. cnt = (cnt << 8) + (len & 0xff); That improves the hash method, benefiting get_properties.

   std::size_t hash() const {
      auto s = getstr();
      std::string_view sv {s, cnt & 0xff};
      return std::hash<std::string_view>()(sv);
   }

https://gist.github.com/marioroy/693d952b578792bf090fe20c2aaccad5

Edit: The gist contains three files, added string_cnt_u.h.

@marioroy
Copy link
Author

marioroy commented Apr 9, 2024

Sorting an array runs faster for fixed length keys. That is the case for block_indirect_sort. So, I tried union std::array in the struct (see string_cnt_u.h).

   union {
      struct {
         char *   str;
         char     extra[extra_sz];
         uint_t   cnt;
      } s_;
      std::array<char, sizeof(s_) - sizeof(uint_t)> d_;
   };

   bool operator<(const string_cnt_t& o) const {
      if (s_.extra[mark_idx] || o.s_.extra[mark_idx])
         return std::strcmp(getstr(), o.getstr()) < 0;
      else
         return d_ < o.d_;
   }

Before and after results.

$ ./llil4map big* big* big* | cksum
llil4map start
use OpenMP
use boost sort        string_cnt.h  string_cnt_u.h
get properties         7.763 secs     7.689 secs
map to vector          0.937 secs     0.836 secs
vector stable sort     1.867 secs     1.589 secs  <-- improved (64 threads)
write stdout           0.674 secs     0.719 secs
total time            11.243 secs    10.835 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

Temporarily, I decreased the number of threads to 8 for sorting. Look at sorting go. :-)

$ ./llil4map big* big* big* | cksum
llil4map start
use OpenMP
use boost sort        string_cnt.h  string_cnt_u.h
get properties         7.703 secs     7.682 secs
map to vector          0.900 secs     0.898 secs
vector stable sort     9.264 secs     7.283 secs  <-- improved (8 threads)
write stdout           0.763 secs     0.640 secs
total time            18.631 secs    16.506 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

Let's try string_cnt_t<20>, again 8 threads for sorting, to get an idea.

$ ./llil4map long* long* long* | cksum
llil4map start
use OpenMP
use boost sort        string_cnt.h  string_cnt_u.h
get properties         9.161 secs     9.208 secs
map to vector          1.729 secs     1.860 secs
vector stable sort    15.586 secs    12.861 secs  <-- improved (8 threads)
write stdout           1.625 secs     1.630 secs
total time            28.104 secs    25.560 secs
    count lines     970195200
    count unique    295755152
29612263 5038456270

Gees! All the union changes just to improve sorting for fixed-length keys.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 9, 2024

return d_ < o.d_;

Hum, this comparison is not necessarily correct, if you have garbage after the terminating null character of the contained string, because the std::array < will not stop at the null character.

@greg7mdp
Copy link
Owner

greg7mdp commented Apr 9, 2024

If you set() a long string, the first 8 bytes contain a pointer. Now if you set a 2 byte string "ab", the first 8 bytes will contain "ab\0?????". I imagine this doesn't happen in llil.

@marioroy
Copy link
Author

marioroy commented Apr 12, 2024

Hi @greg7mdp,

You are correct. There could be garbage, fixed by clearing in set. Thank you!

            std::memset(f_.data(), 0, N);
            std::memcpy(f_.data(), s.data(), std::min(len, N - 1));

Fixing the garbage issue allowed me to improve the == operator in string_cnt_u.h. Now, get_properties runs as fast as my version without string_cnt.

   bool operator==(const string_cnt_t& o) const {
      if (s_.extra[mark_idx] || o.s_.extra[mark_idx])
         return std::strcmp(getstr(), o.getstr()) == 0;
      else
         return std::memcmp(getdata(), o.getdata(), N) == 0;
   }

Freeing memory is now handled with multiple threads, in llil4map_next.cc.

   // Free dynamically allocated memory in parallel.
   #pragma omp parallel for schedule(dynamic, 300000) num_threads(nthds)
   for (size_t i = 0; i < propvec.size(); i++)
      propvec[i].free();

I added a fix-length only version, string_cnt_f.h, nice to have for comparing with my version.

// #include "string_cnt.h"     // original + key length stored with cnt
// #include "string_cnt_f.h"   // fix-length only version
   #include "string_cnt_u.h"   // union std::array

Thank you for string_cnt.h. The program runs fast (string_cnt_u.h, slightly faster string_cnt_f.h).

@greg7mdp
Copy link
Owner

Hi @marioroy , congrats on all the great work. I've been very busy at work, so I haven't had time to work on this or look at your latest changes, but hopefully I will soon. All the best!

@marioroy
Copy link
Author

Greetings, @greg7mdp

I made a new gist containing my final phmap and vector demonstrations. The gist contains a README and LLiL utilities. There are three string_cnt_t variants; with extra comments.

C++ phmap and vector demonstrations using string_cnt_t
https://gist.github.com/marioroy/693d952b578792bf090fe20c2aaccad5

  1. llil4map_next.cc - phmap::parallel_flat_hash_map demonstration
  2. llil4vec_next.cc - similarly, a std::vector demonstration
  3. string_cnt.h - original + key length stored with count
  4. string_cnt_f.h - fixed-length-only version
  5. string_cnt_u.h - union std::array
  6. gen-files-big.sh - generate 92 big files (total 2.8 GB)
  7. gen-files-long.sh - generate 92 long files (total 5.0 GB)
  8. gen-llil.pl - generate LLiL test file
  9. shuffle.pl - shuffle LLiL test file

I learned more C++ from your string_cnt.h example, so much appreciated. The LLiL challenge is solved. llil4map_next.cc runs with low memory consumption and pleasantly fast using string_cnt_u.h.

@marioroy
Copy link
Author

marioroy commented Apr 21, 2024

Hi @greg7mdp, Running get_properties on Linux + ECHO CPU Scheduler + the Realtime patch set. I made a repo for building such a kernel on Clear Linux.

Processing long* long* long* took 16.9s, previously. Now 14.1s on a realtime Linux versus 11.5s non-RT. That took me by surprise. A realtime kernel has greater responsibility for sustaining low latency. This is normal RT behavior for many threads involving locking or pipes/sockets. The other stages perform the same.

   int_type count = fast_atoll64(found + 1);
   string_cnt s{(const char *)beg_ptr, (size_t)(found - beg_ptr), count};

   // Use lazy_emplace to modify the set while the mutex is locked.
   set_ret.lazy_emplace_l(
      s,
      [&](string_cnt_set_t::value_type& p) {
         p.incr(count);      // called only when key was already present
      },
      [&](const string_cnt_set_t::constructor& ctor) {
         ctor(std::move(s)); // construct value_type in place when key not present
      }
   );

@greg7mdp
Copy link
Owner

Hi @marioroy , thanks for the updates. On my side I've been busy and have not worked on llil at all, I still have interesting ideas I'd like to try out when I get a chance.

@marioroy
Copy link
Author

marioroy commented Apr 23, 2024

I made an attempt at dynamic buffer allocation. Get properties decreased to 10.6s on a realtime Linux, now similar to running on a non-realtime kernel.

https://gist.github.com/marioroy/d02881b96b20fa1adde4388b3e216163

The dynamic allocation logic is thread-safe i.e. first argument optional thread id for thread-safety. Plus rolling back the dynamic memory position if the long string was already present.

   int_type count = fast_atoll64(found + 1);
   string_cnt s{tid, beg_ptr, (size_t)(found - beg_ptr), count};

   // Use lazy_emplace to modify the set while the mutex is locked.
   set_ret.lazy_emplace_l(
      s,
      [&](string_cnt_set_t::value_type& p) {
         p.incr(count); // called only when key was already present
         s.mem_rollback(tid);
      },
      [&](const string_cnt_set_t::constructor& ctor) {
         // construct value_type in place when key not present
         // long strings will be stored in dynamic memory storage
         ctor(std::move(s));
      }
   );

Notice lesser memory consumption. And there too, get properties completing faster than before.

$ time ./llil4map long* long* long* | cksum
llil4map start
use OpenMP             ( old )    ( next )     ( buf )
memory consumption     36.6 GB     17.5 GB     16.0 GB
use boost sort
get properties         12.967s     11.527s     10.402s
map to vector           3.244s      1.252s      1.211s
vector stable sort     10.265s      8.708s      8.161s
write stdout            1.410s      1.388s      1.386s
total time             27.887s     22.877s     21.163s
    count lines      970195200
    count unique     295755152
29612263 5038456270

       real time     0m43.973s   0m24.872s   0m21.909s
       user         25m08.532s  22m58.120s  20m56.492s
       sys           0m55.602s   0m48.783s   0m11.551s

Thank you, @greg7mdp. Your string_cnt_t struct is how I learned to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants