Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hines/thread padding #2951

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Hines/thread padding #2951

wants to merge 10 commits into from

Conversation

nrnhines
Copy link
Member

An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).

Using the model of #2787, improvement is small or nonexistent on an Apple M1. Timing results for 8 threads are

nthread   8.2.4.   master.  padding(this PR)
8 run.      16.69.   17.57.       17.38
8 solve    15.41     15.56.      15.49

Perhaps there is a datahandle overhead for plotting that could be solved by caching.

Copy link

✔️ 7492f82 -> Azure artifacts URL

@bbpbuildbot

This comment has been minimized.

Copy link

✔️ 7ab6c4e -> Azure artifacts URL

@bbpbuildbot

This comment has been minimized.

@mgeplf
Copy link
Collaborator

mgeplf commented Jul 5, 2024

An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).

Do you have a sense how often there is false sharing? IE: How often they are written to?
I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.

@nrnhines
Copy link
Member Author

nrnhines commented Jul 5, 2024

Do you have a sense how often there is false sharing? IE: How often they are written to?

Not really. We've been circling around the 8.2 vs 9.0 performance issue #2787 for quite a while and have made a lot of progress (mostly caching pointers to doubles such as diam and area). With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code. As you can see from this PR, the code changes for the padding experiment were quite simple and I'm delighted that the SoA permutation support allowed it to be carried out.

I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.

You are exactly correct. Ironically, that was how it was done in long ago precursors to 8.2. In 9.0, threads constitute merely a permutation of the data (into thread partitions) where each partition for each SoA variable (with this PR) begins on a cacheline boundary. It is actually quite beautiful to me to see the tremendous simplification of threads in 9.0 into a fairly trivial low level permutation. The 8.2 and before implementation of threads was a far reaching and complex change involving a great deal of memory management and copying between representations.

At the moment, even with this PR, 8.2 has a slight performance edge over 9.0 on the #2787 model issue with gcc. That performance edge disappears for the Apple M1 and is much smaller with the intel compiler on x86_64.

The bottom line so far with this experiment is that it appears that destructive cache line sharing is not a significant performance issue for #2787 . But pinning that down with confidence requires a bit more testing.

@pramodk
Copy link
Member

pramodk commented Jul 8, 2024

With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code

With the perf c2c we discussed, did you see significant false sharing reported?

Just for reference, with h.tstop = 1000, the timings for this PR against the master:

Compiler Build Type Threads Time (seconds)
gcc master 8 11.45
gcc master 8 11.48
gcc padding 8 11.11
gcc padding 8 11.08
intel master 8 8.37
intel master 8 8.39
intel padding 8 8.34
intel padding 8 8.41

@mgeplf
Copy link
Collaborator

mgeplf commented Jul 8, 2024

With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code.

Interesting, it's too bad more concrete proof didn't surface.

You are exactly correct. Ironically, that was how it was done in long ago precursors to 8.2. [...]

Also very interesting. It's certainly nice that it's relatively simple to implement with the updated system.

The bottom line so far with this experiment is that it appears that destructive cache line sharing is not a significant performance issue for #2787 .

and

[...] the timings for this PR against the master:

Whenever I see small perf changes, like the above, I always remember the Producing Wrong Data Without Doing Anything Obviously Wrong!. It's old, but the take away is pretty stunning; that there can be measurement bias simply due to link order, or environment size. Their showings aren't necessarily applicable here, it's more that small perf improvements may be have different root causes, and teasing out causal relationships is important.

Copy link

✔️ 19c2887 -> Azure artifacts URL

@bbpbuildbot

This comment has been minimized.

Copy link

sonarcloud bot commented Jul 15, 2024

Copy link

✔️ 13d7fd7 -> Azure artifacts URL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants