Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compiler-rt][ctx_profile] Add the instrumented contextual profiling APIs #89838

Merged
merged 10 commits into from
May 7, 2024

Conversation

mtrofin
Copy link
Member

@mtrofin mtrofin commented Apr 23, 2024

APIs for contextual profiling. ContextNode is the call context-specific counter buffer. ContextRoot is associated to those functions that constitute roots into interesting call graphs, and is the object on which we hang off Arenas for allocating ContextNodes, as well as the ContextNode corresponding to such functions. Graphs of ContextNodes are accessible by one thread at a time.

(Tracking Issue: #89287, more details in the RFC referenced there)

…APIs

APIs for contextual profiling.

(Tracking Issue: llvm#89287, RFC referenced there)
Copy link

github-actions bot commented Apr 30, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

compiler-rt/lib/ctx_profile/CtxInstrProfiling.h Outdated Show resolved Hide resolved
}

uint64_t *counters() {
ContextNode *addr_after = &(this[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here and in the subContexts method and maybe in the class header about the expected layout of the counters and subContexts, since they aren't data members?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

};

/// ContextRoots are allocated by LLVM for entrypoints. LLVM is only concerned
/// with allocating and zero-initializing the global value for it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "the global value for it"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in GlobalValue. fixed.

//
/// TLS where LLVM stores the pointer of the called value, as part of lowering a
/// llvm.instrprof.callsite
extern __thread void *volatile __llvm_ctx_profile_expected_callee[2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there 2 values for the expected callee? Ditto for the __llvm_ctx_profile_callsite below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says in the first comment under extern "C" - position 1 is for scratch. It's so we don't pollute the values for normal collection when signal handlers happen. Normally we'd implement a stack - but we don't need to because we don't care about anything that happens when we go to sctach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see index [0] being accessed for both __llvm_ctx_profile_expected_callee and __llvm_ctx_profile_callsite, other than in the tests. When does index [1] get used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the llvm side. See PR #90821. The runtime never looks at index 1, but having the 2 indices simplifies the lowering logic: LLVM determines the index upon getting the value from here (so in the function entry basic block) and then just writes at that index for each callsite without needing to test every time if it's scratch.

/// Used to obtain the profile. The Writer is called for each root ContextNode,
/// with the ContextRoot::Taken taken. The Writer is responsible for traversing
/// the structure underneath.
/// The Writer's first parameter is an object it knows about, which is what the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by "an object it knows about"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's the closure... rephrased.

TEST_F(ContextTest, ScratchNoCollection) {
EXPECT_EQ(__llvm_ctx_profile_current_context_root, nullptr);
int OpaqueValue = 0;
// this would be the very first function executing this. the TLS is empty,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify - I assume it is scratch because start context hasn't been called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. clarified.

__llvm_ctx_profile_expected_callee[0] = &OpaqueValue;
__llvm_ctx_profile_callsite[0] = &Ctx->subContexts()[2];
auto *Subctx = __llvm_ctx_profile_get_context(&OtherOpaqueValue, 2, 3, 1);
EXPECT_TRUE(isScratch(Subctx));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about why

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because you passed a different pointer to get_context, but it would be good to add a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. ptal.

/// called for any other function than entry points, in the entry BB of such
/// function. Same consideration about LSB of returned value as .._start_context
__ctx_profile::ContextNode *
__llvm_ctx_profile_get_context(void *Callee, __ctx_profile::GUID Guid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type is Callee in practice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a function address. So void * is appropriate.

int OpaqueValue = 0;
const bool IsScratch = isScratch(Ctx);
EXPECT_FALSE(IsScratch);
__llvm_ctx_profile_expected_callee[0] = &OpaqueValue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpaqueValues in these tests are stand ins for callee values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. should I rename - maybe "FakeCallee"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah FakeCallee would be more self explanatory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

__llvm_ctx_profile_callsite[1] = &Subctx->subContexts()[0];

auto *Subctx2 = __llvm_ctx_profile_get_context(&ThirdOpaqueValue, 3, 0, 0);
EXPECT_TRUE(isScratch(Subctx2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about adding comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

///
/// The layout is as follows:
///
/// [[statically-declared fields][counters vector][subcontexts vector]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think interpreting a subcontext ContextNode as part of the same block of memory immediately following counters of width uint64_t is undefined behaviour. In this case I think we get lucky because the first element of the ContextNode struct is a GUID (uint64_t) and thus the alignment of the ContextNode struct and the alignment of each individual counter matches.

I am far from an expert on undefined behaviour and my understanding is based on reading (5) in https://en.cppreference.com/w/cpp/language/reinterpret_cast and https://en.cppreference.com/w/c/language/object. Please feel free to correct me if you think otherwise.

I think the motivation is that the single chunk based layout simplifies the instrumentation GEP computations. A few questions:

  1. Can we test the ContextNode runtime implementation with sanitizers to expose any alignment issues?
  2. Have you considered an alternative design which does not make it easy to run into such issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. The subcontext vector is of pointers to subcontexts. Maybe I should clarify that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said - the rest isn't by accident, but we can statically check that the alignment checks out (added). Basically:

  • the Arenas are 8-aligned because that's how __sanitizer::InternalAlloc returns by default, but added the explicit flag for clarity
  • the memory we start bump-allocating in the Arena is 8-aligned by construction - added a static_assert for maintainability
  • the choice of types for ContextNode's header is explicit (to maintain the invariant) - and added a static assert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then this does not apply since we are reinterpreting 64-bit values as pointers. Yes, enhancing the documentation to clarify would help. I still think it's a good idea to run (some of) the unit tests through ubsan.

// The counters vector starts right after the static header.
uint64_t *counters() {
ContextNode *addr_after = &(this[1]);
return reinterpret_cast<uint64_t *>(reinterpret_cast<char *>(addr_after));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the additional reinterpret_cast to char* required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. Fixed.


// utility to taint a pointer by setting the LSB. There is an assumption
// throughout that the addresses of contexts are even (really, they should be
// align(8), but "even"-ness is the minimum assumption)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the align(8) nature is load bearing, see my other comment about reinterpreting the address immediately following the counters array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and __sanitizer::InternalAlloc align-8s if you pass 0 alignment (the default). We can force that though, for clarity.

///
/// The layout is as follows:
///
/// [[statically-declared fields][counters vector][subcontexts vector]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then this does not apply since we are reinterpreting 64-bit values as pointers. Yes, enhancing the documentation to clarify would help. I still think it's a good idea to run (some of) the unit tests through ubsan.

compiler-rt/lib/ctx_profile/CtxInstrProfiling.h Outdated Show resolved Hide resolved
Comment on lines +144 to +145
for (uint32_t I = 0; I < NrCounters; ++I)
counters()[I] = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just use memset since you know the begin and end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__sanitizer::internal_memset is the best we can do. I'm not super-sure we really want that though (would be happy with std::memset) - because see FIXME here about removing more of the deps to sanitizer_common - not that this would be too hard to fix; but also don't we have a pass in LLVM that would effectivelly recognize this as memset and replace with the intrinsic? I.e. may not achieve much.

Left a FIXME to do std::memset though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I don't think the compiler will be able to reason through the counters() implementation to be able to do this automatically though.

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but wait for snehasish to approve as well

Copy link
Contributor

@snehasish snehasish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with some minor comments.

Comment on lines +144 to +145
for (uint32_t I = 0; I < NrCounters; ++I)
counters()[I] = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I don't think the compiler will be able to reason through the counters() implementation to be able to do this automatically though.


// Verify maintenance to ContextNode doesn't change this invariant, which makes
// sure the inlined vectors are appropriately aligned.
static_assert(sizeof(ContextNode) % Alignment == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be alignof(ContextNode). See the example in https://en.cppreference.com/w/c/language/object for structures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, fixed here and in a few more places.

// Note that this is node-by-node aggregation, i.e. summing counters of nodes
// at the same position in the graph, not flattening.
// Threads that cannot lock Taken (fail TryLock) are given a "scratch context"
// - a buffer they can clobber, safely from a memory access perspective.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then the data written to the scratch context is just dropped?

Copy link
Member Author

@mtrofin mtrofin May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - and expanded in the comment.

The design allows for that not be the case - because "scratch"-ness is first and foremost about not trying to build subcontexts, and is captured by tainting the pointer value (pointer to the memory treated as context), but right now, we drop that info.

// regardless.
::__sanitizer::StaticSpinMutex Taken;

// Avoid surprises due to (unlikely) StaticSpinMutex changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more descriptive message of why we care about the size here is preferred.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

EXPECT_NE(Ctx, nullptr);
EXPECT_NE(Root.CurrentMem, nullptr);
EXPECT_EQ(Root.FirstMemBlock, Root.CurrentMem);
EXPECT_EQ(Ctx->size(), sizeof(ContextNode) + 10 * sizeof(uint64_t) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ctx->size() will segfault even if the expectation on L35 fails since that does not fail the test. L35 should be an ASSERT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

auto *Root = AllContextRoots[I];
__sanitizer::GenericScopedLock<__sanitizer::StaticSpinMutex> TakenLock(
&Root->Taken);
if (!validate(Root)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to validate on all builds? It might be problematic for server apps which expect profile dumping to be quick.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quick right now, we can relax later, but this wasn't a problem so far.

@mtrofin mtrofin merged commit ccf765c into llvm:main May 7, 2024
4 checks passed
mtrofin added a commit that referenced this pull request May 8, 2024
Follow-up from PR ##89838. Some build bots warn-as-error
about signed/unsigned comparison in CtxInstrProfilingTest.

Example: https://lab.llvm.org/buildbot/#/builders/37/builds/34610
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants