Skip to content

Conversation

kotharironak
Copy link
Contributor

Description

Java's UUID.randomUUID uses SecureRandom, which internally relies on the system's /dev/random. This method requires a sufficient amount of entropy to generate random numbers and may lead to blocking.

In the present context, we are focused on generating request IDs. Therefore, a faster UUID generator based on ThreadLocalRandom should be adequate.

For a similar use case, the OpenTelemetry system also utilizes a trace ID based on ThreadRandomLocal, serving as a reference.

As part of this PR, using fastRadom for request-context.

@kotharironak kotharironak requested a review from a team as a code owner August 31, 2023 16:26
@github-actions
Copy link

github-actions bot commented Aug 31, 2023

Test Results

70 tests  ±0   70 ✔️ ±0   16s ⏱️ -7s
12 suites ±0     0 💤 ±0 
12 files   ±0     0 ±0 

Results for commit ed3a0c5. ± Comparison against base commit cb3600a.

♻️ This comment has been updated with latest results.

return new RequestContext()
.put(RequestContextConstants.TENANT_ID_HEADER_KEY, tenantId)
.put(RequestContextConstants.REQUEST_ID_HEADER_KEY, UUID.randomUUID().toString());
.put(RequestContextConstants.REQUEST_ID_HEADER_KEY, UuidCreator.getRandomBasedFast().toString());
Copy link
Contributor Author

@kotharironak kotharironak Aug 31, 2023

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns here aren't the new impl but

  1. Adding a new dependency to a base lib like this impacts the dependency tree of every single service
  2. This is just covering up an upstream problem - that we're generating request contexts inappropriately - many times for a single "request".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a new dependency to a base lib like this impacts the dependency tree of every single service

I can extract out the function from this library based on ThreadLocalRandom.

This is just covering up an upstream problem - that we're generating request contexts inappropriately - many times for a single "request".

I didn't fully get this. Can you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't fully get this. Can you elaborate?

The reason this came up - at least the reason I'm aware of - as an issue is because ingester is generating many request contexts for a single trace ingestion. Each time it needs a context, it generates a new one. Because this now occurs across many enrichers and many threads in parallel, we're running into these performance problems. But not only is that unnecessary, it's actually wrong - we want the same ID to be shared for ingesting the same trace in the first place.

A couple different approaches there would be to either

  1. update the enricher interface to accept the context, and generate it up front before we delegate the enrichment off to an async thread.
  2. Create a request context from the trace details - use the trace id as the "request id". This would probably be defined in the ingestion pipeline repo.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request-id attribute is attached to the grpc-context while making a downstream call. This also seems to be getting added only for RXJava calls, if someone is not using RXJava, this request-id is not attached to the call.
How is this request-id used by the downstream services? Given that its attached only for RXJava based calls and not for all GRPC calls, do we even need to attach this request-id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this came up - at least the reason I'm aware of - as an issue is because ingester is generating many request contexts for a single trace ingestion. Each time it needs a context, it generates a new one. Because this now occurs across many enrichers and many threads in parallel, we're running into these performance problems. But not only is that unnecessary, it's actually wrong - we want the same ID to be shared for ingesting the same trace in the first place.

A couple different approaches there would be to either

update the enricher interface to accept the context, and generate it up front before we delegate the enrichment off to an async thread.
Create a request context from the trace details - use the trace id as the "request id". This would probably be defined in the ingestion pipeline repo.

I agree that for tracing the processing of StructureTrace, it's important to establish a context based on unique keys or attributes that can be used to generate a trace ID. And if request_id is primarily used internally to link outgoing requests to the same trace, we can consider option (2) having new request context based trace.

However, for any other consumers of this interface forTenantId, do we really need a request_id generated from a secure random source? Since this is primarily for tracing purposes and not for resource identification, I think it's more practical to use a fast UUID-based request_id generated with ThreadLocalRandom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can extract out the function from this library based on ThreadLocalRandom.

If we don't plan to use the above library here. I am thinking of having our own function like below:

public static UUID generateFastRandomUUID() {
    long mostSigBits = ThreadLocalRandom.current().nextLong();
    long leastSigBits = ThreadLocalRandom.current().nextLong();
    
    // Set the version (4) For random UUID 
    mostSigBits &= 0xFFFFFFFFFFFF0FFFL;
    mostSigBits |= 0x0000000000004000L;
    // Set variant to RFC 4122
    leastSigBits &= 0x3FFFFFFFFFFFFFFFL;  
    leastSigBits |= 0x8000000000000000L; 

    return new UUID(mostSigBits, leastSigBits);
}

Copy link

@laxmanchekka laxmanchekka Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few stacks showing up the contention at jdk level

@codecov
Copy link

codecov bot commented Aug 31, 2023

Codecov Report

Merging #49 (ed3a0c5) into main (cb3600a) will increase coverage by 0.30%.
The diff coverage is 90.00%.

@@             Coverage Diff              @@
##               main      #49      +/-   ##
============================================
+ Coverage     74.57%   74.88%   +0.30%     
- Complexity      145      146       +1     
============================================
  Files            20       21       +1     
  Lines           409      418       +9     
  Branches         22       22              
============================================
+ Hits            305      313       +8     
- Misses           85       86       +1     
  Partials         19       19              
Flag Coverage Δ
unit 74.88% <90.00%> (+0.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...race/core/grpcutils/context/FastUUIDGenerator.java 87.50% <87.50%> (ø)
...ertrace/core/grpcutils/context/RequestContext.java 75.00% <100.00%> (+0.28%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@aaron-steinfeld
Copy link
Contributor

The request-id attribute is attached to the grpc-context while making a downstream call. This also seems to be getting added only for RXJava calls, if someone is not using RXJava, this request-id is not attached to the call.
How is this request-id used by the downstream services? Given that its attached only for RXJava based calls and not for all GRPC calls, do we even need to attach this request-id?

This is unrelated to rxjava. RequestContext and everything it contains (tenant id, request id, auth context where applicable) is used by all of our grpc calls (a handful of which are themselves managed by rxjava). This request ID is used to correlate errors and logs across services.

I agree that for tracing the processing of StructureTrace, it's important to establish a context based on unique keys or attributes that can be used to generate a trace ID. And if request_id is primarily used internally to link outgoing requests to the same trace, we can consider option (2) having new request context based trace.

The unique key is just the trace id itself though, right? Either way I think it would make sense to update the enricher interface so however we generate the request context is immaterial to the enrichers, it's just a question of level of effort. If that's painful, then having a locally consistent version that does something like buildRequestContextForTrace(structuredTrace) seems reasonable, since that's what we should be doing at the beginning of the enricher regardless.

However, for any other consumers of this interface forTenantId, do we really need a request_id generated from a secure random source? Since this is primarily for tracing purposes and not for resource identification, I think it's more practical to use a fast UUID-based request_id generated with ThreadLocalRandom.

So here is where I think we may be getting too far in the weeds. I agree, it doesn't need to be secure and am certainly glad to have a thread local implementation (yours looks fine) - but I think it becomes an unnecessary optimization if we fix the request context generation itself. We're fixing a symptom of an upstream bug rather than the bug itself.

Few stacks showing up the contention at jdk level

Right, this is the one coming from the context generation being inside the async threads rather than up front. Edited to remove internal details.

@laxmanchekka
Copy link

laxmanchekka commented Sep 1, 2023

if we fix the request context generation itself

GRPC request context generation and implementation is the responsibility of the grpc framework itself. Why do we want to push this responsibility down to the grpc client applications and leak/duplicate this code everywhere.

While I agree that trace enricher refactoring is needed for several other good reasons. Definitely not worth for this one.

@laxmanchekka
Copy link

laxmanchekka commented Sep 1, 2023

@kotharironak : I vote for a simple util based implementation without a new third party dependency for this alone.

@aaron-steinfeld
Copy link
Contributor

GRPC request context generation and implementation is the responsibility of the grpc framework itself. Why do we want to push this responsibility down to the grpc client applications and leak/duplicate this code everywhere.

I would still leave it there, it's just a new constructor allowing callers to bring their own ID when that makes sense.

  public static RequestContext forTenantId(String tenantId) {
    return RequestContext.forTenantAndRequestIds(tenantId, UUID.randomUUID().toString()); // This would be the added api
  }

There's no real duplicate code, as this only impacts clients in that they're adding the extra arg from data they already have where applicable. And that's unknowable data to the framework.

While I agree that trace enricher refactoring is needed for several other good reasons. Definitely not worth for this one.

We can choose how major of a refactor we want this to be. The interface can be left as is or updated, but we can't fix the actual bug here without changing the duplicated RequestContext.forTenantId calls. We could fix the thread contention with the secure random change, but not the broken correlation. IMO, the minimum/maximum effort here isn't really that significant - on the scale of hours.

@mihirgt
Copy link

mihirgt commented Sep 1, 2023

Given that UUID.randomUUID takes a lock and causes thread contention, we should replace it with a a ThreadLocalRandom based implementation. This will help in general.
In follow up PRs we can fix the callers to fix calling context in trace enricher.

@aaron-steinfeld
Copy link
Contributor

Given that UUID.randomUUID takes a lock and causes thread contention, we should replace it with a a ThreadLocalRandom based implementation. This will help in general. In follow up PRs we can fix the callers to fix calling context in trace enricher.

I just don't want us to lose sight of the root cause here, but otherwise no issues with that.

@kotharironak
Copy link
Contributor Author

Given that UUID.randomUUID takes a lock and causes thread contention, we should replace it with a a ThreadLocalRandom based implementation. This will help in general. In follow up PRs we can fix the callers to fix calling context in trace enricher.

I just don't want us to lose sight of the root cause here, but otherwise no issues with that.

@laxmanchekka @mihirgt @aaron-steinfeld
I have added the static function, and removes the usage of third-party lib.
I will work on separate PR for context change for enricher.

* the default randomUUID method that relies on /dev/random. It's suitable for most random UUID
* needs.
*/
public static UUID generateFastRandomUUID() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static UUID generateFastRandomUUID() {
public static UUID randomUUID() {

import java.util.UUID;
import java.util.concurrent.ThreadLocalRandom;

public class UuidGenerator {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class UuidGenerator {
public class FastUUIDGenerator {

laxmanchekka
laxmanchekka previously approved these changes Sep 4, 2023
@kotharironak kotharironak merged commit c303dc0 into main Sep 4, 2023
@kotharironak kotharironak deleted the fast-random-uuid branch September 4, 2023 10:38
import java.util.UUID;
import java.util.concurrent.ThreadLocalRandom;

public class FastUUIDGenerator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public? Doesn't need a new release, but next time you're in here can you move this to package private since it's an internal impl detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially thought of it to our own lib so it can be used. But, I think, we can make it private, anywhere else we can use the third-party library that is already there in most of the projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants