Provide a way to debug a context leak #4100

minwoox · 2022-02-18T15:10:24Z

TL;DR
We can store stacktrace of a RequestContext when it's pushed via RequestContext.push() and use the information to debug a context leak.

We use the RequestContext to store and convey the information of a request.
If a RequestContext is pushed into thread-local and a thread executes lines of code with the context in its thread-local,
we call that the lines of code are request-scoped.
Let's see the following example:

ServiceRequestContext fooCtx = ...; // fooCtx has the information of a request from a client.
...
try (SafeCloseable ignored = fooCtx.push()) {
    // The lines in this try-with-resources
    // block are request scoped
    // until the `fooCtx` is popped by the try-with-resources.
}

Because it's request-scoped, we can use the information of a request in the block.
Let's say that we setup to log message with the request ID using RequestContextExportingAppender:

try (SafeCloseable ignored = fooCtx.push()) {
    logger.trace("I'm tracing the flow") // This will print `RequestID=XXXX - I'm tracing the flow` to the console.
}

Because it prints the request ID, which is unique for each request, with the message, a user easily tracks the call flow of a request in an async server.

The RequestContext should be used with try-with-resources to avoid the context leak. Let's say it's not popped by mistake:

Executor executor = ...
executor.execute(() -> {
    SafeCloseable ignored = fooCtx.push();
    // Do some logic with fooCtx.
    ...
    // fooCtx should be popped here but it isn't
});

executor.execute(() -> {
    // This will also print `RequestID=XXXX - I am not a context of fooCtx.` because the fooCtx is not popped.
    logger.debug("I am not a context of fooCtx.")
});

In the above example, users get confused because the debug log also prints the ID of the fooCtx even though it's irrelevant.
We call that "Context leak".

In order to prevent it, we raise an IllegalStateException when another context is pushed while the thread has a context:

Executor executor = ...
executor.execute(() -> {
    SafeCloseable ignored = fooCtx.push();
    // Do some logic with fooCtx.
    ...
    // fooCtx should be popped here but it isn't
});

executor.execute(() -> {
    ServiceRequestContext barCtx = ...
    try (SafeCloseable ignored = barCtx.push()) { // This raises an IllegalStateException.
        ...
    }
});

So we can notice that the fooCtx is not popped after its use.
However, because the exception is raised when barCtx is pushed, we don't have any stacktrace of the fooCtx.
If we know when the fooCtx is pushed by looking at the stacktrace, we can easily fix the context leak.

Of course, we can scour through all lines of code to find the leak, but it's especially hard to find it when Armeria is used with third-party or in Kotlin.

So if we can make a RequestContext has the stacktrace when it's pushed, we can easily fix the context leak.

RequestContext stores the stacktrace when RequestContext.push() is called.
RequestContext removes the stored stacktrace when RequestContext.pop() is called.
A RequestContext can be pushed multiple times before popped, we need a stack to store the stacktrace.
When IllegalStateException is raised, it also prints the stacktrace of the current RequestContext in thread-local

~~Because generating stacktrace costs a lot, we can introduce different detection levels:~~
~~- DISABLED: Do not record stacktrace~~
~~- ADVANCED: Record only samples of RequestContexts~~
~~- PARANOID: Record all RequestContexts~~
~~as Netty does: https://netty.io/wiki/reference-counted-objects.html#leak-detection-levels~~

We can add a flag to enable this feature or setters to ServerBuilder and ClientBuilder.

Server.builder().recordContextTrace();
// or
Server.builder().recordContextTrace(ctx -> true); // To enable recording for a specific service.

Clients.builder().recordContextTrace();

The text was updated successfully, but these errors were encountered:

klurpicolo · 2022-03-30T15:11:33Z

Hi @minwoox, I'm interesting in this issue. Can I work on this issue?

After read through code, Here is my initial idea about implementation.

Create 2 new classes TraceAble(Service/Client)RequestContext that extend Default(Service/Client)RequestContext to hold the stack of stacktrace.
Override push() to do push stacktrace to stack before pass to super.push()
Add new configuration option recordContextTrace to ServerConfig and ClientOptions. Also add new method to service/client builder.
For server side, Add condition on HttpServerHandler to use TraceAbleServiceRequestContext instead of DefaultServiceRequestContext if recordContextTrace (Predicate<RequestContext>) returns true.
For client side, Add condition on UserClient to use TraceAbleClientRequestContext instead of DefaultClientRequestContext if recordContextTrace (Predicate<RequestContext>) returns true.

On point 4 and 5, I'm quite not sure if there is also another place that have add this logic. I check by tracing on where Default(Service/Client)RequestContext are used. If you have suggestion or preferred implementation, please let me know, Thanks. 🙏

minwoox · 2022-04-01T06:59:34Z

@klurpicolo Thanks a lot for your interest in this issue. 😄
Yeah, I believe you got the point and we can work on it as you explained.

On point 4 and 5, I'm quite not sure if there is also another place

We also need to add this logic to AbstractRequestContextBuilder so that users can also trace when the context is created via a builder. 😉

anuraaga · 2022-04-01T10:33:05Z

Ah I was looking into this a few months ago but forgot to continue after a preparation PR. I think you should be able to implement as an implementation of context storage without any other API changes, similar to this code that probably provides a lot of copy pasta to use

https://github.com/open-telemetry/opentelemetry-java/blob/main/context/src/main/java/io/opentelemetry/context/StrictContextStorage.java

klurpicolo · 2022-04-03T06:51:36Z

@minwoox Thanks. I'll start implementation and then pull request.

@anuraaga That looks pretty useful. Many Thanks! 🙏

minwoox · 2022-04-04T03:28:27Z

Thanks, @anuraaga for the link. I didn't know that otel had a similar implementation.

I think you should be able to implement as an implementation of context storage without any other API changes,

If we use the context storage via a JVM flag, we cannot enable the tracking only for a specific service context.
Isn't it better to add an API to track for a specific context only so that we can minimize the burden of creating stracktrace when a service has a problem?

anuraaga · 2022-04-04T03:48:06Z

Isn't it better to add an API to track for a specific context only so that we can minimize the burden of creating stracktrace when a service has a problem?

Stack traces are so expensive that even for a given service it's probably too much for production. Sampling is one approach, but I think in practice enabling the flag on a staging server tends to catch 99+% of leaks anyways, any it's easy for sampling to not help much if the problem is in a path with low testing coverage anyways (if tests hit it, staging server is enough). Instead of picking out possibly problematic services, often after an issue already made it to production, I believe it is a more robust approach to have a flag that users always enable on staging servers and generally keep disabled on production servers.

And in general, strongly recommending not requiring code change for detecting context leaks - any code change can have an even bigger problem than a context leak. A follow up to the above could be a flag to allow enabling stack traces based on some simple HTTP properties like path (perhaps the matcher can be code that is loaded by SPI, this is to avoid hard coupling like .recordContextTrace within server building itself). Armeria can add a flag to its contexts based on this to cause them to be tracked when pushed / popped - this I think alleviates the need for any TraceAbleServiceRequestContext too. The same storage implementation could probably used for this too - either check JVM flag for global recording, or see if something has been added to context for context-specific logging.

minwoox · 2022-04-04T07:27:20Z

That makes sense. 😄

@klurpicolo Let's use the storage as @anuraaga suggested.
We can create a new RequestContextStorage that tracks the stracktraces when users enable the flag here
https://github.com/line/armeria/blob/master/core/src/main/java/com/linecorp/armeria/internal/common/RequestContextUtil.java#L67-L115

klurpicolo · 2022-04-05T18:10:15Z

Ok, Noted.

@anuraaga

…4232) Motivation: - Context leaks are hard to find because an exception does not tell where/which context is pushed without poping. By using TraceAbleRequestContextStorage, it helps to report the source of context leaks. - Details as mentioned in #4100 By the way, Thanks to @anuraaga for giving a reference to read on [opentelemetry](https://github.com/open-telemetry/opentelemetry-java). Modifications: - Add `TraceAbleRequestContextStorage` that stores `RequestContext` stack trace and reports to the user where it happens. - Add `requestContextLeakDetectionSampler` flag that users can use for enable leak detection. Users can enable it by either system property or SPI flag provider. Result: - Closes #4100 - `TraceAbleRequestContextStorage` is added, so users can use it to report where context leaks happen. How to enable: 1) By system property `-Dcom.linecorp.armeria.requestContextLeakDetectionSampler=<sampler-spec>` 2) By providing FlagsProvider SPI ```java public final class EnableLeakDetectionFlagsProvider implements FlagsProvider { @OverRide public Sampler<? super RequestContext> requestContextLeakDetectionSampler() { return Sampler.always(); } ... } ``` 3) By providing RequestContextStorageProvider SPI (not recommend since RequestContextStorageProvider SPI'll be remove as mentioned in ##4211 ) ```java public final class CustomRequestContextStorageProvider implements RequestContextStorageProvider { @OverRide public RequestContextStorage newStorage() { return new TraceAbleRequestContextStorage(delegate); } } ``` Use case: Users problematic code ```java executor.execute(() -> { SafeCloseable leaked = fooCtx.push(); //This causes Request context leaks! ... }); executor.execute(() -> { try (SafeCloseable ignored = barCtx.push()) { //Exception happen here ... } }); ``` The above code will produce an error as below. Therefore, users can check the stack trace that which line causes context leaks. ``` java.lang.IllegalStateException: Trying to call object wrapped with context [%New RequestContext%], but context is currently set to TraceableServiceRequestContext[%Previous RequestContext%] com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$PendingRequestContextStackTrace: At thread [armeria-testing-eventloop-nio-1-1] previous RequestContext is pushed at the following stacktrace at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$TraceableServiceRequestContext.<init>(LeakTracingRequestContextStorage.java:111) at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$TraceableServiceRequestContext.<init>(LeakTracingRequestContextStorage.java:105) at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage.warpRequestContext(LeakTracingRequestContextStorage.java:82) at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage.push(LeakTracingRequestContextStorage.java:62) at com.linecorp.armeria.internal.common.RequestContextUtil.getAndSet(RequestContextUtil.java:149) at com.linecorp.armeria.server.ServiceRequestContext.push(ServiceRequestContext.java:221) at com.linecorp.armeria.internal.common.TraceRequestContextLeakTest.lambda$singleThreadContextLeak$2(TraceRequestContextLeakTest.java:101) <- This is the line where leaked RequestContext is push ... . This means the callback was called from unexpected thread or forgetting to close previous context. at com.linecorp.armeria.internal.common.RequestContextUtil.newIllegalContextPushingException(RequestContextUtil.java:100) at com.linecorp.armeria.server.ServiceRequestContext.push(ServiceRequestContext.java:237) at com.linecorp.armeria.internal.common.TraceRequestContextLeakTest.lambda$singleThreadContextLeak$3(TraceRequestContextLeakTest.java:107) ... ```

@anuraaga

…ine#4232) Motivation: - Context leaks are hard to find because an exception does not tell where/which context is pushed without poping. By using TraceAbleRequestContextStorage, it helps to report the source of context leaks. - Details as mentioned in line#4100 By the way, Thanks to @anuraaga for giving a reference to read on [opentelemetry](https://github.com/open-telemetry/opentelemetry-java). Modifications: - Add `TraceAbleRequestContextStorage` that stores `RequestContext` stack trace and reports to the user where it happens. - Add `requestContextLeakDetectionSampler` flag that users can use for enable leak detection. Users can enable it by either system property or SPI flag provider. Result: - Closes line#4100 - `TraceAbleRequestContextStorage` is added, so users can use it to report where context leaks happen. How to enable: 1) By system property `-Dcom.linecorp.armeria.requestContextLeakDetectionSampler=<sampler-spec>` 2) By providing FlagsProvider SPI ```java public final class EnableLeakDetectionFlagsProvider implements FlagsProvider { @OverRide public Sampler<? super RequestContext> requestContextLeakDetectionSampler() { return Sampler.always(); } ... } ``` 3) By providing RequestContextStorageProvider SPI (not recommend since RequestContextStorageProvider SPI'll be remove as mentioned in #line#4211 ) ```java public final class CustomRequestContextStorageProvider implements RequestContextStorageProvider { @OverRide public RequestContextStorage newStorage() { return new TraceAbleRequestContextStorage(delegate); } } ``` Use case: Users problematic code ```java executor.execute(() -> { SafeCloseable leaked = fooCtx.push(); //This causes Request context leaks! ... }); executor.execute(() -> { try (SafeCloseable ignored = barCtx.push()) { //Exception happen here ... } }); ``` The above code will produce an error as below. Therefore, users can check the stack trace that which line causes context leaks. ``` java.lang.IllegalStateException: Trying to call object wrapped with context [%New RequestContext%], but context is currently set to TraceableServiceRequestContext[%Previous RequestContext%] com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$PendingRequestContextStackTrace: At thread [armeria-testing-eventloop-nio-1-1] previous RequestContext is pushed at the following stacktrace at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$TraceableServiceRequestContext.<init>(LeakTracingRequestContextStorage.java:111) at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$TraceableServiceRequestContext.<init>(LeakTracingRequestContextStorage.java:105) at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage.warpRequestContext(LeakTracingRequestContextStorage.java:82) at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage.push(LeakTracingRequestContextStorage.java:62) at com.linecorp.armeria.internal.common.RequestContextUtil.getAndSet(RequestContextUtil.java:149) at com.linecorp.armeria.server.ServiceRequestContext.push(ServiceRequestContext.java:221) at com.linecorp.armeria.internal.common.TraceRequestContextLeakTest.lambda$singleThreadContextLeak$2(TraceRequestContextLeakTest.java:101) <- This is the line where leaked RequestContext is push ... . This means the callback was called from unexpected thread or forgetting to close previous context. at com.linecorp.armeria.internal.common.RequestContextUtil.newIllegalContextPushingException(RequestContextUtil.java:100) at com.linecorp.armeria.server.ServiceRequestContext.push(ServiceRequestContext.java:237) at com.linecorp.armeria.internal.common.TraceRequestContextLeakTest.lambda$singleThreadContextLeak$3(TraceRequestContextLeakTest.java:107) ... ```

minwoox added the new feature label Feb 18, 2022

klurpicolo mentioned this issue Apr 26, 2022

Add TraceAbleRequestContextStorage for ease of detect context leaks #4232

Merged

minwoox closed this as completed in #4232 Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a way to debug a context leak #4100

Provide a way to debug a context leak #4100

minwoox commented Feb 18, 2022 •

edited

Loading

klurpicolo commented Mar 30, 2022 •

edited

Loading

minwoox commented Apr 1, 2022

anuraaga commented Apr 1, 2022

klurpicolo commented Apr 3, 2022

minwoox commented Apr 4, 2022

anuraaga commented Apr 4, 2022

minwoox commented Apr 4, 2022

klurpicolo commented Apr 5, 2022

Provide a way to debug a context leak #4100

Provide a way to debug a context leak #4100

Comments

minwoox commented Feb 18, 2022 • edited Loading

klurpicolo commented Mar 30, 2022 • edited Loading

minwoox commented Apr 1, 2022

anuraaga commented Apr 1, 2022

klurpicolo commented Apr 3, 2022

minwoox commented Apr 4, 2022

anuraaga commented Apr 4, 2022

minwoox commented Apr 4, 2022

klurpicolo commented Apr 5, 2022

minwoox commented Feb 18, 2022 •

edited

Loading

klurpicolo commented Mar 30, 2022 •

edited

Loading