Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to debug a context leak #4100

Closed
minwoox opened this issue Feb 18, 2022 · 8 comments · Fixed by #4232
Closed

Provide a way to debug a context leak #4100

minwoox opened this issue Feb 18, 2022 · 8 comments · Fixed by #4232

Comments

@minwoox
Copy link
Member

minwoox commented Feb 18, 2022

TL;DR
We can store stacktrace of a RequestContext when it's pushed via RequestContext.push() and use the information to debug a context leak.

We use the RequestContext to store and convey the information of a request.
If a RequestContext is pushed into thread-local and a thread executes lines of code with the context in its thread-local,
we call that the lines of code are request-scoped.
Let's see the following example:

ServiceRequestContext fooCtx = ...; // fooCtx has the information of a request from a client.
...
try (SafeCloseable ignored = fooCtx.push()) {
    // The lines in this try-with-resources
    // block are request scoped
    // until the `fooCtx` is popped by the try-with-resources.
}

Because it's request-scoped, we can use the information of a request in the block.
Let's say that we setup to log message with the request ID using RequestContextExportingAppender:

try (SafeCloseable ignored = fooCtx.push()) {
    logger.trace("I'm tracing the flow") // This will print `RequestID=XXXX - I'm tracing the flow` to the console.
}

Because it prints the request ID, which is unique for each request, with the message, a user easily tracks the call flow of a request in an async server.

The RequestContext should be used with try-with-resources to avoid the context leak. Let's say it's not popped by mistake:

Executor executor = ...
executor.execute(() -> {
    SafeCloseable ignored = fooCtx.push();
    // Do some logic with fooCtx.
    ...
    // fooCtx should be popped here but it isn't
});

executor.execute(() -> {
    // This will also print `RequestID=XXXX - I am not a context of fooCtx.` because the fooCtx is not popped.
    logger.debug("I am not a context of fooCtx.")
});

In the above example, users get confused because the debug log also prints the ID of the fooCtx even though it's irrelevant.
We call that "Context leak".

In order to prevent it, we raise an IllegalStateException when another context is pushed while the thread has a context:

Executor executor = ...
executor.execute(() -> {
    SafeCloseable ignored = fooCtx.push();
    // Do some logic with fooCtx.
    ...
    // fooCtx should be popped here but it isn't
});

executor.execute(() -> {
    ServiceRequestContext barCtx = ...
    try (SafeCloseable ignored = barCtx.push()) { // This raises an IllegalStateException.
        ...
    }
});

So we can notice that the fooCtx is not popped after its use.
However, because the exception is raised when barCtx is pushed, we don't have any stacktrace of the fooCtx.
If we know when the fooCtx is pushed by looking at the stacktrace, we can easily fix the context leak.

Of course, we can scour through all lines of code to find the leak, but it's especially hard to find it when Armeria is used with third-party or in Kotlin.

So if we can make a RequestContext has the stacktrace when it's pushed, we can easily fix the context leak.

  • RequestContext stores the stacktrace when RequestContext.push() is called.
  • RequestContext removes the stored stacktrace when RequestContext.pop() is called.
  • A RequestContext can be pushed multiple times before popped, we need a stack to store the stacktrace.
  • When IllegalStateException is raised, it also prints the stacktrace of the current RequestContext in thread-local

Because generating stacktrace costs a lot, we can introduce different detection levels:
- DISABLED: Do not record stacktrace
- ADVANCED: Record only samples of RequestContexts
- PARANOID: Record all RequestContexts
as Netty does: https://netty.io/wiki/reference-counted-objects.html#leak-detection-levels

We can add a flag to enable this feature or setters to ServerBuilder and ClientBuilder.

Server.builder().recordContextTrace();
// or
Server.builder().recordContextTrace(ctx -> true); // To enable recording for a specific service.

Clients.builder().recordContextTrace();
@klurpicolo
Copy link
Contributor

klurpicolo commented Mar 30, 2022

Hi @minwoox, I'm interesting in this issue. Can I work on this issue?

After read through code, Here is my initial idea about implementation.

  1. Create 2 new classes TraceAble(Service/Client)RequestContext that extend Default(Service/Client)RequestContext to hold the stack of stacktrace.
  2. Override push() to do push stacktrace to stack before pass to super.push()
  3. Add new configuration option recordContextTrace to ServerConfig and ClientOptions. Also add new method to service/client builder.
  4. For server side, Add condition on HttpServerHandler to use TraceAbleServiceRequestContext instead of DefaultServiceRequestContext if recordContextTrace (Predicate<RequestContext>) returns true.
  5. For client side, Add condition on UserClient to use TraceAbleClientRequestContext instead of DefaultClientRequestContext if recordContextTrace (Predicate<RequestContext>) returns true.

On point 4 and 5, I'm quite not sure if there is also another place that have add this logic. I check by tracing on where Default(Service/Client)RequestContext are used. If you have suggestion or preferred implementation, please let me know, Thanks. 🙏

@minwoox
Copy link
Member Author

minwoox commented Apr 1, 2022

@klurpicolo Thanks a lot for your interest in this issue. 😄
Yeah, I believe you got the point and we can work on it as you explained.

On point 4 and 5, I'm quite not sure if there is also another place

We also need to add this logic to AbstractRequestContextBuilder so that users can also trace when the context is created via a builder. 😉

@anuraaga
Copy link
Collaborator

anuraaga commented Apr 1, 2022

Ah I was looking into this a few months ago but forgot to continue after a preparation PR. I think you should be able to implement as an implementation of context storage without any other API changes, similar to this code that probably provides a lot of copy pasta to use

https://github.com/open-telemetry/opentelemetry-java/blob/main/context/src/main/java/io/opentelemetry/context/StrictContextStorage.java

@klurpicolo
Copy link
Contributor

@minwoox Thanks. I'll start implementation and then pull request.

@anuraaga That looks pretty useful. Many Thanks! 🙏

@minwoox
Copy link
Member Author

minwoox commented Apr 4, 2022

Thanks, @anuraaga for the link. I didn't know that otel had a similar implementation.

I think you should be able to implement as an implementation of context storage without any other API changes,

If we use the context storage via a JVM flag, we cannot enable the tracking only for a specific service context.
Isn't it better to add an API to track for a specific context only so that we can minimize the burden of creating stracktrace when a service has a problem?

@anuraaga
Copy link
Collaborator

anuraaga commented Apr 4, 2022

Isn't it better to add an API to track for a specific context only so that we can minimize the burden of creating stracktrace when a service has a problem?

Stack traces are so expensive that even for a given service it's probably too much for production. Sampling is one approach, but I think in practice enabling the flag on a staging server tends to catch 99+% of leaks anyways, any it's easy for sampling to not help much if the problem is in a path with low testing coverage anyways (if tests hit it, staging server is enough). Instead of picking out possibly problematic services, often after an issue already made it to production, I believe it is a more robust approach to have a flag that users always enable on staging servers and generally keep disabled on production servers.

And in general, strongly recommending not requiring code change for detecting context leaks - any code change can have an even bigger problem than a context leak. A follow up to the above could be a flag to allow enabling stack traces based on some simple HTTP properties like path (perhaps the matcher can be code that is loaded by SPI, this is to avoid hard coupling like .recordContextTrace within server building itself). Armeria can add a flag to its contexts based on this to cause them to be tracked when pushed / popped - this I think alleviates the need for any TraceAbleServiceRequestContext too. The same storage implementation could probably used for this too - either check JVM flag for global recording, or see if something has been added to context for context-specific logging.

@minwoox
Copy link
Member Author

minwoox commented Apr 4, 2022

That makes sense. 😄

@klurpicolo Let's use the storage as @anuraaga suggested.
We can create a new RequestContextStorage that tracks the stracktraces when users enable the flag here
https://github.com/line/armeria/blob/master/core/src/main/java/com/linecorp/armeria/internal/common/RequestContextUtil.java#L67-L115

@klurpicolo
Copy link
Contributor

Ok, Noted.

minwoox pushed a commit that referenced this issue Sep 5, 2022
…4232)

Motivation:

- Context leaks are hard to find because an exception does not tell where/which context is pushed without poping. By using TraceAbleRequestContextStorage, it helps to report the source of context leaks.
- Details as mentioned in #4100 

By the way, Thanks to @anuraaga for giving a reference to read on [opentelemetry](https://github.com/open-telemetry/opentelemetry-java).

Modifications:

- Add `TraceAbleRequestContextStorage` that stores `RequestContext` stack trace and reports to the user where it happens.
- Add `requestContextLeakDetectionSampler` flag that users can use for enable leak detection. Users can enable it by either system property or SPI flag provider.

Result:

- Closes #4100 
- `TraceAbleRequestContextStorage` is added, so users can use it to report where context leaks happen.

How to enable:
1) By system property
`-Dcom.linecorp.armeria.requestContextLeakDetectionSampler=<sampler-spec>`

2) By providing FlagsProvider SPI
```java
public final class EnableLeakDetectionFlagsProvider implements FlagsProvider {

    @OverRide
    public Sampler<? super RequestContext> requestContextLeakDetectionSampler() {
        return Sampler.always();
    }
    ...
}
```

3) By providing RequestContextStorageProvider SPI (not recommend since RequestContextStorageProvider SPI'll be remove as mentioned in ##4211 )
```java
public final class CustomRequestContextStorageProvider implements RequestContextStorageProvider {

    @OverRide
    public RequestContextStorage newStorage() {
        return new TraceAbleRequestContextStorage(delegate);
    }
}
```
Use case:
Users problematic code
```java
executor.execute(() -> {
    SafeCloseable leaked = fooCtx.push(); //This causes Request context leaks!
    ...
});

executor.execute(() -> {
    try (SafeCloseable ignored = barCtx.push()) { //Exception happen here
        ...
    }
});

```
The above code will produce an error as below. Therefore, users can check the stack trace that which line causes context leaks.
```
java.lang.IllegalStateException: Trying to call object wrapped with context [%New RequestContext%], but context is currently set to TraceableServiceRequestContext[%Previous RequestContext%]
com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$PendingRequestContextStackTrace: At thread [armeria-testing-eventloop-nio-1-1] previous RequestContext is pushed at the following stacktrace
	at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$TraceableServiceRequestContext.<init>(LeakTracingRequestContextStorage.java:111)
	at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$TraceableServiceRequestContext.<init>(LeakTracingRequestContextStorage.java:105)
	at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage.warpRequestContext(LeakTracingRequestContextStorage.java:82)
	at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage.push(LeakTracingRequestContextStorage.java:62)
	at com.linecorp.armeria.internal.common.RequestContextUtil.getAndSet(RequestContextUtil.java:149)
	at com.linecorp.armeria.server.ServiceRequestContext.push(ServiceRequestContext.java:221)
	at com.linecorp.armeria.internal.common.TraceRequestContextLeakTest.lambda$singleThreadContextLeak$2(TraceRequestContextLeakTest.java:101)  <- This is the line where leaked RequestContext is push 
	...
. This means the callback was called from unexpected thread or forgetting to close previous context.
	at com.linecorp.armeria.internal.common.RequestContextUtil.newIllegalContextPushingException(RequestContextUtil.java:100)
	at com.linecorp.armeria.server.ServiceRequestContext.push(ServiceRequestContext.java:237)
	at com.linecorp.armeria.internal.common.TraceRequestContextLeakTest.lambda$singleThreadContextLeak$3(TraceRequestContextLeakTest.java:107)
	...
```
heowc pushed a commit to heowc/armeria that referenced this issue Sep 24, 2022
…ine#4232)

Motivation:

- Context leaks are hard to find because an exception does not tell where/which context is pushed without poping. By using TraceAbleRequestContextStorage, it helps to report the source of context leaks.
- Details as mentioned in line#4100 

By the way, Thanks to @anuraaga for giving a reference to read on [opentelemetry](https://github.com/open-telemetry/opentelemetry-java).

Modifications:

- Add `TraceAbleRequestContextStorage` that stores `RequestContext` stack trace and reports to the user where it happens.
- Add `requestContextLeakDetectionSampler` flag that users can use for enable leak detection. Users can enable it by either system property or SPI flag provider.

Result:

- Closes line#4100 
- `TraceAbleRequestContextStorage` is added, so users can use it to report where context leaks happen.

How to enable:
1) By system property
`-Dcom.linecorp.armeria.requestContextLeakDetectionSampler=<sampler-spec>`

2) By providing FlagsProvider SPI
```java
public final class EnableLeakDetectionFlagsProvider implements FlagsProvider {

    @OverRide
    public Sampler<? super RequestContext> requestContextLeakDetectionSampler() {
        return Sampler.always();
    }
    ...
}
```

3) By providing RequestContextStorageProvider SPI (not recommend since RequestContextStorageProvider SPI'll be remove as mentioned in #line#4211 )
```java
public final class CustomRequestContextStorageProvider implements RequestContextStorageProvider {

    @OverRide
    public RequestContextStorage newStorage() {
        return new TraceAbleRequestContextStorage(delegate);
    }
}
```
Use case:
Users problematic code
```java
executor.execute(() -> {
    SafeCloseable leaked = fooCtx.push(); //This causes Request context leaks!
    ...
});

executor.execute(() -> {
    try (SafeCloseable ignored = barCtx.push()) { //Exception happen here
        ...
    }
});

```
The above code will produce an error as below. Therefore, users can check the stack trace that which line causes context leaks.
```
java.lang.IllegalStateException: Trying to call object wrapped with context [%New RequestContext%], but context is currently set to TraceableServiceRequestContext[%Previous RequestContext%]
com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$PendingRequestContextStackTrace: At thread [armeria-testing-eventloop-nio-1-1] previous RequestContext is pushed at the following stacktrace
	at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$TraceableServiceRequestContext.<init>(LeakTracingRequestContextStorage.java:111)
	at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage$TraceableServiceRequestContext.<init>(LeakTracingRequestContextStorage.java:105)
	at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage.warpRequestContext(LeakTracingRequestContextStorage.java:82)
	at com.linecorp.armeria.internal.common.LeakTracingRequestContextStorage.push(LeakTracingRequestContextStorage.java:62)
	at com.linecorp.armeria.internal.common.RequestContextUtil.getAndSet(RequestContextUtil.java:149)
	at com.linecorp.armeria.server.ServiceRequestContext.push(ServiceRequestContext.java:221)
	at com.linecorp.armeria.internal.common.TraceRequestContextLeakTest.lambda$singleThreadContextLeak$2(TraceRequestContextLeakTest.java:101)  <- This is the line where leaked RequestContext is push 
	...
. This means the callback was called from unexpected thread or forgetting to close previous context.
	at com.linecorp.armeria.internal.common.RequestContextUtil.newIllegalContextPushingException(RequestContextUtil.java:100)
	at com.linecorp.armeria.server.ServiceRequestContext.push(ServiceRequestContext.java:237)
	at com.linecorp.armeria.internal.common.TraceRequestContextLeakTest.lambda$singleThreadContextLeak$3(TraceRequestContextLeakTest.java:107)
	...
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants