Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context mixup when using wreck library to make downstream request #1479

Closed
maxmil7 opened this issue Aug 30, 2020 · 30 comments · Fixed by #1546
Closed

Context mixup when using wreck library to make downstream request #1479

maxmil7 opened this issue Aug 30, 2020 · 30 comments · Fixed by #1546
Assignees
Labels
Discussion Issue or PR that needs/is extended discussion. feature-request

Comments

@maxmil7
Copy link

maxmil7 commented Aug 30, 2020

What version of OpenTelemetry are you using?

    "@opentelemetry/node": "^0.10.0",
    "@opentelemetry/plugin-http": "^0.10.0",
    "@opentelemetry/plugin-https": "^0.10.0",
    "@opentelemetry/tracing": "^0.10.1"

What version of Node are you using?

v12.17.0

What did you do?

We have a legacy service invocation library which uses wreck@v12 to invoke downstream services.

This library is used by express middlewares to make service calls to downstream services when an application receives a request. The application is based on ExpressJS

I tried to integrate the application with Opentelemetry so we can deprecate our legacy service instrumentation with Opentelemetry auto instrumentation

What did you expect to see?

I was hoping that all the outgoing http requests would be auto instrumented and the parent-child relationship between the spans created for the service requests would be correct

What did you see instead?

I noticed that for some service calls the parent-child relationship wasn't correct. Specifically, for the following flow:

App receives request -> Middleware calls service A -> Request handler calls service B

The span for call to service B was showing up as child of span created for service A request
Ideally, both service A and service B calls should be children of the original request span

Additional context

I have created a very small express to demonstrate the issue here:
https://github.com/maxmil7/async-demo

The steps to reproduce are in the readme of the app (just requires cloning the app and hitting the index page)

Add any other context about the problem here

FWIW, I tried to use AsyncLocalStorage context manager instead of AsyncHooks and the same issue is visible there as well

Currently, we are using cls-hooked for context-management and this issue is not reproducible with cls-hooked.

@maxmil7 maxmil7 added the bug Something isn't working label Aug 30, 2020
@dyladan
Copy link
Member

dyladan commented Aug 31, 2020

I doubt this is an issue with the context manager, but with the plugin itself most likely. My best guess is that the span created in the middleware is being marked as the active span during that middleware execution, but not being removed as the active span when the middleware completes.

@blumamir
Copy link
Member

blumamir commented Sep 6, 2020

The express middleware gets its context from the previous middleware's next() call which invoked it. In your case it happens in the response callback of the outgoing HTTP call (via Wreck) which is run with the context of the HTTP span to https://github.com.
In my opinion, it's not obvious what the expected behaviour should be in this case.
Don't know what will happen if you add the express plugin to the party, but this is the cause for the spans tree structure you are seeing.
A possible fix is to write an express plugin which will run each middleware in the context of the incoming HTTP span.

@vmarchaud
Copy link
Member

I believe @blumamir is right on why you are seeing this behavior, however i'm curious on how you are using cls-hooked @maxmil7, could you explain your current setup ?

@maxmil7
Copy link
Author

maxmil7 commented Sep 7, 2020

@blumamir thanks for sharing your insight. I did turn on express instrumentation but still saw the same issue.
I agree its not very obvious what the correct behavior should be, however the instrumentation results (especially when seen in Zipkin UI) can be confusing since it looks like the middleware service call initiated the request handler service call

@maxmil7
Copy link
Author

maxmil7 commented Sep 7, 2020

@vmarchaud currently our logging middleware creates a new context object for every request using the the cls-hooked run function.
We store the request specific info (logId, which connection thread to write etc) in the context object and access it as needed when generating logs.

Now that I think of it, since we don't store any parent<->child mapping in the context, its difficult to say whether the async hooks implementation within cls-hooked would have a solution for this issue.

I did some further debugging with different scenarios and the issue seems to be reproducible using plain http.request function as well

app.use((req, res, next) => {
    const r = http.request('http://www.github.com/')
    r.on('response', () => {
        next();
    })
    r.end();
})

Interestingly enough, the issue doesn't surface when using a promise based library such as fetch in the middleware

app.use((req, res, next) => {
    fetch('http://www.github.com/')
        .then(response => {
            next()
        })
    });

I added some breakpoints and it seems like the promise executed in the .then block is actually created with the same context as the top level fetch promise
Hence, when we call next, the correct context gets loaded via the _before hook and the service call in the request handler gets the top level parent.

@blumamir
Copy link
Member

blumamir commented Sep 8, 2020

@maxmil7 I agree that the trace tree can look confusing and might not be what you expect to see. Your second http call (or middleware invocation) depends on the result of the first invocation, so there is a sense in setting it as the child at least in the purely technical perspective.
As another example, if you take the data from the http response and store it in DB, you would probably expect the DB span to be the child of the outgoing http span, and think it would be confusing it they were siblings.
Unfortunately, I can't think of an easy way for OpenTelemetry to tell which strategy to take for general cases, but would be happy to discuss ideas if you have some. Our setup also produces some weird traces sometimes which I'll be super happy to improve.
For your specific case though, you can fix it quite easily by setting the context for each express middleware invocation in an express plugin (or add this functionality to the current plugin)

@vmarchaud
Copy link
Member

I added some breakpoints and it seems like the promise executed in the .then block is actually created with the same context as the top level fetch promise

Yeah thats the behavior of promises based on the ECMA specs, downstreams contexts are created before executing the promises.

@blumamir is right on the fact that technically the second http call to example.com is the same as the one for github.com though. I'm not sure it's possible even with modification to the express plugin to change that behavior.

@dyladan
Copy link
Member

dyladan commented Sep 8, 2020

A possible fix is to write an express plugin which will run each middleware in the context of the incoming HTTP span.

I think this idea deserves exploration. In express, middlewares can can be thought of in parent-child relations, but more often I think of them as more of a processing pipeline. If you have A->B->C, I don't think of middleware B as the parent of C, but just the processing step that happened before it. Just like how in a procedural program, I don't think of line 3 as being the parent of line 4.

@blumamir
Copy link
Member

blumamir commented Sep 9, 2020

I agree about the desired behavior in this case.
The fix in express plugin should be as easy as capturing the active span as the request enters the pipeline, storing it as a symbol property on the req object, and then executing each middleware with this context using withSpan().
My point is that this is only a patch for express, but the problem is broader than this specific issue. When you want to wait for an async operation to complete (as is done here with the wreck call), you will probably put the awaiting code in the callback, causing ALL following operations to show up under the callback context (can be desired in some cases and undesired in others like this one). If you use promises instead of callback, you will not get the span context at all for the then() or await which mean that promise and callback semantics will create different trace trees where you might expect the same behavior.
I see it pop up in various setups, where the child-parent relations between spans in a trace end up weird, surprising or broken.

@vmarchaud
Copy link
Member

I think this idea deserves exploration. In express, middlewares can can be thought of in parent-child relations, but more often I think of them as more of a processing pipeline. If you have A->B->C, I don't think of middleware B as the parent of C, but just the processing step that happened before it. Just like how in a procedural program, I don't think of line 3 as being the parent of line 4.

I'm not against making a "fix" to express but i think this a broader issue, technically step C is a child of B (and the whole context management API assumes this). There will be other cases in the future where some users might find the spans relationships weird even though they are technically correct and we will need to draw a line at which we stop trying to "fix" them.
That's why i would hesitate to "fix" it for express, WDYT ?

@blumamir I would be interested in other examples where the relationships between spans were weird, either here or by private message on gitter ?

@blumamir
Copy link
Member

blumamir commented Sep 9, 2020

Sure, I'll post here some examples:

setTimeout
I am currently trying to fix an issue with setTimeout in node10, which made me dive into context propagation across timers. Currently, context is propagated across timers which makes sense in my opinion:

const mainSpan = tracer.startSpan('main');
tracer.withSpan(mainSpan, () => {
    setTimeout( () => tracer.startSpan('timer span').end(), 10);
})
mainSpan.end();

timer span is the child of main span. But what about this common pattern?

const pool = () => {
    http.get('http://dummy.restapiexample.com/api/v1/employee/1', () => {
        // process the response and call pool again only after process completes
        setTimeout(pool, 1000);
    })
}
pool();

It will create one infinite trace of HTTP calls one after the other.

promise vs callback
aws-sdk package API can operate both in callback and promise modes.
So you can do something like this:

  sqs.receiveMessage(params, ((err: AWS.AWSError, data: AWS.SQS.Types.ReceiveMessageResult) => {
    // here we expect the context of the receive operation
  })

But we can also use promise function and then():

  sqs.receiveMessage(params).promise().then(data => {
    // here the context is NOT set to the receive operation
  })

In the aws-sdk plugin, I actually patched the returned promise to solve this issue, but it only works for the first then and fails for then chain. The issue is even stranger and more confusing when using await on promises. But the bottom line is that it's the same semantic meaning but different behavior.
This is maybe my biggest problem at the moment, as I have no control over the user code, and await syntax is very common and does not play well with OpenTelemetry context management.

bind context on event emitter
Consider the following case:

const server = http.createServer((req, res) => {
    const span1 = tracer.startSpan('span1');
    tracer.withSpan(span1, () => {
        req.on('data', () => {});
        req.on('end', () => {
            tracer.startSpan('span2').end();
            res.writeHead(200);
            res.end('Hello, World!');        
        });
    });
    span1.end();
});
server.listen(8080);

You might expect span2 to be the child of span1 as it is started in withSpan(span1, ... context, but it's surprisingly actually the child of the HTTP span, as the http plugin bind it's context to the event emitter of req.

The examples above are short so they fit in this github issue, but in real life cases, the problem is usually buried under a stack of libraries, plugins and code which makes it very hard to understand and solve. I usually just get some weird structure traces and need to start digging in.

@habmic
Copy link

habmic commented Sep 9, 2020

Hi All, looking at this issue I wonder if it is something that OpenTelemetry as the infrastructure layer should resolve? I think this question of whether those spans should have child <> parent relation or sibling is a question of perspective rather than real technical questions.

In the technical aspect, one middleware is invoking the next meaning they have child <> parent relation. But each one of us will “visualize” it in a different way. It also could vary from different use cases and different plugins.
There will be an endless amount of discussion about it and every plugin will need to address it differently.

For example, with express having three different middlewares, A & B are regular ones where C is an error middleware. Middleware C is reporting using HTTP the error it captured thrown by middleware A, how would you expect to see the HTTP call made in Middleware C?
A - > C - > HTTP

Or

A
C - > HTTP

I think you can argue for both...

Because it is a perspective issue I think OpenTelemetry should define a default but to allow the end-user to modify it, that way as an OpenTelemetry user I can decide for each use case how I want my code to be instrumented.
In general, there are three layers to configure something like that, in OpenTelemetry, in a specific plug-in, or in the user/application code. The kind of solution I’m referring to is in the application code.

@Instrument(mode=root)
function myMiddleware(req, res, next){}

I can work on a more technical suggestion on how to implement it but wanted to raise the discussion first.
Would love to hear your thoughts.

@vmarchaud
Copy link
Member

vmarchaud commented Sep 9, 2020

promise vs callback

You shouldn't have issues if you are using native Promise though. Context propagation should works across multiple .then. If not, i believe this is a bug.

Because it is a perspective issue I think OpenTelemetry should define a default but to allow the end-user to modify it, that way as an OpenTelemetry user I can decide for each use case how I want my code to be instrumented.

That would require to access to the context stack (which is internal currently). I think it's worth checking how other languages handle those cases and if it's a common pain, see how the spec could be changed to allow this.

@dyladan
Copy link
Member

dyladan commented Sep 9, 2020

promise vs callback

You shouldn't have issues if you are using native Promise though. Context propagation should works across multiple .then. If not, i believe this is a bug.

Make sure you are calling enable on the context manager. I had a similar issue when I forgot that before.

@Flarna
Copy link
Member

Flarna commented Sep 12, 2020

As another example, if you take the data from the http response and store it in DB, you would probably expect the DB span to be the child of the outgoing http span, and think it would be confusing it they were siblings.

I don't think so. Just think about how the code would look like with await or in a language where such calls are sync:

function onRequest() {
  await ClientRequest();
  await DbRequest();
  sendResult();
}

For me the root is the incoming request and it has two childs. The timing tells if they run in parallel or sequential.
But the DbRequest is not a subpart of the ClientRequest, its a followup - a sibling.

@dyladan
Copy link
Member

dyladan commented Sep 14, 2020

As another example, if you take the data from the http response and store it in DB, you would probably expect the DB span to be the child of the outgoing http span, and think it would be confusing it they were siblings.

I don't think so. Just think about how the code would look like with await or in a language where such calls are sync:

function onRequest() {
  await ClientRequest();
  await DbRequest();
  sendResult();
}

For me the root is the incoming request and it has two childs. The timing tells if they run in parallel or sequential.
But the DbRequest is not a subpart of the ClientRequest, its a followup - a sibling.

I also would think of it this way.

@blumamir
Copy link
Member

blumamir commented Sep 15, 2020

I don't think so. Just think about how the code would look like with await or in a language where such calls are sync:

function onRequest() {
await ClientRequest();
await DbRequest();
sendResult();
}
For me the root is the incoming request and it has two childs. The timing tells if they run in parallel or sequential.
But the DbRequest is not a subpart of the ClientRequest, its a followup - a sibling.

This code is a syntactic sugar for the following code:

function onRequest() {
  ClientRequest().then( serverResponse => {
    DbRequest(serverResponse).then( dbResponse => {
      sendResult(dbResponse);
    });
  });
}

Each operation relies on the previous operation result (even if the result data is not used, it still relies on it to succeed and not throw). I don't think there is one right answer thought. Maybe the solution is to allow users to configure if they want the callback / promise to be run in the context of the plugin span or not.

Something like this:

const provider = new NodeTracerProvider({
  plugins: {
    http: {
      enabled: true,
      path: '@opentelemetry/plugin-http',
      setOutgoingCallbackContext: false
    }
  }
});

That will also solve the original issue when configured in this way.

@Flarna
Copy link
Member

Flarna commented Sep 15, 2020

Yes, it relies on the result of the previous but it is not a part of the operation. Consider a outgoing http span and the corresponding incoming - they have a parent-child relationship as the one is a part of the other.

I don't think a TracerProvide global setup would be a good choice. If we really allow users to configure how spans are linked I would expect we need this on plugin level as they have to support quite different usecases.
Sorry, misread, proposal is on plugin.

@blumamir
Copy link
Member

You shouldn't have issues if you are using native Promise though. Context propagation should works across multiple .then. If not, i believe this is a bug.

I am using native promises, but patch the promise so the then and catch will be run with the span context:

  private _bindPromise(target: Promise<any>, span: Span) {
    const thisPlugin = this;

    const origThen = target.then;
    target.then = function (onFulfilled, onRejected) {
      const newOnFulfilled = thisPlugin._tracer.bind(onFulfilled, span);
      const newOnRejected = thisPlugin._tracer.bind(onRejected, span);
      return origThen.call(this, newOnFulfilled, newOnRejected);
    };

    return target;
  }

Would love to discuss better alternatives to do the binding to the promise which supports chining

@blumamir
Copy link
Member

Yes, it relies on the result of the previous but it is not a part of the operation. Consider a outgoing http span and the corresponding incoming - they have a parent-child relationship as the one is a part of the other.

I tend to agree in this case, although I believe it should be configurable. Maybe the default should be not to set context on outgoing callback, and user can choose to keep the default behavior in plugin configuration?

@maxmil7
Copy link
Author

maxmil7 commented Sep 17, 2020

Maybe the solution is to allow users to configure if they want the callback / promise to be run in the context of the plugin span or not.

Something like this:

const provider = new NodeTracerProvider({
  plugins: {
    http: {
      enabled: true,
      path: '@opentelemetry/plugin-http',
      setOutgoingCallbackContext: false
    }
  }
});

+1 I think this would be a great option for plugins which create and enter with a new span context. At least for http plugin, we might have to account for the following scenarios:

const r = http.request('http://www.example.com')
  r.on('response', function(response) {
    //process response
}).end();;

const r = http.request('http://www.example.com', function(response) {
   //process response
}).end();;

and promised based request libraries

Also, like the idea of setting outGoingCallbackContext to false at least for the http plugin

@Flarna
Copy link
Member

Flarna commented Sep 17, 2020

Please note that even if we add this option it will not work together with await.

It is possible to monkey patch catch/then on native promises but await uses the unpatched version.
So depending on how user writes the code the span linking may differ even if the difference in the usercode is just "boilerplate" code as mentioned above.

Besides that such configurations make it harder for tools receiving the spans to interpret/analyse the structure and detect issues automatically. If receivers just display spans but have no logic besides that it's not a problem.

If it is really required to have specific span linking in your app I recommend to disable automatism and pass context manually.

@vmarchaud

This comment has been minimized.

@blumamir
Copy link
Member

Besides that such configurations make it harder for tools receiving the spans to interpret/analyse the structure and detect issues automatically.

You mean those tools won't be able to work consistently on traces and will need to assume both cases for the two possible configuration options?
In that case, I think the best solution is to change current behavior and not set span context on outgoing http callback.

@blumamir
Copy link
Member

At least for http plugin, we might have to account for the following scenarios:

I think we need to remove the outgoing http span context from the event emitter callbacks as well. What do you think?

@Flarna
Copy link
Member

Flarna commented Sep 17, 2020

You mean those tools won't be able to work consistently on traces and will need to assume both cases for the two possible configuration options?

I think so, or at least it's much harder for them. if child and sibling spans are mixed based on a configuration not known/supported within the tool they can not do any analysis based on this.
Syncing plugin configs with tools could improve this but in general these two items should work independent. And usually they are created by different people.

@maxmil7
Copy link
Author

maxmil7 commented Sep 17, 2020

If it is really required to have specific span linking in your app I recommend to disable automatism and pass context manually.

Hmm, so that would mean instrumenting the http calls manually instead of using the plugins correct?

I believe a fix is possible within the express plugin by storing the http span, however i don't see how it would be possible to implement a generic option inside the http plugin.

Agreed, it might be possible. However, I think the issue will surface outside of express middlewares as well. Wherever we start chaining the http calls

In that case, I think the best solution is to change current behavior and not set span context on outgoing http callback.

If everyone agrees, I think this might be a good approach
@Flarna if the default behavior is not to set context for outgoing http callback, then it could work consistently across callbacks, event emitter callbacks, promises and async/await as well right?

@Flarna
Copy link
Member

Flarna commented Sep 17, 2020

I think so yes. I may be needed to set the parent span of the HttpClient span as context in callbacks/emitter of the client req/res.

@vmarchaud

This comment has been minimized.

@vmarchaud
Copy link
Member

vmarchaud commented Sep 19, 2020

I feel dumb for not seeing that the express plugin wasn't enabled in the code you given and that the problem was simply the context set by the http plugin. Anyway i've opened a PR to change the behavior.

EDIT: I'm removing the bug label since the current behavior is not bugous in itself

@vmarchaud vmarchaud added Discussion Issue or PR that needs/is extended discussion. feature-request and removed bug Something isn't working labels Sep 19, 2020
vmarchaud added a commit to vmarchaud/opentelemetry-js that referenced this issue Sep 19, 2020
vmarchaud added a commit to vmarchaud/opentelemetry-js that referenced this issue Sep 19, 2020
vmarchaud added a commit to vmarchaud/opentelemetry-js that referenced this issue Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Issue or PR that needs/is extended discussion. feature-request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants