Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed tracing: time inversion for duration calculation #164

Merged
merged 2 commits into from
Jun 24, 2021

Conversation

b-viguier
Copy link
Contributor

@b-viguier b-viguier commented May 18, 2021

Context: I'm facing a strange issue in my NewRelic data, all parent.transportDuration values are 0 for transactions between 2 Php applications instrumented by NewRelic. I opened a ticket and they suggest that servers are not synchronized (but I know they are 😉 ). So I dig into the code…

Explanations

The nr_time_duration function compute the time elapsed between a start timestamp and a stop timestamp. If start > stop, it returns 0.

The nr_distributed_trace_inbound_get_timestamp_delta function intends to return the transport duration, i.e. the time elapsed during the timestamp stored in distributed trace header and the start time of the current transaction, by calling nr_time_duration.

Hypothesis: parameters are passed in the wrong order, then all transport duration are reported to 0 😨

Example: When parent.transportDuration is set.

⚠️ We can find the test if(nr_distributed_trace_inbound_get_timestamp_delta(dt, 0)) (to check if distributed trace has a non-zero timestamp as far as I understand), it seems to work because it relies on this wrong behavior. But this function cannot work in both case, in pseudo-code we cannot have duration(dt, 0) > 0 AND duration(dt, now) > 0 if 0 < dt < now.

ℹ️ Tests use the same function to compute the expected value, then a wrong value always equals the same wrong value.

What I did

  • I reversed parameters order in nr_time_duration call inside nr_distributed_trace_inbound_get_timestamp_delta
  • I also reverted order in tests to compute expected value, and added an assertion to ensure that the duration is not 0 (a wrong value in this case). We could also hard-code the expected value, to be sure to not rely on internal implementation of the tested function.
  • I modified a nr_distributed_trace_inbound_get_timestamp_delta(dt, 0) calls to a new explicit function nr_distributed_trace_inbound_has_timestamp (I think it's the original intent). I created some tests, but I don't understand the default value case of nr_distributed_trace_inbound_get_timestamp_delta function… IMHO there is some trouble here, since it returns 0 because the transaction have a 0 start time 🤔
  • I fixed some tests relying on a wrong expected value, I had to correctly initialize the transaction start time to make sense.
  • Previous fix revealed an other issue in nr_distributed_trace_convert_w3c_headers_tracestate, the timestamp string value was converted to an integer, causing a capacity overflow on x86 arch, I changed to convert to long long unsigned

And now…

I made this PR to run tests on CI, and to share my thoughts with you. What do you think? Am I totally wrong or is there something to investigate? Did I misunderstood something?

Thanks for your attention, I'm looking forward for your feedback 🙇‍♂️

@nr-opensource-bot
Copy link

Can one of the admins verify this patch?

@b-viguier b-viguier force-pushed the fix/dt-transport-duration branch 2 times, most recently from f253977 to d313a20 Compare May 19, 2021 19:07
@zsistla
Copy link
Contributor

zsistla commented May 28, 2021

Hi @b-viguier , thank you for the PR! We always appreciate you taking the time to dig in to the code!!

Looks like it's generating a few errors in the time tests.

@b-viguier
Copy link
Contributor Author

Hi @zsistla 👋
Yes, I saw this last failing test on x86, but I wanted to wait your global feedback on this PR, to know if it was worth investigating further 😅
Anyway, now it's fixed, it was actually related to an other bug with w3c header on 32bits architecture. ✅

Copy link

@Oliboy50 Oliboy50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤞

axiom/nr_distributed_trace.h Outdated Show resolved Hide resolved
@Fahmy-Mohammed
Copy link
Contributor

Apologies for the delay on this @b-viguier. Review in progress and should be done shortly.

@Fahmy-Mohammed
Copy link
Contributor

ok jenkins

Copy link
Contributor

@Fahmy-Mohammed Fahmy-Mohammed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great and the internal Jenkins tests passed as well. Thanks for catching this error @b-viguier and for the excellent fix! This should be included in our next release (Agent version 9.18).

@b-viguier
Copy link
Contributor Author

No worries for the delay @Fahmy-Mohammed 😄
Thanks for the update, it's nice to have your feedback 👍

@@ -378,7 +378,16 @@ nrtime_t nr_distributed_trace_inbound_get_timestamp_delta(
return 0;
}

return nr_time_duration(txn_start, dt->inbound.timestamp);
return nr_time_duration(dt->inbound.timestamp, txn_start);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @b-viguier :)
Overall looks good, and thanks again for your engagement (and patience).

What was the rationale for switching the parameter signature?

There are quite a few places expecting with this order that don't account for the proposed new signature.

axiom/nr_exclusive_time.c:4
axiom/nr_txn.h:2
axiom/nr_distributed_trace.c:1
axiom/util_time.h:1
axiom/nr_segment_external.c:2
axiom/nr_app.c:1
axiom/nr_segment.c:4
axiom/nr_segment_datastore.c:2
axiom/nr_txn.c:3
axiom/nr_agent.c:1
axiom/tests/test_segment.c:1
axiom/tests/test_distributed_trace.c:3
axiom/tests/test_segment_tree.c:2
axiom/tests/test_time.c:3
axiom/tests/test_txn.c:3

agent/php_explain.c:2
agent/php_execute.c:2
agent/php_explain_mysqli.c:2
agent/lib_predis.c:1
agent/php_api_internal.c:1

Copy link
Contributor Author

@b-viguier b-viguier Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nr_time_duration is fine and well used in most cases. Parameters still have the same meaning: nr_time_duration(start, stop) will return stop - start if start < stop (else 0).

But I'm pretty sure that for this call, parameters are in the wrong order. This is certainly because it seems right to use txn_start as the start parameter of nr_time_duration, but it cannot! Here, we are computing the transport duration. Transport finishes when transaction starts. Chronologically, dt->inbound.timestamp is before txn_start, because it is the time where the transport started (timestamp included in HTTP headers).

So, to compute the right duration, we need to compute txn_start - dt->inbound.timestamp, i.e. nr_time_duration(dt->inbound.timestamp, txn_start). I didn't noticed any similar misuse in other calls.

@zsistla is this clearer now? 😄

@Fahmy-Mohammed Fahmy-Mohammed merged commit a46a438 into newrelic:dev Jun 24, 2021
Fahmy-Mohammed pushed a commit that referenced this pull request Aug 3, 2021
* fix(distributed tracing): time inversion for duration calculation

* fix(distributed tracing): incorrect w3c timestamp parsing
zsistla added a commit to zsistla/docs-website that referenced this pull request Aug 23, 2021
---
subject: PHP agent
releaseDate: '2021-08-23'
version: 9.18.1.303
downloadLink: 'https://download.newrelic.com/php_agent/archive/9.18.1.303'
---

## New Relic PHP Agent v9.18.1 ##

### New Features ###
* [Added](newrelic/newrelic-php-agent#162) a docker development environment. It's now possible for contributors to both develop and test (unit tests and integration tests) without setting up a specific environment on their own system. Please see our [documentation](https://github.com/newrelic/newrelic-php-agent/blob/main/docs/dev_environment.md) for more information.
* [Route caching in `Laravel 7.x` is now supported!](newrelic/newrelic-php-agent#174). Transaction naming now works with routes cached via `php artisan route:cache`. @stockalexander, thanks for your contribution!
* `Redis::mget` and `Redis::mset` [functions are now supported](newrelic/newrelic-php-agent#156). @b-viguier, thanks for your contribution!

### Bug Fixes ###
* [Fixed](newrelic/newrelic-php-agent#161) instances where a memory leak was occurring with our `curl multi` instrumentation.
* [Fixed](newrelic/newrelic-php-agent#176) an issue where a supportability metric used to track an edge case was causing a segfault.
* [Fixed](newrelic/newrelic-php-agent#87) an issue where PHP versions with an unknown API version were incorrectly handled during Debian package install.
* [Fixed](newrelic/newrelic-php-agent#164) instances where `parent.transportDuration` values are `0` for transactions between two PHP applications instrumented through distributed tracing. @b-viguier, thanks for your contribution!
* [Fixed](newrelic/newrelic-php-agent#158) an issue where the `newrelic.ini` configuration file was incorrectly installed. @b-viguier, thanks for your contribution!

### Support Statement ###
* New Relic recommends that you upgrade the agent regularly and at a minimum every 3 months. As of this release, the oldest supported version is [8.6.0](/docs/release-notes/agent-release-notes/php-release-notes/php-agent-860238/).
zuluecho9 pushed a commit to zsistla/docs-website that referenced this pull request Aug 23, 2021
---
subject: PHP agent
releaseDate: '2021-08-23'
version: 9.18.1.303
downloadLink: 'https://download.newrelic.com/php_agent/archive/9.18.1.303'
---

## New Relic PHP Agent v9.18.1 ##

### New Features ###
* [Added](newrelic/newrelic-php-agent#162) a docker development environment. It's now possible for contributors to both develop and test (unit tests and integration tests) without setting up a specific environment on their own system. Please see our [documentation](https://github.com/newrelic/newrelic-php-agent/blob/main/docs/dev_environment.md) for more information.
* [Route caching in `Laravel 7.x` is now supported!](newrelic/newrelic-php-agent#174). Transaction naming now works with routes cached via `php artisan route:cache`. @stockalexander, thanks for your contribution!
* `Redis::mget` and `Redis::mset` [functions are now supported](newrelic/newrelic-php-agent#156). @b-viguier, thanks for your contribution!

### Bug Fixes ###
* [Fixed](newrelic/newrelic-php-agent#161) instances where a memory leak was occurring with our `curl multi` instrumentation.
* [Fixed](newrelic/newrelic-php-agent#176) an issue where a supportability metric used to track an edge case was causing a segfault.
* [Fixed](newrelic/newrelic-php-agent#87) an issue where PHP versions with an unknown API version were incorrectly handled during Debian package install.
* [Fixed](newrelic/newrelic-php-agent#164) instances where `parent.transportDuration` values are `0` for transactions between two PHP applications instrumented through distributed tracing. @b-viguier, thanks for your contribution!
* [Fixed](newrelic/newrelic-php-agent#158) an issue where the `newrelic.ini` configuration file was incorrectly installed. @b-viguier, thanks for your contribution!

### Support Statement ###
* New Relic recommends that you upgrade the agent regularly and at a minimum every 3 months. As of this release, the oldest supported version is [8.6.0](/docs/release-notes/agent-release-notes/php-release-notes/php-agent-860238/).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants