Skip to content

Conversation

@kotharironak
Copy link
Contributor

@kotharironak kotharironak commented Jan 7, 2021

Issue:
On one of the deployments, it has been observed that the hypertrace-trace-enricher component was legging heavy in the ingestion pipeline and it was not able to come down. We have also observed continuous re-balancing for its consumed topic in kafka.

Observations:
While investigating, we observed that during the enrichment process, couple of enricher uses an intermediate data structure - StructureTraceGraph, and it has been constructed (or processed) proportional to trace size. In the current set of enrichers, the following enrichers,

While measuring, our analysis suggested, that it was the main cause of an issue. Our sampled trace was having 447 spans in a trace, and it used to spend ~12-13 secs totally in the above two enricher's loop for constructing the graph.

Solution:
As the above data structure - StructureTraceGraph - changes during the processing of trace, we are currently solving it by maintaining it in ThreadLocal and re-constructing it whenever the state change instead of constructing it linearly. As a long-term solution, we are evaluating the option of splitting the above structure into mutable/non-mutable components along with adding enrichment context in the pipeline to pass on such states.

Testing:
We have verified the fix by comparing enriched structured trace with this patch and without this patch using the same sample trace which we used earlier for investigation.

@codecov
Copy link

codecov bot commented Jan 7, 2021

Codecov Report

Merging #100 (bd93b39) into main (c529c39) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #100      +/-   ##
============================================
- Coverage     68.62%   68.61%   -0.02%     
  Complexity      821      821              
============================================
  Files            84       84              
  Lines          3481     3486       +5     
  Branches        367      367              
============================================
+ Hits           2389     2392       +3     
- Misses          948      950       +2     
  Partials        144      144              
Flag Coverage Δ Complexity Δ
unit 68.61% <100.00%> (-0.02%) 0.00 <3.00> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...raceenricher/enrichment/AbstractTraceEnricher.java 100.00% <100.00%> (ø) 21.00 <1.00> (ø)
...chment/enrichers/DefaultServiceEntityEnricher.java 92.68% <100.00%> (ø) 14.00 <0.00> (ø)
...trace/enricher/StructuredTraceEnrichProcessor.java 89.28% <100.00%> (+2.32%) 7.00 <2.00> (ø)
...er/enrichment/enrichers/BackendEntityEnricher.java 65.90% <0.00%> (-3.41%) 17.00% <0.00%> (-1.00%)
...s/resolver/backend/ClientSpanEndpointResolver.java 100.00% <0.00%> (+8.33%) 6.00% <0.00%> (+1.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c529c39...bd93b39. Read the comment docs.

private static ThreadLocal<StructuredTraceGraph> cachedGraph = new ThreadLocal<>();
private static ThreadLocal<StructuredTrace> cachedTrace = new ThreadLocal<>();

public static StructuredTraceGraph buildGraph(StructuredTrace trace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kotharironak Basic question, Why buildGraph is called for every span. Can we call only once for Trace instead of every span ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavan-traceable Where does the buildGraph get called for every span?

Copy link
Contributor

@pavan-traceable pavan-traceable Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavan-traceable buildGraph is called at 3 places,
ApiBoundaryTypeAttributeEnricher
DefaultServiceEntityEnricher
BackendEntityEnricher
Also we work with complete trace in this service, so buildGraph is invoked at Trace level

}

// trace internally changed
if(cachedTrace.get().getEntityList().size() != trace.getEntityList().size() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this scenario will occur ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pull this out in a separate boolean function? say hasTraceChanged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the set of attributes and enriched attributes? If the attributes and enriched attributes change, the graph structure will stay the same, but will we get the latest set of attributes from the graph nodes?

Basically, Does it create a copy of the Event, or just keeps the reference to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is keeping reference. ApiTraceGraph is using getRootEvents, but it is not in the enrichment process. @skjindal93 Can you also re-check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For hasTraceChanged currently, remaining condition of functions - isTraceSame and doeTraceExists works together for this function. So, if we move out, we have to move those conditions too. So, thinking of keeping this as-is for now.

public class StructuredTraceGraphBuilder {
private static final Logger LOG = LoggerFactory.getLogger(StructuredTraceGraphBuilder.class);

private static ThreadLocal<StructuredTraceGraph> cachedGraph = new ThreadLocal<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want it specific to a Thread? Can we use Supplier instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, StructureTraceGraph's state is changing while processing a trace, and we want to re-construct whenever the state affecting its changes. So, we choose thread-local to share the state in a given kstream thread.
@s what do you mean by supplier here? Are you referring to supplying StructureTraceGraph at the start of processing of topology?
cc: @laxmanchekka

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kotharironak We use Suppliers in a similar way to cache the results https://mkyong.com/java8/java-8-supplier-examples/

Though I am not sure, if you can change the entry in supplier based on a certain condition, if the graph structure can potentially change

private static ThreadLocal<StructuredTraceGraph> cachedGraph = new ThreadLocal<>();
private static ThreadLocal<StructuredTrace> cachedTrace = new ThreadLocal<>();

public static StructuredTraceGraph buildGraph(StructuredTrace trace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavan-traceable buildGraph is called at 3 places,
ApiBoundaryTypeAttributeEnricher
DefaultServiceEntityEnricher
BackendEntityEnricher
Also we work with complete trace in this service, so buildGraph is invoked at Trace level

public class StructuredTraceGraphBuilder {
private static final Logger LOG = LoggerFactory.getLogger(StructuredTraceGraphBuilder.class);

private static ThreadLocal<StructuredTraceGraph> cachedGraph = new ThreadLocal<>();
Copy link
Contributor

@findingrish findingrish Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that it is a quick optimisation. My question is eventually should we instead of caching the StructuredTraceGraph at thread level, pass around the instance of this graph between enrichers? Anyways the graph is going to be changed for each Trace. We can just create this graph once for every trace at the entry point and other enrichers can use this graph. Any enrichment task can then update the StructuredTraceGraph if there is a need.
That way we also get rid of the need to compare the graphs and building graph multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it is another way around, the trace is changing during the enrichment process while processing by going over every span and so we have to re-construct StructureTraceGraph every time if the trace state has been modified. Yes, we have to rethink the current processing, and if it's the right data structure, or can we do it differently.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@kotharironak kotharironak marked this pull request as ready for review January 11, 2021 04:19
@kotharironak kotharironak requested a review from a team January 11, 2021 04:19
@kotharironak kotharironak changed the title fix for performance fix for performance observed during enrichment process Jan 11, 2021
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class StructuredTraceGraphBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add unit test for this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a utility wrapper method with a set of ordered condition. I think, you are referring here to test around the structure trace graph itself. Can handle it as part of #105. Can you add more detail to that ticket too?

@buchi-busireddy
Copy link
Contributor

@kotharironak I think this code needs to be microbenchmarked at some point because it's so critical.

@github-actions

This comment has been minimized.

@kotharironak kotharironak merged commit 792f8b2 into main Jan 11, 2021
@kotharironak kotharironak deleted the fix-for-perf-issue branch January 11, 2021 07:59
@github-actions
Copy link

Unit Test Results

  47 files  ±0    47 suites  ±0   48s ⏱️ -5s
241 tests ±0  241 ✔️ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 792f8b2. ± Comparison against base commit c529c39.

kotharironak added a commit that referenced this pull request Jan 11, 2021
* fix for performance

* functional validation

* changes logs

* cleaning PR by removing measuing logging codes

* Removes un-used logger

* Adds comments for redability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants