-
Notifications
You must be signed in to change notification settings - Fork 16
fix for performance observed during enrichment process #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #100 +/- ##
============================================
- Coverage 68.62% 68.61% -0.02%
Complexity 821 821
============================================
Files 84 84
Lines 3481 3486 +5
Branches 367 367
============================================
+ Hits 2389 2392 +3
- Misses 948 950 +2
Partials 144 144
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
| private static ThreadLocal<StructuredTraceGraph> cachedGraph = new ThreadLocal<>(); | ||
| private static ThreadLocal<StructuredTrace> cachedTrace = new ThreadLocal<>(); | ||
|
|
||
| public static StructuredTraceGraph buildGraph(StructuredTrace trace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kotharironak Basic question, Why buildGraph is called for every span. Can we call only once for Trace instead of every span ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavan-traceable Where does the buildGraph get called for every span?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats my understanding from this slack chat https://hypertrace.slack.com/archives/G01BLU6M52N/p1609913769019700?thread_ts=1609841154.015200&cid=G01BLU6M52N.. I may be wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavan-traceable buildGraph is called at 3 places,
ApiBoundaryTypeAttributeEnricher
DefaultServiceEntityEnricher
BackendEntityEnricher
Also we work with complete trace in this service, so buildGraph is invoked at Trace level
| } | ||
|
|
||
| // trace internally changed | ||
| if(cachedTrace.get().getEntityList().size() != trace.getEntityList().size() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this scenario will occur ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pull this out in a separate boolean function? say hasTraceChanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the set of attributes and enriched attributes? If the attributes and enriched attributes change, the graph structure will stay the same, but will we get the latest set of attributes from the graph nodes?
Basically, Does it create a copy of the Event, or just keeps the reference to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is keeping reference. ApiTraceGraph is using getRootEvents, but it is not in the enrichment process. @skjindal93 Can you also re-check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For hasTraceChanged currently, remaining condition of functions - isTraceSame and doeTraceExists works together for this function. So, if we move out, we have to move those conditions too. So, thinking of keeping this as-is for now.
| public class StructuredTraceGraphBuilder { | ||
| private static final Logger LOG = LoggerFactory.getLogger(StructuredTraceGraphBuilder.class); | ||
|
|
||
| private static ThreadLocal<StructuredTraceGraph> cachedGraph = new ThreadLocal<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want it specific to a Thread? Can we use Supplier instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, StructureTraceGraph's state is changing while processing a trace, and we want to re-construct whenever the state affecting its changes. So, we choose thread-local to share the state in a given kstream thread.
@s what do you mean by supplier here? Are you referring to supplying StructureTraceGraph at the start of processing of topology?
cc: @laxmanchekka
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kotharironak We use Suppliers in a similar way to cache the results https://mkyong.com/java8/java-8-supplier-examples/
Though I am not sure, if you can change the entry in supplier based on a certain condition, if the graph structure can potentially change
| private static ThreadLocal<StructuredTraceGraph> cachedGraph = new ThreadLocal<>(); | ||
| private static ThreadLocal<StructuredTrace> cachedTrace = new ThreadLocal<>(); | ||
|
|
||
| public static StructuredTraceGraph buildGraph(StructuredTrace trace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavan-traceable buildGraph is called at 3 places,
ApiBoundaryTypeAttributeEnricher
DefaultServiceEntityEnricher
BackendEntityEnricher
Also we work with complete trace in this service, so buildGraph is invoked at Trace level
| public class StructuredTraceGraphBuilder { | ||
| private static final Logger LOG = LoggerFactory.getLogger(StructuredTraceGraphBuilder.class); | ||
|
|
||
| private static ThreadLocal<StructuredTraceGraph> cachedGraph = new ThreadLocal<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it is a quick optimisation. My question is eventually should we instead of caching the StructuredTraceGraph at thread level, pass around the instance of this graph between enrichers? Anyways the graph is going to be changed for each Trace. We can just create this graph once for every trace at the entry point and other enrichers can use this graph. Any enrichment task can then update the StructuredTraceGraph if there is a need.
That way we also get rid of the need to compare the graphs and building graph multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it is another way around, the trace is changing during the enrichment process while processing by going over every span and so we have to re-construct StructureTraceGraph every time if the trace state has been modified. Yes, we have to rethink the current processing, and if it's the right data structure, or can we do it differently.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
||
| public class StructuredTraceGraphBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add unit test for this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a utility wrapper method with a set of ordered condition. I think, you are referring here to test around the structure trace graph itself. Can handle it as part of #105. Can you add more detail to that ticket too?
|
@kotharironak I think this code needs to be microbenchmarked at some point because it's so critical. |
This comment has been minimized.
This comment has been minimized.
* fix for performance * functional validation * changes logs * cleaning PR by removing measuing logging codes * Removes un-used logger * Adds comments for redability
Issue:
On one of the deployments, it has been observed that the hypertrace-trace-enricher component was legging heavy in the ingestion pipeline and it was not able to come down. We have also observed continuous re-balancing for its consumed topic in kafka.
Observations:
While investigating, we observed that during the enrichment process, couple of enricher uses an intermediate data structure -
StructureTraceGraph, and it has been constructed (or processed) proportional to trace size. In the current set of enrichers, the following enrichers,While measuring, our analysis suggested, that it was the main cause of an issue. Our sampled trace was having 447 spans in a trace, and it used to spend ~12-13 secs totally in the above two enricher's loop for constructing the graph.
Solution:
As the above data structure - StructureTraceGraph - changes during the processing of trace, we are currently solving it by maintaining it in ThreadLocal and re-constructing it whenever the state change instead of constructing it linearly. As a long-term solution, we are evaluating the option of splitting the above structure into mutable/non-mutable components along with adding enrichment context in the pipeline to pass on such states.
Testing:
We have verified the fix by comparing enriched structured trace with this patch and without this patch using the same sample trace which we used earlier for investigation.