New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace upload fails with 'inference error (CPU)' #13
Comments
Hi, David,
The tracing behavior you describe wanting is what we call FDR
(Flight-data-recorder) mode, and is enabled by the 'overwrite' option
described in https://www.kernel.org/doc/Documentation/trace/ftrace.txt being
set to 'true'. In this mode, the per-CPU trace buffers are treated as
rings: once a buffer fills up, the newest events overwrite the oldest. In
this way, you can take a 'snapshot' of the most recent buffer-full of
activity on demand.
We've set up the collection scripts for SchedViz to do one-shot tracing,
but it's possible to add new scripts for FDR-mode tracing. This would
probably entail implementing three trace commands: Start, which configures
and launches a trace, Stop, which stops an ongoing trace, and Trigger,
which takes a snapshot.
There is one caveat with this, related to the inference error you're
seeing. CPU buffers tend to overrun at different rates, leading to a
situation in which low-activity CPU buffers will include events from a time
that is no longer represented in higher-activity CPU buffers, that time
having been overwritten. The solution to this is to clip traces: to
consider valid only the interval common to all traces, from the last first
event among all CPU buffers to the first last event among all CPU buffers.
We have the ability to mark events as clipped in SchedViz, but we haven't
yet added the logic to actually mark events as clipped or unclipped in the
external version. This is probably why you got inference errors on
overwriting your buffers: some events which would've provided important
context had been overwritten and were missing. I'll look into adding that
logic soon.
…On Tue, Nov 19, 2019 at 2:50 AM david-laight ***@***.***> wrote:
If I set the -buffer_size far too small for the -capture seconds timeout
then the trace upload fails because the call to resolveConflict() in
checkCPUs() (in sched_thread_inferrer.go) returns 'Fail'.
My suspicion is the problem happens because the trace files for the
different cpus start at different times.
The reason I'm doing this is because I want to leave the trace running and
stop it when the application sees an unexpected scheduling delay. This
might take hours, but I only want the last 20ms of trace.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#13?email_source=notifications&email_token=AA27XBZ6V2ONW6ZPZGHH6Y3QUPAG3A5CNFSM4JPBOK5KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H2JZBIQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA27XB5OBMK62IM6FVSWLSDQUPAG3ANCNFSM4JPBOK5A>
.
|
Thanks. I was reading trace.sh and realised it would be relatively easy to start/stop the trace from other software. Building the tar file is then a 'simple' matter of running a suitable script. One question - do I need to flush out old trace before the 'echo 1 >trace_on' or can I just repeatedly turn the trace on and off and then collect the last buffer full? Oh - trace.sh probably ought to output the 'started' trace before actually starting the trace. |
If I set the -buffer_size far too small for the -capture seconds timeout then the trace upload fails because the call to resolveConflict() in checkCPUs() (in sched_thread_inferrer.go) returns 'Fail'.
My suspicion is the problem happens because the trace files for the different cpus start at different times.
The reason I'm doing this is because I want to leave the trace running and stop it when the application sees an unexpected scheduling delay. This might take hours, but I only want the last 20ms of trace.
The text was updated successfully, but these errors were encountered: