Skip to content
This repository has been archived by the owner on Jan 2, 2021. It is now read-only.

Memory info and more details in opentelemetry #826

Closed
wants to merge 20 commits into from

Conversation

mpardalos
Copy link
Contributor

This patch adds the following information to opentelemetry traces:

  • Number of entries in the "state" hashmap in ShakeExtras (Metric is called "values map size_bytes")
  • Memory occupancy in that map. (This is currently very slow, and so sizes are only reported about once/minute)
  • Trace of all LSP requests and notifications (as opposed to just those defined in HoverDefinition.hs, as before)

Other changes:

  • Put opentelemetry tracing behind --ot-profiling flag
  • Add --ot-profiling flag to benchmark to run ghcide with --ot-profiling
  • Improve recording of heap_live_bytes in opentelemetry trace by running performGC every 0.5s (only with --ot-profiling)

mpardalos and others added 11 commits September 22, 2020 19:58
Instead of doing it in `HoverDefinition`, do it in
with{Response,Notification,...}. These wrap all handlers, so this should
cover everything. It also means that the span covers the entire
processing time for the request, where before we missed the setup
happening in the with* functions.
Run GC regularly with --ot-profiling
I renamed the fork to distringuish from the original.
It is still being pulled from git using stack. This will be addressed
once I can push the fork to hackage.
@wz1000
Copy link
Collaborator

wz1000 commented Sep 23, 2020

Did you try this out on a ghcide session? Any memorable findings? Which actions/requests take the most time, which ones leak memory, which ones cause memory consumption to spike, etc?

@wz1000
Copy link
Collaborator

wz1000 commented Sep 23, 2020

It would be good to have some documentation on how to see/use/interpret these traces.

Copy link
Collaborator

@pepeiborra pepeiborra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look reasonable to me, if a bit unfinished.

For instance, the benchmark suite has been extended with an option to collect telemetry details, like the length and size of the state HashMap. But nothing is being done with this data!

  • It should be added to the CSV outputs of an experiment
  • It could be graphed by the bench/hist build system so that we can easily spot leaks

startTelemetry _ _ (IdeOTProfiling False) = return ()
startTelemetry name valuesRef (IdeOTProfiling True) = do
mapBytesInstrument <- mkValueObserver (BS.pack name <> " size_bytes")
mapCountInstrument <- mkValueObserver (BS.pack name <> " count")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to these global counters, it would be very useful to have per-rule counters, in order to show which rules are expensive (in items and in memory)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all the suggested changes, with one caveat: I don't think it's feasible to measure per-rule allocations with the current method, as it would be too slow, even for benchmarking use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a measure of per-rule allocations and deriving the total from that be fast enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try implementing this and see the results. My guess is it would actually be quite a bit faster because now you are counting items only when they are inserted into the map. However now the metric becomes one of total allocations instead of live memory since you miss deletions. Would there be some way "intercepting" those so you can count the size of deleted elements?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think it's going to be easy (or possible) to observe the deletions.

Maybe what we want is to keep the current behaviour, but instead of getting the recursive size of the whole values map, break it down by key type and get the recursive size of each group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as a test, I tried measuring allocations when objects get added to the state map, (mpardalos/ghcide@opentelemetry-memory...detailed-memory-test), and it turns out I was wrong. Running a small "hover" benchmark (see code), took about 3 hours, which I think qualifies as unusable for now.

Maybe what we want is to keep the current behaviour, but instead of getting the recursive size of the whole values map, break it down by key type and get the recursive size of each group

That is probably much more feasible. I will give that a try as well and report back

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it so slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried analyzing which entries in the map took the longest and it seems like the worst offenders are GhcSessionIO and GhcSession entries, each of which can take about 10s to measure, I think because their representation is very nested (each can be >1000000 closures). If there is a way to estimate the size of these entries instead of having to traverse them it could improve performance a lot.

If there isn't a way to handle this then the only way to improve performance would be to optimise heapsize further.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried the group approach below? I think that's probably what we want to measure and not the individual insertions, since I suspect that when inserting a new item in the map part of its cost is shared with other entries in the map.

Maybe what we want is to keep the current behaviour, but instead of getting the recursive size of the whole values map, break it down by key type and get the recursive size of each group

@mpardalos
Copy link
Contributor Author

@wz1000

It would be good to have some documentation on how to see/use/interpret these traces.

Should I add that in the README?

@wz1000
Copy link
Collaborator

wz1000 commented Sep 25, 2020

Should I add that in the README?

The docs/ folder seems like the right place for this. You can also link there from the benchmark README.

@mpardalos
Copy link
Contributor Author

@wz1000

Did you try this out on a ghcide session? Any memorable findings? Which actions/requests take the most time, which ones leak memory, which ones cause memory consumption to spike, etc?

I can't say I've found anything directly interesting as this stands. Most requests seem basically instant in my (admittedly limited) testing on the ghcide codebase. As for actions, TypeCheck seems to usually take long (as expected), and I have also seen ReportImportCycles stand out at ~1s.

Memory consumption is, as expected due to #421, constantly rising. I am hoping this will help in addressing that, however I need to find a way of getting more detailed readings to do that. Hopefully the change @pepeiborra suggested (#826 (comment)) helps.

@pepeiborra
Copy link
Collaborator

@mpardalos any updates on this?

@mpardalos
Copy link
Contributor Author

@mpardalos any updates on this?

Sorry for the radio silence, starting final year of uni has been a bit busy. I haven't worked on this since we last spoke, but I'll try to get it ready to merge this week.

See my comment on the code review for what I think is left to do.

atomically $ modifyTVar pendingRequests (Set.insert _id)
writeChan clientMsgChan $ Response r wrap f
let withNotification old f = Just $ \r -> writeChan clientMsgChan $ Notification r (\lsp ide x -> f lsp ide x >> whenJust old ($ r))
let withResponseAndRequest wrap wrapNewReq f = Just $ \r@RequestMessage{_id} -> do
let withNotification old f = Just $ \r@NotificationMessage{_method} -> otTracedHandler "Notification" (show _method) $
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these traces correct? The only thing they will be measuring the time required to forward the request/notification on the channel, not the time we actually spend computing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are completely right. Should I be adding that further below in handleInit?

@mpardalos
Copy link
Contributor Author

I made it report the size for each key separately (lumping together all the entries for a given key which refer to different files). This could be pretty useful although it is still slow.

Here's a screenshot of what the output might look like in tracy.

image

"values map size_bytes" is the total size of the values map, and each key gets its own graph below that.

As I mentioned before, the worst offenders for the slowdown are GhcSession and GhcSessionIO, which are also the biggest entries. I will try skipping them first (to see how much of a difference that makes) and then try putting their measurement on a separate thread so that every other entry can at least have some more detailed data.

@mpardalos mpardalos marked this pull request as draft October 19, 2020 15:45
@wz1000
Copy link
Collaborator

wz1000 commented Oct 19, 2020

@mpardalos great work, that looks awesome!

Here are some sets of rules among which a lot of sharing might be going on: [GhcSession, GhcSessionDeps, GhcSessionIO], [GetModSummaryWithoutTimestamps, GetModSummary], [GetModIfaceFromDisk,GetModIface, GetModIfaceWithoutLinkables], [GetParsedModule,TypeCheck]

It would be awesome if we could group together the entries for such lists too.

Also, can you upload the full trace that you computed somewhere? I'd be very interested in looking at it.

@wz1000
Copy link
Collaborator

wz1000 commented Oct 19, 2020

What does the gc graph show?

@wz1000
Copy link
Collaborator

wz1000 commented Oct 19, 2020

As a way to workaround any slowness, could it be possible to take periodic heap snapshots and dump them to disk (which should be fast), and load them later to all the number crunching?

/cc @mpickering

@mpardalos
Copy link
Contributor Author

What does the gc graph show?

This is part of OpenTelemetry itself. It counts total GC time.

As a way to workaround any slowness, could it be possible to take periodic heap snapshots and dump them to disk (which should be fast), and load them later to all the number crunching?

If that's possible it could be very very interesting

Also, can you upload the full trace that you computed somewhere? I'd be very interested in looking at it.

Here it is. It's basically just me jumping around in ghcide's source, opening/closing files, adding syntax errors, etc. It is still slow but I think that skipping those GhcSession(IO)? entries brings it into the realm of usable.

ghcide.eventlog.gz

@wz1000
Copy link
Collaborator

wz1000 commented Oct 19, 2020

Some more things to measure:

  • number of entries/files for each key
  • Number of stale vs current entries

@pepeiborra
Copy link
Collaborator

This is starting to look great. Should we focus on getting a minimum useful kernel checked in so that we can iterate over it?

@wz1000
Copy link
Collaborator

wz1000 commented Oct 20, 2020

Here is one hack that might possibly work to emulate the 'snapshot' idea if you have enough swap:

sample :: UTCTime -> IO ProcessId
sample time = forkProcess $ do
  raiseSignal sigSTOP
  -- Do profiling
  profileResults <- ...
  dumpResultsToFile ("sample-"++show time) profileResults

Then we need to collect all of these ProcessIds (one for each sample), and before exiting ghcide, raise SIGCONT on each of these one by one and waiting for the process to exit.

However, you do have to be careful about not using any Handles/MVars/locks from the original process when you are computing results or dumping them out. I'm pretty sure it would crash if you try to write them to the eventlog, so you need to dump it to a newly opened file.

@pepeiborra
Copy link
Collaborator

It's starting to look like this PR is abandoned.
@mpardalos can we get a minimum useful change merged?

@mpardalos
Copy link
Contributor Author

As it stands, would you consider this ready to merge? Functionality is there, everything is just quite slow.

@pepeiborra
Copy link
Collaborator

As long as it is behind a flag, it's mergeable

@pepeiborra
Copy link
Collaborator

@mpardalos I want to have a play with this, but I don't know how do I hook the event trace to Tracy. Could you add some docs or links?

@wz1000
Copy link
Collaborator

wz1000 commented Nov 21, 2020

@mpardalos
Copy link
Contributor Author

I added a doc for how to use this functionality.
I included how to actually build the tracy viewer because the instructions are buried in the pdf manual for tracy.

@pepeiborra
Copy link
Collaborator

pepeiborra commented Nov 21, 2020

That leads to an error, what's next?

[nix-shell:~/scratch/ghcide]$ eventlog-to-tracy ghcide.eventlog
Converting ghcide.eventlog to ghcide.eventlog.trace.json...
dyld: Library not loaded: libcapstone.4.dylib
  Referenced from: /nix/store/ks7805136p3f308p6i2jwygv5nd0jbmb-tracy-0.7.1/bin/import-chrome
  Reason: image not found
eventlog-to-tracy: callProcess: import-chrome "ghcide.eventlog.trace.json" "ghcide.eventlog.tracy" (exit -6): failed

I am using the following branch: http://github.com/pepeiborra/ghcide/tree/opentelemetry

EDIT: figured it out

@mpardalos
Copy link
Contributor Author

Was it maybe something unclear from the instructions? Might be a good idea to specify.

& HMap.fromListWith (++)
valuesSize <- sequence $ HMap.mapWithKey (\k v -> withSpan ("Measure " <> (BS.pack $ show k)) $ \sp -> do
{ instrument <- instrumentFor k
; byteSize <- (sizeOf (undefined :: Word) *) <$> recursiveSizeNoGC v
Copy link
Collaborator

@pepeiborra pepeiborra Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call to recursiveSizeNoGC leads to a segfault if a GC arises while recursiveSizeNoGC is counting, which is quite easy to reproduce.

Given the way recursiveSizeNoGC works this is impossible to avoid (the act of measuring the size itself allocates memory) but it could be minimised by stopping the world while measuring.

Did you consider this and/or other workarounds @mpardalos ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another, much cheaper way to minimise the segfaults in recursiveSizeNoGC is to use System.Mem.Weak to observe GC runs, and abort the count when this happens.

@mpardalos are you accepting PRs for heapsize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah absolutely

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the Weak workaround and bingo, no more segfaults. So I'm going to:

  • Send you a heapsize PR
  • Ask you to upload a new heapsize version to Hackage
  • Commandeer this diff (I have made plenty of changes)
  • Get this merged

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke too soon, segfaults can still arise if GC strikes in between the Weak check and the address dereferencing. It would be much safer if this code was written in C and accessed through an unsafe foreign import, because that guarantees that a GC cannot strike.

I also investigated a solution for the sharing problems, providing an Applicative interface so that it's possible to do multiple heapsize measurements with the same set of seen pointers. Again, the GC gets in the way - making this approach not viable sadly. One would need to rewrite the whole thing in C.

@pepeiborra
Copy link
Collaborator

I managed to get full measurements with sharingfor a small project by setting a very large nursery (-A4G) and grouping key types in chunks instead of trying to measure the whole values dictionary in one go, as well as optimising the measurement code a bit to allocate less.

However, I still observe crashes due to SIGILL (Illegal Instruction). My attempts to get a backtrace have been unsuccessful, Mac OS got in the way. I'll try to replicate in Linux tomorrow.

Screenshot below where you can observe the effects of sharing: Typecheck, GetModIface, GetModSummary, and others have a cost of 1 because everything? is reachable from the GhcSession, apparently. Sadly not very helpful. The only exception is GetParsedModule, but that's accidental: it wasn't in the sharing group when I took the measurement. Once added to the sharing group, it also has cost 1.

image
image
image

@wz1000
Copy link
Collaborator

wz1000 commented Nov 24, 2020

Why does GetHieAst have a size of 1? I'm pretty sure its not being shared by anything else.

@pepeiborra
Copy link
Collaborator

Superseded by #922

@pepeiborra pepeiborra closed this Nov 25, 2020
@pepeiborra
Copy link
Collaborator

Why does GetHieAst have a size of 1? I'm pretty sure its not being shared by anything else.

This had a simple explanation: Shake extras is reachable from the GhcSessionIO closure.

@wz1000
Copy link
Collaborator

wz1000 commented Nov 25, 2020

This had a simple explanation: Shake extras is reachable from the GhcSessionIO closure.

Does that mean everything in the Shake store is reported as part of GhcSessionIO?

@pepeiborra
Copy link
Collaborator

This had a simple explanation: Shake extras is reachable from the GhcSessionIO closure.

Does that mean everything in the Shake store is reported as part of GhcSessionIO?

Yes, I would expect so

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants