Memory info and more details in opentelemetry #826

mpardalos · 2020-09-22T19:06:12Z

This patch adds the following information to opentelemetry traces:

Number of entries in the "state" hashmap in ShakeExtras (Metric is called "values map size_bytes")
Memory occupancy in that map. (This is currently very slow, and so sizes are only reported about once/minute)
Trace of all LSP requests and notifications (as opposed to just those defined in HoverDefinition.hs, as before)

Other changes:

Put opentelemetry tracing behind --ot-profiling flag
Add --ot-profiling flag to benchmark to run ghcide with --ot-profiling
Improve recording of heap_live_bytes in opentelemetry trace by running performGC every 0.5s (only with --ot-profiling)

Instead of doing it in `HoverDefinition`, do it in with{Response,Notification,...}. These wrap all handlers, so this should cover everything. It also means that the span covers the entire processing time for the request, where before we missed the setup happening in the with* functions.

Run GC regularly with --ot-profiling

I renamed the fork to distringuish from the original. It is still being pulled from git using stack. This will be addressed once I can push the fork to hackage.

wz1000 · 2020-09-23T18:34:13Z

Did you try this out on a ghcide session? Any memorable findings? Which actions/requests take the most time, which ones leak memory, which ones cause memory consumption to spike, etc?

wz1000 · 2020-09-23T18:35:42Z

It would be good to have some documentation on how to see/use/interpret these traces.

pepeiborra

The changes look reasonable to me, if a bit unfinished.

For instance, the benchmark suite has been extended with an option to collect telemetry details, like the length and size of the state HashMap. But nothing is being done with this data!

It should be added to the CSV outputs of an experiment
It could be graphed by the bench/hist build system so that we can easily spot leaks

pepeiborra · 2020-09-23T19:47:30Z

src/Development/IDE/Core/Tracing.hs

+startTelemetry _ _ (IdeOTProfiling False) = return ()
+startTelemetry name valuesRef (IdeOTProfiling True) = do
+  mapBytesInstrument <- mkValueObserver (BS.pack name <> " size_bytes")
+  mapCountInstrument <- mkValueObserver (BS.pack name <> " count")


In addition to these global counters, it would be very useful to have per-rule counters, in order to show which rules are expensive (in items and in memory)

I agree with all the suggested changes, with one caveat: I don't think it's feasible to measure per-rule allocations with the current method, as it would be too slow, even for benchmarking use.

Would a measure of per-rule allocations and deriving the total from that be fast enough?

I'm going to try implementing this and see the results. My guess is it would actually be quite a bit faster because now you are counting items only when they are inserted into the map. However now the metric becomes one of total allocations instead of live memory since you miss deletions. Would there be some way "intercepting" those so you can count the size of deleted elements?

Hmm, I don't think it's going to be easy (or possible) to observe the deletions.

Maybe what we want is to keep the current behaviour, but instead of getting the recursive size of the whole values map, break it down by key type and get the recursive size of each group

So, as a test, I tried measuring allocations when objects get added to the state map, (mpardalos/ghcide@opentelemetry-memory...detailed-memory-test), and it turns out I was wrong. Running a small "hover" benchmark (see code), took about 3 hours, which I think qualifies as unusable for now.

Maybe what we want is to keep the current behaviour, but instead of getting the recursive size of the whole values map, break it down by key type and get the recursive size of each group

That is probably much more feasible. I will give that a try as well and report back

Why is it so slow?

I tried analyzing which entries in the map took the longest and it seems like the worst offenders are GhcSessionIO and GhcSession entries, each of which can take about 10s to measure, I think because their representation is very nested (each can be >1000000 closures). If there is a way to estimate the size of these entries instead of having to traverse them it could improve performance a lot.

If there isn't a way to handle this then the only way to improve performance would be to optimise heapsize further.

Have you tried the group approach below? I think that's probably what we want to measure and not the individual insertions, since I suspect that when inserting a new item in the map part of its cost is shared with other entries in the map.

Maybe what we want is to keep the current behaviour, but instead of getting the recursive size of the whole values map, break it down by key type and get the recursive size of each group

mpardalos · 2020-09-24T20:36:54Z

@wz1000

It would be good to have some documentation on how to see/use/interpret these traces.

Should I add that in the README?

wz1000 · 2020-09-25T15:50:32Z

Should I add that in the README?

The docs/ folder seems like the right place for this. You can also link there from the benchmark README.

mpardalos · 2020-09-25T15:58:59Z

@wz1000

Did you try this out on a ghcide session? Any memorable findings? Which actions/requests take the most time, which ones leak memory, which ones cause memory consumption to spike, etc?

I can't say I've found anything directly interesting as this stands. Most requests seem basically instant in my (admittedly limited) testing on the ghcide codebase. As for actions, TypeCheck seems to usually take long (as expected), and I have also seen ReportImportCycles stand out at ~1s.

Memory consumption is, as expected due to #421, constantly rising. I am hoping this will help in addressing that, however I need to find a way of getting more detailed readings to do that. Hopefully the change @pepeiborra suggested (#826 (comment)) helps.

pepeiborra · 2020-10-09T20:04:35Z

@mpardalos any updates on this?

mpardalos · 2020-10-12T15:23:42Z

@mpardalos any updates on this?

Sorry for the radio silence, starting final year of uni has been a bit busy. I haven't worked on this since we last spoke, but I'll try to get it ready to merge this week.

See my comment on the code review for what I think is left to do.

wz1000 · 2020-10-12T15:31:50Z

src/Development/IDE/LSP/LanguageServer.hs

            atomically $ modifyTVar pendingRequests (Set.insert _id)
            writeChan clientMsgChan $ Response r wrap f
-    let withNotification old f = Just $ \r -> writeChan clientMsgChan $ Notification r (\lsp ide x -> f lsp ide x >> whenJust old ($ r))
-    let withResponseAndRequest wrap wrapNewReq f = Just $ \r@RequestMessage{_id} -> do
+    let withNotification old f = Just $ \r@NotificationMessage{_method} -> otTracedHandler "Notification" (show _method) $


Are these traces correct? The only thing they will be measuring the time required to forward the request/notification on the channel, not the time we actually spend computing.

You are completely right. Should I be adding that further below in handleInit?

mpardalos · 2020-10-19T15:43:43Z

I made it report the size for each key separately (lumping together all the entries for a given key which refer to different files). This could be pretty useful although it is still slow.

Here's a screenshot of what the output might look like in tracy.

"values map size_bytes" is the total size of the values map, and each key gets its own graph below that.

As I mentioned before, the worst offenders for the slowdown are GhcSession and GhcSessionIO, which are also the biggest entries. I will try skipping them first (to see how much of a difference that makes) and then try putting their measurement on a separate thread so that every other entry can at least have some more detailed data.

wz1000 · 2020-10-19T15:52:38Z

@mpardalos great work, that looks awesome!

Here are some sets of rules among which a lot of sharing might be going on: [GhcSession, GhcSessionDeps, GhcSessionIO], [GetModSummaryWithoutTimestamps, GetModSummary], [GetModIfaceFromDisk,GetModIface, GetModIfaceWithoutLinkables], [GetParsedModule,TypeCheck]

It would be awesome if we could group together the entries for such lists too.

Also, can you upload the full trace that you computed somewhere? I'd be very interested in looking at it.

wz1000 · 2020-10-19T15:57:21Z

What does the gc graph show?

wz1000 · 2020-10-19T16:00:49Z

As a way to workaround any slowness, could it be possible to take periodic heap snapshots and dump them to disk (which should be fast), and load them later to all the number crunching?

/cc @mpickering

mpardalos · 2020-10-19T16:16:38Z

What does the gc graph show?

This is part of OpenTelemetry itself. It counts total GC time.

As a way to workaround any slowness, could it be possible to take periodic heap snapshots and dump them to disk (which should be fast), and load them later to all the number crunching?

If that's possible it could be very very interesting

Also, can you upload the full trace that you computed somewhere? I'd be very interested in looking at it.

Here it is. It's basically just me jumping around in ghcide's source, opening/closing files, adding syntax errors, etc. It is still slow but I think that skipping those GhcSession(IO)? entries brings it into the realm of usable.

ghcide.eventlog.gz

wz1000 · 2020-10-19T16:58:18Z

Some more things to measure:

number of entries/files for each key
Number of stale vs current entries

pepeiborra · 2020-10-19T20:44:58Z

This is starting to look great. Should we focus on getting a minimum useful kernel checked in so that we can iterate over it?

wz1000 · 2020-10-20T16:15:13Z

Here is one hack that might possibly work to emulate the 'snapshot' idea if you have enough swap:

sample :: UTCTime -> IO ProcessId
sample time = forkProcess $ do
  raiseSignal sigSTOP
  -- Do profiling
  profileResults <- ...
  dumpResultsToFile ("sample-"++show time) profileResults

Then we need to collect all of these ProcessIds (one for each sample), and before exiting ghcide, raise SIGCONT on each of these one by one and waiting for the process to exit.

However, you do have to be careful about not using any Handles/MVars/locks from the original process when you are computing results or dumping them out. I'm pretty sure it would crash if you try to write them to the eventlog, so you need to dump it to a newly opened file.

pepeiborra · 2020-11-07T12:04:25Z

It's starting to look like this PR is abandoned.
@mpardalos can we get a minimum useful change merged?

mpardalos · 2020-11-07T13:56:37Z

As it stands, would you consider this ready to merge? Functionality is there, everything is just quite slow.

pepeiborra · 2020-11-07T17:03:06Z

As long as it is behind a flag, it's mergeable

pepeiborra · 2020-11-21T13:35:16Z

@mpardalos I want to have a play with this, but I don't know how do I hook the event trace to Tracy. Could you add some docs or links?

wz1000 · 2020-11-21T13:38:46Z

I believe you want to use this executable: https://github.com/ethercrow/opentelemetry-haskell/blob/master/opentelemetry-extra/exe/eventlog-to-tracy/Main.hs

mpardalos · 2020-11-21T17:35:04Z

I added a doc for how to use this functionality.
I included how to actually build the tracy viewer because the instructions are buried in the pdf manual for tracy.

pepeiborra · 2020-11-21T17:55:58Z

That leads to an error, what's next?

[nix-shell:~/scratch/ghcide]$ eventlog-to-tracy ghcide.eventlog
Converting ghcide.eventlog to ghcide.eventlog.trace.json...
dyld: Library not loaded: libcapstone.4.dylib
  Referenced from: /nix/store/ks7805136p3f308p6i2jwygv5nd0jbmb-tracy-0.7.1/bin/import-chrome
  Reason: image not found
eventlog-to-tracy: callProcess: import-chrome "ghcide.eventlog.trace.json" "ghcide.eventlog.tracy" (exit -6): failed

I am using the following branch: http://github.com/pepeiborra/ghcide/tree/opentelemetry

EDIT: figured it out

mpardalos · 2020-11-21T23:35:48Z

Was it maybe something unclear from the instructions? Might be a good idea to specify.

pepeiborra · 2020-11-23T00:23:25Z

src/Development/IDE/Core/Shake.hs

+                    & HMap.fromListWith (++)
+            valuesSize <- sequence $ HMap.mapWithKey (\k v -> withSpan ("Measure " <> (BS.pack $ show k)) $ \sp -> do
+                    { instrument <- instrumentFor k
+                    ; byteSize <- (sizeOf (undefined :: Word) *) <$> recursiveSizeNoGC v


This call to recursiveSizeNoGC leads to a segfault if a GC arises while recursiveSizeNoGC is counting, which is quite easy to reproduce.

Given the way recursiveSizeNoGC works this is impossible to avoid (the act of measuring the size itself allocates memory) but it could be minimised by stopping the world while measuring.

Did you consider this and/or other workarounds @mpardalos ?

Another, much cheaper way to minimise the segfaults in recursiveSizeNoGC is to use System.Mem.Weak to observe GC runs, and abort the count when this happens.

@mpardalos are you accepting PRs for heapsize?

Yeah absolutely

I have tested the Weak workaround and bingo, no more segfaults. So I'm going to:

Send you a heapsize PR

Ask you to upload a new heapsize version to Hackage

Commandeer this diff (I have made plenty of changes)

Get this merged

mpardalos/Heapsize#1

mpardalos#1

I spoke too soon, segfaults can still arise if GC strikes in between the Weak check and the address dereferencing. It would be much safer if this code was written in C and accessed through an unsafe foreign import, because that guarantees that a GC cannot strike.

I also investigated a solution for the sharing problems, providing an Applicative interface so that it's possible to do multiple heapsize measurements with the same set of seen pointers. Again, the GC gets in the way - making this approach not viable sadly. One would need to rewrite the whole thing in C.

pepeiborra · 2020-11-24T20:23:00Z

I managed to get full measurements with sharingfor a small project by setting a very large nursery (-A4G) and grouping key types in chunks instead of trying to measure the whole values dictionary in one go, as well as optimising the measurement code a bit to allocate less.

However, I still observe crashes due to SIGILL (Illegal Instruction). My attempts to get a backtrace have been unsuccessful, Mac OS got in the way. I'll try to replicate in Linux tomorrow.

Screenshot below where you can observe the effects of sharing: Typecheck, GetModIface, GetModSummary, and others have a cost of 1 because everything? is reachable from the GhcSession, apparently. Sadly not very helpful. The only exception is GetParsedModule, but that's accidental: it wasn't in the sharing group when I took the measurement. Once added to the sharing group, it also has cost 1.

wz1000 · 2020-11-24T21:11:21Z

Why does GetHieAst have a size of 1? I'm pretty sure its not being shared by anything else.

pepeiborra · 2020-11-25T10:54:19Z

Superseded by #922

pepeiborra · 2020-11-25T10:58:48Z

Why does GetHieAst have a size of 1? I'm pretty sure its not being shared by anything else.

This had a simple explanation: Shake extras is reachable from the GhcSessionIO closure.

wz1000 · 2020-11-25T11:03:16Z

This had a simple explanation: Shake extras is reachable from the GhcSessionIO closure.

Does that mean everything in the Shake store is reported as part of GhcSessionIO?

pepeiborra · 2020-11-25T11:29:26Z

This had a simple explanation: Shake extras is reachable from the GhcSessionIO closure.

Does that mean everything in the Shake store is reported as part of GhcSessionIO?

Yes, I would expect so

mpardalos and others added 11 commits September 22, 2020 19:58

Move tracing functions to own module

e4d1761

Bump opentelemetry to 0.6.0

767f472

Write Values map size to OpenTelemetry metric

9423b0e

Add flag for OpenTelemetry profiling

4c6ab59

Run GC regularly with --ot-profiling

Add flag to enable OT profiling in benchmark

6940773

Use heapsize instead of ghc-datasize

1bbe02a

I renamed the fork to distringuish from the original. It is still being pulled from git using stack. This will be addressed once I can push the fork to hackage.

Bump opentelemetry to 0.6.1 - fixes 8.6 build

4d079e9

Use heapsize from hackage

53c9a40

Address HLint messages

a697ab8

Merge branch 'master' into opentelemetry-memory

4279bb6

pepeiborra approved these changes Sep 23, 2020

View reviewed changes

pepeiborra mentioned this pull request Sep 29, 2020

System hangs and crashes after consuming all 32GB of RAM haskell/haskell-language-server#286

Closed

wz1000 reviewed Oct 12, 2020

View reviewed changes

Record size of each key independently

69ae2c7

mpardalos force-pushed the opentelemetry-memory branch from 27df584 to 69ae2c7 Compare October 19, 2020 14:50

mpardalos added 3 commits October 19, 2020 16:28

Refactor startTelemetry function

b32d0d8

Remove delay between measuring memory loops

7ba4add

Each key in values map gets own OT instrument

f29acde

mpardalos marked this pull request as draft October 19, 2020 15:45

mpardalos added 4 commits November 21, 2020 15:53

Measure values map length more rarely

922b653

Rename --ot-profiling to --ot-memory-profiling

352609f

Add docs for how to use the opentelemetry output

1cb5837

Add instructions to build release version of tracy

1039864

Clarify dependencies in opentelemetry instructions

98bc3d9

pepeiborra reviewed Nov 23, 2020

View reviewed changes

pepeiborra mentioned this pull request Nov 25, 2020

Opentelemetry traces and heapsize memory analysis #922

Merged

1 task

pepeiborra closed this Nov 25, 2020

mpardalos mentioned this pull request Jan 19, 2021

Memory usage haskell/haskell-language-server#802

Closed

Memory info and more details in opentelemetry #826

Memory info and more details in opentelemetry #826

Conversation

mpardalos commented Sep 22, 2020

wz1000 commented Sep 23, 2020

wz1000 commented Sep 23, 2020

pepeiborra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpardalos commented Sep 24, 2020

wz1000 commented Sep 25, 2020

mpardalos commented Sep 25, 2020

pepeiborra commented Oct 9, 2020

mpardalos commented Oct 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpardalos commented Oct 19, 2020

wz1000 commented Oct 19, 2020

wz1000 commented Oct 19, 2020

wz1000 commented Oct 19, 2020

mpardalos commented Oct 19, 2020

wz1000 commented Oct 19, 2020

pepeiborra commented Oct 19, 2020

wz1000 commented Oct 20, 2020

pepeiborra commented Nov 7, 2020

mpardalos commented Nov 7, 2020

pepeiborra commented Nov 7, 2020

pepeiborra commented Nov 21, 2020

wz1000 commented Nov 21, 2020

mpardalos commented Nov 21, 2020

pepeiborra commented Nov 21, 2020 • edited Loading

mpardalos commented Nov 21, 2020

pepeiborra Nov 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pepeiborra commented Nov 24, 2020

wz1000 commented Nov 24, 2020

pepeiborra commented Nov 25, 2020

pepeiborra commented Nov 25, 2020

wz1000 commented Nov 25, 2020 • edited Loading

pepeiborra commented Nov 25, 2020

pepeiborra commented Nov 21, 2020 •

edited

Loading

pepeiborra Nov 23, 2020 •

edited

Loading

wz1000 commented Nov 25, 2020 •

edited

Loading