New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testsuite write-merge failing #14

Open
jberthold opened this Issue Jun 28, 2016 · 14 comments

Comments

Projects
None yet
5 participants
@jberthold
Contributor

jberthold commented Jun 28, 2016

The new test suite write-merge fails on master (f5ac7a0) with an index-out-of-bounds error:

Test suite write-merge: RUNNING...
write-merge: Ix{Int}.index: Index (39) out of range ((0,19))
Test suite write-merge: FAIL

(This is with GHC-7.8.4) but does not seem to relate to the GHC version used to compile the library.
A quick first analysis indicates that serialiseEventLog is causing the failure for older trace files. It seems as though some assumptions about the admissible events are baked into the code?

GHCI session:

*WriteMerge> testFile "test/queens-ghc-7.0.2.eventlog"
Loading package array-0.5.0.0 ... linking ... done.
Loading package deepseq-1.3.0.2 ... linking ... done.
Loading package bytestring-0.10.4.0 ... linking ... done.
Loading package containers-0.5.5.1 ... linking ... done.
Loading package binary-0.7.1.0 ... linking ... done.
*** Exception: Ix{Int}.index: Index (39) out of range ((0,22))
*WriteMerge> testFile "test/queens-ghc-6.12.1.eventlog"
*** Exception: Ix{Int}.index: Index (39) out of range ((0,19))
*WriteMerge> 

GHC-6.12.3 knew 19 event types, GHC-7.0.2 knew 22, test files produced by newer GHCs contain many more event types, and those files work as expected:

*WriteMerge> testFile "test/mdlLogMPI1.eventlog"
test/mdlLogMPI1.eventlog: success
True
*WriteMerge> testFile "test/782stop.eventlog"
Different output at line 95:
Original:  10678973: cap 0: stopping thread 1 (blocked on black hole owned by thread 2)
New:       10678973: cap 0: stopping thread 1 (blocked reading an MVar)
test/782stop.eventlog: failure re-written file does not match the original
False
*WriteMerge> testFile "test/pre77stop.eventlog"
test/pre77stop.eventlog: success
True
*WriteMerge> testFile "test/783stop.eventlog"
test/783stop.eventlog: success
True
*WriteMerge> testFile "test/mandelbrot-mmc-2011-06-14.eventlog"
*** Exception: eventTypeNum UnknownEvent
*WriteMerge> :q

The failure with GHC-7.8.2 is expected (see code https://github.com/haskell/ghc-events/blob/master/GHC/RTS/Events.hs#L1184 ) . One could probably modify the file header of the written log file (adding the event that identifies GHC-7.8.3) to fix it - the format which is written is the one of 7.8.3.

Unclear what causes the failure with mandelbrot-mmc-2011-06-14.eventlog. This file has many event types, whose numbers are neither consecutive nor ordered in the file, which might be the reason.

@kvelicka

This comment has been minimized.

Show comment
Hide comment
@kvelicka

kvelicka Jun 28, 2016

Collaborator

Hi,

First and foremost, thank you very much for a detailed and extensive description of the issue!

I'm aware of this bug and I initially thought that it was introduced along with the incremental parser that recently got merged into master. However, at the beginning of this year I ran a regression test of sorts - I retrofitted [1] ghc-events-0.4.4.0 with this suite and found the same failures, if I remember correctly. While this says very little about the cause of the issues, it makes me wonder just how long has this functionality been broken for.

I think that your analysis may very well be correct, especially given the currently known issues with varying event types (i.e. headers) being emitted by different GHC versions. However, I haven't gotten around to investigating this more in recent days. I'd like to do so in the future but that is unlikely to happen in the coming couple of months at least. However, if you work is affected by this issue, I'll assign it a higher priority than I would otherwise. In addition, if you're interested in poking around this further, I can either try to provide assistance or would at least make use of your findings when I get around to working on this further.

Last but not least - and completely out of curiosity - you are one of the few people that are showing active interest in ghc-events lately. Are you using it in some significant capacity at your work/projects? If so, I'd be interested in hearing experiences/feature requests/information that may aid the development and the feature roadmap for future releases of the library.

[1] Note that the write-merge test suite was a part of my recent PR, there were no tests other than [a smaller] test-versions previously.

Collaborator

kvelicka commented Jun 28, 2016

Hi,

First and foremost, thank you very much for a detailed and extensive description of the issue!

I'm aware of this bug and I initially thought that it was introduced along with the incremental parser that recently got merged into master. However, at the beginning of this year I ran a regression test of sorts - I retrofitted [1] ghc-events-0.4.4.0 with this suite and found the same failures, if I remember correctly. While this says very little about the cause of the issues, it makes me wonder just how long has this functionality been broken for.

I think that your analysis may very well be correct, especially given the currently known issues with varying event types (i.e. headers) being emitted by different GHC versions. However, I haven't gotten around to investigating this more in recent days. I'd like to do so in the future but that is unlikely to happen in the coming couple of months at least. However, if you work is affected by this issue, I'll assign it a higher priority than I would otherwise. In addition, if you're interested in poking around this further, I can either try to provide assistance or would at least make use of your findings when I get around to working on this further.

Last but not least - and completely out of curiosity - you are one of the few people that are showing active interest in ghc-events lately. Are you using it in some significant capacity at your work/projects? If so, I'd be interested in hearing experiences/feature requests/information that may aid the development and the feature roadmap for future releases of the library.

[1] Note that the write-merge test suite was a part of my recent PR, there were no tests other than [a smaller] test-versions previously.

@jberthold

This comment has been minimized.

Show comment
Hide comment
@jberthold

jberthold Jun 29, 2016

Contributor

Hi Karl,

retrofitted [1] ghc-events-0.4.4.0 with this suite and found the same failures, if I remember correctly. While this says very little about the cause of the issues, it makes me wonder just how long has this functionality been broken for.

The put functionality was probably never used extensively, therefore its bug went undetected.
Also, events defined once should never change (design principle of ghc-events), but they did in fact change in the past, so there will be failing cases.
That said, the root cause is probably an accidental detail in the internals of the putEventLog.

Maybe I can drill in a bit more for this bug, but I have limited time to spend.
I am doing this in the evenings - and also need to spend some time updating clients with this latest version. (Also to answer your last question:) I know that you found ghc-events-parallel my earlier drop-in replacement for ghc-events with support for the Parallel Haskell runtime system's events. Now that this is merged in, the two clients for parallel Haskell, Eden-TraceLab and EdenTV (which btw predates ThreadScope by ~4 years) should use the standard ghc-events.

Contributor

jberthold commented Jun 29, 2016

Hi Karl,

retrofitted [1] ghc-events-0.4.4.0 with this suite and found the same failures, if I remember correctly. While this says very little about the cause of the issues, it makes me wonder just how long has this functionality been broken for.

The put functionality was probably never used extensively, therefore its bug went undetected.
Also, events defined once should never change (design principle of ghc-events), but they did in fact change in the past, so there will be failing cases.
That said, the root cause is probably an accidental detail in the internals of the putEventLog.

Maybe I can drill in a bit more for this bug, but I have limited time to spend.
I am doing this in the evenings - and also need to spend some time updating clients with this latest version. (Also to answer your last question:) I know that you found ghc-events-parallel my earlier drop-in replacement for ghc-events with support for the Parallel Haskell runtime system's events. Now that this is merged in, the two clients for parallel Haskell, Eden-TraceLab and EdenTV (which btw predates ThreadScope by ~4 years) should use the standard ghc-events.

@kvelicka

This comment has been minimized.

Show comment
Hide comment
@kvelicka

kvelicka Jul 1, 2016

Collaborator

I see, consolidating the two libraries would definitely be helpful to the
ecosystem.

As for the bug, sounds like one of us will have to look more into it as we
get some free time. I hopefully should, in about a moth or two from now.

Collaborator

kvelicka commented Jul 1, 2016

I see, consolidating the two libraries would definitely be helpful to the
ecosystem.

As for the bug, sounds like one of us will have to look more into it as we
get some free time. I hopefully should, in about a moth or two from now.

@maoe maoe referenced this issue Feb 10, 2017

Merged

Enable Travis CI #16

@maoe maoe referenced this issue Apr 8, 2017

Merged

Refactor #22

@maoe

This comment has been minimized.

Show comment
Hide comment
@maoe

maoe Apr 23, 2017

Member

I just briefly looked at this issue.

At least for the exception thrown for queens-ghc-6.12.1.eventlog the problem happens like this:

  1. In the eventlog, there is the deprecated EVENT_STEAL_SPARK event type whose event ID is 5 in the header.
  2. We parse events whose event type is EVENT_STEAL_SPARK as SparkSteal, which is also used for the newer EVENT_SPARK_STEAL event type.
  3. When serializing the events, we serialize them as EVENT_SPARK_STEAL, whose event ID is 39 which does not exist in the header.
  4. When re-deserializing the events, the deserializer tries to run the parser for event ID 39 and ends up with an index out of range error.

I'm still not sure what the best way to address this. Would we like to replace these deprecated events with the alternatives when deserializing, or restore original events when serializing?

Member

maoe commented Apr 23, 2017

I just briefly looked at this issue.

At least for the exception thrown for queens-ghc-6.12.1.eventlog the problem happens like this:

  1. In the eventlog, there is the deprecated EVENT_STEAL_SPARK event type whose event ID is 5 in the header.
  2. We parse events whose event type is EVENT_STEAL_SPARK as SparkSteal, which is also used for the newer EVENT_SPARK_STEAL event type.
  3. When serializing the events, we serialize them as EVENT_SPARK_STEAL, whose event ID is 39 which does not exist in the header.
  4. When re-deserializing the events, the deserializer tries to run the parser for event ID 39 and ends up with an index out of range error.

I'm still not sure what the best way to address this. Would we like to replace these deprecated events with the alternatives when deserializing, or restore original events when serializing?

@kvelicka

This comment has been minimized.

Show comment
Hide comment
@kvelicka

kvelicka Apr 23, 2017

Collaborator

I think it's fine to emit the newer alternatives when serialising -- I'm not aware of any other notable consumers of eventlogs besides ghc-events.

In addition, this has been broken literally for years, at this point a solution would be a good solution.

Collaborator

kvelicka commented Apr 23, 2017

I think it's fine to emit the newer alternatives when serialising -- I'm not aware of any other notable consumers of eventlogs besides ghc-events.

In addition, this has been broken literally for years, at this point a solution would be a good solution.

@jberthold

This comment has been minimized.

Show comment
Hide comment
@jberthold

jberthold Apr 23, 2017

Contributor

Thanks @maoe for bumping this.
I think the problem is inherent to the property tested by write-merge: Can a round-trip of writing back and re-parsing an eventlog yield the same result?
This is problematic in the general case because particular properties of the different generations of GHC writing the event log are encoded in the parser, and this information is not available to the serialiseLog. See

-- The standard encoding is used here, which is wrong for eventlogs
for the example of the stopThread treatment: I had to drop the special case of GHC-7.8.2 on the floor. The example analysed above is a different case, one where two distinct events in the log generate the same information in the HS data structure. In both cases the log can only be written back one way, so there is no fix for the property...

Just though: If the serialseLog function wrote the header after all events used in the log are known, the resulting log file could probably be parsed without problems, yielding the exact same HS data structure. I am guessing this would work because the array out of bounds originates from the shorter header, and the 7.8.2 special parse for stopThread is activated by a certain header (not tested, just a conjecture for now). Or maybe one could even write a full header of all known events when the log is written?

Contributor

jberthold commented Apr 23, 2017

Thanks @maoe for bumping this.
I think the problem is inherent to the property tested by write-merge: Can a round-trip of writing back and re-parsing an eventlog yield the same result?
This is problematic in the general case because particular properties of the different generations of GHC writing the event log are encoded in the parser, and this information is not available to the serialiseLog. See

-- The standard encoding is used here, which is wrong for eventlogs
for the example of the stopThread treatment: I had to drop the special case of GHC-7.8.2 on the floor. The example analysed above is a different case, one where two distinct events in the log generate the same information in the HS data structure. In both cases the log can only be written back one way, so there is no fix for the property...

Just though: If the serialseLog function wrote the header after all events used in the log are known, the resulting log file could probably be parsed without problems, yielding the exact same HS data structure. I am guessing this would work because the array out of bounds originates from the shorter header, and the 7.8.2 special parse for stopThread is activated by a certain header (not tested, just a conjecture for now). Or maybe one could even write a full header of all known events when the log is written?

@maoe

This comment has been minimized.

Show comment
Hide comment
@maoe

maoe Apr 24, 2017

Member

because particular properties of the different generations of GHC writing the event log are encoded in the parser, and this information is not available to the serialiseLog

I'm not totally sure if this is the case. The set of parsers varies depending on the GHC version as you said but it's constructed from the header content, which is also available in the serializer. I'll take a deeper look into this issue when I have time.

Member

maoe commented Apr 24, 2017

because particular properties of the different generations of GHC writing the event log are encoded in the parser, and this information is not available to the serialiseLog

I'm not totally sure if this is the case. The set of parsers varies depending on the GHC version as you said but it's constructed from the header content, which is also available in the serializer. I'll take a deeper look into this issue when I have time.

@mboes

This comment has been minimized.

Show comment
Hide comment
@mboes

mboes May 25, 2017

Does anyone even need the write-back-to-disk functionality? It seems like an odd feature to me to be able to write our own event logs when we're not the compiler. If we got rid of it, a lot of code would go away...

mboes commented May 25, 2017

Does anyone even need the write-back-to-disk functionality? It seems like an odd feature to me to be able to write our own event logs when we're not the compiler. If we got rid of it, a lot of code would go away...

@maoe

This comment has been minimized.

Show comment
Hide comment
@maoe

maoe May 25, 2017

Member

@mboes I think the write back functionality can be useful. For example, we can write a tool to slice and/or filter specific events from a huge eventlog file to a smaller chunk. It's like the tcpslice for pcap files.

Member

maoe commented May 25, 2017

@mboes I think the write back functionality can be useful. For example, we can write a tool to slice and/or filter specific events from a huge eventlog file to a smaller chunk. It's like the tcpslice for pcap files.

@maoe

This comment has been minimized.

Show comment
Hide comment
@maoe

maoe May 25, 2017

Member

Also it's used publicly at least in this package: https://hackage.haskell.org/package/linux-perf

Member

maoe commented May 25, 2017

Also it's used publicly at least in this package: https://hackage.haskell.org/package/linux-perf

@kvelicka

This comment has been minimized.

Show comment
Hide comment
@kvelicka

kvelicka May 26, 2017

Collaborator

I've been told that it's also useful for assembling the full eventlog for a run of a distributed program where all the workers would be writing their own eventlogs.

Collaborator

kvelicka commented May 26, 2017

I've been told that it's also useful for assembling the full eventlog for a run of a distributed program where all the workers would be writing their own eventlogs.

@jberthold

This comment has been minimized.

Show comment
Hide comment
@jberthold

jberthold May 26, 2017

Contributor

...assembling the full eventlog for a run of a distributed program where all the workers would be writing their own eventlogs.

You might think... (we did for some time when we built EdenTV), but actually the first thing any log-consuming software will do is take it apart into the separate eventlogs :)
That said, I would not remove the functionality, rather think a bit about what the expected behaviour should be and what compromises are acceptable in this.

Contributor

jberthold commented May 26, 2017

...assembling the full eventlog for a run of a distributed program where all the workers would be writing their own eventlogs.

You might think... (we did for some time when we built EdenTV), but actually the first thing any log-consuming software will do is take it apart into the separate eventlogs :)
That said, I would not remove the functionality, rather think a bit about what the expected behaviour should be and what compromises are acceptable in this.

@maoe

This comment has been minimized.

Show comment
Hide comment
@maoe

maoe Jun 2, 2017

Member

Sorry, I didn't mean to close this.

Member

maoe commented Jun 2, 2017

Sorry, I didn't mean to close this.

@osa1

This comment has been minimized.

Show comment
Hide comment
@osa1

osa1 Sep 2, 2018

I haven't debugged this myself yet, but my understanding is that this problem occurs because there are multiple ways to parse an event (in other words we map different events to the same Event) but we only have one serializer for any given event. Looking at the original report and @maoe's comment, it seems to me that the problem occurs like this:

  • GHC 6 and GHC 7 have different formats for EVENT_STEAL_SPARK (GHC 6 uses one more word for the same event). Furthermore, we map two different events (EVENT_SPARK_STEAL and EVENT_STEAL_SPARK) to SparkSteal.

  • The parser decides how to parse an EVENT_STEAL_SPARK (or whether to parse EVENT_SPARK_STEAL or EVENT_STEAL_SPARK) by looking into the file header.

  • Because the serializer has only one way to serialize any given event, all these three variants of the same (or similar) events are serialized in one way.

So basically when parsing we map more stuff to less stuff and lose information.

Two ways to deal with this:

  • Implement something like EventParsers for serialization (maybe call it EventSerializers) that knows which events are supported and what format to use for any given event.

  • Update EventInfo so that for any format of any event that ghc-events can parse there's a distinct constructor in EventInfo. This is probably not a good idea as it means having three events for the example above:

    • EventStealSparkGHC6 for GHC 6 version of EVENT_STEAL_SPARK
    • EventStealSparkGHC7 for GHC 7 version of EVENT_STEAL_SPARK
    • EventSparkSteal for GHC 7's EVENT_SPARK_STEAL
      (I don't know why GHC 7 has both variants of this event)

But before putting more effort into this I'd suggest dropping support for some ancient GHC versions. I don't know what the policy for dropping support and what's the support window, but I can see in the .cabal file that the earliest GHC we support is currently 7.8.4. So perhaps just drop GHC 6 tests, add GHC 7 (or 8) tests, drop GHC 6 parsers etc. Also, this problem will disappear if we drop GHC 6 support and none of the GHC 7 and 8 events have different formats in different GHC versions (I don't know if this is really the case, haven't checked yet).

Any suggestions?

osa1 commented Sep 2, 2018

I haven't debugged this myself yet, but my understanding is that this problem occurs because there are multiple ways to parse an event (in other words we map different events to the same Event) but we only have one serializer for any given event. Looking at the original report and @maoe's comment, it seems to me that the problem occurs like this:

  • GHC 6 and GHC 7 have different formats for EVENT_STEAL_SPARK (GHC 6 uses one more word for the same event). Furthermore, we map two different events (EVENT_SPARK_STEAL and EVENT_STEAL_SPARK) to SparkSteal.

  • The parser decides how to parse an EVENT_STEAL_SPARK (or whether to parse EVENT_SPARK_STEAL or EVENT_STEAL_SPARK) by looking into the file header.

  • Because the serializer has only one way to serialize any given event, all these three variants of the same (or similar) events are serialized in one way.

So basically when parsing we map more stuff to less stuff and lose information.

Two ways to deal with this:

  • Implement something like EventParsers for serialization (maybe call it EventSerializers) that knows which events are supported and what format to use for any given event.

  • Update EventInfo so that for any format of any event that ghc-events can parse there's a distinct constructor in EventInfo. This is probably not a good idea as it means having three events for the example above:

    • EventStealSparkGHC6 for GHC 6 version of EVENT_STEAL_SPARK
    • EventStealSparkGHC7 for GHC 7 version of EVENT_STEAL_SPARK
    • EventSparkSteal for GHC 7's EVENT_SPARK_STEAL
      (I don't know why GHC 7 has both variants of this event)

But before putting more effort into this I'd suggest dropping support for some ancient GHC versions. I don't know what the policy for dropping support and what's the support window, but I can see in the .cabal file that the earliest GHC we support is currently 7.8.4. So perhaps just drop GHC 6 tests, add GHC 7 (or 8) tests, drop GHC 6 parsers etc. Also, this problem will disappear if we drop GHC 6 support and none of the GHC 7 and 8 events have different formats in different GHC versions (I don't know if this is really the case, haven't checked yet).

Any suggestions?

@osa1 osa1 referenced a pull request that will close this issue Sep 8, 2018

Open

Record skipped event data in events #43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment