New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ARM #1373

Open
khuey opened this Issue Nov 18, 2014 · 22 comments

Comments

Projects
None yet
9 participants
@khuey
Member

khuey commented Nov 18, 2014

Should file this since I'm working on it.

https://github.com/khuey/rr/compare/arm

Requires a couple kernel patches at the moment too.

@ignoramous

This comment has been minimized.

Show comment
Hide comment
@ignoramous

ignoramous Jan 31, 2015

Would this mean that rr could be used to debug android applications?

Would this mean that rr could be used to debug android applications?

@rocallahan

This comment has been minimized.

Show comment
Hide comment
@rocallahan

rocallahan Feb 1, 2015

Member

Eventually yes, but that's a lot of extra work beyond the ARM support Kyle is working on.

Member

rocallahan commented Feb 1, 2015

Eventually yes, but that's a lot of extra work beyond the ARM support Kyle is working on.

@MagaTailor

This comment has been minimized.

Show comment
Hide comment
@MagaTailor

MagaTailor Oct 27, 2015

Any news on ARM support?

Any news on ARM support?

@rocallahan

This comment has been minimized.

Show comment
Hide comment
@rocallahan

rocallahan Oct 27, 2015

Member

ARM support is not happening in the forseeable future. We discovered a critical technical issue: ARM processors implement atomic operations using a load-linked/store-conditional pair of instructions, and those operations can fail nondeterministically (from rr's point of view; failures depend on cache state and whether a hardware interrupt occurs between the instructions). So we don't have a performance counter that is deterministic enough for rr to use under those conditions.

To fix this, we'd have to modify rr's design philosophy and instrument all ARM code, perhaps using DynamoRio or something like that. That's a lot of work and the cost/benefit for Mozilla doesn't seem to be there right now.

If you really need rr-like functionality for ARM and Android, I recommend buying UndoDB from Undo Software.

Member

rocallahan commented Oct 27, 2015

ARM support is not happening in the forseeable future. We discovered a critical technical issue: ARM processors implement atomic operations using a load-linked/store-conditional pair of instructions, and those operations can fail nondeterministically (from rr's point of view; failures depend on cache state and whether a hardware interrupt occurs between the instructions). So we don't have a performance counter that is deterministic enough for rr to use under those conditions.

To fix this, we'd have to modify rr's design philosophy and instrument all ARM code, perhaps using DynamoRio or something like that. That's a lot of work and the cost/benefit for Mozilla doesn't seem to be there right now.

If you really need rr-like functionality for ARM and Android, I recommend buying UndoDB from Undo Software.

@matt2909

This comment has been minimized.

Show comment
Hide comment
@matt2909

matt2909 Oct 29, 2015

That is a strange comment, can you explain what is lacking with an approach such as the linux kernel takes for atomic operations:

http://lxr.free-electrons.com/source/arch/arm/include/asm/atomic.h#L41

That is a strange comment, can you explain what is lacking with an approach such as the linux kernel takes for atomic operations:

http://lxr.free-electrons.com/source/arch/arm/include/asm/atomic.h#L41

@rocallahan

This comment has been minimized.

Show comment
Hide comment
@rocallahan

rocallahan Oct 29, 2015

Member

There is no difficulty implementing atomic operations. The problem is that they can disturb the performance counters.

For example, suppose we're using the number of retired instructions, measured via HW performance counters, as our progress counter. Suppose we record a simple program that just does an atomic increment using the code sequence you referenced. Suppose that the LL/SC pair succeeds the first time and we record N instructions executed. Now suppose we replay the execution but this time, a hardware interrupt occurs between the ldrex and the strex instructions, forcing the strex to fail and the code to execute another iteration of the loop. The program completes but performance counters report that we have executed N+4 instructions.

This effect means that performance counters are not 100% reliable for our purposes, which makes rr's zero-instrumentation approach infeasible.

Member

rocallahan commented Oct 29, 2015

There is no difficulty implementing atomic operations. The problem is that they can disturb the performance counters.

For example, suppose we're using the number of retired instructions, measured via HW performance counters, as our progress counter. Suppose we record a simple program that just does an atomic increment using the code sequence you referenced. Suppose that the LL/SC pair succeeds the first time and we record N instructions executed. Now suppose we replay the execution but this time, a hardware interrupt occurs between the ldrex and the strex instructions, forcing the strex to fail and the code to execute another iteration of the loop. The program completes but performance counters report that we have executed N+4 instructions.

This effect means that performance counters are not 100% reliable for our purposes, which makes rr's zero-instrumentation approach infeasible.

@Keno

This comment has been minimized.

Show comment
Hide comment
@Keno

Keno Sep 18, 2016

Contributor

I've been reading more about ARM performance counters. It seems that at least the newer ARM chips can count failed strex instructions. I wonder whether setting ticks to branches taken - failed strex would be consistent enough for our purposes (It's of course possible for there to be branches in the ll/sc pair, but I don't know how common that is in the real world).

Contributor

Keno commented Sep 18, 2016

I've been reading more about ARM performance counters. It seems that at least the newer ARM chips can count failed strex instructions. I wonder whether setting ticks to branches taken - failed strex would be consistent enough for our purposes (It's of course possible for there to be branches in the ll/sc pair, but I don't know how common that is in the real world).

@rocallahan

This comment has been minimized.

Show comment
Hide comment
@rocallahan

rocallahan Sep 18, 2016

Member

Mmm. Reference?

Member

rocallahan commented Sep 18, 2016

Mmm. Reference?

@khuey

This comment has been minimized.

Show comment
Hide comment
@khuey

khuey Sep 18, 2016

Member

From my notes (and you should double check this, because they're from over a year ago), there were two issues.

  1. Cortex A17 counter value 0x63 claims to count "Exclusive instruction speculatively executed - STREX fail." That speculative part is a killer.
  2. The Cortex A17 removed the architecturally executed branch counter. The only counter of architectural executions remaining is the instructions retired counter.
Member

khuey commented Sep 18, 2016

From my notes (and you should double check this, because they're from over a year ago), there were two issues.

  1. Cortex A17 counter value 0x63 claims to count "Exclusive instruction speculatively executed - STREX fail." That speculative part is a killer.
  2. The Cortex A17 removed the architecturally executed branch counter. The only counter of architectural executions remaining is the instructions retired counter.
@Keno

This comment has been minimized.

Show comment
Hide comment
@Keno

Keno Sep 18, 2016

Contributor

I was looking at the Cortex-A9 docs, http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388g/BEHDIGBF.html, which has

STREX failed
Counts the number of STREX instructions architecturally executed and failed.

but I do see (as @khuey points out) that some other chips have a similar event with "speculatively executed", which I'm not sure what exactly that means.

Contributor

Keno commented Sep 18, 2016

I was looking at the Cortex-A9 docs, http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388g/BEHDIGBF.html, which has

STREX failed
Counts the number of STREX instructions architecturally executed and failed.

but I do see (as @khuey points out) that some other chips have a similar event with "speculatively executed", which I'm not sure what exactly that means.

@khuey

This comment has been minimized.

Show comment
Hide comment
@khuey

khuey Sep 18, 2016

Member

Ah, that's interesting. Unfortunately the A9 is so old that it doesn't support separating user space counts from kernel space counts, so any hopes of using its performance counters for rr died a long time ago.

Speculatively executed means the processor may have tried to perform some work (based off a branch prediction or whatever) that it ends up throwing away later. Having instructions that didn't architecturally execute show up in the counts makes them unsuitable, unfortunately.

Member

khuey commented Sep 18, 2016

Ah, that's interesting. Unfortunately the A9 is so old that it doesn't support separating user space counts from kernel space counts, so any hopes of using its performance counters for rr died a long time ago.

Speculatively executed means the processor may have tried to perform some work (based off a branch prediction or whatever) that it ends up throwing away later. Having instructions that didn't architecturally execute show up in the counts makes them unsuitable, unfortunately.

@Keno

This comment has been minimized.

Show comment
Hide comment
@Keno

Keno Oct 3, 2016

Contributor

How fast do these interrupts fire? Could we set an interrupt on STREX failing, even speculatively, then single step past it?

Contributor

Keno commented Oct 3, 2016

How fast do these interrupts fire? Could we set an interrupt on STREX failing, even speculatively, then single step past it?

@matt2909

This comment has been minimized.

Show comment
Hide comment
@matt2909

matt2909 Oct 3, 2016

| How fast do these interrupts fire?

That entirely depends on the micro-architecture, but most aggressive
implementations will not offer guarantees about the delay from event firing
to the interrupt being taken. This "skew" can be many 10's of cycles in the
extreme case.

You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1373 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAUR3h7_Cxy9UDBSOx5wyNnKrG9pmHt6ks5qwHW0gaJpZM4C82gf
.

matt2909 commented Oct 3, 2016

| How fast do these interrupts fire?

That entirely depends on the micro-architecture, but most aggressive
implementations will not offer guarantees about the delay from event firing
to the interrupt being taken. This "skew" can be many 10's of cycles in the
extreme case.

You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1373 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAUR3h7_Cxy9UDBSOx5wyNnKrG9pmHt6ks5qwHW0gaJpZM4C82gf
.

@Keno

This comment has been minimized.

Show comment
Hide comment
@Keno

Keno Oct 3, 2016

Contributor

Do you happen to know of any other way to make strex instructions trap?

Contributor

Keno commented Oct 3, 2016

Do you happen to know of any other way to make strex instructions trap?

@vielmetti

This comment has been minimized.

Show comment
Hide comment
@vielmetti

vielmetti Jun 30, 2017

Reopening this issue; any active work on ARM going on now?

Reopening this issue; any active work on ARM going on now?

@Keno

This comment has been minimized.

Show comment
Hide comment
@Keno

Keno Jun 30, 2017

Contributor

any active work on ARM going on now?

Unfortunately, no. I think the current consensus is that we'd have to get changes into the silicon (some mechanism to get an interrupt on strex failing) in order to make rr feasible on ARM.

Contributor

Keno commented Jun 30, 2017

any active work on ARM going on now?

Unfortunately, no. I think the current consensus is that we'd have to get changes into the silicon (some mechanism to get an interrupt on strex failing) in order to make rr feasible on ARM.

@andersjel

This comment has been minimized.

Show comment
Hide comment
@andersjel

andersjel Aug 22, 2017

How about running rr in an ARM guest in qemu. The code for failing strex instructions is generated here.

andersjel commented Aug 22, 2017

How about running rr in an ARM guest in qemu. The code for failing strex instructions is generated here.

@rocallahan

This comment has been minimized.

Show comment
Hide comment
@rocallahan

rocallahan Aug 22, 2017

Member

I guess we could hack extra features into QEMU but that would not provide the "low overhead" or convenience that we're looking for with rr.

Member

rocallahan commented Aug 22, 2017

I guess we could hack extra features into QEMU but that would not provide the "low overhead" or convenience that we're looking for with rr.

@rpw

This comment has been minimized.

Show comment
Hide comment
@rpw

rpw Aug 23, 2017

CoreSight tracing might be an alternative to using performance counters on ARM. In Linux kernels >= 4.9 and suitable hardware exposing the ETM macrocell in the device tree it is available through the perf interface:

http://events.linuxfoundation.org/sites/events/files/slides/ELC-E16.pdf

rpw commented Aug 23, 2017

CoreSight tracing might be an alternative to using performance counters on ARM. In Linux kernels >= 4.9 and suitable hardware exposing the ETM macrocell in the device tree it is available through the perf interface:

http://events.linuxfoundation.org/sites/events/files/slides/ELC-E16.pdf

@khuey

This comment has been minimized.

Show comment
Hide comment
@khuey

khuey Aug 23, 2017

Member
Member

khuey commented Aug 23, 2017

@rocallahan

This comment has been minimized.

Show comment
Hide comment
@rocallahan

rocallahan Aug 23, 2017

Member

How would Coresight help with #1373 (comment) ?

Member

rocallahan commented Aug 23, 2017

How would Coresight help with #1373 (comment) ?

@vielmetti

This comment has been minimized.

Show comment
Hide comment
@vielmetti

vielmetti Nov 29, 2017

@khuey If you need access to OEM hardware information I should have a pretty good way to get that on Arm server-class equipment from Cavium, Huawei, Hisilicon through my work at @packethost .

@khuey If you need access to OEM hardware information I should have a pretty good way to get that on Arm server-class equipment from Cavium, Huawei, Hisilicon through my work at @packethost .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment