Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT 6502: Reset issues on PiZero/Model B #141

Closed
hoglet67 opened this issue Nov 25, 2021 · 22 comments
Closed

JIT 6502: Reset issues on PiZero/Model B #141

hoglet67 opened this issue Nov 25, 2021 · 22 comments

Comments

@hoglet67
Copy link
Owner

hoglet67 commented Nov 25, 2021

Pressing BREAK (and esp. Ctrl-BREAK) on a Pi Zero/Model B is unreliable - about 50% of the time the tube detection fails.

(Originally reported by KenLowe on stardot, but reproduced by myself)

This happens very frequently on beta1, and very very rarely (if ever) on alpha4.

I've bisected all the commits between alpha4 and beta1 and this is the one where the behaviour changes significantly: 4b6710d.

However, looking at the that commit, the changes don't relate in any way to the JIT-6502, and with the default value of vdu=0 in the cmdline.txt, this code is never even executed.

As far as I can tell, this new issue:

  • affects the nornal build, not the debug build
  • affects the BBC B, and not the Master
  • affects the Pi Zero, not that faster Pi Zero 2 and 3A+
  • seems worse if the Ctrl key is held down
  • causes the Tube Detect code to fail, even though the JIT 6502 appears to start normally

The Model B Tube Detect code is pretty simple:

LDA #&81
STA &FEE0
LDA &FEE0
ROR A
BCC &DB4D

I've instrumented this using the ICE-6502.

What seems to be happening is the read of &FEE0 is returning &CE where as it should be reading &CF.

If I set a breakpoint and manually read &FEE0 a second time, then the correct value &CE is returned. So I think what's happening is that when the JIT-6502 is "cold" after a reset, effect of the write of &81 to &FEE0 (which is meant to set bit 0) is being delayed, and is happening too late.

What I'm not clear on is why the Master seems to be immune. I can think of two possibilities:

  • it's related to the reset bounce - on my Model B the characterisc of RST bounce on release of the break key is a burst of short (2us) high pulses approx 2ms before a clean rising RST edge. The Master has a clean reset.
  • it's related to nTUBE glitches. The Master doesn't seem to have these.
@hoglet67
Copy link
Owner Author

There is a chance there could be some improvement. This little setup loop ends up touch 512Kbytes of memory.

loopdejitsetup:
   subs  r3,r3,#1
   str   temp2,[r0],#NEXTJITLET
   sub   temp2,temp2,#NEXTJITLET>>2
   str   temp,[r4],#4   // JITTEDTABLE16
   bne   loopdejitsetup

Breaking it up into two loops and writing 4 four words at once would improve its performance and memory utilisation on Pizero where the cache is very small.

Only a guess, the timing is just on the edge. I think breaking up the JIT init setup loop into two loops will give a significant performance improvement. which if we are too close timing wise should help.

You mean something like this:

// setup table JITLET ( 64K x bl JITLET)
// setup JITTEDTABLE16 with 64K x mov pc,r14
   ldr   temp2,=dojit-JITLET-8
   mov   r0,#JITLET

   mov   r3,#0x4000 // 64K/4
   mov   temp2,temp2,LSR#2
   add   temp2,temp2,#BLINSTRUCTION

loopdejitsetup1:
   subs  r3,r3,#1
   str   temp2,[r0],#NEXTJITLET
   sub   temp2,temp2,#NEXTJITLET>>2
   str   temp2,[r0],#NEXTJITLET
   sub   temp2,temp2,#NEXTJITLET>>2
   str   temp2,[r0],#NEXTJITLET
   sub   temp2,temp2,#NEXTJITLET>>2
   str   temp2,[r0],#NEXTJITLET
   sub   temp2,temp2,#NEXTJITLET>>2
   bne   loopdejitsetup1

   mov   r3,#0x4000 // 64K/4
   ldr   temp,=MOVPCR14INSTRUCTION
   mov   r4,#JITTEDTABLE16

loopdejitsetup2:
   subs  r3,r3,#1
   str   temp,[r4],#4   // JITTEDTABLE16
   str   temp,[r4],#4   // JITTEDTABLE16
   str   temp,[r4],#4   // JITTEDTABLE16
   str   temp,[r4],#4   // JITTEDTABLE16
   bne   loopdejitsetup2

// flush caches

I just tried this and it made no difference....

@dp111
Copy link
Collaborator

dp111 commented Nov 25, 2021 via email

@hoglet67
Copy link
Owner Author

hoglet67 commented Nov 25, 2021

I tried another experiment with the ICE...

I set a breakpoint at D9CD (the start of the Beeb's reset handler). This allows me to wait there for a few seconds before allowing the Beeb to boot and probe the tube. Even if I wait for a few seconds, the bug still happens. This suggests to me it's not loopdejitsetup code being slow, as this still runs as soon a BREAK is released.

What has to run for the STA &FEE0/LDA &FEE0 code to work?

As far as I remember, you'll get:

  • GPU sees the tube write caused by STA &FEE0 and sends a message via the doorbell which causes an ARM FIQ
  • arm_fiq_handler_jit (assembler), which calls
  • tube_io_handler (C), which calls
  • tube_host_write (C), which updates tube_regs[0] in the block of 8 IO words which is seen by the GPU

All that has to happen before LDA &FEE0 gets to the tube read cycle, so within about 3x 2MHz cycles, or 1.5us!

But none of this involved the JIT Co Pro.

So is this all down to that code path being slowed down because the JIT in reset has evicted everything from the cache?

But then why doesn't this show up on the Master?

Maybe it's just the Master code is less time critical, so lets look:

.LE375
LDX #&01
STX LFEE0
LDA LFEE0
EOR #&01
LDX #&81
STX LFEE0
AND LFEE0
ROR A
RTS

That's probably less time critical because it doesn't matter if the first LDA &FEE0 reads stale data, because bit 0 in the stale data will be zero (because it's cleared on reset). But it serves to warm up the cache. Does that seem plausible to you Dominic?

@hoglet67
Copy link
Owner Author

Adding this after the caches have been flushed seems to help significantly:

// Warm up the cache...
   mov  r0,#0x000F
   BL   tube_io_handler

// Continue with existing code...
   mov   r0, #0x10000
   ldrh  temp0, [r0, #-4]        // Fetch the vector address

@dp111
Copy link
Collaborator

dp111 commented Nov 25, 2021 via email

@hoglet67
Copy link
Owner Author

With or without the above fix?

@dp111
Copy link
Collaborator

dp111 commented Nov 25, 2021 via email

@hoglet67
Copy link
Owner Author

Can you possibly test presenting the rube_reg with &CF ?

Not sure exactly what this means....

@dp111
Copy link
Collaborator

dp111 commented Nov 25, 2021 via email

@hoglet67
Copy link
Owner Author

Why 0xCF in particular?

@hoglet67
Copy link
Owner Author

Oh, I get it... let me give that a try....

@hoglet67
Copy link
Owner Author

That's also an effective workaround... (it works independenty of the code to warm up the cache)

@hoglet67
Copy link
Owner Author

static void tube_reset()
-   HSTAT1 |= HBIT_3 | HBIT_2 | HBIT_1;
+   HSTAT1 |= HBIT_3 | HBIT_2 | HBIT_1 | HBIT_0;

@hoglet67
Copy link
Owner Author

I suspect a side effect of initializing the tube_reg[0] to CF rather than CE is that now the Master code will fail, so lets just test that....

... and yes, that moved the problem from the Model B to the Master

Which rules out any RST switch bounce nonsense....

I need to finish now, but I'll come back to this tomorrow....

@hoglet67
Copy link
Owner Author

Warming up the cache is interesting and I'm slightly surprised we haven't seen this sort of problem before.

I've done a bit more testing, and it looks like the Native ARM Co Pro suffers the same issue on the Model B

@hoglet67
Copy link
Owner Author

hoglet67 commented Nov 28, 2021

Dominic,

Rather than pre-load the cache, I wondered if it might be better to try to make tube_io_handler a bit more efficient at dealing the the tube detection case.

This is what I've ended up with:

int tube_io_handler(uint32_t mail)
{
   // 23..16 -> D7..D0
   // 12     -> RST (active high)
   // 11     -> RnW
   // 10..8  -> A2..A0
   //
   // Shortcut writes of 0x81 or 0x01 to the tube control register bit 0
   // This is used for tube detection by the MOS
   // Writes need to be visible within three 6502 bus cycles (1.5us)
   //
   //         D D D D  D D D D  - - - R  R A A A  - - - -  - - - -
   //  mask:  0 1 1 1  1 1 1 1  0 0 0 1  1 1 1 1  0 0 0 0  0 0 0 0 = 0x7F1F00
   // value:  0 0 0 0  0 0 0 1  0 0 0 0  0 0 0 0  0 0 0 0  0 0 0 0 = 0x010000
   if ((mail & 0x7F1F00) == 0x010000) {
      if (tube_irq & TUBE_ENABLE_BIT) {
         HSTAT1 = (HSTAT1 & ~HBIT_0) ^ ((mail & (1 << 23)) ? HBIT_0 : 0);
         // then process the request as normal
      }
   }

   // Check for Reset
   if (mail & (1 << 12)) {
      tube_irq |= RESET_BIT;
   } else {
      unsigned int addr = (mail >> 8) & 7;
      // Check read write flag
      if (mail & (1 << 11)) {
         tube_host_read(addr);
      } else {
         tube_host_write(addr, (mail >> 16) & 0xFF);
      }
   }
   return tube_irq ;
}

I also got rid of all the non-GPU code that was #ifdeffed out.

The code that is generated for the fast path is:

0e40f02c <tube_io_handler>:
 e40f02c:	e59f3750 	ldr	r3, [pc, #1872] ; 0x7f1f00
 e40f030:	e92d4070 	push	{r4, r5, r6, lr}
 e40f034:	e0033000 	and	r3, r3, r0
 e40f038:	e3530801 	cmp	r3, #65536	; 0x10000
 e40f03c:	e59f4744 	ldr	r4, [pc, #1860]	; address of tube_irq
 e40f040:	0a000015 	beq	e40f09c <tube_io_handler+0x70>

<non-fast-path-code ommitted>

 e40f09c:	e5943000 	ldr	r3, [r4]
 e40f0a0:	e3130008 	tst	r3, #8
 e40f0a4:	0affffe6 	beq	e40f044 <tube_io_handler+0x18>
 e40f0a8:	e3a01202 	mov	r1, #536870912	; 0x20000000
 e40f0ac:	e59120a0 	ldr	r2, [r1, #160]	; 0xa0 == read tube_regs[0]
 e40f0b0:	e1a037a0 	lsr	r3, r0, #15
 e40f0b4:	e2033c01 	and	r3, r3, #256	; 0x100
 e40f0b8:	e3c22c01 	bic	r2, r2, #256	; 0x100
 e40f0bc:	e1833002 	orr	r3, r3, r2
 e40f0c0:	e58130a0 	str	r3, [r1, #160]	; 0xa0 == write tube_regs[0]
 e40f0c4:	eaffffde 	b	e40f044 <tube_io_handler+0x18>

I'm in two minds about whether this is the right fix...

On the one hand, it does make the tube detection case considerable faster.

On the other hand, it makes all the other cases a few instructions longer.

Dave

@dp111
Copy link
Collaborator

dp111 commented Nov 28, 2021 via email

@hoglet67
Copy link
Owner Author

I wonder if it should actually be in the fiq handler so there is less cache lines to load before returning the result.

Possibly, but we have three seperate FIQ Handlers (the normal one, the JIT 6502 one and the Native ARM one)

Tube_io_handler is actually called from quite a few places:

copro-armnativeasm.S:        bl      tube_io_handler        // Update the Tube ULA emulation
copro-armnativeasm.S:        bl      tube_io_handler        // Update the Tube ULA emulation
jit13k.S:   BL   tube_io_handler
jit.S:   BL   tube_io_handler
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube.S:      bl       tube_io_handler

@BigEd
Copy link
Collaborator

BigEd commented Nov 28, 2021

Why not do both the IRQ improvement and the cache-warming? It feels reasonable to me and should add some margin. Unless warming the cache is too inelegant or unmaintainable?

@hoglet67
Copy link
Owner Author

I would rather keep the cache warming in reserve, as in needs to be added seperately to each Co Pro that blats the cache in some way on reset.

@dp111
Copy link
Collaborator

dp111 commented Dec 1, 2021

Just a little update all the uses of tube_io_handler in tube-isr.c are #ifdefined out at the moment. jit13k.S was a commit that shouldn't have been made. copro-armnativeasm.S actuall again only has one use and the other is a #def build option. So there are infact only three uses. The being of each of the FIQ handlers are very similar and so should perhaps actually be a macro.

@hoglet67
Copy link
Owner Author

hoglet67 commented Jan 7, 2022

Closing this for now, until we have some evidence the current approch needs improving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants