PiTubeDirect is a simple, cheap, and compact way to add a second processor to your BBC or Master micro. It's the lowest cost way to run a Tube-enhanced Elite on a fast 6502, or CP/M, DOS+, or Panos on the relevant CPU type, or a gigahertz of Native ARM. It's the fastest 6502 you're likely to see, and the highest performance coprocessor.
Second processors used to cost hundreds or thousands, and some are very rare and still expensive. With PiTubeDirect you just add a simple interface to a Raspberry Pi and connect directly to the Tube socket. It can be fitted under a Beeb or inside a Master, or just cable-connected.
The minimal configuration would use:
- a 40 pin IDC cable
- a pair of 74LVC245A level shifters which change the tube interface from 5V to 3.3V levels
- a Raspberry Pi Zero or any Raspberry Pi you already have
- a micro SD card with the PiTubeDirect package on it
The bill-of-materials for this is approximately ten pounds. But for a little more money you can buy one ready made: see Level Shifter Options.
With this project you will have a configurable coprocessor which can be powered by the Beeb and fitted inside it, with a choice of:
- 274MHz 65C102 (
- 3MHz 65C102 (
*FX 151,230,1) (for games compatibility)
- 112MHz Z80 (
- 63MHz 80286 (
- 27MHz 6809 (
- 59MHz ARM2 (
- 35MHz 32016 (
- null co-pro (
*FX 151,230,14) (to save having to power off the pi)
- 1000MHz ARMnative (
Equivalent speeds are approximate. Generally a Pi 3 will run faster than a Pi 1. For ideas to exercise each of the CPU models, see Examples for each CoPro core.
On This Page
Here's the minimal configuration, using a Pi Zero and a DIY two-chip level shifter: A closer view of the level shifter (a PCB design for this is in progress): A closer view of the Pi Zero: The 65C02 Co Processor running the CLOCKSP benchmark: The 65C02 Co Processor running the Tube Elite: The ARM Co Processor running the CLOCKSP benchmark: The most complex / desirable / expensive Beeb Co Processor was the 32016: This runs an operating system called Panos: And was aimed at the scientific market: It's possible to dynamically switch between Co Processors using *FX 151,230,N (same mechanism the Matchbox Co Processor uses if you are familiar with that): After hitting BREAK, the Pi has reconfigured itself as an 80x86: This runs Digital Research DOSPlus 2.1: Which in turn runs an early graphical windowing system called GEM: Here's the iconic Paint paint program from circa 1986 (30 years ago!): Finally, the whole system, including an old Atari trackball converted to look like an AMX Mouse: You can see the Pi is now a Pi 3, which helps with the larger/more complex emulators (more below).
How it Works
The Tube chip in an original BBC Co Processor is a custom ULA (Uncommitted Logic Array) that provides four bidirectional FIFOs, allowing the BBC Micro (host) and the Co Processor (parasite) to reliably exchange messages with full flow control.
In PiTubeDirect, the functionality of the Tube chip is emulated in software on the Raspberry Pi, and the Tube host interface on the BBC micro is connected to the Raspberry Pi's GPIO header via a pair of 74LVC245A level shifter chips.
The level shifters are necessary because the Tube interface uses 5V levels, where as the Raspberry Pi's GPIO signals use 3.3V levels. Omitting the level shifters would likely damage the Raspberry Pi, so please don't try!
The Tube host interface is simply an extension of the 6502 bus and operates at 2MHz. The nTUBE signal (indicating an access to one of eight host-side tube registers) becomes active ~100ns into the 6502 bus signal. This generates an interrupt on the Pi, which then has about ~400ns (at most) to service the access in real time.
Clearly minimizing interrupt latency is crucial to reliable operation, and we use several techniques here:
- dispense with an operating system - PiTubeDirect is a bare metal system were we control everything
- use a FIQ interrupt (so registers don't have to be stacked)
- carefully hand optimize the FIQ handler
- avoid cache misses within the FIQ handler by locking critical code and data into the cache
- if multiple cores are available, dedicate an entire core to the FIQ handler
Doing all this, it's just possible to achieve the required performance.
For more information, see the FIQ interrupt handler walkthrough
The PiTubeDirect firmware currently includes emulations of the following Beeb Co Processors:
- 65C102 (using 65tube - the fastest known native ARM 65C02 emulation)
- 65C102 (using lib6502 - written in C)
- 80x86 (using Fake86 - written in C)
- ARM2 (using MAME's ARM 2/3/6 emulation - written in C)
- 32016 (using a 32016 emulation that started life in B-Em, and was resurrected earlier this year)
See Credits and Acknowledgements for who we have to thank for each of these emulations.
Several Pi Models are supported, but within the team we are concentrating on the two extremes:
- the £4.00 Pi Zero (BCM2835/ARM1176) which has a single ARM core that runs at up to 1.0GHz
- the £30.00 Pi 3 (BCM2837/ARM Cortex A53) which has four ARM cores running at up to 1.2GHz
On the Pi Zero, the challenge is reducing interrupt latency, regardless of what the main emulator is doing, as they are both sharing on the same ARM core. The typical interrupt latency we observe is 80ns. However, if the main emulator has a cache miss at exactly the same time as the host attempts to read a tube register, this can increase to 300ns, which means the read data arrives marginally late.
We have focused on the 6502 emulation using 65tube, which has been reduced in size to ~9KB. In theory this should fit inside the 16KB L1 cache. But in practice we still observe occasional late reads (on a scope). That said, Tube Elite does run reasonably reliably. But we are close to the edge here, and this is best viewed as an experiment that's still in process.
We now use the GPU to handle the time critical requests from the host. This now means we don't miss a request.
On the Pi 3, we dedicate one of the cores to interrupt handling, and doing this results in an interrupt latency that is very tightly controlled, and varies between 100ns and 120ns. This provides ample time to reliably service 6502 reads and writes, regardless how large the main emulator is, and what it is doing.
The above has now been moved over to the GPU for increased performance.
Relationship to earlier projects
PiTubeDirect is closely related, but distinct from, two earlier Beeb Co Processor projects:
- the Matchbox Co Processor (see github and stardot) implements multiple Co Processors using a Xilinx XC6SLX9 FPGA. More than 50 of these have been built and distributed through the stardot forums. The cost is about £50.
- the PiTubeClient project (see github and stardot) is an extension to the Matchbox Co Processor that allows a range of Co Processors to be emulated in software on a Raspberry Pi.
One of the designs in the Matchbox Co Processor is an "SPI Co Processor" containing an VHDL implementation of the Acorn Tube chip together with an SPI slave interface. A software emulation of a Co Processor, running on the Raspberry Pi, can use SPI to read/write the tube registers. The Raspberry Pi firmware to do all this is PiTubeClient.
PiTubeDirect is an evolution of PiTubeClient that avoids the need to use a Matchbox Co Processor. It does this by emulating the Acorn Tube chip itself in software on the Raspberry Pi. This introduces some very hard real time constraints on the Raspberry Pi, and the fun of this project was/is overcoming these.
Under the hood
If you connect a serial cable to the Pi, you will get some diagnostic logging:
FIRMWARE_VERSION : 572ca1d3 BOARD_MODEL : 00000000 BOARD_REVISION : 00a02082 BOARD_MAC_ADDRESS : 5ceb27b8 17d73569 BOARD_SERIAL : ce5c6935 00000000 EMMC_FREQ : 250.000 MHz 250.000 MHz 250.000 MHz UART_FREQ : 48.000 MHz 1000.000 MHz 1000.000 MHz ARM_FREQ : 1000.000 MHz 1000.000 MHz 1000.000 MHz CORE_FREQ : 400.000 MHz 400.000 MHz 400.000 MHz V3D_FREQ : 300.000 MHz 300.000 MHz 300.000 MHz H264_FREQ : 300.000 MHz 300.000 MHz 300.000 MHz ISP_FREQ : 300.000 MHz 300.000 MHz 300.000 MHz SDRAM_FREQ : 450.000 MHz 450.000 MHz 450.000 MHz PIXEL_FREQ : 0.000 MHz -1894.967 MHz -1894.967 MHz PWM_FREQ : 0.000 MHz 500.000 MHz 500.000 MHz CORE TEMP : 52.08 °C CORE VOLTAGE : 1.32 V SDRAM_C VOLTAGE : 1.20 V SDRAM_P VOLTAGE : 1.20 V SDRAM_I VOLTAGE : 1.20 V CMD_LINE : dma.dmachans=0x7f35 bcm2708_fb.fbwidth=656 bcm2708_fb.fbheight=416 bcm2709.boardrev=0xa02082 bcm2709.serial=0xce5c6935 smsc95xx.macaddr=B8:27:EB:5C:69:35 bcm2708_fb.fbswap=1 bcm2709.uart_clock=48000000 vc_mem.mem_base=0x3dc00000 vc_mem.mem_size=0x3f000000 dwc_otg.lpm_enable=0 console=ttyS0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline copro=0 fsck.repair=no rootwait COPRO : 0 0 0000000000 1100000000 1 0000220000 0000220011 2 0000000010 0000111110 A0 = GPIO27 = mask 08000000 A1 = GPIO02 = mask 00000004 A2 = GPIO03 = mask 00000008 enable_MMU_and_IDCaches cpsr = 600001d3 extctrl = 00000000 00000040 ttbcr = 00000000 ttbr0 = 01fac04a sctrl = 00c5183d ctype = 84448004
On power up, after the MMU, I and D caches are enabled, a short benchmark is run on Core 0:
benchmarking core.... cycle counter = 4000192 L1I_CACHE = 4000013 L1I_CACHE_REFILL = 2 L1D_CACHE = 2 L1D_CACHE_REFILL = 0 L2D_CACHE_REFILL = 2 INST_RETIRED = 6000026 benchmarking io toggling.... cycle counter = 63203584 L1I_CACHE = 3000029 L1I_CACHE_REFILL = 4 L1D_CACHE = 2000002 L1D_CACHE_REFILL = 1 L2D_CACHE_REFILL = 4 INST_RETIRED = 6000028 benchmarking 1KB memory copy.... cycle counter = 3904 L1I_CACHE = 446 L1I_CACHE_REFILL = 5 L1D_CACHE = 520 L1D_CACHE_REFILL = 10 L2D_CACHE_REFILL = 31 INST_RETIRED = 824 benchmarking 2KB memory copy.... cycle counter = 1920 L1I_CACHE = 840 L1I_CACHE_REFILL = 0 L1D_CACHE = 1032 L1D_CACHE_REFILL = 11 L2D_CACHE_REFILL = 19 INST_RETIRED = 1593 benchmarking 4KB memory copy.... cycle counter = 4160 L1I_CACHE = 1597 L1I_CACHE_REFILL = 0 L1D_CACHE = 2056 L1D_CACHE_REFILL = 11 L2D_CACHE_REFILL = 35 INST_RETIRED = 3128 benchmarking 8KB memory copy.... cycle counter = 8960 L1I_CACHE = 3131 L1I_CACHE_REFILL = 0 L1D_CACHE = 4104 L1D_CACHE_REFILL = 26 L2D_CACHE_REFILL = 69 INST_RETIRED = 6200 benchmarking 16KB memory copy.... cycle counter = 15104 L1I_CACHE = 6182 L1I_CACHE_REFILL = 0 L1D_CACHE = 8200 L1D_CACHE_REFILL = 14 L2D_CACHE_REFILL = 132 INST_RETIRED = 12342 benchmarking 32KB memory copy.... cycle counter = 37376 L1I_CACHE = 12325 L1I_CACHE_REFILL = 0 L1D_CACHE = 16392 L1D_CACHE_REFILL = 119 L2D_CACHE_REFILL = 260 INST_RETIRED = 24630 benchmarking 64KB memory copy.... cycle counter = 99200 L1I_CACHE = 24633 L1I_CACHE_REFILL = 0 L1D_CACHE = 32776 L1D_CACHE_REFILL = 189 L2D_CACHE_REFILL = 512 INST_RETIRED = 49208 benchmarking 128KB memory copy.... cycle counter = 224832 L1I_CACHE = 49190 L1I_CACHE_REFILL = 0 L1D_CACHE = 65544 L1D_CACHE_REFILL = 175 L2D_CACHE_REFILL = 1024 INST_RETIRED = 98358 benchmarking 256KB memory copy.... cycle counter = 422272 L1I_CACHE = 98343 L1I_CACHE_REFILL = 0 L1D_CACHE = 131080 L1D_CACHE_REFILL = 264 L2D_CACHE_REFILL = 2048 INST_RETIRED = 196662 benchmarking 512KB memory copy.... cycle counter = 875136 L1I_CACHE = 196647 L1I_CACHE_REFILL = 0 L1D_CACHE = 262152 L1D_CACHE_REFILL = 557 L2D_CACHE_REFILL = 4099 INST_RETIRED = 393270 benchmarking 1024KB memory copy.... cycle counter = 1901376 L1I_CACHE = 393256 L1I_CACHE_REFILL = 0 L1D_CACHE = 524296 L1D_CACHE_REFILL = 268 L2D_CACHE_REFILL = 9069 INST_RETIRED = 786486
The cycle counter is in 1GHz ARM clock cycles.
Then, if there are multiple cores, these are started, and finally the emulator is started:
Raspberry Pi Direct 65C02 (65tube) Client main running on core 0 starting core 1 SPIN1 starting core 2 SPIN2 starting core 3 CORE3 enable_MMU_and_IDCaches cpsr = 600001d3 extctrl = 00000000 00000040 ttbcr = 00000000 ttbr0 = 01fac04a sctrl = 00c5183d ctype = 84448004 emulator running on core 3
Each time the Co Processor is reset (by hitting BREAK on the Beeb), ARM performance stats can be logged:
cycle counter = 244349525184 L1I_CACHE = 3928583582 L1I_CACHE_REFILL = 79 L1D_CACHE = 123315172 L1D_CACHE_REFILL = 26 L2D_CACHE_REFILL = 113 INST_RETIRED = 26060255 tube reset - copro 0