Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation failing on Raspberry Pi CM4 for PCI-E driver #280

Open
timonsku opened this issue Dec 13, 2020 · 93 comments
Open

Installation failing on Raspberry Pi CM4 for PCI-E driver #280

timonsku opened this issue Dec 13, 2020 · 93 comments
Assignees
Labels
comp:model Model related isssues Hardware:M.2 Accelerator A+E Coral M.2 Accelerator A+E key issues type:support Support question or issue

Comments

@timonsku
Copy link

Following the installation guide for the M.2 I get several compilation errors when its trying to install gasket.
Here the log of the make process:
gasket-make.log

It seems its mostly the 3 same errors
invalid use of undefined type ‘struct msix_entry’’
implicit declaration of function ‘writeq_relaxed’; did you mean ‘writel_relaxed’
implicit declaration of function ‘readq_relaxed’; did you mean ‘readw_relaxed’
implicit declaration of function ‘pci_disable_msix’; did you mean ‘pci_disable_sriov’

This is using gcc version 8.3.0 using the latest Raspbian with Kernel 5.4.51-v7l+
Unsure whether this is compiler, kernel header or code issues.

@Namburger Namburger added the PCIe Issue relating to our pcie modules label Dec 15, 2020
@Namburger
Copy link

Namburger commented Dec 15, 2020

Hello @timonsku we have investigated the CM4 previously and unfortunately, we determined that it won't works with our PCIe modules as the CPU doesn't have MSI-X supports as required by our requirements.

@timonsku
Copy link
Author

Hey Namburger,
the pi engineers have worked on this and have added support for MSI-X in the latest kernel.
See this forum discussion: https://www.raspberrypi.org/forums/viewtopic.php?p=1772216&sid=fa34ae6597591c1f80cb68c8138c6a67#p1772216

@Namburger
Copy link

As I mentioned, we have explored this path and there is still a little on going efforts but I don't believe it is something we can promise. @mbrooksx might be able to give you more info on this

@timonsku
Copy link
Author

Oh I see. If it doesn't turn out to be a true hw limitation I would be very interested in seeing this getting supported.
I currently have hardware in development that would see good use of the M.2 modules.

@usbguru
Copy link

usbguru commented Dec 15, 2020

@timonsku
Unfortunately this ARM hardware does not support MSI-X. The raspberry pi discussion you referenced raised my hopes that limited performance with emulated interrupts might work. Although it still does not work, the on-going work is encouraging, and might lead to performance nearly as good as if the original MSI-X hardware interrupts were on the ARM silicon. Stay tuned!

@usbguru usbguru reopened this Dec 15, 2020
@usbguru usbguru closed this as completed Dec 15, 2020
@mbrooksx
Copy link
Member

@timonsku : Yes, I'm actively working with the people in the Pi forum discussion. While MSI-X isn't technically supported by the BCM2711, as you saw from that patch if SW indicates it works then the PCIe hardware is actually able to map some MSI-X interrupts correctly.

We've validated farther than you have (including MSI-X), your errors are because you're building for the 32-bit kernel but the driver expects 64-bit read/write (thus why writeq/readq don't exist). My plan is to customize the driver for Pi (including 32-bit workarounds) and likely submit it to the Pi kernel vs trying to update our DKMS package. Will keep you informed of the status.

@mbrooksx mbrooksx reopened this Dec 15, 2020
@timonsku
Copy link
Author

Awesome that is great to hear :)

@Valdiolus
Copy link

Great to hear that somebody is working on this issue! Already received my RPI CM4 + IO Board + PCIe Coral acc.
Any news? Maybe I can help?

@markus-k
Copy link

markus-k commented Jan 15, 2021

Has anyone had a go at this? I've done a bit of debugging and hacking myself and got the kernel module to load and libedgetpu to start an inference (although it never finishes, some event is missing, and there is an HIB error?).

There are some changes needed in both the kernel module and the user-space drivers, so far primarily replacing 64bit memory accesses with two 32bit ones. My progress is here for the module which I have updated to the latest version from the dkms package and here for libedgetpu, but these changes are of course nowhere near merge-quality.

This is what libedgetpu logs:

I :273] Starting in normal mode
I :83] Opening /dev/apex_0. read_only=0
I :97] mmap_offset=0x0000000000040000, mmap_size=4096
I :108] Got map addr at 0x0xb6fde000
I :97] mmap_offset=0x0000000000044000, mmap_size=4096
I :108] Got map addr at 0x0xb6fdd000
I :97] mmap_offset=0x0000000000048000, mmap_size=4096
I :108] Got map addr at 0x0xb6fdc000
I :229] Read: offset = 0x00000000000486f0, value: = 0x0000000000000000, w0=0x00000000, w1=0x00000000
I :191] Write: offset = 0x00000000000487a8, value = 0x0000000000000000
I :229] Read: offset = 0x0000000000048578, value: = 0x0000000000000010, w0=0x00000010, w1=0x00000000
I :136] MmuMapper#Map() : 00000000b6627000 -> 0000000001000000 (1 pages) flags=00000000.
I :55] MapMemory() page-aligned : device_address = 0x0000000001000000
I :169] Queue base : 0xb6627000 -> 0x0000000001000000 [4096 bytes]
I :136] MmuMapper#Map() : 00000000b6628000 -> 0000000001001000 (1 pages) flags=00000000.
I :55] MapMemory() page-aligned : device_address = 0x0000000001001000
I :179] Queue status block : 0xb6628000 -> 0x0000000001001000 [16 bytes]
I :191] Write: offset = 0x0000000000048590, value = 0x0000000001000000
I :191] Write: offset = 0x0000000000048598, value = 0x0000000001001000
I :191] Write: offset = 0x00000000000485a0, value = 0x0000000000000100
I :191] Write: offset = 0x0000000000048568, value = 0x0000000000000005
I :229] Read: offset = 0x0000000000048570, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :229] Read: offset = 0x00000000000486d0, value: = 0x0000000000000000, w0=0x00000000, w1=0x00000000
I :191] Write: offset = 0x0000000000044018, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000044158, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000044198, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000441d8, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000044218, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000048788, value = 0x000000000000007f
I :229] Read: offset = 0x0000000000048788, value: = 0x000000000000007f, w0=0x0000007f, w1=0x00000000
I :191] Write: offset = 0x00000000000400c0, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040150, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040110, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040250, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040298, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000402e0, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040328, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040190, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000401d0, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040210, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000486e8, value = 0x0000000000000000
I :45] Set event fd : event_id:0 -> event_fd:7,
I :45] Set event fd : event_id:4 -> event_fd:11,
I :62] event_fd=7. Monitor thread begin.
I :45] Set event fd : event_id:5 -> event_fd:12,
I :45] Set event fd : event_id:6 -> event_fd:13,
I :62] event_fd=12. Monitor thread begin.
I :62] event_fd=11. Monitor thread begin.
I :45] Set event fd : event_id:7 -> event_fd:14,
I :62] event_fd=13. Monitor thread begin.
I :45] Set event fd : event_id:8 -> event_fd:15,
I :62] event_fd=14. Monitor thread begin.
I :45] Set event fd : event_id:9 -> event_fd:16,
I :45] Set event fd : event_id:10 -> event_fd:17,
I :62] event_fd=15. Monitor thread begin.
I :45] Set event fd : event_id:11 -> event_fd:18,
I :62] event_fd=16. Monitor thread begin.
I :62] event_fd=17. Monitor thread begin.
I :45] Set event fd : event_id:12 -> event_fd:19,
I :62] event_fd=18. Monitor thread begin.
I :191] Write: offset = 0x00000000000486a0, value = 0x000000000000000f
I :191] Write: offset = 0x00000000000485c0, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000486c0, value = 0x0000000000000001
I :172] Opening device at /dev/apex_0
I :62] event_fd=19. Monitor thread begin.
I :75] event_fd=19. Monitor thread got num_events=1.
I :191] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :191] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :229] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :229] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
I :75] event_fd=19. Monitor thread got num_events=1.
I :191] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :191] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :229] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :229] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
I :47] Adding input "map/TensorArrayStack/TensorArrayGatherV3" with 150528 bytes.
I :58] Adding output "prediction" with 965 bytes.
I :167] Request prepared, total batch size: 1, total TPU requests required: 1.
I :310] Request [0]: Submitting P0 request immediately.
I :373] Request [0]: Need to map parameters.
I :136] MmuMapper#Map() : 00000000ad93d000 -> 8000000000000000 (953 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000000000
I :252] Mapped params : Buffer(ptr=0xad93d000) -> 0x8000000000000000, 3900864 bytes.
I :252] Mapped params : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :387] Request [0]: Need to do parameter-caching.
I :80] [0] Request constructed.
I :46] InstructionBuffers created.
I :653] Created new instruction buffers.
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :368] MapDataBuffers() done.
I :187] Linking Parameter: 0x8000000000000000
I :136] MmuMapper#Map() : 0000000001266000 -> 8000000000400000 (3 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000400000
I :223] Mapped "instructions" : Buffer(ptr=0x1266000) -> 0x8000000000400000, 9680 bytes. Direction=1
I :384] MapInstructionBuffers() done.
I :481] [0] SetState old=0, new=1.
I :393] [0] NotifyRequestSubmitted()
I :481] [0] SetState old=1, new=2.
I :83] Request[0]: Submitted
I :401] [0] NotifyRequestActive()
I :481] [0] SetState old=2, new=3.
I :133] Request[0]: Scheduling DMA[0]
I :394] Adding an element to the host queue.
I :191] Write: offset = 0x00000000000485a8, value = 0x0000000000000001
I :80] [1] Request constructed.
I :113] Adding input "map/TensorArrayStack/TensorArrayGatherV3" with 150528 bytes.
I :188] Adding output "prediction" with 965 bytes.
I :46] InstructionBuffers created.
I :653] Created new instruction buffers.
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :136] MmuMapper#Map() : 0000000001226000 -> 8000000000440000 (38 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000440000
I :223] Mapped "map/TensorArrayStack/TensorArrayGatherV3" : Buffer(ptr=0x1226440) -> 0x8000000000440440, 150528 bytes. Direction=1
I :136] MmuMapper#Map() : 0000000001276000 -> 8000000000404000 (1 pages) flags=00000004.
I :55] MapMemory() page-aligned : device_address = 0x8000000000404000
I :223] Mapped "prediction" : Buffer(ptr=0x1276000) -> 0x8000000000404000, 968 bytes. Direction=2
I :368] MapDataBuffers() done.
I :93] Linking map/TensorArrayStack/TensorArrayGatherV3[0]: 0x8000000000440440
I :93] Linking prediction[0]: 0x8000000000404000
I :136] MmuMapper#Map() : 00000000012b9000 -> 8000000000420000 (32 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000420000
I :223] Mapped "instructions" : Buffer(ptr=0x12b9000) -> 0x8000000000420000, 129536 bytes. Direction=1
I :384] MapInstructionBuffers() done.
I :481] [1] SetState old=0, new=1.
I :393] [1] NotifyRequestSubmitted()
I :481] [1] SetState old=1, new=2.
I :83] Request[1]: Submitted
I :401] [1] NotifyRequestActive()
I :481] [1] SetState old=2, new=3.
I :133] Request[1]: Scheduling DMA[0]
I :394] Adding an element to the host queue.
I :191] Write: offset = 0x00000000000485a8, value = 0x0000000000000002

Also the only interrupt firing seems to be the fatal error one:

cat /sys/class/apex/apex_0/interrupt_counts
0x00: 0
0x01: 0
0x02: 0
0x03: 0
0x04: 0
0x05: 0
0x06: 0
0x07: 0
0x08: 0
0x09: 0
0x0a: 0
0x0b: 0
0x0c: 2

@Namburger
Copy link

@markus-k woa, thanks for sharing that
@mbrooksx for awareness

@hiwudery
Copy link

@markus-k thank your for your sharing.
I add othbootargs=gasket.dma_bit_mask=32 to avoid HIB error.
But after running the sample program, I still get the following errors.
Did you have any ideas ? (Rasbian OS is 32bit; all the code is download from markus-k's repo)
Thank you
-Jack

messageImage_1612070152087
messageImage_1612070000068

@markus-k
Copy link

@hiwudery That's weird. Your upper and lower 32bits are cloned when reading from the device (see the line with I :229), which my patch should fix. Maybe the compiler optimized the two reads into one ldrd? But since that still performs two 32bit accesses, I don't really understand why that happens.

I just tried setting dma_bit_mask but still get HIB Errors, in addition to out of memory errors when mapping buffers. Also from dmesg:

[  971.201472] apex 0000:01:00.0: gasket_perform_mapping i 0
[  971.201480] apex 0000:01:00.0: gasket_page_table_map done: ha b657c000 daddr 1000000 num 1, flags 0 ret 0
[  971.201552] apex 0000:01:00.0: gasket_perform_mapping i 0
[  971.201558] apex 0000:01:00.0: gasket_page_table_map done: ha b657d000 daddr 1001000 num 1, flags 0 ret 0
[  971.271839] apex 0000:01:00.0: gasket_alloc_extended_subtable -> fail to map page ffffffffffffffff [pfn 6d9fed66 phys 732d8923]
[  971.271854] apex 0000:01:00.0: no memory for extended addr subtable
[  971.271861] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[  971.271868] apex 0000:01:00.0: gasket_page_table_map done: ha ad63c000 daddr 8000000000000000 num 953, flags 2 ret -12
[  971.271907] apex 0000:01:00.0: gasket_alloc_extended_subtable -> fail to map page ffffffffffffffff [pfn 6d9fed66 phys 732d8923]
[  971.271915] apex 0000:01:00.0: no memory for extended addr subtable
[  971.271921] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[  971.271928] apex 0000:01:00.0: gasket_page_table_map done: ha ad63c000 daddr 8000000000000000 num 953, flags 0 ret -12

I'm also not sure if dma_bit_mask is right here. The comment says it's used for PCIe controller which can't do 64bit addressing, but the Raspberry Pis PCIe controller can do 64bit addressing, but only 32bit wide accesses (as noted by PhilE here).

@mbrooksx
Copy link
Member

mbrooksx commented Feb 3, 2021

Yes, what you've done is essentially everything I've done for debug. The only additional change you alluded to is correct - the compiler is too smart for libedgetpu and expects a competent system that would be able have 64-bit wide accesses. I fixed this by using volatile variables to skip caching. My repos of progress are:
https://github.com/mbrooksx/libedgetpu (Userspace)
https://github.com/mbrooksx/pi-cm4-gasket-hacks (Kernel)

Note that I added an additional print - the host-side page address for the failed DMA transaction (it reports 0x100004000000000 - which is outside of the Pi RAM). The hope is that dma_bit_mask and command line swiotlb=65536 would create shadow registers in the 32-bit space but the Pi PCIe restrictions are very challenging. It is likely the coherent memory (setup in libedgetpu) is corrupted and thus the shared memory between the two is passing invalid information.

The other option that may be easier is the 32-bit kernel. It has issues with allocating enough BAR memory, but with some device tree tweaks this could likely be fixed. This paired with the 32-bit "aware" user-space may be an easier path. I've asked the Pi team to investigate this as well.

@geerlingguy
Copy link

@mbrooksx - And for the benefit of anyone who hasn't touched BAR space allocations, here's a guide I wrote on it a few months back testing graphics cards on the CM4: https://gist.github.com/geerlingguy/9d78ea34cab8e18d71ee5954417429df

The latest 5.10.y kernels for Pi OS already increased the default allocation to 1 GB I think (maybe even 4 or 8 GB? I don't remember if I followed up and checked on those commits).

@markus-k
Copy link

markus-k commented Feb 4, 2021

Yes, what you've done is essentially everything I've done for debug. The only additional change you alluded to is correct - the compiler is too smart for libedgetpu and expects a competent system that would be able have 64-bit wide accesses. I fixed this by using volatile variables to skip caching. My repos of progress are:
https://github.com/mbrooksx/libedgetpu (Userspace)
https://github.com/mbrooksx/pi-cm4-gasket-hacks (Kernel)

Note that I added an additional print - the host-side page address for the failed DMA transaction (it reports 0x100004000000000 - which is outside of the Pi RAM). The hope is that dma_bit_mask and command line swiotlb=65536 would create shadow registers in the 32-bit space but the Pi PCIe restrictions are very challenging. It is likely the coherent memory (setup in libedgetpu) is corrupted and thus the shared memory between the two is passing invalid information.

The other option that may be easier is the 32-bit kernel. It has issues with allocating enough BAR memory, but with some device tree tweaks this could likely be fixed. This paired with the 32-bit "aware" user-space may be an easier path. I've asked the Pi team to investigate this as well.

Alright, at least I haven't been looking in the completely wrong place. I've done most of my debugging on a 32-bit kernel so far. The default BAR space seems to be 1GB, I'm not sure if that's enough, but I'm not seeing any BAR allocation errors.

In case this helps anyone, some more debug logs. I've added your additional debug print, on a 32-bit kernel without any additional parameters:

[   77.630936] apex 0000:01:00.0: Fault VA: 0x0
[   77.630952] apex 0000:01:00.0: Fault VA: 0x0
[   77.635926] apex 0000:01:00.0: Fault VA: 0x0
[   77.635940] apex 0000:01:00.0: Fault VA: 0x0
[   77.635953] apex 0000:01:00.0: Fault VA: 0x0
[   77.635966] apex 0000:01:00.0: Fault VA: 0x0
[   77.635978] apex 0000:01:00.0: Fault VA: 0x0
[   77.635990] apex 0000:01:00.0: Fault VA: 0x0
[   77.636002] apex 0000:01:00.0: Fault VA: 0x0
[   77.636014] apex 0000:01:00.0: Fault VA: 0x0
[   83.141193] apex 0000:01:00.0: Fault VA: 0x1001000
[   83.141216] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[   83.141237] apex 0000:01:00.0: Computed Failing Bus Addr: 0x40c800000
[   83.141259] apex 0000:01:00.0: Fault VA: 0x1001000
[   83.141277] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[   83.141296] apex 0000:01:00.0: Computed Failing Bus Addr: 0x40c800000
[   83.141320] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
[   83.141345] apex 0000:01:00.0: Fault VA: 0xffffffff
[   83.141362] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[   83.141381] apex 0000:01:00.0: Computed Failing Bus Addr: 0x0
[   83.141402] apex 0000:01:00.0: Fault VA: 0x0
[   83.150222] apex 0000:01:00.0: Fault VA: 0x0
[   83.150243] apex 0000:01:00.0: Fault VA: 0x0
[   83.150263] apex 0000:01:00.0: Fault VA: 0x0
[   83.150284] apex 0000:01:00.0: Fault VA: 0x0
[   83.150309] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff

I've also tried using gasket.dma_bit_mask=32 swiotlb=65536 on a 32-bit kernel:

[   41.372303] apex 0000:01:00.0: Fault VA: 0x0
[   41.372321] apex 0000:01:00.0: Fault VA: 0x0
[   41.378062] apex 0000:01:00.0: Fault VA: 0x0
[   41.378079] apex 0000:01:00.0: Fault VA: 0x0
[   41.378094] apex 0000:01:00.0: Fault VA: 0x0
[   41.378109] apex 0000:01:00.0: Fault VA: 0x0
[   41.378124] apex 0000:01:00.0: Fault VA: 0x0
[   41.378139] apex 0000:01:00.0: Fault VA: 0x0
[   41.378153] apex 0000:01:00.0: Fault VA: 0x0
[   41.378168] apex 0000:01:00.0: Fault VA: 0x0
[   41.628343] ------------[ cut here ]------------
[   41.628367] WARNING: CPU: 3 PID: 707 at kernel/dma/swiotlb.c:683 swiotlb_map+0x38c/0x43c
[   41.628374] apex 0000:01:00.0: swiotlb addr 0x0000000415400000+4096 overflow (mask ffffffff, bus limit 47fffffff).
[   41.628379] Modules linked in: sha256_generic cfg80211 rfkill 8021q garp stp llc binfmt_misc v3d raspberrypi_hwmon vc4 gpu_sched dwc2 cec roles drm_kms_helper drm bcm2835_isp(C) i2c_bcm2835 bcm2835_codec(C) bcm2835_v4l2(C) drm_panel_orientation_quirks v4l2_mem2mem bcm2835_mmal_vchiq(C) videobuf2_dma_contig videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc apex(C) snd_soc_core vc_sm_cma(C) gasket(C) snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd syscopyarea sysfillrect sysimgblt fb_sys_fops backlight rpivid_mem uio_pdrv_genirq uio i2c_dev ip_tables x_tables ipv6
[   41.628599] CPU: 3 PID: 707 Comm: python3 Tainted: G         C        5.10.6-v7l+ #6
[   41.628602] Hardware name: BCM2711
[   41.628605] Backtrace:
[   41.628617] [<c0b84b94>] (dump_backtrace) from [<c0b84f24>] (show_stack+0x20/0x24)
[   41.628621]  r7:ffffffff r6:00000000 r5:60000013 r4:c12e6c98
[   41.628626] [<c0b84f04>] (show_stack) from [<c0b892bc>] (dump_stack+0xcc/0xf8)
[   41.628632] [<c0b891f0>] (dump_stack) from [<c02216d4>] (__warn+0xfc/0x114)
[   41.628637]  r10:00001000 r9:00000009 r8:c02a5a50 r7:000002ab r6:00000009 r5:c02a5a50
[   41.628640]  r4:c0e3cd00 r3:c1205094
[   41.628645] [<c02215d8>] (__warn) from [<c0b856c8>] (warn_slowpath_fmt+0xa4/0xd8)
[   41.628648]  r7:000002ab r6:c0e3cd00 r5:c1205048 r4:c0e3ccbc
[   41.628654] [<c0b85628>] (warn_slowpath_fmt) from [<c02a5a50>] (swiotlb_map+0x38c/0x43c)
[   41.628658]  r9:c1b8b070 r8:c1205048 r7:00000000 r6:ffffffff r5:00000000 r4:ffffffff
[   41.628664] [<c02a56c4>] (swiotlb_map) from [<c02a0668>] (dma_map_page_attrs+0x254/0x394)
[   41.628668]  r10:00000001 r9:00001000 r8:c1b8b1e0 r7:00000000 r6:ffffffff r5:c1205048
[   41.628671]  r4:c1b8b070
[   41.628690] [<c02a0414>] (dma_map_page_attrs) from [<bf115184>] (gasket_map_extended_pages+0x100/0x45c [gasket])
[   41.628694]  r10:00000000 r9:c4112000 r8:c32ab700 r7:f09dc000 r6:00000200 r5:000003b9
[   41.628697]  r4:f085d018
[   41.628717] [<bf115084>] (gasket_map_extended_pages [gasket]) from [<bf115900>] (gasket_page_table_map+0xa8/0x100 [gasket])
[   41.628721]  r10:c32ab740 r9:ad63c000 r8:00000000 r7:80000000 r6:c2f97c00 r5:c32ab700
[   41.628724]  r4:000003b9
[   41.628741] [<bf115858>] (gasket_page_table_map [gasket]) from [<bf112a9c>] (gasket_map_buffers_common+0x90/0xa8 [gasket])
[   41.628745]  r10:00000005 r9:00000001 r8:c30e1180 r7:4028dc0c r6:c2f97c00 r5:c2f97c00
[   41.628748]  r4:c32a5d90
[   41.628767] [<bf112a0c>] (gasket_map_buffers_common [gasket]) from [<bf112cac>] (gasket_handle_ioctl+0x1f8/0x8e0 [gasket])
[   41.628770]  r5:beb40fa0 r4:c1205048
[   41.628788] [<bf112ab4>] (gasket_handle_ioctl [gasket]) from [<bf1106f8>] (gasket_ioctl+0x9c/0x118 [gasket])
[   41.628792]  r9:beb40fa0 r8:c2f97c00 r7:bf09a1b0 r6:4028dc0c r5:c30e1180 r4:c1205048
[   41.628805] [<bf11065c>] (gasket_ioctl [gasket]) from [<c0451180>] (sys_ioctl+0x1d4/0x8ec)
[   41.628809]  r9:c32a4000 r8:00000000 r7:c30e1180 r6:c30e1181 r5:c1205048 r4:4028dc0c
[   41.628815] [<c0450fac>] (sys_ioctl) from [<c0200040>] (ret_fast_syscall+0x0/0x28)
[   41.628818] Exception stack(0xc32a5fa8 to 0xc32a5ff0)
[   41.628822] 5fa0:                   beb40f9c 00000000 00000005 4028dc0c beb40fa0 00000005
[   41.628826] 5fc0: beb40f9c 00000000 b454da7c 00000036 00000001 01f0349c 00000000 b48a4bbc
[   41.628829] 5fe0: b454db58 beb40f74 b443ba3f b6cd551c
[   41.628833]  r10:00000036 r9:c32a4000 r8:c0200204 r7:00000036 r6:b454da7c r5:00000000
[   41.628836]  r4:beb40f9c
[   41.628840] ---[ end trace a2d67e6b70f87dd2 ]---
[   41.628855] apex 0000:01:00.0: no memory for extended addr subtable
[   41.628861] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[   41.628911] apex 0000:01:00.0: no memory for extended addr subtable
[   41.628917] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[   41.646322] apex 0000:01:00.0: Fault VA: 0x1001000
[   41.646330] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[   41.646338] apex 0000:01:00.0: Computed Failing Bus Addr: 0xc800000
[   41.646347] apex 0000:01:00.0: Fault VA: 0x1001000
[   41.646352] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[   41.646359] apex 0000:01:00.0: Computed Failing Bus Addr: 0xc800000
[   41.646372] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
[   41.646384] apex 0000:01:00.0: Fault VA: 0xffffffff
[   41.646389] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[   41.646396] apex 0000:01:00.0: Computed Failing Bus Addr: 0xdeadbeef
[   41.646405] apex 0000:01:00.0: Fault VA: 0x0
[   41.648266] apex 0000:01:00.0: Fault VA: 0x0
[   41.648275] apex 0000:01:00.0: Fault VA: 0x0
[   41.648283] apex 0000:01:00.0: Fault VA: 0x0
[   41.648292] apex 0000:01:00.0: Fault VA: 0x0
[   41.648305] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff

In this case mapping the buffer fails in libedgetpu:

I :192] Write: offset = 0x00000000000486a0, value = 0x000000000000000f
I :62] event_fd=19. Monitor thread begin.
I :192] Write: offset = 0x00000000000485c0, value = 0x0000000000000001
I :192] Write: offset = 0x00000000000486c0, value = 0x0000000000000001
I :75] event_fd=19. Monitor thread got num_events=1.
I :192] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :192] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :231] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :172] Opening device at /dev/apex_0
I :231] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
I :75] event_fd=19. Monitor thread got num_events=1.
I :192] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :192] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :231] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :231] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
I :47] Adding input "map/TensorArrayStack/TensorArrayGatherV3" with 150528 bytes.
I :58] Adding output "prediction" with 965 bytes.
I :167] Request prepared, total batch size: 1, total TPU requests required: 1.
I :310] Request [0]: Submitting P0 request immediately.
I :373] Request [0]: Need to map parameters.
I :118] Failed to map buffer with flags, error -1
Traceback (most recent call last):
  File "classify_image.py", line 126, in <module>
    main()
  File "classify_image.py", line 115, in main
    interpreter.invoke()
  File "/home/pi/venv/lib/python3.7/site-packages/tflite_runtime/interpreter.py", line 540, in invoke
    self._interpreter.Invoke()
RuntimeError: Failed to execute request. Could not map pages : 5 (Cannot allocate memory)Node number 1 (EdgeTpuDelegateForCustomOp) failed to invoke.

I :226] Releasing Edge TPU device at /dev/apex_0
I :178] Closing Edge TPU device at /dev/apex_0

@hiwudery
Copy link

hiwudery commented Feb 4, 2021

@markus-k in gasket_page_table.c, the page table is 64bit format not 32bit format. I think the gasket_page_table also need to modify in 32bit kernel.

  • Address format:
  • Simple addresses - those whose containing pages are directly placed in the
  • device's address translation registers - are laid out as:
  • [ 63 - 25: 0 | 24 - 12: page index | 11 - 0: page offset ]

@geerlingguy
Copy link

geerlingguy commented Feb 17, 2021

I also wanted to note something here that may be of interest—I noticed earlier someone mentioned writeq being present on 64-bit OSes. I'll soon be testing the Coral TPU (M.2 A+E key version) on a Pi so haven't yet had first-hand experience, but with a different driver I was taking a look at, it seems that one problem may be that writeq is not supported on Pi OS / the Pi's PCI-E bus like it may be on some other 64-bit systems.

Edit: New bug reported relating to that driver issue is here: raspberrypi/linux#4158

@geerlingguy
Copy link

On 64-bit Pi OS (with latest kernel compiled at 5.10.14-v8+), I get the following kernel panic after running through the default steps in the setup guide:

IMG_3633

(Cross-linking to geerlingguy/raspberry-pi-pcie-devices#44 (comment))

@markus-k
Copy link

You should probably read the rest of this issue, there hasn't been any development since my last comment to my knowledge. The default gasket module won't work at all, my fixed one at least loads and can read temperature, but something is still wrong with the DMA, so it won't work either. Then there's probably still a few other things broken in the user space driver as well.

I don't have the time to dig into this right now, and my knowledge with kernel dev is limited anyway. So best we can do is hope someone with deep understanding of how the DMA and TPU works can find some time and look into it.

@timonsku
Copy link
Author

@mbrooksx sounded like Google was working on it? Maybe he could update us. I still have very big interest in this for my product but don't have the resources or know-how to dig into this.

@markus-k
Copy link

If someone at Google is working on it, or is going to, it would be nice to get a very rough ETA (weeks, months) on when we can expect to know whether or not the TPU will ever work over PCIe on a CM4. I'll be creating a new revision of my products PCB in few weeks, and if there's very little chance the PCIe TPU won't work anytime soon, I'll have to switch both to USB.

@manishbuttan
Copy link

manishbuttan commented Apr 13, 2022 via email

@manoj7410
Copy link

@manishbuttan Please contact at Coral sales link given at https://coral.ai/products/accelerator-module#tech-specs

@manishbuttan
Copy link

manishbuttan commented Apr 13, 2022 via email

@manishbuttan
Copy link

Hello, anyone here from Google Coral Team? I have filled the online form for Sales contact twice, but there is no response. Please share USB3 implementation details for Coral Accelerator Module. I have already received over 200 Corals Accelerator Modules and can't proceed with PCB design without this information. Thanks.

@hjonnala
Copy link
Contributor

Hello @manishbuttan Our sales team have responded to your inquiry on our website. Please check your email and follow up there. Thanks!

@manishbuttan
Copy link

Thanks Hemanth. Yes, I just received an email from Bill from Google. Am working with him to get this completed.

@magic-blue-smoke
Copy link

Designed m.2 card with Coral Accelerator Module that seem to work fine with Piunora CM4 baseboard.
Test suggestions are welcomed

@TimPearson
Copy link

Designed m.2 card with Coral Accelerator Module that seem to work fine with Piunora CM4 baseboard.
Test suggestions are welcomed

Brilliant idea - Why not try the cheap and widely available Waveshare board that are only around $20 and have an M key M.2 interface. I had all but given up on Coral they don't work over PCiE on RPi4 but this now becomes a new possibility.

@vukitoso
Copy link

Waveshare board that are only around $20 and have an M key M.2 interface

can you give a link to "Waveshare board"?

@timonsku
Copy link
Author

Designed m.2 card with Coral Accelerator Module that seem to work fine with Piunora CM4 baseboard. Test suggestions are welcomed

Glad to see Piunora making full circle. It was the original reason I opened this thread :)
Hope it worked out for you so far!

@langestefan
Copy link

Designed m.2 card with Coral Accelerator Module that seem to work fine with Piunora CM4 baseboard. Test suggestions are welcomed

Cool stuff. Is this available somewhere?

@JakobTewes
Copy link

Designed m.2 card with Coral Accelerator Module that seem to work fine with Piunora CM4 baseboard. Test suggestions are welcomed

Cool stuff. Is this available somewhere?

Also quite some interest here 😜

@CipherLab
Copy link

Lets say I wanted to use coral usb and inserted a M.2 to USB pcie Riser Adapter... on a PC you'd go into the bios and change it from m2 to pcie, but would I have to do something similar on the Yellow? If so, can someone guide me to how?

@hjonnala
Copy link
Contributor

Designed m.2 card with Coral Accelerator Module that seem to work fine with Piunora CM4 baseboard. Test suggestions are welcomed

Hello @magic-blue-smoke feel free to run the CTS test to evaluate the hardware design. Thanks!

@fhloston
Copy link

fhloston commented Feb 9, 2023

This looks very promising - Coral and NVME

https://twitter.com/Merocle/status/1622644808626970624/photo/1

@ghollingworth
Copy link

No it doesn't... It's no more promising that any of the other products that will not work due to hardware limitations of BCM2711 and the Google Coral device

@will127534
Copy link

will127534 commented Feb 9, 2023

I'm just going to do a shameless plug here: https://github.com/will127534/Coral-USB3-M2-Module
A full opensourced design with CTS test passed.

@kklem0
Copy link

kklem0 commented Feb 9, 2023

I'm just going to do a shameless plug here: https://github.com/will127534/Coral-USB3-M2-Module

A full opensourced design with CTS test passed.

Nice! With MIT license 👍👍👍!

@EnziinSystem
Copy link

EnziinSystem commented Apr 15, 2023

I'm just going to do a shameless plug here: https://github.com/will127534/Coral-USB3-M2-Module A full opensourced design with CTS test passed.

Your board design is very professional.

Did it work with Pi CM4 at full performance?

Thanks.

@jambamamba
Copy link

I'm just going to do a shameless plug here: https://github.com/will127534/Coral-USB3-M2-Module A full opensourced design with CTS test passed.

Where can we get this from? I want to try it out

@will127534
Copy link

will127534 commented May 21, 2023

Did it work with Pi CM4 at full performance?

I think that's CTS was testing?

Where can we get this from? I want to try it out

I'm not going to sell this, I've been using this board to evaluate Coral module but it's performance, availability (The one you saw in the image takes 6 months of waiting) and having a USB3 controller in the middle of both device and host that supports PCIe just doesn't make sense in terms of power, cost, complexity. The git repo is more about documenting the USB3 capability for Coral module that Google hides from it's datasheet.

@jambamamba
Copy link

I'm not going to sell this, I've been using this board to evaluate Coral module but it's performance, availability (The one you saw in the image takes 6 months of waiting) and having a USB3 controller in the middle of both device and host that supports PCIe just doesn't make sense in terms of power, cost, complexity. The git repo is more about documenting the USB3 capability for Coral module that Google hides from it's datasheet.

Really appreciate you spending so much time and energy on this,

@gtxaspec
Copy link

gtxaspec commented May 26, 2023

offtopic a bit:

@will127534 have you thought of Coral IC> USB 3 / 3.1 / 3.2? (since the availability for the usb coral is limited at the moment ) Would this technically work with a new design?

Couldn't you add x corals to a system using this method if the Coral IC supports USB?

@will127534
Copy link

will127534 commented May 26, 2023

offtopic a bit:

@will127534 have you thought of Coral IC> USB 3 / 3.1 / 3.2? (since the availability for the usb coral is limited at the moment ) Would this technically work with a new design?

There must be a reason why Google hide the USB3 function in the datasheet at the first place, my guess is that it needs more care (Signal boost) if the USB 3 traces goes longer, so yes you probably can but I'm not sure if that will work with a longer USB3 cable. Also Coral module is limited too.....

Couldn't you add x corals to a system using this method if the Coral IC supports USB?

Adding more coral module is indeed possible but at that point I'll move on to Nvidia's solution to probably save some bucks and save some optimizing effort for that setup.

@n1mda
Copy link

n1mda commented Mar 28, 2024

Is there any progress on making the Coral with with CM4?

@geerlingguy
Copy link

@n1mda - Coral and CM4 are a no go. Coral seems to work on the Pi 5 (and hopefully the CM5 when it is released), as it has a more compliant PCIe bus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:model Model related isssues Hardware:M.2 Accelerator A+E Coral M.2 Accelerator A+E key issues type:support Support question or issue
Projects
None yet
Development

No branches or pull requests