Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial setup of M.2 Accelerator with Dual Edge TPU fails #491

Open
mogorman opened this issue Oct 15, 2021 · 38 comments
Open

Initial setup of M.2 Accelerator with Dual Edge TPU fails #491

mogorman opened this issue Oct 15, 2021 · 38 comments
Assignees
Labels
comp:model Model related isssues Hardware:M.2 Accelerator with dual Edge TPU Coral M.2 Accelerator with Dual Edge TPU issues type:support Support question or issue

Comments

@mogorman
Copy link

mogorman commented Oct 15, 2021

I tried to get it the dual edge tpu card working in my pc and cant seem to get it to do any work. I am running a fresh ubuntu 21.04 and followed instructions from here https://coral.ai/docs/m2/get-started/ . I tried the one troubleshooting suggestion, pcie_aspm=off and it seemed to have no effect. Also shouldnt I see two boards? I am only seeing the one apex_0 input. I see my m.2 is only single laned my bad

Any advice or things to try would be very appreciated.

Relevant dmesg lines

[    1.606874] gasket: loading out-of-tree module taints kernel.
[    1.643067] gasket: module verification failed: signature and/or required key missing - tainting kernel
[   14.713273] apex 0000:01:00.0: RAM did not enable within timeout (12000 ms)
[   14.713358] apex 0000:01:00.0: Couldn't initialize interrupts: -12
[   19.877233] apex 0000:01:00.0: Apex performance not throttled due to temperature
[   24.996940] apex 0000:01:00.0: Apex performance not throttled due to temperature
[   30.117372] apex 0000:01:00.0: Apex performance not throttled due to temperature
[   35.181605] apex 0000:01:00.0: Apex performance not throttled due to temperature
[   40.223074] apex 0000:01:00.0: Apex performance not throttled due to temperature

Not all boots have the Couldn't initialize interrupts error.

lspci -vvv|grep -i MSI-X

pcilib: sysfs_read_vpd: read failed: Input/output error
	Capabilities: [d0] MSI-X: Enable+ Count=128 Masked-
	Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
	Capabilities: [b0] MSI-X: Enable+ Count=13 Masked-

lspci -vv

01:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: apex
	Kernel modules: apex

python3 examples/classify_image.py --model test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite --labels test_data/inat_bird_labels.txt --input test_data/parrot.jpg

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
    delegate = Delegate(library, options)
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 119, in __init__
    raise ValueError(capture.message)
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mog/pycoral/examples/classify_image.py", line 121, in <module>
    main()
  File "/home/mog/pycoral/examples/classify_image.py", line 71, in main
    interpreter = make_interpreter(*args.model.split('@'))
  File "/usr/lib/python3/dist-packages/pycoral/utils/edgetpu.py", line 87, in make_interpreter
    delegates = [load_edgetpu_delegate({'device': device} if device else {})]
  File "/usr/lib/python3/dist-packages/pycoral/utils/edgetpu.py", line 52, in load_edgetpu_delegate
    return tflite.load_delegate(_EDGETPU_SHARED_LIB, options or {})
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
ValueError: Failed to load delegate from libedgetpu.so.1

dmesg after run

[  261.356821] apex 0000:01:00.0: RAM did not enable within timeout (12000 ms)
[  261.356838] apex 0000:01:00.0: Error in device open cb: -110
[  261.357124] apex 0000:01:00.0: Apex performance not throttled due to temperature
[  266.436691] apex 0000:01:00.0: Apex performance not throttled due to temperature
[  271.556885] apex 0000:01:00.0: Apex performance not throttled due to temperature
[  276.677227] apex 0000:01:00.0: Apex performance not throttled due to temperature

good lspci -vv

01:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
	Subsystem: Global Unichip Corp. Coral Edge TPU
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 19
	Region 0: Memory at 8f900000 (64-bit, prefetchable) [virtual] [size=16K]
	Region 2: Memory at 8f800000 (64-bit, prefetchable) [virtual] [size=1M]
	Capabilities: [80] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr+ FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s (ok), Width x1 (ok)
			TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [d0] MSI-X: Enable+ Count=128 Masked-
		Vector table: BAR=2 offset=00046800
		PBA: BAR=2 offset=00046068
	Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [f8] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
	Capabilities: [108 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [110 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [200 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr+ BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 40000001 0000000f 8f81a318 ffffffcf
	Kernel driver in use: apex
	Kernel modules: apex
@google-coral-bot google-coral-bot bot added comp:model Model related isssues type:support Support question or issue labels Oct 15, 2021
@hjonnala hjonnala added the Hardware:M.2 Accelerator with dual Edge TPU Coral M.2 Accelerator with Dual Edge TPU issues label Oct 15, 2021
@hjonnala
Copy link
Contributor

can you please share the output of below snippet.

root@root ~# python3
Python 3.7.3 (default, Jan 22 2021, 20:04:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pycoral.pybind._pywrap_coral import ListEdgeTpus as list_edge_tpus
>>> list_edge_tpus()
[{'type': 'pci', 'path': '/dev/apex_0'}]

@mogorman
Copy link
Author

mogorman commented Oct 15, 2021

when i run that command my output looks the same except I am on python 3.9.7 and gcc 11.2.0

mog@random:~$ python3
Python 3.9.7 (default, Sep 10 2021, 14:59:43) 
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pycoral.pybind._pywrap_coral import ListEdgeTpus as list_edge_tpus
>>> list_edge_tpus()
[{'type': 'pci', 'path': '/dev/apex_0'}]
>>> 

@manoj7410
Copy link

@mogorman Which machine/hardware are you working with ?

@mogorman
Copy link
Author

@manoj7410 currently trying this on a librem mini v1. with a Coral M.2 Accelerator with Dual Edge TPU

@manoj7410
Copy link

@mogorman Please disable the secure boot on your machine and then try to run the demo again.

@mogorman
Copy link
Author

it doesnt have secure boot enabled. its using stock seabios

@hjonnala
Copy link
Contributor

can you please paste the output of below command:

python3 -c 'from pycoral.utils.edgetpu import get_runtime_version; print(get_runtime_version())'

@mogorman
Copy link
Author

mog@random:~$ python3 -c 'from pycoral.utils.edgetpu import get_runtime_version; print(get_runtime_version())'
BuildLabel(COMPILER=6.3.0 20170516,DATE=redacted,TIME=redacted), RuntimeVersion(14)

@hjonnala
Copy link
Contributor

hjonnala commented Oct 19, 2021

can you please check the permissions of /dev/apex_0 and check if this works for you.

@mogorman
Copy link
Author

its already 660.

mog@random:~$ ls -lah /dev/apex_0 
crw-rw---- 1 root apex 120, 0 Oct 19 12:32 /dev/apex_0

mog@random:~$ cat /etc/group |grep apex
apex:x:1001:mog

@hjonnala
Copy link
Contributor

can you please try the demo with the below lines here and share the logs..

from pycoral.pybind._pywrap_coral import SetVerbosity as set_verbosity
set_verbosity(10)

@mogorman
Copy link
Author

Python 3.9.7 (default, Sep 10 2021, 14:59:43) 
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pycoral.pybind._pywrap_coral import SetVerbosity as set_verbosity
>>> set_verbosity(10)
True
>>> import argparse
>>> import time
>>> 
>>> import numpy as np
>>> from PIL import Image
>>> from pycoral.adapters import classify
>>> from pycoral.adapters import common
>>> from pycoral.utils.dataset import read_label_file
>>> from pycoral.utils.edgetpu import make_interpreter
>>> 
>>> labels = read_label_file("pycoral/test_data/inat_bird_labels.txt")
>>> interpreter = make_interpreter(["pycoral/test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite"])
I tflite/edgetpu_manager_direct.cc:453] No matching device is already opened for shared ownership.
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x1a6e, product:0x89a
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x18d1, product:0x9302
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I tflite/edgetpu_context_direct.cc:106] USB always DFU: False (default)
I tflite/edgetpu_context_direct.cc:128] USB bulk-in queue capacity: default
I tflite/edgetpu_context_direct.cc:67] Performance expectation: Max (default)
I ./driver/mmio/host_queue.h:266] Starting in normal mode
I driver/kernel/kernel_registers.cc:83] Opening /dev/apex_0. read_only=0

I tflite/edgetpu_context_direct.cc:401] Failed to open device [Apex (PCIe)] at [/dev/apex_0]: Failed precondition: Device open failed : -1 (Connection timed out)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
    delegate = Delegate(library, options)
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 119, in __init__
    raise ValueError(capture.message)
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/pycoral/utils/edgetpu.py", line 87, in make_interpreter
    delegates = [load_edgetpu_delegate({'device': device} if device else {})]
  File "/usr/lib/python3/dist-packages/pycoral/utils/edgetpu.py", line 52, in load_edgetpu_delegate
    return tflite.load_delegate(_EDGETPU_SHARED_LIB, options or {})
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
ValueError: Failed to load delegate from libedgetpu.so.1

i didnt go further as couldnt talk to it

@hjonnala
Copy link
Contributor

I am not sure how to fix the device open error.. Seems to be issue with MSI-X support (lspci -vvv|grep -i MSI-X).. Might be your host machine does not support M.2 dual edge TPU.

I tflite/edgetpu_context_direct.cc:401] Failed to open device [Apex (PCIe)] at [/dev/apex_0]: Failed precondition: Device open failed : -1 (Connection timed out)

@mogorman
Copy link
Author

mogorman commented Oct 21, 2021

see above where i posted the output of lspci for info. seems it should support it?

@mogorman
Copy link
Author

tried on 2 other machines, only worked in one of them. pretty frustrating. happy to test anything on my other machine.

@manoj7410
Copy link

@mogorman The machine, on which the PCIe device is working, has same configuration of the machine, on which the device is not working ?

@mogorman
Copy link
Author

they are different types of machines. currently working on a seed odyssey board via an nvme to mini pci e adapter. the others where just straight into the mini pcie slot

@manoj7410
Copy link

Do you see any useful difference in the output of <lspci -vvv> from both the machines ?
Additionally, do you see this error on the other machine too ? [ 1.643067] gasket: module verification failed: signature and/or required key missing - tainting kernel

@tedzhouhk
Copy link

Any updates on this? I meet basically the same issue when running M.2 B+M key TPU on either M.2 slot of intel 7700k on STRIX Z270i motherboard with Ubuntu 20.

I'd like to add that sometime when I reboot the machine, the coral edge TPU can be entirely gone and not visible until the next reboot.

@hjonnala Please let me know if anything I can share will be useful.

@hjonnala
Copy link
Contributor

hjonnala commented Dec 3, 2021

@tedzhouhk can you please share the following details:

Python 3.9.7 (default, Sep  3 2021, 06:18:44) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tflite_runtime as tflite
>>> import pycoral
>>> from pycoral.utils.edgetpu import get_runtime_version
>>> get_runtime_version()
'BuildLabel(COMPILER=6.3.0 20170516,DATE=redacted,TIME=redacted), RuntimeVersion(14)'
>>> tflite.__version__
'2.5.0.post1'
>>> pycoral.__version__
'2.0.0'
>>> from pycoral.pybind._pywrap_coral import ListEdgeTpus as list_edge_tpus
>>> list_edge_tpus()
[{'type': 'pci', 'path': '/dev/apex_0'}]

@tedzhouhk
Copy link

Sure, here's the output (it's the same as yours expect the python version.

Python 3.8.10 (default, Sep 28 2021, 16:10:42) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tflite_runtime as tflite
>>> import pycoral
>>> from pycoral.utils.edgetpu import get_runtime_version
>>> get_runtime_version()
'BuildLabel(COMPILER=6.3.0 20170516,DATE=redacted,TIME=redacted), RuntimeVersion(14)'
>>> tflite.__version__
'2.5.0.post1'
>>> pycoral.__version__
'2.0.0'
>>> from pycoral.pybind._pywrap_coral import ListEdgeTpus as list_edge_tpus
>>> list_edge_tpus()
[{'type': 'pci', 'path': '/dev/apex_0'}]
>>> 

@hjonnala
Copy link
Contributor

hjonnala commented Dec 3, 2021

@tedzhouhk please add these two lines to the demo and share the output in txt file.

from pycoral.pybind._pywrap_coral import SetVerbosity as set_verbosity
set_verbosity(10)

@tedzhouhk
Copy link

tedzhouhk commented Dec 3, 2021

Here's the result. There's around 10 seconds before the "failed to open device" error showed up after the "opening /dev/apex_0. read_only=0" message.

Python 3.8.10 (default, Sep 28 2021, 16:10:42) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pycoral.pybind._pywrap_coral import SetVerbosity as set_verbosity
>>> set_verbosity(10)
True
>>> from pycoral.utils.edgetpu import make_interpreter
>>> interpreter = make_interpreter(["pycoral/test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite"])
I tflite/edgetpu_manager_direct.cc:453] No matching device is already opened for shared ownership.
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x1a6e, product:0x89a
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[4] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[9]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[7]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[4]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[11]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[10]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x18d1, product:0x9302
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[4] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[9]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[7]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[4]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[11]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[10]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I tflite/edgetpu_context_direct.cc:106] USB always DFU: False (default)
I tflite/edgetpu_context_direct.cc:128] USB bulk-in queue capacity: default
I tflite/edgetpu_context_direct.cc:67] Performance expectation: Max (default)
I ./driver/mmio/host_queue.h:266] Starting in normal mode
I driver/kernel/kernel_registers.cc:83] Opening /dev/apex_0. read_only=0
I tflite/edgetpu_context_direct.cc:401] Failed to open device [Apex (PCIe)] at [/dev/apex_0]: Failed precondition: Device open failed : -1 (Connection timed out)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
    delegate = Delegate(library, options)
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 119, in __init__
    raise ValueError(capture.message)
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/pycoral/utils/edgetpu.py", line 87, in make_interpreter
    delegates = [load_edgetpu_delegate({'device': device} if device else {})]
  File "/usr/lib/python3/dist-packages/pycoral/utils/edgetpu.py", line 52, in load_edgetpu_delegate
    return tflite.load_delegate(_EDGETPU_SHARED_LIB, options or {})
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
ValueError: Failed to load delegate from libedgetpu.so.1

"dmesg | grep apex" after the execution:

[755674.366783] apex 0000:06:00.0: RAM did not enable within timeout (12000 ms)
[755674.366799] apex 0000:06:00.0: Error in device open cb: -110

@hjonnala
Copy link
Contributor

hjonnala commented Dec 3, 2021

can you try the Workaround to disable Apex and Gasket section form this page: https://coral.ai/docs/m2/get-started/#troubleshooting-on-linux

@tedzhouhk
Copy link

tedzhouhk commented Dec 3, 2021

Do you mean disable apex and gasket when installing the driver? There is no apex or gasket before I install the driver.

@hjonnala
Copy link
Contributor

hjonnala commented Dec 3, 2021

@tedzhouhk Yes, please try the Workaround to disable Apex and Gasket section. If still not working, if possible please try with Linux container or Ubuntu 18.04.

@tedzhouhk
Copy link

@hjonnala Happy new year and sorry for the late reply, I finally have time to try this out on Ubuntu 18.04. Unfortunately, I got the same error. Any chance that it's a hardware problem?

@Calimerorulez
Copy link

Same error here still, despite I disabled power save via the kernel parameter pcie_aspm=off

@pedymaster
Copy link

What helped my solve the issue was this reddit thread
I put options vfio-pci ids=1ac1:089a disable_idle_d3=1 into /etc/modprobe.d/tpu.conf rebooted and it worked like a charm.
My system is Ubuntu 20.04.4 LTS and I run M.2. cortex through adapter in PCI slot

@CoMPaTech
Copy link

fwiw - I had the RAM did not enable within timeout when accidentally switching my reserved memory in a (desktop) BIOS from 32 MB to 512 MB (for other reasons playing with additional GPU which didn't pan out). (Re)setting it to 32MB and the issue was gone, it might be worth a shot?

@didi767
Copy link

didi767 commented Aug 18, 2022

What helped my solve the issue was this reddit thread I put options vfio-pci ids=1ac1:089a disable_idle_d3=1 into /etc/modprobe.d/tpu.conf rebooted and it worked like a charm. My system is Ubuntu 20.04.4 LTS and I run M.2. cortex through adapter in PCI slot

Do you mean to add it to the Proxmox machine itself?

@pedymaster
Copy link

What helped my solve the issue was this reddit thread I put options vfio-pci ids=1ac1:089a disable_idle_d3=1 into /etc/modprobe.d/tpu.conf rebooted and it worked like a charm. My system is Ubuntu 20.04.4 LTS and I run M.2. cortex through adapter in PCI slot

Do you mean to add it to the Proxmox machine itself?

I dont do proxmox / virtualization. I have it on baremetal.

@tedzhouhk
Copy link

I actually get it working by using this recommended PCIe to M.2 adaptor. My motherboard is Asus Z270i with i7-7700k. It has two m.2 slots. My OS is installed in one m.2 drive so previously I have switched location between the edgetpu and the SSD but neither is working. Then I tried the PCIe to m.2 adapter and use the only PCIe3x16 slot and set it to x4 mode. Everything seems to work well.

@blacklizard
Copy link

What helped my solve the issue was this reddit thread I put options vfio-pci ids=1ac1:089a disable_idle_d3=1 into /etc/modprobe.d/tpu.conf rebooted and it worked like a charm. My system is Ubuntu 20.04.4 LTS and I run M.2. cortex through adapter in PCI slot

This worked for me in Ubuntu 22.04.3 LTS

@blacklizard
Copy link

What helped my solve the issue was this reddit thread I put options vfio-pci ids=1ac1:089a disable_idle_d3=1 into /etc/modprobe.d/tpu.conf rebooted and it worked like a charm. My system is Ubuntu 20.04.4 LTS and I run M.2. cortex through adapter in PCI slot

As per my previous, it worked initially for about 24 hours, after that the issue appeared again. I have to remove power from the machine and start it again so that the TPU is detected

@blacklizard
Copy link

Since my last comment, I've downgraded from ubuntu 22.04 to Debian 10, 4.19.0-25-amd64, working without any issue so far, no crash or no event of TPU missing from the machine

@AnnoyingTechnology
Copy link

Extremely odd. We have two servers, identical specs except for the CPU (3900X on one, 5900X on the other).
Same PCI devices order, same disks, same RAM.

Running latest proxmox with a VM containing Coral's drivers.
It works on the 3900X server but not on the 5900X on which we get :

[  192.751583] apex 0000:00:10.0: RAM did not enable within timeout (12000 ms)
[  192.751607] apex 0000:00:10.0: Error in device open cb: -110
[  192.751634] apex 0000:00:10.0: Apex performance not throttled due to temperature

@gonzalezcalleja
Copy link

Same issue here:

  • CPU AMD Ryzen 7 5700U
  • Coral miniPCI with adapter in M2 port

Every 24h the tpu:

[70557.636145] apex 0000:03:00.0: Apex performance not throttled due to temperature
[70562.865720] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb000 flags=0x0030]
[70562.865761] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb100 flags=0x0030]
[70562.865793] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb200 flags=0x0030]
[70562.865827] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb300 flags=0x0030]
[70562.865861] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb400 flags=0x0030]
[70562.865892] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb500 flags=0x0030]
[70562.865921] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb600 flags=0x0030]
[70562.865949] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb700 flags=0x0030]
[70562.865977] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb800 flags=0x0030]
[70562.866006] apex 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x38010003cf6bb900 flags=0x0030]
[70563.067383] workqueue: check_temperature_work_handler [apex] hogged CPU for >13333us 4 times, consider switching to WQ_UNBOUND
[70563.269147] apex 0000:03:00.0: Apex performance not throttled due to temperature
[70612.832647] apex 0000:03:00.0: RAM did not enable within timeout (12000 ms)
[70626.952663] apex 0000:03:00.0: RAM did not enable within timeout (12000 ms)
[70626.952678] apex 0000:03:00.0: Error in device open cb: -110
[70870.825650] apex 0000:03:00.0: RAM did not enable within timeout (12000 ms)
[70870.825672] apex 0000:03:00.0: Error in device open cb: -110
[71107.841695] apex 0000:03:00.0: RAM did not enable within timeout (12000 ms)
[71107.841726] apex 0000:03:00.0: Error in device open cb: -110
[71344.872647] apex 0000:03:00.0: RAM did not enable within timeout (12000 ms)
[71344.872661] apex 0000:03:00.0: Error in device open cb: -110
[71587.448691] apex 0000:03:00.0: RAM did not enable within timeout (12000 ms)
[71587.448716] apex 0000:03:00.0: Error in device open cb: -110
[71824.537672] apex 0000:03:00.0: RAM did not enable within timeout (12000 ms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:model Model related isssues Hardware:M.2 Accelerator with dual Edge TPU Coral M.2 Accelerator with Dual Edge TPU issues type:support Support question or issue
Projects
None yet
Development

No branches or pull requests