Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DHT measure() locks up (core0) when _thread is running (core1) if WIZNET5K networking is enabled #10448

Open
GM-Script-Writer-62850 opened this issue Jan 8, 2023 · 20 comments
Labels

Comments

@GM-Script-Writer-62850
Copy link

GM-Script-Writer-62850 commented Jan 8, 2023

firmware file name: v1.19.1-782-g699477d12 (2022-12-20) .uf2
Board: https://micropython.org/download/W5500_EVB_PICO/

I have 4 DHT22 wired up using 3 pins , they all report numbers just file when _thread is not in use, at 1st i figured i could not access the GPIO at the same instant as, so i made a lock to stop core1 from access the GPIO during a measure, this did nothing, If I comment out my newThread it seems to run till the cows come home

  • DHT22 pins: 28, 14, 10, 6

this seems to be all it takes to break it (usually happens within 4 loops w/ 4 sensors):

EDIT: see next comment

no errors show up in Thonny, it just hangs

@GM-Script-Writer-62850
Copy link
Author

GM-Script-Writer-62850 commented Jan 10, 2023

Did some more testing
replication requires both network being enabled and thread (getting rid of either prevents this issue from appearing)

from machine import Pin
from time import sleep_ms
import dht
from _thread import start_new_thread as newThread

import network
nic=network.WIZNET5K()
nic.active(True)
#nic.ifconfig("dhcp")

def core1_job():
	while True:
		sleep_ms(1000)

newThread(core1_job,()) # this will make sensor.measure() lockup, you can expect this to happen in under 60 seconds
#28, 14, 10, 6
sensors=[
	dht.DHT22(Pin(28)),
	dht.DHT22(Pin(14)),
	dht.DHT22(Pin(10)),
	dht.DHT22(Pin(6))
]

while True:
	for sensor in sensors:
		sensor.measure()
		temp = sensor.temperature()
		hum = sensor.humidity()
		print("Temperature: {}°C   Humidity: {:.0f}% ".format(temp, hum))
	print("-----")
	sleep_ms(2500)

@GM-Script-Writer-62850 GM-Script-Writer-62850 changed the title DHT measure() locks up (core0) when _thread is running (core1) DHT measure() locks up (core0) when _thread is running (core1) if networking is enabled Jan 13, 2023
@GM-Script-Writer-62850
Copy link
Author

GM-Script-Writer-62850 commented Jan 13, 2023

I did some more testing

My current theory is if you do not use core 1 the wiznet driver will use core 1 and if you do it will use core 0 and
when the DHT locks out interrupts on the core DHT is using the wiznet driver crashes the PICO, this may just be something that need to be documented in the wiznet driver

I have not been able to get a PICO W to crash so far

from machine import Pin
from time import sleep_ms
import dht
from _thread import start_new_thread as newThread
import network

# Configuration
board = 1        # 1 = Wiznet W5500-EVB-Pico; 0 = PICO W
sensors_core = 0 # 0 will crash, 1 will not
run_core_1 = 1   # enable core 1 work load
enable_nic = 1   # enable network controller
# This only crashes with configuration:
# [1, 0, 1, 1]
# will crash almost extremely fast

sensors=[# Will crash with only 1 sensor
	dht.DHT22(Pin(28)),
	dht.DHT22(Pin(14)),
	dht.DHT22(Pin(10)),
	dht.DHT22(Pin(6))
]

loops=0 # This is just a counter

if board and enable_nic:
	# Wiznet W5500-EVB-Pico - https://micropython.org/download/W5500_EVB_PICO/
	lan=network.WIZNET5K()
	lan.active(True)
elif enable_nic:
	# PICO W - https://micropython.org/download/rp2-pico-w/
	#from wifi_auth import ssid, password

	wlan = network.WLAN(network.STA_IF)
	wlan.active(True)
	#wlan.config(pm = 0xa11140)

def readSensors():
	global loops
	print("----- Pass:",loops)
	loops+=1
	try:
		for sensor in sensors:
			sensor.measure()
			temp = sensor.temperature()
			hum = sensor.humidity()
			print("Temperature: {}°C   Humidity: {:.0f}% ".format(temp, hum))
	except Exception as e:
		print("DHT22 ERROR:",e)# This catches nothing...

def core1_job():
	sleep_ms(100)# Get cores outof sink fore reading prints
	if sensors_core:
		print(" * Read sensor(s) oncore 1")
		sleep_ms(2000)
		while True:
			readSensors()
			sleep_ms(2500)
	else:
		print(" * Infinite sleep on core 1")
		sleep_ms(2000)
		while True:
			sleep_ms(1)

print("--- Test Configuration ---")
if run_core_1:
	print(" * Core 1 enabled")
	newThread(core1_job,())
else:
	print(" * Core 1 disabled")

if sensors_core:
	print(" * Infinite sleep on core 0")
	sleep_ms(2000)
	while True:
		sleep_ms(1)
else:
	print(" * Read sensor(s) on core 0")
	sleep_ms(2000)
	while True:
		readSensors()
		sleep_ms(2500)

@GM-Script-Writer-62850 GM-Script-Writer-62850 changed the title DHT measure() locks up (core0) when _thread is running (core1) if networking is enabled DHT measure() locks up (core0) when _thread is running (core1) if WIZNET5K networking is enabled Jan 13, 2023
@MilhouseVH
Copy link

MilhouseVH commented May 22, 2023

I agree that the current 1.20 WIZNET5K network module is NOT _thread/multicore compatible. Flagging this in the documentation would have saved me a lot of time and grief! 😄

Running literally anything on the second core - even something incredibly trivial - will usually result in abnormal behaviour either during or shortly after the network has been initialised. Sometimes it may work normally for a while, which is what makes narrowing this down to the network module so much more time consuming.

So long as you stick to a single core, the W5500-EVB-PICO is a very nice board - just don't try to use the second RP2040 core!

@GM-Script-Writer-62850
Copy link
Author

GM-Script-Writer-62850 commented May 23, 2023

my success varies with what is being done on core 0, using core 1 works fine as lone as you do not use something that blocks interrupts on core 0, due to this i have the complex stuff on core 1 and the simple stuff (polling) on core 0

i have had it crash or appear to crash, but it was more like it froze than crashed and where it stopped processing made logical sense as to what happened, i suspect some a EMI flipped a bit in memory and wear stuff happened as this is the most plausible explanation i can think of

i have had request return a error BadStatusLine even though my server sent a 200 status code to the pico, my solution was to try and ignore it and it just yolo yeets the data at the server and hopes it gets it

still annoying that the way i have it set up using the network hogs both cores as i call the request on core 1 and then core 0 does the work of getting data for core 1 to process the data received

the PICO W is definitionally easier to work with in this regard and you do not give up a bunch of GPIO pins (i have one w5500 pico that ran out of gpio pins) now if i want to add stuff i need to use a multiplex IC

@MilhouseVH
Copy link

MilhouseVH commented May 23, 2023

I have a main script that is running running on core0 which handles a GPIO digitial input (ie. button, actually it's a relay output but to all intents and purposes it's a button!) via IRQ and processes that button "press" (outside of the IRQ) in an infinite while loop which makes a network request whenever the button has been pressed. This is working perfectly (I don't disable any IRQs, ever).

I then added a second thread using _thread which under certain circumstances would flip the output of another GPIO for a few seconds (a simple piezo buzzer wired to a GPIO - high() would makes a noise, then wait a few milliseconds before sending it low() for silence, then repeat a few times for a suitably annoying noise). This is when it all went to hell in a handbasket. So long as this second thread was started even if it was doing nothing (waiting in a utime.sleep(), or waiting on a lock.acquire()) then any subsequent network.ifconfig() would hang (most of the time). This is actually even before I had wired up the IRQ to handle the button, so to be honest in my case you can ignore the whole IRQ side of things as that's not relevant.

For me, starting a trivial second thread (that does almost nothing) before initialising the WIZNET5K network is sufficient for the RP2040 to hang completely.

I now avoid _thread on the W5500 and run everything I have to run on the same RP2040 core. No more problems, but of course it would have been nice to interleave some of the noise making stuff by making use of the second core to avoid those time delays on the main core. Not a big deal in the scheme of things, and the increased reliability and lack of crashes is so much more preferable!

At the very least though, a warning about this incompatibility with the WIZNET5K network library and _thread would save people a lot of time (until it's fixed, assuming it's possible to fix it).

@991jo
Copy link

991jo commented May 23, 2023

I am also affected by this on a W5500-EVB-Pico board running micropython v1.20.0.
It looks like simply using the WIZNET driver is not enough. Activating the driver, creating a socket and receiving packets and printing them is not sufficient to freeze the pico. It looks like some sort of IO has to be done as well.

Here is some code that gets my board to freeze reliable within a minute when sending packets to the NIC from a second machine:

On core0 it receives data from a UDP socket and prints it, then writes some data to a NeoPixel LED strip.
On core1 it simply is running a blink function for the onboard led.

from machine import Pin
import network
import socket
from time import sleep
import _thread
from neopixel import NeoPixel

LED_PIN = Pin(25, Pin.OUT)
PIXEL_PIN = Pin(3, Pin.OUT)


def initialize_nic():
    print("initializing NIC")
    nic = network.WIZNET5K()
    nic.active(True)

    print("waiting for connection to come up")
    for i in range(0, 10):
        if nic.isconnected():
            break
        print(f"waited {i} seconds for nic to come up")
        print(nic.isconnected())
        sleep(1)
    else:
        return None

    print("network is up, address is:")
    print(nic.ifconfig())
    return nic


def second_thread():
    while True:
        LED_PIN.on()
        sleep(1)
        LED_PIN.off()
        sleep(1)

def handle_packet(message, pixels):
    print(message)

    for i in range(10):
        pixels[i] = (0, 255, 0)
    pixels.write()

def main():
    pixels = NeoPixel(PIXEL_PIN, 10)
    for i in range(10):
        pixels[i] = (255, 0, 0)
    pixels.write()

    nic = initialize_nic()

    if nic is None:
        print("failed to initialize network")
        return

    core_1 = _thread.start_new_thread(second_thread, ())

    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.settimeout(1.0)
    sock.bind(("", 6454))

    while True:
        try:
            message, address = sock.recvfrom(1024)
            print(f"received a message from {address}: { message }")
            handle_packet(message, pixels)

        except OSError:
            print("Timeout receiving from the socket")
            print(nic.ifconfig())
            print(f"connected: { nic.isconnected() }")
            sleep(1)


if __name__ == "__main__":
    main()

Replacing the neopixel code in the handle_packet function code with a simple sleep() call does not lead to the board locking up (within the 5 minutes I waited).

The neopixel code itself also works just fine, when I don't start the second thread, everything works as intended.

(Note that my code does not depend on the input of other devices (except for sending UDP datagrams to the Pico). You can run it without the LEDs connected, so this is the minimal example code that I could come up with, that reproduces the issue and does not depend on other hardware.)

@MilhouseVH
Copy link

MilhouseVH commented May 23, 2023

Here's another example.

Perhaps I misremembered, but it seems that it DOES require an active IRQ callback in order to trigger the crash, at least for me (pretty sure it was crashing during network.ifconfig('dhcp') as well, but maybe that was something else...)

In the test below, it configures the network with dhcp, then rapidly loops around blinking the LED and checking the irq_flag to detect when the button has been pressed.

The second thread is just spinning around, doing practically nothing.

It can take anywhere from between 1 and a couple of dozen button presses, but eventually the RP2040 WILL lock up hard.

No additional network IO required beyond the initial network configuration.

Disable the core_1 thread and it will run forever without a problem.

Note that by default CORE0 output is written to the the UART while CORE1 output goes to print() (ie. Thonny console) - just to avoid both cores trampling on each other (which could be avoided with a lock.acquire() but I didn't want to add more complication to the test).

from machine import UART, Pin, SPI
import utime
import network
import _thread

# DEBUG output
uart0 = UART(0, 115_200, parity=None, stop=1, bits=8, tx=Pin(0), rx=Pin(1), timeout=10)

# Pins...
button = Pin(2, Pin.IN, Pin.PULL_UP)

led = Pin(25, Pin.OUT, value=0)

irq_flag = False

# IRQ callback
def button_callback(pin):
    global irq_flag
    irq_flag = True

# W5x00 chip init
def w5x00_init():
    spi = SPI(0, 2_000_000, mosi=Pin(19), miso=Pin(16), sck=Pin(18))
    nic = network.WIZNET5K(spi, Pin(17), Pin(20))

    nic.active(True)

    nic.ifconfig('dhcp')

    while not nic.isconnected():
        utime.sleep_ms(50)

    return nic

def connect_network():
    nic = w5x00_init()

    ipconfig = nic.ifconfig()
    mac = ':'.join([f'{x:02x}' for x in nic.config('mac')])

    config_print('ETHERNET', 'connected!')
    config_print('IPCONFIG', f"ip '{ipconfig[0]}', mask '{ipconfig[1]}', gateway '{ipconfig[2]}', dns '{ipconfig[3]}'")
    config_print('MAC ADDRESS', mac)
    config_print('HOSTNAME', network.hostname())

def ymdhms():
    now = utime.localtime()
    now_ms = utime.ticks_ms() % 1000
    return f'{now[0]:04d}-{now[1]:02d}-{now[2]:02d} {now[3]:02d}:{now[4]:02d}:{now[5]:02d}.{now_ms:03d}'

def config_print(key, value):
    log_print(f'{key:14}: {value}')

def log_print(msg, use_uart=True):
    if uart0 and use_uart:
        uart0.write(f'[{ymdhms()}] {msg}\r\n')
    else:
        print(f'[{ymdhms()}] {msg}')

def second_thread():
    while True:
        log_print('CORE1: Busy doing nothing...', False)
        utime.sleep(1)

def main():
    global irq_flag

    log_print('CORE0: Starting up...')

    core_1 = _thread.start_new_thread(second_thread, ())

    # Establish a network connection
    log_print('CORE0: Establishing network...')
    connect_network()

    button.irq(trigger=Pin.IRQ_FALLING, handler=button_callback)

    led_time = 0
    log_print('CORE0: Main loop is now running!')
    while True:
        now_ms = utime.ticks_ms()

        if utime.ticks_diff(now_ms, led_time) >= 1000:
            led.toggle()
            led_time = now_ms

        if irq_flag:
            irq_flag = False
            log_print('CORE0: BUTTON PRESSED')

        utime.sleep_ms(10)

if __name__ == '__main__':
    main()

@991jo
Copy link

991jo commented Jun 19, 2023

Quick follow-up: I have code that is using the WIZNET5K driver, PIO and DMA on a RP2040 but without multithreading and that also locks up randomly.

@GM-Script-Writer-62850
Copy link
Author

Quick follow-up: I have code that is using the WIZNET5K driver, PIO and DMA on a RP2040 but without multithreading and that also locks up randomly.

Using the micropython build? thus far i have had much better luck using the v2.0 build from wiznet's repo: https://github.com/Wiznet/RP2040-HAT-MicroPython/releases

this happen with the code i posted her as well, but random lockup are far less frequent, no idea why, i have no idea why i am getting them in the 1st place, but with the micropython build it locks up about every day as compared to every month or 2

@991jo
Copy link

991jo commented Jun 20, 2023

Using the micropython build?
yes.

@MilhouseVH
Copy link

I wonder if the Waveshare RP2040-ETH will be more reliable and capable of using both cores - for my use case it would be a more-or-less drop in replacement for the Wiznet W5500-EVB-Pico (I'd just need to switch from USB-Micro power to USB-C, and the smaller size Waveshare might even be an advantage).

I'm starting to get the feeling the Wiznet5K module isn't going to get any better, so moving on to something else might be the best long-term option than flogging this dead horse.

@GM-Script-Writer-62850
Copy link
Author

i have a project using a pico W, it has not had a single issue (that was not my fault) the only thing core 1 does is directly control the 7 segment displays in software, if i end up replacing my W5500s board i am just gonna use a PICO W, i like the idea of wired more than wifi, but the pico not crashing takes priority, at this time i do not know if the crashes i have been having are from EMI causing memory corruption, takes a long time to debug a unknown point of failure when you only get 1 single boolean value every month or so (i ran out of I/O pins)

@MilhouseVH
Copy link

The W5500 has been totally fine for me aside from these second core/multithreading issues that were a pain to debug and are totally undocumented (this issue is possibly the only "documentation" that exists!)

To be honest I don't really need the second core for my current project - which is just as well! - but it's nice to have different hardware options so I may pick up the Waveshare ETH board next time I'm ordering and see if it works any better than the W5500-EVB-Pico. I'm not expecting it to be a packet shifting monster - and the W5500 certainly isn't that! - but I only require occasional and very limited network IO so it should be fine.

For reference, a single HTTPS GET request over the LAN takes between 5 and 6 seconds with the W5500-EVB-Pico, while the same query implemented as a raw TCP socket request completes in 110ms. TLS on the Pico (or maybe just the W5500?) seems to be a bit of an issue, so my project will use only the socket requests - far too much latency with the HTTPS requests!

Assuming there are no dual core issues with the Waveshare RP2040-ETH I would certainly use it in any future projects in favour of the W5500, or any future Wiznet products assuming this issue with Wiznet5K is never fixed, just in case it became necessary to use the second core.

@GM-Script-Writer-62850
Copy link
Author

please let me know if you run into any issues with it

i have not noticed network times like 5-6 seconds with HTTP GET request (no need for HTTPS on my local network)

i finally have both of my 5500 units deployed, we will see if this unit acts up with anything over the next 3 months, i managed to allocate every pin on both for use, i need to add the 2 fan plugs and a led circuit to one, if nothing goes wrong on this second deployment i am going to blame my old network switch (i did replace a capacitor on the power board, so maybe another is faulty causing intermittent issues)

again if you have a w5500 board and are getting fedup with it crashing or locking up try the build v2.0 build here this is what i am not running on both of my w5500s now, maybe i will try the daily micropython build in a few months, i want to see if this thing will crash (hopefully i do not loose power over the next couple weeks with t-storms every day that my 3 UPS units can't deal with)

i have 2 PICO W and 2 W5500-EVB-PICO controllers deployed, i have had one of each act up in what may be the same way, this is why a suspect my switch, other PICO W has been running for probably 6 months or more without issue (aside from i need a update for connecting to wifi cause the PICO boot faster than the router)

@MilhouseVH
Copy link

please let me know if you run into any issues with it

Of course, will do - although not sure when I'll be ordering next, it could be some time.

i have not noticed network times like 5-6 seconds with HTTP GET request (no need for HTTPS on my local network)

Yeah HTTP is fine, but HTTPS is really quite bad (way too much latency for my needs, which is 1.5 seconds maximum response time).

I'm seeing 5-6 seconds for a trivially simple HTTPS GET on my internal-only development LAN, but in "production" the HTTPS-only web server will be internet-facing, so likely even worse performance.

Fortunately the production web server will be on the same LAN as the W5500(s) - potentially 9 of them all on the same LAN - and so in the end it was easier to spin up a LAN-only facing socket server that listens for the W5500 requests and proxies those requests through to the HTTPS web server and back to the W5500 in less than 110ms. A bit of a hack, but actually works fine.

And for various reasons the socket server also turned out to be much easier to implement than creating and configuring an HTTP server... 😀

By the way I'm using the official "v1.20.0 (2023-04-26) .uf2" release for W5500.

@GM-Script-Writer-62850
Copy link
Author

GM-Script-Writer-62850 commented Jul 7, 2023

in the interest of testing running a second thread that sleeps some time then runs gc.collect(), should not be hard to work in for your single threaded application

was was trying to debug some code to see how i was able to get a lock up on my code using the 1.2 build (i get no crash errors in my code) but when i tried to make test code i managed to get a memory error in a very short time period (less then 10 loops)

from machine import Pin
from time import sleep
from dht import DHT22
from threadsafe import Message # https://github.com/peterhinch/micropython-async/tree/master/v3/threadsafe
import _thread
import urequest # https://github.com/micropython/micropython-lib/issues/546
import network
import uasyncio
#from gc import collect

print("Hello World")

#hard reset sensor
Pin(7, Pin.OUT, value=0)
Pin(8, Pin.OUT, value=0)
sleep(1)

#setup sensor
Pin(7, Pin.OUT, value=1)#This pin powers the sensor
sleep(2)
sensor=DHT22(Pin(8))

LED=Pin(25, Pin.OUT)#debug LED (onboard)

nic=network.WIZNET5K()
nic.active(True)
err=1
while err:
	try:
		nic.ifconfig("dhcp")
		err=0
	except Exception as e:
		print("Network error:",e)
print(nic.ifconfig())

unblock_thread_lock=uasyncio.Lock()
async def unblock(func, *args, **kwargs):#https://github.com/peterhinch/micropython-async/blob/master/v3/docs/THREADING.md#4-taming-blocking-functions
	def wrap(func, message, args, kwargs):
		message.set(func(*args, **kwargs))  # Run the blocking function.
	msg = Message()
	await unblock_thread_lock.acquire()
	LED.on()
	_thread.start_new_thread(wrap, (func, msg, args, kwargs))
	msg=await msg
	LED.off()
	unblock_thread_lock.release()
	return msg

async def main():
	while True:
		await uasyncio.sleep(10)
		uasyncio.create_task(unblock(urequest.get,"http://10.0.0.69:8080/ok.txt")) # just needed a test page
		await uasyncio.sleep(10)
		await unblock(sensor.measure)
		print(sensor.temperature(),"C;",sensor.humidity(),"%")
		#uasyncio.create_task(unblock(collect))
		#collect()
uasyncio.run(main())

i'm gonna try running collect() before i call urequest (edit: make that after, that looks to work better) maybe there is a mem leak in the driver?

@GM-Script-Writer-62850
Copy link
Author

GM-Script-Writer-62850 commented Jul 9, 2023

not a memory issue, be sure you close() urequest

i think the only way you are safely going to read a sensor and have a W5[1/5]00 NIC is todo this dirty hack adding ~414ms of overhead to reading the sensor reading

def readDHT(pin):
	cfg=nic.ifconfig()#~1.1ms
	nic.active(False)#~0.1ms
	pin.measure()#~272.5ms
	nic.active(True)#~0.1ms
	nic.ifconfig(cfg)#~412.5ms

if you are reading multiple sensors you could do something like this and save some time overall

cfg=None
def readDHT(pin):
	if nic.isconnected():
		cfg=nic.ifconfig()#~1.1ms
		nic.active(False)#~0.1ms
	pin.measure()#~272.5ms
def request(url):
	if not nic.isconnected():
		nic.active(True)#~0.1ms
		nic.ifconfig(cfg)#~412.5ms
	return urequest.get(url)

@GM-Script-Writer-62850
Copy link
Author

GM-Script-Writer-62850 commented Jul 9, 2023

so much for that it still locks up with my actual code that reads a few sensors, sends the post data to the server then i press a button a couple times (1 to 5 times) each time it makes a get request

at this point i suspect if you so much as have a pin configured for DHT you are gonna crash with the network loaded

@GM-Script-Writer-62850
Copy link
Author

MilhouseVH i think i figured something out try running gc.collect() at the end of your interrupt, it looks like when i do that after pin.measure() as the exact next line (not 2 lines later) it does not crash for some reason

@GM-Script-Writer-62850
Copy link
Author

GM-Script-Writer-62850 commented Aug 24, 2023

i have done even more testing

even if you disable the NIC before using the second core then enable it after you are done with the second core it will crash

so my guess (not even sure this is a thing) would be a clock de-sync is to blame, at this time i have the W5500_EVB_PICO-20230426-v1.20.0.uf2 firmware loaded and everything running on the 1st core and i have not had it crash once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants