Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rp2/thread: Fix memory corruption when thread is created in core1. #8310

Closed
wants to merge 5 commits into from

Conversation

yjchun
Copy link
Contributor

@yjchun yjchun commented Feb 16, 2022

This PR try to fix #7124 and #7981 among others.

There were many issues reported about multicore in the forums and some of them are memory corruption. The problem I narrowed down was that GC was not correctly working on core1. I tried to fix this problem and so far it is working fine in my own tests. But as I am not sure if I am touching something sensitive correctly.
Hope someone with knowledge on GC to take a look.

Test code 1

from machine import UART, Pin
import time
import _thread
import gc

def _test():
    count = 0
    uart = UART(0, 115200 , parity=None, stop=1, bits=8, tx=Pin(0), rx=Pin(1), txbuf=32, timeout=10)
    while True:
        m = f'{count} '* 10 + '\r\n'
        uart.write(m)
        count += 1
        if count % 200 == 0:
            print(f'count: {count}, mem_free({gc.mem_free()}).')
            #gc.collect()

def count_count():
	n = 1
	while True:
		print(gc.mem_free())
		#gc.collect()
		print("Count Count counts", n)
		n += 1
		time.sleep(1)

_thread.start_new_thread(_test, ())
count_count()

Test code 2:

def test():
    for i in range(0x1800):
        print(gc.mem_free())
        bytes(0)
    print("finished")

_thread.start_new_thread(test, ())

@yjchun
Copy link
Contributor Author

yjchun commented Feb 16, 2022

Updated PR as test fails when both threads call gc.collect() aggressively: Garbages are printed in main thread.

from machine import UART, Pin
import time
import _thread
import gc

def _test():
    count = 0
    uart = UART(0, 115200 , parity=None, stop=1, bits=8, tx=Pin(0), rx=Pin(1), txbuf=32, timeout=10)
    while True:
        m = f'{count} '* 10 + '\r\n'
        uart.write(m)
        count += 1
        gc.collect()
        if count % 200 == 0:
            pass

def count_count():
    n = 1
    m = None
    while True:
        m = f'{n} '* 10
        print(m)
        gc.collect()
        n += 1
        #time.sleep(1)
    return m

_thread.start_new_thread(_test, ())
count_count()

I put mutex for GC to fix this but still not really sure if it is OK to manipulate core1's stack from core0. But if I put if(get_core_num()==1) for mp_thread_gc_others(), it does not work.
Any opinion?

@yjchun
Copy link
Contributor Author

yjchun commented Feb 16, 2022

Sorry. Yet another fix. Following test was failing:

from math import sin, cos, radians
import time

def calc():
    product = 1.0
    for counter in range(1, 1000, 1):
        for dex in list(range(1, 360, 1)):
            angle = radians(dex)
            product *= sin(angle)**2 + cos(angle)**2
    return product

def bench(number=10, repeat=5):
    results = []
    for i in range(repeat):
        result = []
        for n in range(number):
            t = time.ticks_ms()
            calc()
            elapsed = time.ticks_diff(time.ticks_ms(), t)
            result.append(elapsed)
        results.append(sum(result)/number/1000)

    results = list(sorted(results))
    return results


def test1(name):
    for n in range(3):
        r = bench(1, 1)
        print(f'{name}: {r[0]:.3f}')

import _thread
_thread.start_new_thread(test1, ('core1',))
test1('core0')

@jrullan
Copy link

jrullan commented Feb 18, 2022

@yjchun thank you so much for helping to resolve this issue. Over in the RPI forums we were starting to lose faith that this issue was going to be taken over.

@yjchun
Copy link
Contributor Author

yjchun commented Feb 19, 2022

I think I found root cause of the problem: was simply missing gc_collect_root() on core1's stack and arg.
Hope this is really last update on this PR and sorry for frequent update. Better had studied GC properly before making this PR.

@peterhinch
Copy link
Contributor

This patch has fixed #7977.

@fivdi
Copy link
Contributor

fivdi commented Feb 19, 2022

This patch has fixed #7977.

Excellent, that's good news.

@peterhinch
Copy link
Contributor

After further testing with this patch I suspect there is still something wrong with the memory management.

I have simple uasyncio code running on core 0 and a minimal loop running on core 1. The two threads share a UART, but it is protected with a _thread.lock instance. If I disable the core 1 routine, everything works. Enabling it leads to non-deterministic behaviour on both cores. Putting a gc.collect() in the loop on core 1, and in a looping coroutine on core 0, radically alters the behaviour but it is still wrong.

Can anyone else report success or failure here?

@yjchun
Copy link
Contributor Author

yjchun commented Feb 20, 2022

I think there are uncertainties about HW resource sharing among multicores. In UART case, UART uses an interrupt and UART buffer is not protected against ISR: core1 writes to UART buffer and core0 UART ISR modify same buffer can happen at the same time.
I guess most other HW resources are not multicore safe either.

@peterhinch
Copy link
Contributor

Ah, thanks for that. My code protects the UART with _thread.lock so hopefully writing to the buffer is safe (I only write, and then only a single character). The interrupt may be unsafe, because it may occur after the lock is released.

If the maintainers can identify which hardware devices cannot be safely shared, it really needs documenting as it will trip up numerous users.

I wonder if the PIO can safely be shared, with both threads writing to the FIFO.

@yjchun
Copy link
Contributor Author

yjchun commented Feb 21, 2022

A little off-topic:
I didn't do any experiments but only from speculation, not many python objects are multicore safe. As long as object operations span multiple cycles there may be race condition problem, without expensive thread lock. In pthread world, I used to put lock even for a++ and I guess this might be same in rp2 thread (unlike CPython thread where they have GIL and using single core only).

@peterhinch, For HW resources, I agree we need to identify which HW devices can be shared. For your test case, it is still not clear if it is the case I mentioned or other problem. I like to do more serious tests with multicore but may take some time.

@peterhinch
Copy link
Contributor

peterhinch commented Feb 21, 2022

I've found the problem - a concurrency bug in my code. I now have shared write access to the UART working reliably. For anyone wanting to emulate this, a lock is essential. Further, each thread only writes a single character to the UART. There may well be a concurrency issue with multi-byte writes given the presumed behaviour of the interrupt.

Returning to the topic of this thread my dual-core test runs indefinitely with no issues with shared memory. Without the patch my script rapidly fails.

Apologies for hijacking this thread with a fault which proved to be irrelevant.

@EliAaron
Copy link

EliAaron commented Feb 24, 2022

This patch fixed my problem with playing WAV files on a second thread!

Now with this patch I can play WAV files on core1 and on core0 I can print output or wait for a user input and start/stop the wave-player accordingly.

Hope these changes get accepted and merged into the base branch quickly.

Thanks!

Details:
I tried playing WAV files on my Raspberry Pi Pico on a second thread using this project https://github.com/danjperron/PicoAudioPWM, but I could not get it to work properly. The playing got stuck after a second or two and the device became unresponsive. I tried commentating out different parts of the wavePlayer code to find the source of the problem. I found that starting the DMA (dma1.start()) caused a crash after a few seconds. I also found that commenting most of the code except the WAV file reading (f.readframes(nbFrame)) caused a crash after about 30 seconds. Now with this patch it works smoothly!

@dpgeorge
Copy link
Member

@yjchun the patch now looks very good, and I understand why it's needed: the stack (and arg) of core1 is itself a root pointer, not just the entries in it. So the bug was that the entire stack was being reclaimed by the GC.

With this patch the line gc_collect_root((void **)core1_stack, core1_stack_num_words); can now be removed because the scanning that it does has already been done by gc_collect_root((void **)&core1_stack, 1). If you agree, can you please remove that part of the code?

@dpgeorge
Copy link
Member

@peterhinch it looks like you don't have any additional issues that came out of the discussion here? If you do, please open other issues to record them.

@yjchun
Copy link
Contributor Author

yjchun commented Feb 25, 2022

@dpgeorge OK. I will test it. But if I understand correctly, gc_helper_collect_regs_and_stack() was supposed to check core1_stack when run in core1 but it is not. That's why gc_collect_root((void **)&core1_stack, 1) is outside of if clause. Is there no problem with it?

@dpgeorge
Copy link
Member

gc_helper_collect_regs_and_stack() was supposed to check core1_stack when run in core1 but it is not.

You are right it is not checking core1_stack pointer itself (only the elements it points to). This is because the pointer is stored in the BSS and is not traced as a root pointer by the GC.

@yjchun
Copy link
Contributor Author

yjchun commented Feb 25, 2022

@dpgeorge Do we still need to call gc_helper_collect_regs_and_stack() in core1 then?
In a quick test, it works fine without calling gc_helper_collect_regs_and_stack() in core1.

@dpgeorge
Copy link
Member

Do we still need to call gc_helper_collect_regs_and_stack() in core1 then?

We don't need to collect the stack in that function, because it's already being collected by the call to collect (&core_stack1, 1). But we do need to collect the registers.

If you leave that call there, it'll collect the stack twice and be slightly inefficient. If you try to remove it that would be good, but might be tricky.

@yjchun
Copy link
Contributor Author

yjchun commented Feb 25, 2022

Tried replacement for gc_helper_collect_regs_and_stack() on core1.
Will this gather registers correctly? I tried this for core1 but my test program did run well with this and without this. I couldn't figure out test case for when object is stored only in registers, when GC is called.
And another problem is that both core0's and core1's registers need be collected, not just the on GC.

#if MICROPY_PY_THREAD
// provided by gchelper_m0.s
uintptr_t gc_helper_get_regs_and_sp(uintptr_t *regs);

MP_NOINLINE void gc_collect_regs() {
    gc_helper_regs_t regs;
    gc_helper_get_regs_and_sp(regs);
    gc_collect_root((void **)regs, sizeof(regs) / sizeof(uint32_t));
}
#endif

@peterhinch
Copy link
Contributor

@dpgeorge

it looks like you don't have any additional issues that came out of the discussion here?

That's correct. With the patch my application is rock-solid.

@dpgeorge
Copy link
Member

Will this gather registers correctly? I tried this for core1 but my test program did run well with this and without this. I couldn't figure out test case for when object is stored only in registers, when GC is called.

It's really hard to find such test cases. But, yes, that code looks like it'll collect/scan the registers correctly.

And another problem is that both core0's and core1's registers need be collected, not just the on GC.

Yes, I also realised this. That's hard to do, and so far doesn't seem to be needed... so let's just fix the issue of scanning the &core1_stack and &core1_arg.

@yjchun
Copy link
Contributor Author

yjchun commented Feb 25, 2022

So, this may be my final version then. The fix is already pushed. Thanks for the support. For the registers GC scanning, let me try in another PR.

@dpgeorge
Copy link
Member

dpgeorge commented Mar 1, 2022

Thanks for updating, it looks very good now. Squashed and merged in 2d47020, and I also tested the snippets of code that previously broke, and they are now working.

Great work @yjchun !

@dpgeorge dpgeorge closed this Mar 1, 2022
tannewt pushed a commit to tannewt/circuitpython that referenced this pull request Aug 24, 2023
Use a different way to force building sdkconfig early
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rp2: crash and memory corruption with threading
6 participants