Fixed very slow implementation of _ff_mod#768
Conversation
…ake more and more time, then become fast again as the index (uint16_t) wraps around
hathach
left a comment
There was a problem hiding this comment.
thank for the pr, could you provide the before and after timing measurement based on your test setup. It is not clear to me how much it is boost the performance here.
|
I have implemented a JTAG probe that receives over 6500 USB packets to program a FPGA. During my tests the wall clock programming goes from 2.5s to 4.5s as I repeat the programming and then goes back to 2.5s at the 10th programming and the cycle repeats. My fifo depth is 64, so mod will loop a 1000 times in the worst case. After the fix, the wall clock time is stable at 2.4s
This is on a Raspberry pico board.
Thanks,
Patrick
|
|
By the way, it I pulled my hair for a while before discovering the root cause of the slowdown. I don’t know how to profile code on a Cortex-M0+ so I had to convince myself that it was communication problem first (using a LA I saw that the JTAG output was normal, except the programming bursts were more and more spaced out). I then looked at the state of my USB interface for the first time and for the 9th time. I noticed that the read and write pointers were super large on the 9th time. That surprised me so I looked at the implementation and discovered the mod problem.
Patrick
|
hathach
left a comment
There was a problem hiding this comment.
Could you also print out the idx value at the 1st and 9th, I am curious to see what is its value at that time.
|
Make sure you use this repo master for testing instead of the fork in raspberrypi repo. If not then switch branch and try again |
|
I can confirm that everything is good when I switch to the master branch. Thank you very much for your help!! |
thanks, please test again this repo for your next issue/pr. There has been quite a bit of fixes for rp2040 since it is released in sdk. |
See hathach#768 for discussion / inspiration
Describe the PR
Fixed very slow implementation. Symptoms are that the USB transfers take more and more time, then become fast again as the index (uint16_t) wraps around
I appreciate the fact that some MPUs don't have a hardware divide, so I provided a fast path if the fifo depth is a power of 2.