-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glscopeclient hangs on Rigol MSO5000 #358
Comments
It appears that this is not actually directly a scopehal problem at all. When changing the sample depth to 1K it does a little over 1 waveform per second. At 10k and 100k it becomes only marginally slower, but at 1M it becomes significantly slower and at 10M it hangs completely. At this point no more data is sent in Wireshark, so it could potentially be a problem on the scope. Maybe it could transfer data in smaller sections? So as discussed on IRC, the issue can be pivoted to "rigol driver is slow, can we make it faster?". Maybe the answer is just "no". I installed their own scope software, and it seems to be similarly slow and of low sample depth. A possibly unrelated problem is that the sending of commands does not appear to happen immediately. When loading the configuration from the scope, no commands are sent until a few seconds AND after the mouse is hovered over the start button. Strange but true. Maybe there is also an unnecessary delay in sending other commands, making it slow. |
After the latest caching changes I now get the following traces with 1k sample depth. It's definitely some racy mutex thing, as the may thread gets stuck on a different functioin call every time, but always with the scope thread in aquire.
|
This is the trace from wireshark, which includes my model scope and version. It just simply does not send any data.
|
After fixing the timeout behaviour, I got the following long trace. I can't tell any obvious reason why it'd just not return data. |
I have discovered that the bug can be reproduced by zooming in and out on the timescale. It is interesting to look at the timing of operations. The SING command takes several hundreds of ms to become responsive again, looking at the time from SING to the first reply to TRIG?. It is hard to tell from the trace, but my impression is that when the time between the TRIG:STAT? reply and the WAV:DATA? command is short, it bugs out and sends nothing. Does anyone know the Wireshark-fu to measure latency between various commands and them working or not? Concretely speaking, assuming we're indeed reading too fast, the question is... now what? Do we just insert a delay? Can we measure when it's actually ready? Hrmmmm |
I got the following reply from Rigol, suggesting that the MODE and FORM command have to be sent every time. I asked for clarification on this point and sent them my Wireshark trace.
|
So I wanted to look at how their own software works, which appears to be running continuously. How does it synchronize? It does not. The two slow sines are in phase on the scope, and almost in antiphase on the PC, I can see why glscopeclient uses single runs. But of course the main thing I came for is the Wireshark trace of their software: The only thing worth noting is that they don't send MODE and FORM as Rigol told me, they only do this once. So what have we learned? Not much. Their software is slow but works. |
It gets weirder! I tried the LXI transport, and the good news is that it doesn't hang or timeout, but I assure you that channel 1-3 have no signal on them. So I guess while it's busy triggering and updating, it may just return stale or no data at all? In the Wireshark trace it shows that the RPC encapsulation stops it from hanging, but does not stop it from sending empty DATA? replies. Telling from the trace when it sends wrong data is less easy, so I can't rule out that's a buffer overflow error in glscopeclient when it gets less data than expected. |
Right, so the LXI transport will just copy less bytes when it gets a shorter reply. A hacky solution would be to zero the buffer before reading and YOLO. I think the proper solution is to change the API to return the number of bytes read, and handle it correctly in all drivers. Or turn len into a pointer type so it's updated with the actual length, much in the same way |
Rigol:
Uuuuuhhhh?! So I guess you'd either use screen mode or insert a delay? Yuk. |
I tested |
@pepijndevos @kench I've been told by several people that the hang is fixed with latest firmware from Rigol, please test and confirm. |
I confirm with my Rigol MSO5074 with HW 01.01.00 and latest Firmware Upgrade 01.03.00.01 (Released on 12 Aug 2020) there is no freeze or crash on Rigol side |
Seems like the general consensus is that it's fixed. If anybody still has problems it's likely a new bug, so open a new ticket. |
New root cause identified. Reopened and migrated to scopehal repo. |
Fix the issue ngscopeclient#358 glscopeclient hangs on Rigol MSO5000 This fix oscilloscope single trigger synchronization with waveform data by adding "*WAI" command which "Waits for all the pending operations to complete before executing any additional commands." (it is an IEEE488.2 Common Commands)
Hopefully fixed for good in 6bfbfb4. |
When I configure the glscopeclient for my Rigol and press OK, it pops up a window if I want to kill the app or wait for it to respond, and then after a good few seconds the UI comes up.
It is extremely slow to load any data, showing like 3 WFMs, 0.05 WFM/s, and while the label of the sample rate matches what i see on the scope (2GSa/s, 4Mpts), the view is zoomed in to to 20.000 ns, while the scope shows 200us/div. One time I managed to zoom out, but generally the UI just hangs and the time scale vanishes completely when zooming too far.
I tried to trace the commands it sends:
When I clicked the cross, it again showed the "this app is not responding" dialog and I had to kill it.
As suggested on IRC, I also ran it in gdb, to see where it deadlocks the UI.
All the threads:
All the stacktraces:
The text was updated successfully, but these errors were encountered: