Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to download after certain previous downloads #194

Open
PropGit opened this issue Jul 17, 2015 · 36 comments
Open

Unable to download after certain previous downloads #194

PropGit opened this issue Jul 17, 2015 · 36 comments
Assignees
Milestone

Comments

@PropGit
Copy link
Contributor

PropGit commented Jul 17, 2015

Parallax IDE may not be flushing it's receive buffer after a reset condition is delivered to the BASIC Stamp, or not waiting long enough before flushing. See BASIC Stamp Programming Protocol v1.1 page 5.

After the following program is downloaded to the BS2 module, Parallax IDE is often unable to identify (or download) to the BS2 again whereas on the same computer the BSE software is able to do both with no problem. See this demo video for an example. Sometimes Parallax IDE is able to identify again, but unable to download and it displays a toast saying "Serial port not open."

UPDATE: Here's a more in-depth video using a different PBASIC program to cause the problem.

' BRANCH.BS2
' This program shows how the value of idx controls the destination of the
' BRANCH instruction.

' {$STAMP BS2}
' {$PBASIC 2.5}

idx             VAR     Nib


Main:
  DEBUG "idx: ", DEC1 idx, " "
  BRANCH idx, [Task_0, Task_1, Task_2]          ' branch to task
  DEBUG "BRANCH target error...", CR, CR        ' ... unless out of range

Next_Task:
  idx = idx + 1 // 4                            ' force idx to be 0..3
  PAUSE 250
  GOTO Main

Task_0:
  DEBUG "BRANCHed to Task_0", CR
  GOTO Next_Task

Task_1:
  DEBUG "BRANCHed to Task_1", CR
  GOTO Next_Task

Task_2:
  DEBUG "BRANCHed to Task_2", CR
  GOTO Next_Task
@2fast2fourier
Copy link
Contributor

I was able to replicate this issue and tracked a connection leak to the serialport library. I've submitted a PR (garrows/browser-serialport/pull/38) and I will resolve this issue once the fix is verified and merged.

@PropGit
Copy link
Contributor Author

PropGit commented Aug 6, 2015

We have a customer in our educator's course today that is experiencing this; at least the pattern fits. Very frustrating. I hope this actually fixes it, but I'm skeptical. Please test using multiple programs that contain repetitive DEBUG output- that's where I see it happen most often.

I don't remember having to restart Parallax IDE in order to solve this particular problem; I just have to use the BSE software to reprogram the BS2-IC with a simple PBASIC program containing a STOP command, then Parallax IDE can recognize it and reprogram it.

@phated
Copy link
Contributor

phated commented Aug 7, 2015

This should be fixed with dependency updates. However, we were never able to duplicate this problem with the supplied source code, so it should be thoroughly tested by @PropGit. Please re-open if the problem persists.

@phated phated closed this as completed Aug 7, 2015
@PropGit
Copy link
Contributor Author

PropGit commented Aug 17, 2015

This problem is not entirely fixed.

The original problem still exists exactly as before using the Branch code and demonstrated by the first video in the first post.

The second and likely related problem noted by the second video and the Debug_Debugin code (below) still occurs, but not with as much frequency. Sometimes it fails, sometimes it does not. Perhaps the dependency updates improved things?

@PropGit PropGit reopened this Aug 17, 2015
@phated
Copy link
Contributor

phated commented Aug 17, 2015

Requires more research and information, as we are not able to duplicate with the given code.

@phated phated assigned PropGit and unassigned 2fast2fourier Aug 17, 2015
@PropGit
Copy link
Contributor Author

PropGit commented Aug 17, 2015

Let's add it to the next Sprint and I'll follow-up with more research at that time.

Question: Do you have a logic analyzer? I can't remember.

@urbantumbleweed urbantumbleweed modified the milestones: Sprint 9, Sprint 8 Aug 18, 2015
@2fast2fourier
Copy link
Contributor

Yea, I have a Saleae logic analyzer, it should work fine for this case. I wasn't able to get the code above to replicate the problem, but I can start by running it again and watching the serial lines. That'll let me see if I was just lucky and was avoiding a timing issue on my end.

If you can come up with any other examples that might cause the issue more reliably, that would help a ton.

@PropGit
Copy link
Contributor Author

PropGit commented Aug 18, 2015

Okay, I'll try to discover more.

As it might related to this (but maybe not), take a look at Issue #237.

@PropGit
Copy link
Contributor Author

PropGit commented Aug 18, 2015

Please disgregard #237. My mistake, it's a legal and optional part of the protocol.

But this issue here is still a problem. I'll try to give more info if I can.

@PropGit
Copy link
Contributor Author

PropGit commented Aug 18, 2015

Here's some more evidence to study. There are some unexpected behaviors from Parallax IDE that show up here.

To start with, for reference here's a trace of the BSE software:

  • Downloading the BRANCH source to the BS2 (there's an Identify step that is automatically done at the start of Download process as part of the port scanning procedure, so effectively there's two identifications for the download process)
  • Letting it run (BS2 transmits data from running code)
  • Identifying three more times (successfully)

image

Here's a trace of the Parallax IDE app trying to do the same thing as above:

image

First: Note that there are two reset pulses in every case. There should only be one for each time it wishes to Identify a particular stamp type.

Second: Parallax IDE isn't always transmitting any hello codes after the simulated break condition ends. This is something I pointed out in Issue #177 that I've just reopened.

Here's a close-up of the first Identify attempt (above) I made after the successful download.
image

  • There's a reset with no simulated break condition, so the BS2 starts running it's program and transmitting data.
  • Then there's a second reset followed by a simulated break condition (putting the BS2 into a host-communication mode, so it doesn't transmit any more), but then Parallax IDE never actually sends a hello code... it just starts the process all over again (4 more times) presumably looking for a BS2 (the first time) and a BS2e, BS2sx, BS2p, and BS2pe (the next 4 times). Without it actually transmitting any hello codes, it won't ever get a response from the stamp.

For reference, here's what the same moment looks like from the BSE's signals:
image

Of course, it Identifies the Stamp, so it doesn't attempt any more resets, breaks, and hello code transmissions.

NOTE: I'm running this on a Win 7 Pro 64-bit machine. This doesn't happen all the time. It will work fine at times and other times it behaves this way. To get these captures I simply keep downloading followed by identifying over and over again with difference cadence until it breaks. Sometimes it does this very frequently, sometimes not. It's not usual for a user to do this exactly thing, but I started this because it should never behave this way and all too often, when I was doing a simple download after a previous download, it'd get stuck in this kind of mode and I could usually never get it to snap back into proper behavior. Very frustrating. I started doing the routine above because it's the only thing I've found to cause the problem often enough that I could capture it reliably.

@PropGit
Copy link
Contributor Author

PropGit commented Aug 18, 2015

Here is the Saleae Logic 8 capture files from the previous post.

BSE (BASIC Stamp Editor) Run + Three Identifies

Parallax IDE Run + Three Identifies

@2fast2fourier
Copy link
Contributor

I've identified what appears to be the cause of this. I wasn't able to replicate this issue on OSX/Chromebook, but on WinXP and Win10 I was able to replicate it with the provided program and other programs with excessive DEBUG messaging early in the program.

I came across an existing ticket for Chrome that described the same issue, where it was tracked down to a CE_OVERRUN error state occurring when opening the port. However, chrome converts that code into a generic system_error that also encompasses several other system errors. The overrun state itself is recoverable, but since that state was indistinguishable from the other error states, we close the connection automatically. The other merged error cases are CE_FRAME (framing error), CE_RXOVER (buffer overflow), and CE_RXPARITY (parity error), and potentially generic errors reported by the OS.

I was able to write a simple workaround, where I attempt to recover the connection after a system_error and allow the identify/bootload process to continue. This seems to work pretty well, but this leaves us in a bad position where we can't handle the other error cases.

That Chrome ticket has a few patches that will break up system_error into new error cases we can use to resolve this, but the patches landed just a few weeks ago and won't appear in a stable release for a few months.

I'm putting together a branch with the workaround so you can verify it from your end.

@phated
Copy link
Contributor

phated commented Aug 25, 2015

@PropGit I think we should target the version of chrome which the linked patch lands in. It doesn't seem to be a good idea to forgo all error handling for a single case that only happens on Windows. I found a graph on new version uptake at http://clicky.com/marketshare/global/web-browsers/google-chrome/ that shows that chrome has near-perfect uptake when a new version is released.

@PropGit
Copy link
Contributor Author

PropGit commented Aug 25, 2015

@2fast2fourier - Thanks. Once I can test the workaround branch, we can decide further on this. If it all works, we or I can create another issue to further address this with a later stable release of Chrome and hopefully handle each of the distinct errors appropriately (choosing to lump some of them together, perhaps).

The CE_OVERRUN error is not one I would have expected at this point; that's a lower-level problem with system scheduling in the OS as far as I know.

@PropGit
Copy link
Contributor Author

PropGit commented Aug 25, 2015

@phated - Are we looking at a few months out by doing that? The noted errors are all reception errors that are soft issues from our perspective; things that happen due to context changes (like opening a port that was otherwise closed but with a device on it that was sending data already, or performing a reset on the hardware that was in the middle of sending data).

Unless there's other errors of a more serious nature (possible) then I don't think it's a dangerous thing.

@2fast2fourier
Copy link
Contributor

A small update, the patch I mentioned before is in the chrome dev channel and scheduled for v46. v45 should release in September (44 was 7/21), meaning 46 will probably release late October or early November.

After digging into this I found out that I'm actually running into framing errors, not overruns. The timing on this is unusual, I'm still working out the exact series of serial events. I'll update when I have more info.

@urbantumbleweed urbantumbleweed modified the milestone: Sprint 10 Sep 1, 2015
@PropGit
Copy link
Contributor Author

PropGit commented Sep 3, 2015

@2fast2fourier - I sent an email... I built and ran but get a "serial.break is not a function" error.

@PropGit
Copy link
Contributor Author

PropGit commented Sep 4, 2015

I can see the break condition working with the experimental-break branch's build and Chrome 46.0.2490.13 beta-m installed.

The reset signal is way outside of the break condition though, which causes the BASIC Stamp to wake up, see no break condition (yet) and move on to its normal operation, ignoring any further data from the host trying to program it.

The BASIC Stamp wakes up from reset in roughly 18 ms, so the break condition needs to be happening before that. In this capture, the break condition doesn't start until 32 ms after reset ends.

image

For a target signal reference, here's the BSE's signals (zoomed-in a little more) from the same computer successfully identifying a BS2-IC.

image

In past implementations, I found that I could get the proper response (DTR inside of Break condition) by, strangely, telling it to do a DTR toggle first, then a Break condition immediately. (I think this is what I did).

There's other potential tricks as well, like extending the DTR reset to start before the break, then start a break, then end the DTR reset, then end the break condition.

I know there may be different limitations with Chrome.SerialAPI, but I'm just suggesting trying different things that may not be obvious.

@2fast2fourier
Copy link
Contributor

That DTR is only showing the leading edge, and I was experimenting with bringing DTR high before break, and dropping it immediately after, like the trick you mention. I was under the assumption that the reset only finished on DTR falling edge, but what happens while DTR is high? There's a relatively long period between DTR on/off due to the callback events (~16ms at best), I might be able to work around it though.

@PropGit
Copy link
Contributor Author

PropGit commented Sep 5, 2015

@2fast2fourier - No, I just zoomed out the image too far to see all of DTR; it's a high pulse of only 1.08 uS in my capture, rising edge and falling edge.

You're right, the reset condition ends upon DTR falling edge (and about 18 mS later the BASIC Stamp is fully awake). While DTR is high, the BASIC Stamp is asleep.

By DTR on/off do you mean DTR high/low (as seen by the BASIC Stamp)? The length of time of DTR high is relatively inconsequential; it just needs to be long enough for the signal to generate a reset on the BASIC Stamp. The timing of the DTR fall and the break condition is the most critical part of this sequence.

@2fast2fourier
Copy link
Contributor

@PropGit I've made some changes on my experimental-break branch, can you remove node_modules, reinstall, and check it out? I've found a workaround for an issue where it can't recover from system_error state by structuring the handshake and closing/reopening the connection at a specific point if we run into the error state.

It seems to improve the success rate considerably, but I'm still working on the timing. I'd also like to see how the timing works out on your system, if you could send me a full scan that would be great.

@PropGit
Copy link
Contributor Author

PropGit commented Sep 14, 2015

@2fast2fourier - I removed node_modules and then did a git pull origin experimental-break and it tells me my branch is already up-to-date.

I'm on commit 5de0fe... from Sept 1st.

@2fast2fourier
Copy link
Contributor

@PropGit That's the right commit for the main repo, the changes are in bs2-serial-protocol which will be installed when you npm install again. Deleting node_modules ensures we'll get an up to date version.

@PropGit
Copy link
Contributor Author

PropGit commented Sep 14, 2015

@2fast2fourier - Okay, rebuilding now.

@PropGit
Copy link
Contributor Author

PropGit commented Sep 14, 2015

No successful identification on Windows 7.

Here's the full scan:

image

Here's progressive zooms of the scan for BS2:
image

image

The reset pulse is 1.22 uS in width and its rising edge is about 3 ms after the break condition ends. There's another reset that happens before the break condition starts, but that's not visible in the last zoom.

@urbantumbleweed urbantumbleweed modified the milestones: Sprint 10, Sprint 11 Sep 14, 2015
@PropGit
Copy link
Contributor Author

PropGit commented Sep 19, 2015

@2fast2fourier - Following up on our meeting discussion, here are some details from the Windows Delphi-based serial code I wrote for BSE.

The entire Serial.pas unit can be viewed, but here is a summary (with unimportant details left out for brevity):

The TSerial.Open method converts the COM port ID to UNC form, then opens the port using the CreateFile Windows API call:

strpcopy(TheBuffer, '\\.\'+Port);  {Copy comm port setting into buffer (using the UNC method to even open ports COM10 and above)}
nCid := CreateFile(TheBuffer, GENERIC_READ or GENERIC_WRITE, 0, nil, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, LongInt(0));    {Open the comm port}

Then, if the handle nCid is valid, it get's the port state information into a device control block:

if NOT GetCommState(nCid, LpDCB) then
  {some error handling here}
  ...
else
  {Control block was successfully built}
  begin
  SetComConfiguration(Baud, Parity, DataBits, FlowControl);  {Set Com port's settings}
  getmem(Buffer, BufferSize);                                {Allocate space for input buffer}
  SetTimeouts(0,0,0,0,0);                                    {Initialize Timeouts to defaults}
  end;
  • SIDENOTE Ha! I just realized I must have started refactoring attribute-setting code into SetComConfiguration and never quite finished, because this code gets the state, then if there's no error, it calls SetComConfiguration which itself gets the state again... oops!

SetComConfiguation gets the COM port state, sets the attributes it wants, then (critically) immediately sets DTR (and RTS) low. This is probably why we don't see the double DTR pulse in BSE anymore. NOTE: I think we treat both DTR and RTS the same way at the same time because some USB-to-Serial devices bring out RTS instead of DTR.

{Set Baud, Parity, Data Bits, and Stops Bits; all while limiting them to reasonable values; }
LpDCB.BaudRate := ifthen(StrToInt(Baud) < 110, LpDCB.BaudRate, StrToInt(Baud));
LpDCB.ByteSize := ifthen(not StrToInt(DataBits) in [7, 8], LpDCB.ByteSize, StrToInt(DataBits));
if uppercase(Parity) = 'EVEN' then LpDCB.Parity := EVENPARITY else LpDCB.Parity := NOPARITY;
LpDCB.StopBits := ONESTOPBIT;
LpDCB.Flags    := 0;
{Make final API call to set up initial state of port}
setcommstate(nCid, LpDCB);
ChangeDTR(False);
ChangeRTS(False);

@PropGit
Copy link
Contributor Author

PropGit commented Sep 19, 2015

@2fast2fourier - Some other notable methods in Serial.pas are TSerial.FlushRX and TSerial.SetTimeouts.

@2fast2fourier
Copy link
Contributor

I've isolated this down to an issue with chrome's serial connection hanging after receiving a framing error. I've opened an bug on the chromium project, and included a demo app that will hopefully make it really easy for them to replicate. https://code.google.com/p/chromium/issues/detail?id=534517

At this point, without a fix for that specific issue, I have minimized the cases that will trigger this issue. The timing is a little variable, but as long as any program has a delay of at least 100-200ms before outputting any serial data, we shouldn't run into any issues. If for any reason a framing error occurs, that connection will be unresponsive until we close it again. However, framing errors only really happen when starting the break condition, or if a reset happens outside of our break.

To explain that 100ms figure: The critical part is that we need to be able to enter the break condition before the any serial data is sent from the board. The chrome serial open call causes DTR to trigger, but we don't have any control over that and that opening DTR won't trigger a framing error. After the open callback (~10-20ms delay), we request a setBreak. The break condition seems to take an additional 10ms to establish, but at that point we don't have to worry about any data coming in from the device. Then we can complete a DTR cycle and enter programming mode.

I'll be cleaning up my experimental branch and preparing it to PR, that should resolve #144 as well. When a fix lands for the chrome issue, we shouldn't need to change anything on our side.

@PropGit
Copy link
Contributor Author

PropGit commented Sep 21, 2015

@2fast2fourier - Thank you for this report and for submitting the bug report on Chromium too.

The 200 ms delay will be a workaround we'll have to add to example code for now.

However, framing errors only really happen when starting the break condition, or if a reset happens outside of our break.

QUESTION: Are you saying that the reflected break condition doesn't cause a framing error and a resulting lock-out of controlling the port?

NOTE: Resets outside of break (via dev board reset button) are not super common, but they are far from rare as it's a test-and-diagnose step users take while developing.

There's also the case you haven't see yet: BASIC Stamp code uses SEROUT instead of DEBUG in order to transmit at a different baud rate. This will cause framing errors also.

The chrome serial open call causes DTR to trigger, but we don't have any control over that and that opening DTR won't trigger a framing error.

Have you tried to immediately set DTR low (Clear DTR) right after opening and configuring the port, like we do in the BSE example code to see if it gets rid of the port-opening-induced DTR toggle?

@PropGit
Copy link
Contributor Author

PropGit commented Sep 21, 2015

@2fast2fourier

In the Chromium bug report, you say:

As an experimental workaround, I removed RECEIVE_ERROR_FRAME_ERROR from ShouldPauseOnReceiveError in serial_event_dispatcher.cc. This meant it would no longer automatically pause when it received the frame_error. This worked, the connection no longer entered an unrecoverable state...

Is serial_event_dispatcher.cc a part of the Chromium project (ie: not something we can override in Parallax IDE)? Because, automatic pausing of port processing/activity upon errors like this is something we don't experience in BSE and thus we can handle it on a higher level right where we need it (and we choose to ignore it and move on; exactly what we need and expect).

@2fast2fourier
Copy link
Contributor

@PropGit For the first question, that reflected break is the primary cause of our framing errors if it occurs while the board is transmitting. Otherwise breaks seem to be pretty safe if it occurs while the device isn't sending.

SEROUT triggered frame errors are definitely something we probably want to handle, but they aren't as likely to be an issue while programming so I think that should be a separate issue. In any case, when the bug is fixed in chromium we won't need to differentiate framing errors from any other recoverable error state.

The issue with opening the port is that we don't have access to the serial port until after that callback is called. That means the initial DTR flip happens well before we know the connectionId that the rest of chrome.serial uses. Honestly, this initial DTR is mostly harmless, we never see any framing errors from it since the board is reset right before the port starts listening. The main issue with it is that the board will fully wake up about 18ms after that reset and potentially start transmitting data before we can setBreak.

Yes, we can't override anything in the chromium project, it's all deep in the API internals and apps aren't allowed to modify chrome itself.

@PropGit
Copy link
Contributor Author

PropGit commented Sep 21, 2015

@2fast2fourier - Thank you.

FYI: Ironically, this morning a customer reported having a problem that I think is exactly this issue.

@PropGit
Copy link
Contributor Author

PropGit commented Oct 2, 2015

@2fast2fourier - Regarding this post above, I whole-heartedly agree, you've greatly minimized the conditions that trigger the problem. I've tested with all the code (and more) that I found the original problem with and Parallax IDE never failed to download again on Win 7! This is incredibly exciting because I didn't expect it to work so well with this problem still existing! Great Job!

I've found that I can force it to fail in much the same way, but only by downloading code including a SEROUT instruction that transmits at a different baud rate than the Debug Terminal's default 9600.

Idea For Possible Interim Workaround
Considering all the possibilities, we still need Chrome's Serial API to be fixed to solve all the future problems we'll likely have with this, but maybe there's an interim workaround that will make things even more robust currently, now that Chrome v45+ supports the BREAK condition!

You know how simply opening the port causes the first DTR toggle event, as seen here?
image

The BS2 resets in about 18 ms and starts transmitting data (if the previously downloaded code does so).
image

So I see a possible way to nearly eliminate framing error problems when trying to Identify or Download:
Is it possible to set the BREAK condition immediately after opening the port, and will that be likely to occur in less than 18 ms? If so, after setting BREAK immediately, you'd simply toggle DTR (on purpose, the second time) and then clear BREAK and continue with the Identification and Download sequence as normal. NOTE: It seems we've learned from that error handling mistake this week that if you set BREAK and leave it before you closed the port, the BREAK condition is persistent and the port actually opens up next time with BREAK already set. If we can't set BREAK fast enough after opening the port, perhaps taking advantage of this BREAK persistence can pay off. If we can get BREAK to be set immediately, or nearly immediately, upon port opening, it completely prevents any spurious serial data from the BASIC Stamp from causing a framing error!

What do you think?

@phated phated modified the milestone: Sprint 11 Oct 5, 2015
@urbantumbleweed urbantumbleweed modified the milestone: Sprint 13 Oct 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants