Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programming sync errors #50

Closed
bc547 opened this issue Sep 15, 2015 · 7 comments
Closed

Programming sync errors #50

bc547 opened this issue Sep 15, 2015 · 7 comments
Labels

Comments

@bc547
Copy link

bc547 commented Sep 15, 2015

Setup

  • Arduino pro mini 3.3V/8Mhz with optiboot bootloader v6.2 modified to serial speed 250000 baud.
  • ESP-07 with esp-link (commit f057a65), configured for 250000 baud, serial port connected to the arduino, gpio12 connected to arduino reset.

My programming environment (where I run avrdude) is a linux vmware host in a remote datacenter, connected to my home network via a vpn over a cable connection. The ESP module is associated with an access point on my home network. So the latency/jitter between the ESP and avrdude is a bit higher than a physical host on the same network as the ESP.

STK500 protocol

For short overview see: http://www.nongnu.org/pulsefire/apidocs/src-html/org/nongnu/pulsefire/device/flash/avr/Stk500Command.html, For the full specs see http://www.atmel.com/Images/doc2525.pdf)

For this explanation, only the next commands are parameters are relevant:

  • commands:
    • 0x10 = STK_OK
    • 0x14 = STK_INSYNC
    • 0x20 = CRC_EOP
    • 0x30 = STK_GET_SYNC
    • 0x41 = STK_GET_PARAMETER
  • parameters:
    • 0x80 = Parm_STK_HW_VER (hardware version)
    • 0x81 = Parm_STK_SW_MAJOR (software version major)
    • 0x82 = Parm_STK_SW_MINOR (software version minor)
    • 0x98 = Param_STK500_TOPCARD_DETECT (stk500 topcard detect)

STK500 synchronisation mechanism

The programmer sends STK_GET_SYNC;CRC_EOP (0x30;0x20) repeatedly until it receives STK_INSYNC;STK_OK (0x14;0x10) from the arduino. At that moment synchronisation is complete. After initial synchronisation, avrdude starts querying the device (hardware/software version,...) by sending for example STK_GET_PARAMETER;Parm_STK_HW_VER;CRC_EOP (0x41;0x80;0x20). The arduino replies with STK_INSYNC;hardwareversion number;STK_OK (0x14;0x3;0x10). This interrogation continues a for while and when all information is gathered and evaluated by avrdude, the programming sequence can start.

The following snippet illustrates initial synchronisation and beginning of the interrogation sequence. This is part of the console output if you run avrdude with multiple -v options (-v -v -v -v -v -v -v -v).

avrdude: Version 6.1, compiled on Aug 16 2014 at 00:04:23
avrdude: Send: 0 [30]   [20]
avrdude: Send: 0 [30]   [20]
avrdude: Send: 0 [30]   [20]
avrdude: Recv: . [14]
avrdude: Recv: . [10]
avrdude: Send: A [41] . [80]   [20]
avrdude: Recv: . [14]
avrdude: Recv: . [03]
avrdude: Recv: . [10]
avrdude: Send: A [41] . [81]   [20]
avrdude: Recv: . [14]
avrdude: Recv: . [06]
avrdude: Recv: . [10]
avrdude: Send: A [41] . [82]   [20]
avrdude: Recv: . [14]
avrdude: Recv: . [02]
avrdude: Recv: . [10]
avrdude: Send: A [41] . [98]   [20]
avrdude: Recv: . [14]
avrdude: Recv: . [03]
avrdude: Recv: . [10]
...

Synchronisation errors

The following snippet illustrates an out of sync sequence (as seen by avrdude) I get very regularly:

avrdude: Version 6.1, compiled on Aug 16 2014 at 00:04:23
avrdude: Send: 0 [30]   [20]
avrdude: Send: 0 [30]   [20]
avrdude: Send: 0 [30]   [20]
avrdude: Recv: . [14]
avrdude: Recv: . [10]
avrdude: Send: A [41] . [80]   [20]
avrdude: Recv: . [14]
avrdude: Recv: . [10]
avrdude: Recv: . [14]
avrdude: stk500_getparm(): (a) protocol error, expect=0x14, resp=0x14
avrdude: Send: A [41] . [81]   [20]
avrdude: Recv: . [03]
avrdude: stk500_getparm(): (a) protocol error, expect=0x14, resp=0x03
avrdude: Send: A [41] . [82]   [20]
avrdude: Recv: . [10]
avrdude: stk500_getparm(): (a) protocol error, expect=0x14, resp=0x10
avrdude: Send: A [41] . [98]   [20]
avrdude: Recv: . [14]
avrdude: Recv: . [06]
avrdude: Recv: . [10]
...

In this snippet, avrdude receives the reply (0x14;0x10) to the 0x30;0x20 synchronisation sequence twice. Since avrdude does not expect this, we get a 2 byte mismatch between issuing a command and receiving the answer to the command. This causes the protocol error messages and finally abort of the programming. In order to get a better understanding how this happens, we look at the serial communication between the ESP and the arduino.

The next image is captured with a logic analyser placed on the serial interface between ESP and arduino and shows 3 logic signals:

  1. RX: serial port on arduino and shows the information sent from the ESP to the arduino
  2. TX: serial port on arduino and shows the information sent from the arduino to the ESP
  3. RESET: show the status of the arduino reset line.

esp-link-nok-annotated

Initial situation: Arduino is running and idling, not sending or receiving anything on the serial port. ESP module is booted and configured for serial bridge. Next, avrdude is started and has opened a tcp connection to the ESP module and starts sending it's first command.

Let's see whats happening step by step:

  1. ESP received the first sync sequence (0x30;0x20) and toggles the reset pin of the arduino
  2. The first received sequence (0x30;0x20) is sent to the arduino. (It is sent very short after the reset, and can be considered lost in almost all situations because the MCU has not booted yet at this moment)
  3. avrdude did not receive a reply and resends the sync sequence (0x30;0x20)
  4. the arduino sends a very late reply (0x14;0x10) to the second sync sequence. A possible explanation for the very late reply is the arduino serial port is already configured and listening around step 3, but some other bootcode is running at that moment and serial processing has not yet started until now.
  5. avrdude did not receive the reply from 4 yet and resends for the 3th time the sync sequence (0x30;0x20)
  6. arduino bootloader is up and running and immediately sends a reply (0x14;0x10)
  7. by this time, avrdude has received the first reply (step 4) and sends the first getparm command
  8. arduino replies immediately to the received getparm command (0x14;0x03;0x10)
  9. avrdude sends the next getparm command
  10. arduino immediately replies to the second command

(I verified the exact data bytes sent/received at each step, but lost the screenshots)

A few possible solutions

  1. (simplified) ESP could discard late replies by the arduino. E.g. a reply to a sync command must occur within 50ms or else the reply is discarded. The effect would have been that the reply of step 4 is not sent to the arduino and protocol synchronisation would be ok.
  2. The ESP could do the first synchronisation step by itself: Send 0x30;0x20 to the arduino until 0x14;0x10 is received from the arduino. Once avrdude resends the sync sequence (0x30;0x20) after the arduino is in sync, the ESP can answer to avrdude immediately by itself with 0x14;0x10.

My feeling is solution 1 seems to be the most protocol agnostic. If needed, the 50ms timeout can be set as an esp-link parameter.
Solution 2 feels like the most robust algorithm, but must be adapted for each programming protocol.

Remark 1: Maybe my observations are an optiboot specific timing issue. If needed, I could try with some other bootloaders.

Remark 2: Still haven't found the time to get acquainted with the esp-link code itself :-(

@tve
Copy link
Member

tve commented Sep 15, 2015

Thanks for the detailed report! Can you confirm that you are using v1.0.3 of esp-link?

I've looked at similar traces and tweaked things back and forth. The avrdude sync code leaves a lot to be desired, for example, it doesn't deal with late sync responses coming in, as you've noted. I don't really agree with The first received sequence (0x30;0x20) is sent to the arduino. (It is sent very short after the reset, and can be considered lost in almost all situations because the MCU has not booted yet at this moment) because optiboot does come up in less than a millisecond, if I remember correctly. On the time-scale of your trace that's going to look like "at the same time". There is a delay of 1ms after toggling reset (https://github.com/jeelabs/esp-link/blob/v1.0/serial/serbridge.c#L235), which I thought was enough, but I can try it again myself as well. One thing I'm curious about is the detailed timing between the reset line, the actual reset pin on the arduino (there may be A/C coupling which can delay the rising edge), and the first character arriving.

WRT your overall set-up: there is a timing constant in optiboot which determines how much time the programmer has between an optiboot response and the next programming command. By default this seems to be 500ms and it applies to every single command. If at any point in time the programmer doesn't send the next command within that time-frame the programming sequence ends and the loaded sketch is run. With your long-distance set-up you may well find that it's difficult to meet this constraint reliably. I'm not saying it's impossible, I don't know what your network connectivity is for one, just mentioning it as something to look out for. Actually, since you're already compiling your own version of optiboot you might as well increase this, however it also increases the time optiboot waits at boot before running your sketch, so if you need that to be short then this won't work for you.

In terms of timing, right now esp-link does not interfere in any way with the data transmitted, and I'd like to keep it that way if at all possible. I could increase the initial delay after reset, but if it exceeds ~10ms I need to change the code quite a bit 'cause that starts to interfere with wifi operation.

@tve
Copy link
Member

tve commented Sep 15, 2015

With regard to boot time, there are fuse settings that determine how long it takes before the mcu boots (Clock source fuse bits). Typically the fuses are set to the most conservative timing which means at least a 65 ms delay before the mcu boots after a powerup or reset.

65ms is more than a busy-wait will accomodate. This brings me to the question why there is not a prompt reply to the second sync attempt (labeled 3 in your trace).

@tve tve added the bug label Sep 16, 2015
@bc547
Copy link
Author

bc547 commented Sep 16, 2015

The version of esp-link I used is the master tree code at commit f057a65.

Optiboot remarks

Looking at the code of optiboot I notice:

  1. The watchdog timer is initialised (optiboot.c line 502) with a timeout value of 1 second. So I assume delay can be up to 1 second (instead of 0.5s)?
  2. I was wondering too about the unusual delay of the response.
    It seems the UART is initialised first, next the led flashes in a blocking way and finally the serial response is processed.

So this timing issue might be optiboot specific. I'll try with the default arduino bootloader in the weekend to see if it behaves differently.

Reset timing

Currently I only have the logic analyser screenshot (see below), but if you want, I can record it in the analog domain too with my scope. But since you toggle the reset from 1 to 0 and back, the rising edge of the reset should be fine afaik. Just like the 1-0 pulse 'passes' the capacitor, the 0-1 pulse wil also 'pass' the capacitor and quickly restore reset back to a logical 1.

image

As can be seen on the screenshot above, the reset pulse is low for 0.2091 ms. The time between the end of the pulse and beginning of the serial TX (from ESP to arduino) is 1.124 ms (time between markers A1 and A2 is displayed on the right side of the screenshot).

Remark The reset channel is also measured on the reset pin of the arduino (just like TX and RX) and NOT on GPIO12 of the ESP module.

My network latency

Thanks for the remark about my overall setup. Just to get an idea, I just measured my round trip times.

I have this small serial-echo sketch running on the arduino that is connected to the ESP:

bool led=false;

void setup()
{
  pinMode(13, OUTPUT);      
  Serial.begin(250000);      
}

void loop() 
{
  while (Serial.available()) {
    Serial.write(Serial.read());
    digitalWrite(13, led);
    led=!led;
  }
}

I also have a small perl program running in my remote development environment that repeatedly sends 1 byte of data to the esp-link'ed arduino and waits for the response. Based on those full-loop round trip times I calculated some statistics (based on 10000 rtt measurements):

min       = 0.00993s
max       = 0.65968s
range     = 0.64974s
count     = 10000
mean      = 0.01488s
median    = 0.01415s
stddev    = 0.00932s

So occasionally, there are some outliers up to 650ms, but in general rtt is 14ms with a small standard deviation. An occasional watchdog timeout might happen sometimes, but in general I would not expect it to be a real issue with this setup.

conclusion & questions

Maybe my sync timing issue is mostly caused by the way optiboot blinks the led. I will test the default arduino bootloader during the weekend.

Some questions:

  • What bootloader do you use?
  • What does avrdude send/receive when you get an error? e.g. Do you also receive a double sync reply (0x14;0x10)?

@tve
Copy link
Member

tve commented Sep 18, 2015

FYI, I will work on this and want to say a big thanks for the detailed description. I also use optiboot but I was reminded of the AVR (and PIC) fuses which can delay the start-up time by varying amounts...

@tve
Copy link
Member

tve commented Sep 21, 2015

I'm now dropping the first sync bytes sent by the programmer on the floor. Let's see whether this improves things...

@bc547
Copy link
Author

bc547 commented Sep 21, 2015

Ok. I'll test it as soon as I got some free time.

@tve
Copy link
Member

tve commented Jan 1, 2016

I'm closing this given that the built-in programming support works pretty well at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants