Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not emitting frame issues #87

Closed
cblack-f3 opened this issue Feb 2, 2021 · 13 comments
Closed

Not emitting frame issues #87

cblack-f3 opened this issue Feb 2, 2021 · 13 comments

Comments

@cblack-f3
Copy link

We've been banging our heads for a bit on this issue, and I'd like to get some insight into what might be going on in the s2e_nextTxAction state machine to cause this "radio is not emitting frame" error. Firstly, the issue is intermittent (one out of every 10-30 or so downlinks). A bit about the setup

Environment

  • GW: Multitech Conduit fw version 5.3.0 : Station Ver : 2.0.5(mlinux/std) 2020-10-19 19:44:13
  • LNS: Chirpstack network server 3.12
  • End-device: Custom device - Class C
  • Station conf: station.conf.txt
  • Station bootlog for reference: basicstation_bootlog.txt

The Test

We are manually issuing an unsolicited downlink frame in class C mode (so no RX1/RX2 timing windows come into play here). The device is muted after joining the LNS, so as to not interfere with the test. We manually send a 2 byte dummy frame to this device every 5-10 seconds, though we have seen the error occur up to an hour after the previous packet, so I can safely say this is not a "back to back" frame issue. The attached logfile shows 7 successful downlinks, then the 8th fails. (primary error logfile: successes_then_failure.txt) When this occurs, it will never succeed in the "retry" (it always goes through 10 retries and eventually fails); maybe that nuance of behavior is important.

At this point I am not convinced it is a server issue, so I'd like to get some insight on this as a potential issue in basic station.

Also, I am very curious why the retries don't seem to function - for one test, I set CLASS_C_BACKOFF_BY setting in the station conf, to allow more time between retries. It still failed all 10 attempts across 5 seconds: emitting_frame_error_500ms.txt. Maybe this information is also useful

@beitler
Copy link
Contributor

beitler commented Feb 3, 2021

Thanks for the well structured description. I am sure we will find out what is going on in your setup.

Judging by the logs, your frame is definitely placed into the radio TX buffer, but the TX status verification is failing. These are time critical operations. from the high level description is sounds to me that the TX timing parameters are not properly tuned to the expected SPI delays for your gateway:

basicstation/src/s2conf.h

Lines 190 to 193 in bd17e53

CONF_PARAM(TX_MIN_GAP , ustime, tspan_s , DFLT_TX_MIN_GAP, "min distance between two frames being TXed")
CONF_PARAM(TX_AIM_GAP , ustime, tspan_s , DFLT_TX_AIM_GAP, "aim for this TX lead time, if delayed should not fall under min")
CONF_PARAM(TX_MAX_AHEAD , ustime, tspan_s , DFLT_TX_MAX_AHEAD, "maximum time message can be scheduled into the future")
CONF_PARAM(TXCHECK_FUDGE , ustime, tspan_s , DFLT_TXCHECK_FUDGE, "check radio state this time into ongoing TX")

To find out, please do the following:

  • Print the low-level tuning parameters with station -p and post them here.
  • Run your tests again with libloragw compiled in DEBUG mode (explained here) and station run with XDEBUG loglevel. Please post the logs here.
  • When conducting your test it would help to have a spectrum analyser handy to check if the frame for which station is unable to verify the TX status is actually emitted by the radio or not.

@cblack-f3
Copy link
Author

Thanks for the quick reply:
station -p

   str RADIODEV             = "/dev/spidev?.0" builtin      default radio device
    u4 LOGFILE_SIZE         = "10MB"     builtin      default size of a logfile
    u4 LOGFILE_ROTATE       = 3          builtin      besides current log file keep *.1..N (none if 0)
    u4 TCP_KEEPALIVE_EN     = 1          builtin      TCP keepalive enabled
    u4 TCP_KEEPALIVE_IDLE   = 60         builtin      TCP keepalive TCP_KEEPIDLE [s]
    u4 TCP_KEEPALIVE_INTVL  = 15         builtin      TCP keepalive TCP_KEEPINTVL [s]
    u4 TCP_KEEPALIVE_CNT    = 4          builtin      TCP keepalive TCP_KEEPCNT
    u4 MAX_JOINEUI_RANGES   = 10         builtin      max ranges to suppress unwanted join requests
ustime CUPS_CONN_TIMEOUT    = "60s"      builtin      connection timeout
ustime CUPS_OKSYNC_INTV     = "24h"      builtin      regular check-in with CUPS for updates
ustime CUPS_RESYNC_INTV     = "1m"       builtin      check-in with CUPS for updates after a failure
    u4 CUPS_BUFSZ           = "8KB"      builtin      read from CUPS in chunks of this size
ustime GPS_REPORT_DELAY     = "120s"     builtin      delay GPS reports and consolidate
ustime GPS_REOPEN_TTY_INTV  = "1s"       builtin      recheck TTY open if it failed
ustime GPS_REOPEN_FIFO_INTV = "1s"       builtin      recheck if FIFO writer fake GPS
ustime CMD_REOPEN_FIFO_INTV = "1s"       builtin      recheck if FIFO writer
ustime RX_POLL_INTV         = "20ms"     builtin      interval to poll SX1301 RX FIFO
ustime TC_TIMEOUT           = "2s"       station.conf reconnected to muxs
ustime CLASS_C_BACKOFF_BY   = "100ms"    builtin      retry interval for class C TX attempts
    u4 CLASS_C_BACKOFF_MAX  = 10         builtin      max number of class C TX attempts
ustime RADIO_INIT_WAIT      = "200ms"    builtin      max wait for radio init command to finish
ustime PPS_VALID_INTV       = "10m"      builtin      max age of last PPS sync for GPS time conversions
ustime TIMESYNC_RADIO_INTV  = "2100ms"   builtin      interval to resync MCU/SX1301
ustime TIMESYNC_LNS_RETRY   = "71ms"     builtin      resend timesync message to server
ustime TIMESYNC_LNS_PAUSE   = "5s"       builtin      pause after unsuccessful volley of timesync messages
    u4 TIMESYNC_LNS_BURST   = 10         builtin      volley of timesync messages before pausing
ustime TIMESYNC_REPORTS     = "5m"       builtin      report interval for current timesync status
ustime TX_MIN_GAP           = "10ms"     builtin      min distance between two frames being TXed
ustime TX_AIM_GAP           = "20ms"     builtin      aim for this TX lead time, if delayed should not fall under min
ustime TX_MAX_AHEAD         = "600s"     builtin      maximum time message can be scheduled into the future
ustime TXCHECK_FUDGE        = "5ms"      builtin      check radio state this time into ongoing TX
ustime BEACON_INTVL         = "128s"     builtin      beaconing interval
    u4 TLS_SNI              = true       builtin      Set and verify server name of TLS connections

I will work on libloragw in DEBUG and using a spec an to see if the packet actually sent. I would say it is not, since our end device sees a frame on 100% of the success cases, and does not see the frame on 100% of the 'frame not emitted' cases

@cblack-f3
Copy link
Author

cblack-f3 commented Feb 8, 2021

@beitler Here is a logfile again, with XDEBUG. I've been working with Multitech on the debug libloragw. They sent me a new ipk for station, let me know if this has the appropriate level. This one failed on the 9th frame I sent down. I added a space between each frame for readability

station_log_xdebug.txt

Also working on that spec an - I need to borrow one from the office so might be a few days

@cblack-f3
Copy link
Author

cblack-f3 commented Feb 11, 2021

I've hooked up an envelope follower to my scope

When it succeeds, I get this:
DS4_QuickPrint1

When I produce the error condition, there is NO RF activity.

@beitler
Copy link
Contributor

beitler commented Feb 11, 2021

Thank you for the details.

Let me walk through each line of the relevant part of the log:

INFO: tx_start_delay=1497 (1497.000000) - (1497, bw_delay=0.000000, notch_delay=0.000000)

The HAL prints this this line as part of the lgw_send setup sequence:
https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/src/loragw_hal.c#L1414
https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/src/loragw_hal.c#L407
Although this line is printed first, it does not mean that it happened chronologically before the next log lines. The reason is that the HAL uses printf directly for logging. Station caches the log lines in a buffer and emits the buffer once full or timeout reached in order to reduce IO overhead. That's why station logs are slightly delayed.


2021-02-08 18:32:51.516 [AIO:XDEB] [3|WS] < {"msgtype":"dnmsg","DevEui":"01-01-01-01-01-01-01-01","dC":2,"diid":13086,"pdu":"606d23dc000a0a0003000000700331ffff01c7fff68e2aafa3","priority":1,"RX2DR":10,"RX2Freq":923900000,"xtime":49258120969134059,"rctx":0}

This is the Websocket reporting that a downlink message came in. Notice, that a class C unsolicited downlink does not require xtime. It generates a warning later on.


2021-02-08 18:32:51.516 [S2E:WARN] RxDelay mapped to 1 as it was not present!

RXDelay parameter was not passed, so 1 second is assumed.


2021-02-08 18:32:51.516 [SYN:ERRO] Cannot convert xtime=0xAF000002BA622B - obsolete session: 175 (current 50)

This is the xtime related warning. Basically the timestamp cannot be properly mapped to any time reference and cannot be converted to radio time. However, since class C unsolicited downlinks are scheduled 'asap' the xtime value is ignored anyway and this warning does not have an effect on the downlink processing.


2021-02-08 18:32:51.516 [S2E:VERB] 101:101:101:101 diid=13086 [ant#0] - starting TX in 19ms694us

The received dnmsg generated a TX job which is scheduled to happen in about 20ms which corresponds to TX_AIM_GAP. This basically means that station schedules this downlink as soon as possible.


2021-02-08 18:32:51.541 [S2E:ERRO] 101:101:101:101 diid=13086 [ant#0] - radio is not emitting frame - abandoning TX, trying alternative

5ms into the ongoing transmission (TXCHECK_FUDGE) station retrieves the radio tx status and expects it to be in TXSTATUS_EMITTING.

int txs = ral_txstatus(txunit);

The error message indicates that the radio was found not to be in TXSTATUS_EMITTING state.


2021-02-08 18:32:51.641 [S2E:ERRO] 101:101:101:101 diid=13086 [ant#0] - radio is not emitting frame - abandoning TX, trying alternative

The retry happens 100ms later and failing for the same reason.


Unfortunately, I was not able to reproduce the problem on the hardware that is available to me.
Given that in your case this issue occurs sporadically, it is probably related so a timing problem varying with a jitter.
I propose you adjust the relevant TX timing parameters a bit and see if the problem occurs more or less often.
Specifically, please try the following values in various combinations:
TX_AIM_GAP --> 20ms, 30ms, 50ms
TXCHECK_FUDGE --> 5ms, 10ms, 20ms
These values can be set inside station.conf.

The following flow diagram may help to understand how these values work together:
image

Also, it would be good to print out the actual tx_status of the radio at the point where EMITTING is expected. This would require adding a corresponding printf statement either in the HAL:
https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/src/loragw_hal.c#L1636
or in station:

int ral_txstatus (u1_t txunit) {

@cblack-f3
Copy link
Author

Thanks for the very descriptive reply! I will start on this right away and ping MT about the rebuild with some extra debug statements.

Seeing that no frame is seen on the RF detector, would you expect "TXCHECK_FUDGE" to have an impact? I know I am getting beyond my understanding of the internals, but if this error is that we are checking at the wrong time, it would seem like I would still see the packet at the RF out.

@pferland-multitech
Copy link

Would this patch file work to output the status, or would you need it to be more specific?

`--- ral_lgw.c 2021-02-11 11:20:53.683043806 -0600
+++ ral_lgw_new.c 2021-02-11 11:29:49.287054602 -0600
@@ -240,6 +240,7 @@
#else
int err = lgw_status(TX_STATUS, &status);
#endif

  • LOG(XDEBUG, "TX STATUS: %#010x", status);
    if (err != LGW_HAL_SUCCESS) {
    LOG(MOD_RAL|ERROR, "lgw_status failed");
    return TXSTATUS_IDLE;
    `

@beitler
Copy link
Contributor

beitler commented Feb 12, 2021

@pferland-multitech Indeed, this patch would help to visualize the radio status when station expects the EMITTING state.

Also, what struck me in the log is that the retries are scheduled only 100ms apart while it should be 120ms (CLASS_C_BACKOFF_BY + TX_AIM_GAP ). Which would mean that TX_AIM_GAP is not honoured. It turns out the s2e tx state machine has a bug which causes this behavior. The patch below fixes that:

diff --git a/src/s2e.c b/src/s2e.c
index f8fd7f5..8d74d18 100644
--- a/src/s2e.c
+++ b/src/s2e.c
@@ -637,6 +645,7 @@ ustime_t s2e_nextTxAction (s2ctx_t* s2ctx, u1_t txunit) {
                 // Something went wrong - should be emitting
                 LOG(MOD_S2E|ERROR, "%J - radio is not emitting frame - abandoning TX, trying alternative", curr);
                 ral_txabort(txunit);
+                curr->txflags &= ~TXFLAG_TXING;
                 goto check_alt;
             }
             // Looks like it's on air

This patch will not fix the issue you are observing. It will just fix the retries after the initial TX failure. While it is still puzzling why the TX is failing in the first place, this patch will at least prevent the retries from failing. Please apply this patch for your further tests. It will flow into the next release as well.

@cblack-f3
Copy link
Author

I thought I would post the first of the 6 combinations of GAP/FUDGE, since it has the patch and TXSTATUS debug print, currently working on the rest of the combinations

gap20_fudge5.txt

@beitler
Copy link
Contributor

beitler commented Feb 15, 2021

Great, thanks for posting the data, which is already quite revealing. Have a look at these lines:

2021-02-15 19:08:27.644 [S2E:VERB] 101:101:101:101 diid=31845 [ant#0] - starting TX in 19ms685us
2021-02-15 19:08:27.670 [any:XDEB] TX STATUS: 0x00000003
2021-02-15 19:08:27.670 [S2E:ERRO] 101:101:101:101 diid=31845 [ant#0] - radio is not emitting frame - abandoning TX, trying alternative
2021-02-15 19:08:27.777 [any:XDEB] TX STATUS: 0x00000002

At time 08:27.644 The downlink command is received from the LNS and scheduled for transmission ~20ms later.
At time 08:27.665 (+20) The downlink is put into the TX buffer of the sx1301 (no log)
At time 08:27.670 (+5) The radio is polled for its status. The gateway reports 0x00000003 (TX_SCHEDULED). The TX operation is aborted. The status reported later by the sx1301 is TX_FREE.

https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/inc/loragw_hal.h#L141-L142

This indicates that at the time when the status is checked, the frame did not start emitting yet. This points at severe drift between MCU clock and radio clock.
The clock sync reports also show large worst case drifts:

2021-02-15 19:08:28.325 [SYN:INFO] MCU/SX130X drift stats: min: -3.3ppm  q50: -15.7ppm  q80: -63.8ppm  max: -102.1ppm - threshold q90: -100.4ppm
2021-02-15 19:08:28.325 [SYN:INFO] Mean MCU drift vs SX130X#0: -28.8ppm

but 100 ppm is not enough to drift away more than 5ms between sync events (occurring every TIMESYNC_RADIO_INTV - 2100ms by default).

Still, it would be interesting to see the behavior with larger values for TXCHECK_FUDGE.

If you could, please incorporate the patch from my previous comment into your tests going forward. Thanks!

@cblack-f3
Copy link
Author

cblack-f3 commented Feb 15, 2021

OK, these tests took awhile today, but here is the 9 combinations of TXFUDGE & GAP, and their results. The test consists of a script that sends down 200 frames per test, spaced out by 5 seconds.

  • ERRORS is the number of occurrences of the frame not emitting error
  • TXDONE is the number of frames successfully sent from the gateway (The patch is working: retries are now succeeding)
  • RECEIVED is from the device logs, showing how many frames made it to the end device
CONFIG            ERRORS      TXDONE     RECEIVED
-------------------------------------------------
Gap20 Fudge 5         12         200          200
Gap20 Fudge 10        17         200          199
Gap20 Fudge 20         6         200          200

Gap30 Fudge 5          1         200          200
Gap30 Fudge 10         1         200          200
Gap30 Fudge 20         2         200          200

Gap50 Fudge 5          0         200          200
Gap50 Fudge 10         0         200          199
Gap50 Fudge 20         0         200          200

Logs for reference:
gap20_fudge5.txt
gap20_fudge10.txt
gap20_fudge20.txt
gap30_fudge5.txt
gap30_fudge10.txt
gap30_fudge20.txt
gap50_fudge5.txt
gap50_fudge10.txt
gap50_fudge20.txt

pferland-multitech pushed a commit to pferland-multitech/basicstation that referenced this issue Feb 16, 2021
@beitler
Copy link
Contributor

beitler commented Feb 16, 2021

Thank you for your effort. These are good results. Apparently the default timing parameters are too tight for your hardware platform. I would suggest you configure your installation with TX_AIM_GAP=50ms and TXCHECK_FUDGE=5ms.

@pferland-multitech Maybe multitech could evaluate the findings in this thread and check if this is an isolated issue or whether the default (compile time) timing parameters should be adjusted for all multitech installations. For example, if the SPI speed is clocked down relative to the default, or if the SPI driver is introducing delays, this will require adjustments on the timing parameters.

@cblack-f3
Copy link
Author

I'm going to close this issue as it seems the fix on the retry bug has been tested enough on our end.

Further timing research on the conduit can occur outside this issue

beitler added a commit that referenced this issue Jan 31, 2022
* deps: Updated sx1302_hal dependency to version 2.1.0 (no LBT yet) (Resolves #89, #103, #121, #130)
* deps: Added sx1302_hal patch for handling of latched xticks rollover
* deps: Updated mbedTLS dependency to version 2.28.0 (LTS)
* deps: Fixed lgw patch causing IQ inversion in 500kHz channel (Resolves #81)
* s2e: Added support for AU915 (Resolves #43)
* s2e: Added support for LoRaWAN Regional Parameters Common Names (Resolves #18)
* s2e: Fixed dnchnl2 issue (Resolves #79)
* s2e: Fixed class C backoff logic (Resolves #87)
* s2e: Fixed class B beacon format (Resolves #129, #131)
* s2e: Fixed DR range check in upchannels list parser (Resolves #141)
* ral: Changed handling of xticks for lgw1302
* ral: Fixed radio in use issue (Resolves #53, #62)
* ral: Fixed types in txpow assignment (master/slave) (Resolves #118)
* ral: Fixed class B beacon parameters (Resolves #132)
* sx130xconf: Fixed parsing of rssi_tcomp values for sx1302 (Resolves #144)
* tls: Fixed TLS cert parsing issue (Resolves #76)
* sys_linux: Added support for usb/spi prefix in radio devname
* sys_linux: Added mbedTLS version to startup log
* sys_linux: Changed version to be printed to stdout (Resolves #51)
* sys_linux: Changed default max dbuf size (Resolves #95)
* sys_linux: Fixed relative home path handling (Resolves #140)
* sys_linux: Fixed memory corruption during system command execution (Resolves #146)
* tc/cups: Fixed sync on credset file IO (Resolves #94)
* timesync: Fixed UTC to PPS alignment
* log: Changed verbosity of XDEBUG log level
* log: Changed logging experience for improved clarity
* log: Added HAL log integration into logging module
* make: Changed makefiles for more space-friendliness (Resolves #66)
* net: Changed strictness on line-endings in key files (Resolves #68)
* gps: Fixed parsing of ublox NAV-TIMEGPS message
* Restored LICENSE file (Resolves #63, #67)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants