settings for logical switches are sometimes changed by itself #1834

wolkstein · 2014-11-04T19:52:18Z

hi,
i noticed a strange problem with logical switches. i have defined 6 Logical switches to get six flight-modes at my apm2.6 using switches SG and SD. I defined for each switch an function(and) V1 and V2 depending to switch position.
"And Switch", Duration and delay for all six Logical switches are unused.

now the problem. if the taranis is powered longer than 30-45 minuts suddenly my LS settings are changed by self. suddenly i can not switch back to first position(first flightmode LS1) . if i open the Logical Switches tab in model settings there is an And Switch Defined for LS 1. connected to !=SA↑. what goes wrong. i using opentx 2.0.12

i noticed this bug two times(taranis uptime >30 minutes). but i can not reliably reconstruct the bug. mostly nothing happens although the taranis uptime is >45 minutes.

/g
wolke

projectkk2glider · 2014-11-05T07:52:15Z

This is now the third report of such or similar incident I have seen on forums. Possible causes:

real bug in opentxt
a Lua script that does that

@wolkstein please provide more details such as:

do you have any Lua scripts active/running if yes, post them here
post your model setup (upload somewhere and post link here)
did you noticed that something else changes or it is always logical switches only
is the change random or always the same (here I mean which parameter changes and to what value)
is the changed logical switch always the same or not

wolkstein · 2014-11-05T10:45:28Z

hi,
thx for answering.
i use two different versions of models. the bad, i do not know if the bug happens at both models because i can not surly reconstruct it.

first

is the change random or always the same (here I mean which parameter changes and to what value)?

it is always the same change. LS1 get an additional "And Switch" Defined to !=SA↑. this was each time if it happens exactly the same!

second

do you have any Lua scripts active/running if yes, post them here

Yes, i have two scripts running. one telemetry script and one model script. and i have two models in use.
first FLY-bMANITA and second FLY-AMANITA. both models are the same but with different model setup and telemetry screen script.

FLY-AMANITA is my first and old setup, which use this files (bug happens surly at this model)

telemetry script: https://raw.githubusercontent.com/wolkstein/MavLink_FrSkySPort/single-cell-lipo-voltage/Lua_Telemetry/DisplayApmPosition/SCRIPTS/Model_Name/telem1.lua
model script: https://raw.githubusercontent.com/wolkstein/MavLink_FrSkySPort/single-cell-lipo-voltage/Lua_Telemetry/DisplayApmPosition/SCRIPTS/MIXES/cellinfo.lua
model setup as pdf: https://github.com/wolkstein/MavLink_FrSkySPort/raw/single-cell-lipo-voltage/Lua_Telemetry/SettingsWolkstein/FLY-AMANITA.pdf
model setup as eepe: https://raw.githubusercontent.com/wolkstein/MavLink_FrSkySPort/single-cell-lipo-voltage/Lua_Telemetry/SettingsWolkstein/05-11-2014-bmanita-and-amanita.eepe
both models are inside

FLY-bMANITA is my second and new setup, which use this files (i an unsure if it happens with this model)
changes to my first model are.

no engaged switch infos(playSound(*Engaged)
no use of bmp images in telemetry screen. in fist model, the telemetry screen have a massive use of bmp files to display different values. i change this using normal asci chars. i noticed that each time in loop you will call lcd.drawPixmap(*pixmap) an "f_open" happens. i am not sure but maybe this will cause a high I/O load.
telemetry script: https://raw.githubusercontent.com/wolkstein/MavLink_FrSkySPort/single-cell-lipo-voltage/Lua_Telemetry/DisplayApmPosition/SCRIPTS/Model_Name/wolkstein-telem1.lua
model script: https://raw.githubusercontent.com/wolkstein/MavLink_FrSkySPort/single-cell-lipo-voltage/Lua_Telemetry/DisplayApmPosition/SCRIPTS/MIXES/cellinfo.lua
model setup as pdf: https://github.com/wolkstein/MavLink_FrSkySPort/raw/single-cell-lipo-voltage/Lua_Telemetry/SettingsWolkstein/FLY-bMANITA.pdf
model setup as eepe: https://raw.githubusercontent.com/wolkstein/MavLink_FrSkySPort/single-cell-lipo-voltage/Lua_Telemetry/SettingsWolkstein/05-11-2014-bmanita-and-amanita.eepe
both models are inside

last

did you noticed that something else changes or it is always logical switches only?

YES!! i often lost connection. if i download my apm logfiles i see that my apm often switch to RTL as defind as Transmitter failsave operation. this happens although i am in a really short distance to model (4-10meters).

greetings
wolke

kilrah · 2014-11-05T10:53:45Z

Someone reported a similar issue on a forum, also L1 is being changed but not only the AND switch, also the other values. He seemed to have matched it as happening at the same time as an "SD card error".

Sacre100 · 2014-11-05T11:57:11Z

In the LUA script, there is some string concat operation ( .. operator) which means a lot of work for the garbage collector. There is also some local declaration within function.

Perhaps it's a similar problem as we had previously with the predimrc.lua script.

You should try to remove concat operation for exemple by replacing this kind of code :

lcd.drawText (192, 21, telem_sats.."sat", 0)

by this one :

lcd.drawText (192, 21, telem_sats, 0)
lcd.drawText (lcd.getLastPos(), 21, "sat", 0)

For the local declaration, you can try for example to move the declaration at the script level by replacing this kind of code :

local function rollpanel()

  local rangle=math.rad(getValue(204)*10)
  local rx2=145
  local rx1=100
  local ry1=round(32-20*math.sin(rangle))
  local ry2=round(32+20*math.sin(rangle))

  lcd.drawLine (rx1,ry1-1 ,rx2, ry2-1, SOLID, 2)
  lcd.drawLine (rx1,ry1 ,rx2, ry2, SOLID, 2)
  lcd.drawLine (rx1,ry1+1 ,rx2, ry2+1, SOLID, 2)
end

by this one :

local rangle, rx2, rx1, ry1, ry2

local function rollpanel()

  rangle=math.rad(getValue(204)*10)
  rx2=145
  rx1=100
  ry1=round(32-20*math.sin(rangle))
  ry2=round(32+20*math.sin(rangle))

  lcd.drawLine (rx1,ry1-1 ,rx2, ry2-1, SOLID, 2)
  lcd.drawLine (rx1,ry1 ,rx2, ry2, SOLID, 2)
  lcd.drawLine (rx1,ry1+1 ,rx2, ry2+1, SOLID, 2)
end

If by testing this you solve the issue, this will help the team to identify the cause of the problem.

wolkstein · 2014-11-05T12:59:32Z

thx,
i follow your tips. i change my model and telemetry script. i move all drawText("TEXT"..VARIABLE) to drawText("blahblah") and drawNumber(variable). also i move all vars to script head. if interested follow this. wolkstein/MavLink_FrSkySPort@e2b1f67 and this wolkstein/MavLink_FrSkySPort@0c8b26d

mostly i write in c++. lua is totally new for me. but making all vars script wide is a little bit strange for me. because i often use same var names in different functions.

/g
wolke

edit:
i will test my FLY-bMANITA model with all this changes in telemetry and model script against the first FLY-AMANITA version with his untouched files. hopefully i can isolate the problem.

wolkstein · 2014-11-05T15:46:23Z

hi, i have reproduce the error here after ~40 minuts uptime. taranis was running with modified lua scripts as described in previous comment. model was FLY-bMANITA. after ~40 minutes LS1 get an additional "And Switch" Defined to !=SA↑. this happens while taranis play massive audio. i have defined audio Messages to GEAR and FOTO switches. after 40 minuts uptime i toggle this switches very quickly ~30-40 times to produce massive audio messages. and than suddenly my apm switch from stabilize to althold. this happens because LS for my first position become the !=SA↑ and switch.

i am not really happy with this problem. it feels that you can not really build on your model settings.

i make a binary diff from my models.eepe files bevor and after the bug happens.
diff 05-11-2014-bmanita-and-amanita.eepe 05-11-2014-bmanita-and-amanita-after-bug.eepe

here the diff output.

2c2
< :20000000050001180040000001005810030021220C00D422000000000000000000000000D1
---
> :20000000050001180040000001005710030024220C00D422000000000000000000000000CF
12c12
< :20014000000041910350040180010AA20811D301E2D8E23206F102C10291014600000000F9
---
> :200140000000419103510380010AA20811D301E2D8E2401FF102C1029101460000000000D3
14,31c14,31
< :2001800004000B060C1926FE0D010E09140195014865786149020C2EC2082EA150E16449AF
< :2001A0000101A814F8EEF1ECECF4FB9101B164490102A412F1F4F4D102B164490103A41276
< :2001C0000500F1F4F4D103B164490104A319FFE9E104A4029CFF31F154A1FF480104A40237
< :2001E000DAFF32F154A1FE480104A402EFFF33F154A1FD480104A2020B9134F154A1FC482E
< :2002000006000104A202249135F154A1FB480104A202649136F154A1FA480105B1644901BB
< :20022000614B0106B164490163A4FAF1ECF1D107B16449015DA3EEECF47F7F7F7F7F7F7F5F
< :2002400007007F7F7F7F7F7F7F7F7F7F7F7F7F7F500614F8EEF1ECEC480601F7F4FBEEF178
< :20026000480412F1F4F44A0612EBFCFCFBEE48050DF1FCFBED4904070501124A0406F1EC4D
< :200280000800F14A0312140C7F7F7F7F7F59014DE262C04C014EA101B264C04C014CA10271
< :2002A000B29CC04C014BA103B264C07F7F7F7F7F7F7F7F7F7F7F7F7F7F5601FF7F7F7F7FDA
< :2002C00009007F7F7F7F7F7F7F7F7F7F7D02120AB108B2130AB108B2140AB108B314F6FFFE
< :2002E000A108B313F6FFA108B312F6FFA1087F7F7F7F7F7F58030F1401480A0F0B535441C1
< :200300000A0052544C4F4798100B67656172646EB8110B676561727570B6160B464F544FC0
< :20032000D9F10B53544F504C4F477F7F7F7F7F7F7F7F7F7F7F7F7F7F6001809130B20380F7
< :200340000B009130B403B5813FB401B5813FB301018491059F1A1B1C0F10112313182021FD
< :2003600022171516150E0F101112194225240C011413123C1223111915164F0301B9FBF10C
< :200380000000FC7F7FC1F97F7F7F7F480863656C6C696E666F4C02CE0191027F7F7C0314D0
< :2003A000F8EE9301F7F49305F4FB9312EBFC7F72013F47000000000000000000000000004D
---
> :2001800004000B060C1926FE0D010E09140195014865786149030C2B05B308ED049150E1B5
> :2001A00064490101A814F8EEF1ECECF4FB9101B164490102A412F1F4F4D102B1644901037F
> :2001C0000500A412F1F4F4D103B164490104A319FFE9E104A4029CFF31F154A1FF48010427
> :2001E000A402DAFF32F154A1FE480104A402EFFF33F154A1FD480104A2020B9134F154A1CC
> :200200000600FC480104A202249135F154A1FB480104A202649136F154A1FA480105B164C1
> :200220004901614B0106B164490163A4FAF1ECF1D107B16449015DA3EEECF47F7F7F7F7F13
> :2002400007007F7F7F7F7F7F7F7F7F7F7F7F7F7F7F7F500614F8EEF1ECEC480601F7F4FB59
> :20026000EEF1480412F1F4F44A0612EBFCFCFBEE48050DF1FCFBED4904070501124A04064B
> :200280000800F1ECF14A0312140C7F7F7F7F7F59014DE262C04C014EA101B264C04C014C37
> :2002A000A102B29CC04C014BA103B264C07F7F7F7F7F7F7F7F7F7F7F7F7F7F5601FF7F7F35
> :2002C00009007F7F7F7F7F7F7F7F7F7F7F7F7D02120AB108A3FF130AB108B2140AB108B319
> :2002E00014F6FFA108B313F6FFA108B312F6FFA1087F7F7F7F7F7F58030F1401480A0F0BA0
> :200300000A0053544152544C4F4798100B67656172646EB8110B676561727570B6160B46CA
> :200320004F544FD9F10B53544F504C4F477F7F7F7F7F7F7F7F7F7F7F7F7F7F60018091303A
> :200340000B00B203809130B403B5813FB401B5813FB301018491059F1A1B1C0F1011231321
> :2003600018202122171516150E0F101112194225240C011413123C1223111915164F030158
> :200380000000B9FBF1FC7F7FC1F97F7F7F7F480863656C6C696E666F4C02CE0191027F7FBE
> :2003A0007C0314F8EE9301F7F49305F4FB9312EBFC7F72013F4700000000000000000000BA

/g
wolke

bsongis · 2014-11-05T17:04:57Z

Would you give a link to your 2 eeproms? These eepe files are in compressed format, which is not the raw memory dump. It would be interesting to dump both of them and see which data have been written in the memory write overflow.

bsongis · 2014-11-05T17:10:56Z

On which screen are you when the problem happens? On your custom Lua telemetry screen?

bsongis · 2014-11-05T17:36:44Z

1st LS andsw is at offset 8 in the logical switches array. This array comes after the big array of 512 curve points (512 bytes) which is often unused. It would be interesting to know if there is some difference in this array as well ...

bsongis · 2014-11-05T17:44:53Z

Also would it be reproductible in the latest stable nightly build here:
http://jenkins.open-tx.org/nightly-20/
projectkk2glider added an interesting fix there ...

projectkk2glider · 2014-11-05T20:27:08Z

My main suspect is the audio thread. There are also some reports (on forums) that audio sometimes stops working (only wav, the tones continue to play) until radio is power cycled.

As Bertrand already pointed out, there are a couple of fixes related to this (SD card and audio thread) in master. @wolkstein please test with some of the latest nightly builds and report back.

wolkstein · 2014-11-05T21:48:48Z

hi,
@bsongis
both eepe files are here.
https://github.com/wolkstein/MavLink_FrSkySPort/tree/single-cell-lipo-voltage/Lua_Telemetry/SettingsWolkstein
the custom apm telemetry screen was running.

i will try a nightly build. can i install(flash) the nightly 2.0.13 builds via dfu util? i am on linux and i have companion Version 2.0.12, Sep 19 2014 installed. my dfu-util version is dfu-util 0.7.

edit:
is the raw memory dump inside the eepe files?. i ask because i overwrite(remove And Switch) in my bugy bManita model on taranis. i thinkit now it is to late for an dump.

bsongis · 2014-11-05T22:12:30Z

Those eepe files are compressed dumps, it would be nice to store somewhere the eepe before / after the bug so that I can compare them (after inflate, Companion can do that with one line uncommented in opentxinterface.cpp)

Yes using Linux is the easiest way. I do that everyday.

bsongis · 2014-11-06T20:07:08Z

No news, good news?

wolkstein · 2014-11-06T21:22:57Z

:) yesterday i install a nightly build on my taranis. but i have no time to play around. think at the weekend i will do some tests.

bsongis · 2014-11-06T21:26:15Z

We all cross the fingers :)

wolkstein · 2014-11-07T21:34:47Z

hi,
ok today i found some time for testing:). but it ends with a new issue:(
firstly i was not able to reproduce the LS Switch bug. but this happens because my taranis today play suddenly only "one" sound file with stuttering. in my case "stabilize active". i noticed that the stuttering wasn't randomly. the stuttering was exactly, each time, at the same position in sound. between stabilize and active. so i decide to interrupt the switch test and pay my attention to the audio or maybe more (sdcard/filesystem) problem.

i do some things to circle the effect. info, i get this (my first taranis) two weeks ago with this sdcard inside!!

firstly, i switch back from the nightly build to my old firmware 2.0.12
no changes, stuttering exactly as before.

second, i put the sd card into my computer an open the affected soundfile with audacity. soundfile was good no stuttering. samplerate, sample format and sample depth are as expected. so no problem here. i opend the file directly from sdcard, no caching to my computer!!

third, i make a file system check with gparted and fsck.fat.
gparted reports that the quantity of cluster is not as expected 60020 against 60021 expected clusters. size of fat is wrong. size was 256 sectors against 235 expected sectors.
fsck.fat -av reports the same.

wolke@schnell:~$ sudo fsck.fat -av /dev/mmcblk0p1
fsck.fat 3.0.26 (2014-03-07)
fsck.fat 3.0.26 (2014-03-07)
Checking we can access the last sector of the filesystem
Boot sector contents:
System ID "mkfs.fat"
Media byte 0xf8 (hard disk)
       512 bytes per logical sector
     32768 bytes per cluster
         1 reserved sector
First FAT starts at byte 512 (sector 1)
         2 FATs, 16 bit entries
    131072 bytes per FAT (= 256 sectors)
Root directory starts at byte 262656 (sector 513)
      1024 root directory entries
Data area starts at byte 295424 (sector 577)
     60020 data clusters (1966735360 bytes)
16 sectors/track, 4 heads
       137 hidden sectors
   3841911 sectors total
Reclaiming unconnected clusters.
/dev/mmcblk0p1: 1685 files, 4394/60020 clusters

you can see clusters and sectors are not as expected.

next step, i backup my sdcard and reformat the card with fat16 "mkfs.fat -F16 -v -I -n". than i copying all files back to card.

after starting taranis with my old 2.0.12 firmware "stabilize active" sounds without distortion. also sound was good after re flashing to nightly build firmware.

so at the end no result:(

but today, before i noticed the sound problem, i do ~100 file copy actions, mount and umount actions with taranis conected as usbdevice. maybe because this the problem starts. stabilise active was stuttering at the first time it was played today.

/g
wolke

bsongis · 2014-11-07T22:12:01Z

Interesting. So you already know our answer ... would you try to reproduce those ~100 copy actions and see if you reproduce the problem ?

projectkk2glider · 2014-11-08T08:00:34Z

This is very similar to audio stuttering issue in #1779. Looks like reading fragmented file with FatFs takes more SD sectors reads than without fragmentation (well that is expected), but exactly how much more? Maybe I will research this just for my curiosity sake.

But to solve this part of the issue, maybe we should recommend users prepare their SD card like so:

remove any unnecessary files (like logs)
defragment SD card
and when adding new sounds to SD card, do a defragment after that.

lvale · 2014-11-08T14:41:07Z

Hi

I know it's my fault because of my lousy script :)

Andre's workaround for the audio stuttering problem is the same I've pointed at the other issue #1779 although I'm a bit more radical because I format the card.

I'm building some audio files that will have different total occupied sectors (file length).

bsongis · 2014-11-08T15:32:35Z

So ... do we close both issues and release?

lvale · 2014-11-08T16:10:05Z

If I don't have more unexpected reboots :) Since it's raining this weekend I was planning on more testing of a recent build as you asked.

nigelsheffield · 2014-11-09T15:13:20Z

I had this for the first time ever today, only difference from anything before was I ran this winter postal task script from on4mj.
http://www.rcgroups.com/forums/member.php?u=536303
, I did the full 20 mins with this script running on a copy of a model , then I switched back to my normal model without this script but with my normal scripts which I have been running no problems, I noticed I had no thermal mode and the LS1 had !sa added like described.

So to be clear it did not change the model with the difference script in but changed the old normal model on load after flying with the other script.

wolkstein · 2014-11-09T19:47:40Z

hi,
i also do some time tests today because the LS1 problem. strange, because i get the problem in one model but not in an new model which i have created from scratch today. both models use the same scripts!!.
my old models FLY-bMANITA and FLY-AMANITA was a copy from an model-template i found somewhere in this thread. http://diydrones.com/forum/topics/amp-to-frsky-x8r-sport-converter?id=705844%3ATopic%3A1539556&page=1#comments i do not remember where i found it exactly:(. with this models together with my scripts i get the "!=SA↑" as and in LS1 problem. but with my new model created from scratch not!! i do long-time tests >45 minutes using massive audio output.
scripts in use:
https://github.com/wolkstein/MavLink_FrSkySPort/blob/single-cell-lipo-voltage/Lua_Telemetry/DisplayApmPosition/SCRIPTS/Model_Name/wolkstein-telem1.lua
and
https://github.com/wolkstein/MavLink_FrSkySPort/blob/single-cell-lipo-voltage/Lua_Telemetry/DisplayApmPosition/SCRIPTS/MIXES/cellinfo.lua

models in test, FLY-bMANITA and amanita. amanita is the new model i create today from scratch.
https://raw.githubusercontent.com/wolkstein/MavLink_FrSkySPort/single-cell-lipo-voltage/Lua_Telemetry/SettingsWolkstein/fly-bmanita-amanita-november9-longtimetest.eepe

today, all tests are on firmware nightly build from 5.11.2014.

/g
wolke

edit:
in model amanita i add an SB↓"and switch" in L6. this switch was created by user not by bug!!

bsongis · 2014-11-09T19:51:41Z

Do you mean that you had the problem with our nightly builds?

projectkk2glider · 2014-11-19T07:44:45Z

Same mutexID appearing twice in TCBTbl is perfectly normal. It just means that two threads are using it. One is inside mutex, other is waiting for the first one to release mutex and will take it right after that.

What is problematic is the case of the mutex being entered two times (your mutexCheck variable). That is plain wrong and must not happen. There are two possibilities why this happens:

first one is a bug in the code that uses mutexes - ie mutex enter and leave calls are not used in pairs.
second is a bug in the mutex implementation itself.

I will do some more research about the second point. Maybe we need to add some memory barrier statements also.

Now about the bug in FatFs f_close and 'f_closedir'. This bug happens only if SD card present signal is at logical 1 when one of above functions is called. This would only happen if for example the mechanical switch inside SD card slot would be defective - ie have a bad contact.

Anyway it is a simple fix in FatFs that I have already done and have stashed in my workspace. I will commit it later.

projectkk2glider · 2014-11-19T20:12:52Z

I am just checking assembler output of CoEnterMutexSection() with two optimization levels s and 2. First results look horrifying! The -O2 version is completely screwed, the -Os is a little better, but stil possible bug. Stay tuned...

projectkk2glider · 2014-11-19T20:36:02Z

First problem in CoOS. When entering mutex there is a check if scheduler is locked, if it is the call is abandoned, otherwise it continues. Code fragment:

    if(OSSchedLock != 0)                /* Is OS lock?                        */
    {                                
        return E_OS_IN_LOCK;            /* Yes,error return                   */
    }   

#if CFG_PAR_CHECKOUT_EN >0
    if(mutexID >= MutexFreeID)          /* Invalid 'mutexID'                  */
    {
        return E_INVALID_ID;    
    }
#endif

    OsSchedLock();  // just does OSSchedLock++

I think that the OSSchedLock checking and incrementing should be an atomic operation. As it is now there is possibility of thread being interrupted between OSSchedLock check and increment.

Same problem is in CoOS flags and mailboxes, but we don't use them.

bsongis · 2014-11-19T20:52:38Z

Would it be a problem in such case? I didn't read everything but I think no, it seems rather an internal sanity test, if CoOS schedules another task at this time, the schedlock will be incremented and decremented (possibly even many times if another task also takes any mutex)

Instead the code (checking and changing mutexState) after this lock seems to me really problematic. If it is not considered as volatile, I am afraid it can be moved before the lock when O2 optims enabled.

projectkk2glider · 2014-11-19T20:56:19Z

OsSchedLock(); section at -Os optimization level in function CoEnterMutexSection() (my comments begin with ###):

108:CoOS/kernel/mutex.c ****     OsSchedLock();
 194 0012 1C78        ldrb  r4, [r3, #0]  @ zero_extendqisi2  @ OSSchedLock, OSSchedLock   ### loads current value of OSSchedLock
 196 0014 2E4A        ldr r2, .L25+8  @ tmp175,       @ Loads address of MutexTbl which it needs later (this cames from code reordering even at -Os optimization level!)
 198 0016 651C        adds  r5, r4, #1  @ tmp171, OSSchedLock,    ### adds 1 TO OSSchedLock
 199 0018 E9B2        uxtb  r1, r5  @ OSSchedLock.6, tmp171  ###Zero extend Byte. Extends an 8-bit value to a 32-bit value.
 201 001a 1970        strb  r1, [r3, #0]  @ OSSchedLock.6, OsSchedLock   ### stores new value of OSSchedLock

This is ok, still there is little code reordering, but it does not affect desired result.

Same section at -O2:

109:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;
 212 0016 2E4C        ldr r4, .L31+8  @ tmp174,       ###loads address of TCBRunning
 110:CoOS/kernel/mutex.c ****     pMutex  = &MutexTbl[mutexID];
 214 0018 2E49        ldr r1, .L31+12 @ tmp175,       ### loads address of MutexTbl
 109:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;
 216 001a 2468        ldr r4, [r4, #0]  @ pCurTcb, TCBRunning     ### loads pCurTcb in r4
 217                .LVL11:
 108:CoOS/kernel/mutex.c ****     OsSchedLock();
 219 001c 1E78        ldrb  r6, [r3, #0]  @ zero_extendqisi2  @ OSSchedLock, OSSchedLock    ### loads current value of OSSchedLock
 110:CoOS/kernel/mutex.c ****     pMutex  = &MutexTbl[mutexID];
 221 001e 01EBC005    add r5, r1, r0, lsl #3  @, pMutex, tmp175, mutexID,       ### gets pMutex  = &MutexTbl[mutexID];      
 222                .LVL12:
 112:CoOS/kernel/mutex.c ****     pCurTcb->mutexID = mutexID;
 224 0022 E071        strb  r0, [r4, #7]  @ mutexID, pCurTcb_9->mutexID       ### ERROR does pCurTcb->mutexID = mutexID; before  storing OSSchedLock
 108:CoOS/kernel/mutex.c ****     OsSchedLock();
 226 0024 0136        adds  r6, r6, #1  @ tmp171, OSSchedLock,              ### adds 1 to OSSchedLock
 113:CoOS/kernel/mutex.c ****     if(pMutex->mutexFlag == MUTEX_FREE)       /* If mutex is available        */   
 228 0026 6A78        ldrb  r2, [r5, #1]  @ zero_extendqisi2  @ D.4522, pMutex_12->mutexFlag  ### ERROR loads pMutex_12->mutexFlag before  storing OSSchedLock
 108:CoOS/kernel/mutex.c ****     OsSchedLock();
 230 0028 F6B2        uxtb  r6, r6  @ OSSchedLock.6, tmp171
 231                .LVL13:
 232 002a 1E70        strb  r6, [r3, #0]  @ OSSchedLock.6, OSSchedLock    ### stores new value of OSSchedLock

This is very bad: before OSSchedLock is incremented and stored in memory this happens:

pCurTcb->mutexID = mutexID;
pMutex->mutexFlag is read for later use in comparison

@bsongis please review my findings.

projectkk2glider · 2014-11-19T21:10:38Z

Ha, compiler barrier completely cures above problem with -O2 optimization. New source:

    OsSchedLock();
    asm volatile ("" : : : "memory");         // prevents compiler reordering
    pCurTcb = TCBRunning;
    pMutex  = &MutexTbl[mutexID];

And resulting assembly:

108:CoOS/kernel/mutex.c ****     OsSchedLock();
 211                    .loc 1 108 0
 212 0016 1C78          ldrb    r4, [r3, #0]    @ zero_extendqisi2  @ OSSchedLock, OSSchedLock
 213 0018 651C          adds    r5, r4, #1  @ tmp171, OSSchedLock,
 214 001a E9B2          uxtb    r1, r5  @ OSSchedLock.6, tmp171
 215                .LVL11:
 216 001c 1970          strb    r1, [r3, #0]    @ OSSchedLock.6, OSSchedLock      ### OSSchedLock stored before any other operation on mutex - SUCCESS
 109:CoOS/kernel/mutex.c ****     asm volatile ("" : : : "memory");         // prevents compiler reordering
 217                    .loc 1 109 0
 110:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;
 218                    .loc 1 110 0
 219 001e 2C4B          ldr r3, .L31+8  @ tmp174,
 111:CoOS/kernel/mutex.c ****     pMutex  = &MutexTbl[mutexID];
 220                    .loc 1 111 0
 221 0020 2C4A          ldr r2, .L31+12 @ tmp175,
 110:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;

bsongis · 2014-11-19T21:13:48Z

Yes I confirm this is really bad to compile CoOS in -O2
I propose to continue with -Os and keep the barrier as well (does it change something in -Os) ?
And we release ASAP. This LS1 bug is a nightmare.
I will send an email to the CoOS team tomorrow.

projectkk2glider · 2014-11-19T21:18:58Z

With -Os it ok, the one reorder (which luckilly was not critical) is removed:

194 0012 1C78           ldrb    r4, [r3, #0]    @ zero_extendqisi2  @ OSSchedLock, OSSchedLock
 195 0014 651C          adds    r5, r4, #1  @ tmp171, OSSchedLock,
 196 0016 E9B2          uxtb    r1, r5  @ OSSchedLock.6, tmp171
 197                .LVL7:
 198 0018 1970          strb    r1, [r3, #0]    @ OSSchedLock.6, OSSchedLock
 109:CoOS/kernel/mutex.c ****     asm volatile ("" : : : "memory");         // prevents compiler reordering
 199                    .loc 1 109 0
 110:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;
 200                    .loc 1 110 0
 201 001a 2D4B          ldr r3, .L25+8  @ tmp174,
 111:CoOS/kernel/mutex.c ****     pMutex  = &MutexTbl[mutexID];

So yes, I agree on -Os and on adding barrier. The barrier should also be added to some other functions like __disable_irq().

projectkk2glider · 2014-11-19T21:20:46Z

And I think that volatile for mutex structure or any of its members is not needed, because it is protected by OSSchedLock.

bsongis · 2014-11-19T21:22:52Z

Right I introduced it to prevent them from being read/modified before the lock, obviously it works as well, prevents the reorder (I didn't checked though) but it's not needed.

projectkk2glider · 2014-11-19T21:23:45Z

Will you add memory barriers?

bsongis · 2014-11-19T21:26:14Z

Yes. Perhaps in changing the OsSchedLock() macro. Are you sure that memory barrier is needed for __disable_irq(), it's already ASM code?

projectkk2glider · 2014-11-19T21:33:54Z

Are you sure that memory barrier is needed for __disable_irq(), it's already ASM code.

It can't hurt.

Fixes #1848

lvale · 2014-11-19T23:15:32Z

Congrats guys.
Chasing compiler "optimizations" is a real PIA....

LapinFou · 2014-11-20T09:24:06Z

Congratulation for this bug fix! A difficult one to tackle. 👍
I don't know if it related to the compilation directive, but the 2.0.13 is 20% smaller than the 2.0.12!

projectkk2glider · 2014-11-20T11:09:19Z

I don't know if it related to the compilation directive, but the 2.0.13 is 20% smaller than the 2.0.12!

It is, it is now optimized for size mainly.

LapinFou · 2014-11-20T11:17:57Z

More room for some new great feature !!!
;o)
Again, congrats to all of you for your hard work.

MikeBland · 2014-11-21T00:11:53Z

I've been using -Os for ersky9x anyway. I had a look at the assembler and did see the odd assembler instruction that appeared in the middle of the sequence for Incrementing the OsSchedLock variable.
I tried the following:
#define OsSchedLock() { asm(""); OSSchedLock++; asm volatile ("" : : : "memory"); }
and that seems to make every reference work OK.
I may be using a different version of the compiler, but you might try this.
I have found using asm("") useful in the AVR compiler as well.

Mike.

bsongis · 2014-11-21T06:12:17Z

Yes Mike, now we have:
#define OsSchedLock() { OSSchedLock++; asm volatile ("" : : : "memory"); }
And we compile in -Os
We are using the very last 4.7 arm-gcc release

MikeBland · 2014-11-21T09:32:22Z

But I have the extra asm("") in front as well. This brackets the OSSchedLock++ with 2 asm directives, so the compiler doesn't let other code cross either boundary.

Mike.

projectkk2glider · 2014-11-21T12:20:26Z

In my current understanding putting asm("") will change nothing. To archive a memory barrier the full statement must be included: asm volatile ("" : : : "memory")

I am also concerned about IRQ disable/enable functions. I propose (again) that we change them to (only add "memory" modifier):

__attribute__( ( always_inline ) ) static __INLINE void __enable_irq(void)
{
  __ASM volatile ("cpsie i" : : : "memory");
}
__attribute__( ( always_inline ) ) static __INLINE void __disable_irq(void)
{
  __ASM volatile ("cpsid i" : : : "memory");
}

Without the "memory" modifier the compiler is free to reorder code around __enable_irq() statement.

MikeBland · 2014-11-21T13:27:03Z

I believe there are TWO different things here, what the compiler does and what the processor does.

The code I get for OsSchedLock is usually like:

4083ca: 7e5a ldrb r2, [r3, #25]
4083cc: 3201 adds r2, #1
4083ce: b2d2 uxtb r2, r2
4083d0: 765a strb r2, [r3, #25]

Without the leading asm(""), sometimes other instructions are interleaved within this sequence.
With the leading asm(""), this sequence always appears as shown, with nothing else in between the 4 instructions.

I believe this is because the compiler assumes any asm code might have an effect the compiler doesn't "know" about, so must not rearrange an code before the asm to appear after the asm.

The second thing is the processor may reorder physical memory accesses, regardless of the actual instruction order. It may be necessary to use a real DMB (Data Memory Barrier) instruction. This may only be necessary for multiple processors, rather than multiple tasks on a single processor.

According to the datasheets for the STM ARM processor, and the Atmel ARM processor, this instruction is available.

I found this:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka14041.html

On the AVR compiler, I have noticed that if you have a return in the middle of a procedure, the compiler sometimes generates a set of pop instructions and a return in addition to those at the end of the procedure. terminating the procedure with:
asm("") ;
return ;
stops this happening and saves quite a few bytes of flash.

Mike.

Fixes #1848

CaptainSwashplate · 2015-03-29T13:54:47Z

Wow - quite the thread.
I was one (the only?) person reporting this problem in the forum
http://openrcforums.com/forum/viewtopic.php?f=92&t=6178
Thanks for tracking it down. I have had my intentions changed externally last year in October but now updating everything and looking forward to trying the new firmware on easter. If everything goes well with the test-plane I'll go forward and convert the next models.
Thanks for your hard work!

projectkk2glider added Radio Firmware Bug Needs more data labels Nov 5, 2014

projectkk2glider added this to the OpenTX 2.0.13 milestone Nov 5, 2014

projectkk2glider mentioned this issue Nov 6, 2014

Audio stops working #1838

Closed

bsongis added a commit that referenced this issue Nov 19, 2014

Fixes #1834

b359aa1

Fixes #1848

bsongis closed this as completed in b46b049 Nov 19, 2014

projectkk2glider mentioned this issue Nov 21, 2014

Find out why GCC 4.8 produces unusable opentx firmware #1723

Closed

projectkk2glider added a commit that referenced this issue Feb 14, 2015

Fixes #1834

eccec5b

Fixes #1848

projectkk2glider added a commit that referenced this issue Feb 20, 2015

Fixes #1834

6bf0418

Fixes #1848

projectkk2glider referenced this issue Apr 30, 2015

I found it ... 3 hours just for this #ifdef missing!

7ba457b

wolkstein mentioned this issue Nov 5, 2015

Distance from home not displaying in radar Clooney82/MavLink_FrSkySPort#43

Closed

settings for logical switches are sometimes changed by itself #1834

settings for logical switches are sometimes changed by itself #1834

Comments

wolkstein commented Nov 4, 2014

projectkk2glider commented Nov 5, 2014

wolkstein commented Nov 5, 2014

kilrah commented Nov 5, 2014

Sacre100 commented Nov 5, 2014

wolkstein commented Nov 5, 2014

wolkstein commented Nov 5, 2014

bsongis commented Nov 5, 2014

bsongis commented Nov 5, 2014

bsongis commented Nov 5, 2014

bsongis commented Nov 5, 2014

projectkk2glider commented Nov 5, 2014

wolkstein commented Nov 5, 2014

bsongis commented Nov 5, 2014

bsongis commented Nov 6, 2014

wolkstein commented Nov 6, 2014

bsongis commented Nov 6, 2014

wolkstein commented Nov 7, 2014

bsongis commented Nov 7, 2014

projectkk2glider commented Nov 8, 2014

lvale commented Nov 8, 2014

bsongis commented Nov 8, 2014

lvale commented Nov 8, 2014

nigelsheffield commented Nov 9, 2014

wolkstein commented Nov 9, 2014

bsongis commented Nov 9, 2014

projectkk2glider commented Nov 19, 2014

projectkk2glider commented Nov 19, 2014

projectkk2glider commented Nov 19, 2014

bsongis commented Nov 19, 2014

projectkk2glider commented Nov 19, 2014

projectkk2glider commented Nov 19, 2014

bsongis commented Nov 19, 2014

projectkk2glider commented Nov 19, 2014

projectkk2glider commented Nov 19, 2014

bsongis commented Nov 19, 2014

projectkk2glider commented Nov 19, 2014

bsongis commented Nov 19, 2014

projectkk2glider commented Nov 19, 2014

lvale commented Nov 19, 2014

LapinFou commented Nov 20, 2014

projectkk2glider commented Nov 20, 2014

LapinFou commented Nov 20, 2014

MikeBland commented Nov 21, 2014

bsongis commented Nov 21, 2014

MikeBland commented Nov 21, 2014

projectkk2glider commented Nov 21, 2014

MikeBland commented Nov 21, 2014

CaptainSwashplate commented Mar 29, 2015