Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

settings for logical switches are sometimes changed by itself #1834

Closed
wolkstein opened this issue Nov 4, 2014 · 149 comments
Closed

settings for logical switches are sometimes changed by itself #1834

wolkstein opened this issue Nov 4, 2014 · 149 comments

Comments

@wolkstein
Copy link

hi,
i noticed a strange problem with logical switches. i have defined 6 Logical switches to get six flight-modes at my apm2.6 using switches SG and SD. I defined for each switch an function(and) V1 and V2 depending to switch position.
"And Switch", Duration and delay for all six Logical switches are unused.

now the problem. if the taranis is powered longer than 30-45 minuts suddenly my LS settings are changed by self. suddenly i can not switch back to first position(first flightmode LS1) . if i open the Logical Switches tab in model settings there is an And Switch Defined for LS 1. connected to !=SA↑. what goes wrong. i using opentx 2.0.12

i noticed this bug two times(taranis uptime >30 minutes). but i can not reliably reconstruct the bug. mostly nothing happens although the taranis uptime is >45 minutes.

/g
wolke

@projectkk2glider
Copy link
Member

This is now the third report of such or similar incident I have seen on forums. Possible causes:

  • real bug in opentxt
  • a Lua script that does that

@wolkstein please provide more details such as:

  • do you have any Lua scripts active/running if yes, post them here
  • post your model setup (upload somewhere and post link here)
  • did you noticed that something else changes or it is always logical switches only
  • is the change random or always the same (here I mean which parameter changes and to what value)
  • is the changed logical switch always the same or not

@wolkstein
Copy link
Author

hi,
thx for answering.
i use two different versions of models. the bad, i do not know if the bug happens at both models because i can not surly reconstruct it.

first

  • is the change random or always the same (here I mean which parameter changes and to what value)?

it is always the same change. LS1 get an additional "And Switch" Defined to !=SA↑. this was each time if it happens exactly the same!

second

  • do you have any Lua scripts active/running if yes, post them here

Yes, i have two scripts running. one telemetry script and one model script. and i have two models in use.
first FLY-bMANITA and second FLY-AMANITA. both models are the same but with different model setup and telemetry screen script.

FLY-AMANITA is my first and old setup, which use this files (bug happens surly at this model)

FLY-bMANITA is my second and new setup, which use this files (i an unsure if it happens with this model)
changes to my first model are.

last

  • did you noticed that something else changes or it is always logical switches only?

YES!! i often lost connection. if i download my apm logfiles i see that my apm often switch to RTL as defind as Transmitter failsave operation. this happens although i am in a really short distance to model (4-10meters).

greetings
wolke

@kilrah
Copy link
Member

kilrah commented Nov 5, 2014

Someone reported a similar issue on a forum, also L1 is being changed but not only the AND switch, also the other values. He seemed to have matched it as happening at the same time as an "SD card error".

@Sacre100
Copy link

Sacre100 commented Nov 5, 2014

In the LUA script, there is some string concat operation ( .. operator) which means a lot of work for the garbage collector. There is also some local declaration within function.

Perhaps it's a similar problem as we had previously with the predimrc.lua script.

You should try to remove concat operation for exemple by replacing this kind of code :

lcd.drawText (192, 21, telem_sats.."sat", 0)

by this one :

lcd.drawText (192, 21, telem_sats, 0)
lcd.drawText (lcd.getLastPos(), 21, "sat", 0)

For the local declaration, you can try for example to move the declaration at the script level by replacing this kind of code :

local function rollpanel()

  local rangle=math.rad(getValue(204)*10)
  local rx2=145
  local rx1=100
  local ry1=round(32-20*math.sin(rangle))
  local ry2=round(32+20*math.sin(rangle))

  lcd.drawLine (rx1,ry1-1 ,rx2, ry2-1, SOLID, 2)
  lcd.drawLine (rx1,ry1 ,rx2, ry2, SOLID, 2)
  lcd.drawLine (rx1,ry1+1 ,rx2, ry2+1, SOLID, 2)
end

by this one :

local rangle, rx2, rx1, ry1, ry2

local function rollpanel()

  rangle=math.rad(getValue(204)*10)
  rx2=145
  rx1=100
  ry1=round(32-20*math.sin(rangle))
  ry2=round(32+20*math.sin(rangle))

  lcd.drawLine (rx1,ry1-1 ,rx2, ry2-1, SOLID, 2)
  lcd.drawLine (rx1,ry1 ,rx2, ry2, SOLID, 2)
  lcd.drawLine (rx1,ry1+1 ,rx2, ry2+1, SOLID, 2)
end

If by testing this you solve the issue, this will help the team to identify the cause of the problem.

@wolkstein
Copy link
Author

thx,
i follow your tips. i change my model and telemetry script. i move all drawText("TEXT"..VARIABLE) to drawText("blahblah") and drawNumber(variable). also i move all vars to script head. if interested follow this. wolkstein/MavLink_FrSkySPort@e2b1f67 and this wolkstein/MavLink_FrSkySPort@0c8b26d

mostly i write in c++. lua is totally new for me. but making all vars script wide is a little bit strange for me. because i often use same var names in different functions.

/g
wolke

edit:
i will test my FLY-bMANITA model with all this changes in telemetry and model script against the first FLY-AMANITA version with his untouched files. hopefully i can isolate the problem.

@wolkstein
Copy link
Author

hi, i have reproduce the error here after ~40 minuts uptime. taranis was running with modified lua scripts as described in previous comment. model was FLY-bMANITA. after ~40 minutes LS1 get an additional "And Switch" Defined to !=SA↑. this happens while taranis play massive audio. i have defined audio Messages to GEAR and FOTO switches. after 40 minuts uptime i toggle this switches very quickly ~30-40 times to produce massive audio messages. and than suddenly my apm switch from stabilize to althold. this happens because LS for my first position become the !=SA↑ and switch.

i am not really happy with this problem. it feels that you can not really build on your model settings.

i make a binary diff from my models.eepe files bevor and after the bug happens.
diff 05-11-2014-bmanita-and-amanita.eepe 05-11-2014-bmanita-and-amanita-after-bug.eepe

here the diff output.

2c2
< :20000000050001180040000001005810030021220C00D422000000000000000000000000D1
---
> :20000000050001180040000001005710030024220C00D422000000000000000000000000CF
12c12
< :20014000000041910350040180010AA20811D301E2D8E23206F102C10291014600000000F9
---
> :200140000000419103510380010AA20811D301E2D8E2401FF102C1029101460000000000D3
14,31c14,31
< :2001800004000B060C1926FE0D010E09140195014865786149020C2EC2082EA150E16449AF
< :2001A0000101A814F8EEF1ECECF4FB9101B164490102A412F1F4F4D102B164490103A41276
< :2001C0000500F1F4F4D103B164490104A319FFE9E104A4029CFF31F154A1FF480104A40237
< :2001E000DAFF32F154A1FE480104A402EFFF33F154A1FD480104A2020B9134F154A1FC482E
< :2002000006000104A202249135F154A1FB480104A202649136F154A1FA480105B1644901BB
< :20022000614B0106B164490163A4FAF1ECF1D107B16449015DA3EEECF47F7F7F7F7F7F7F5F
< :2002400007007F7F7F7F7F7F7F7F7F7F7F7F7F7F500614F8EEF1ECEC480601F7F4FBEEF178
< :20026000480412F1F4F44A0612EBFCFCFBEE48050DF1FCFBED4904070501124A0406F1EC4D
< :200280000800F14A0312140C7F7F7F7F7F59014DE262C04C014EA101B264C04C014CA10271
< :2002A000B29CC04C014BA103B264C07F7F7F7F7F7F7F7F7F7F7F7F7F7F5601FF7F7F7F7FDA
< :2002C00009007F7F7F7F7F7F7F7F7F7F7D02120AB108B2130AB108B2140AB108B314F6FFFE
< :2002E000A108B313F6FFA108B312F6FFA1087F7F7F7F7F7F58030F1401480A0F0B535441C1
< :200300000A0052544C4F4798100B67656172646EB8110B676561727570B6160B464F544FC0
< :20032000D9F10B53544F504C4F477F7F7F7F7F7F7F7F7F7F7F7F7F7F6001809130B20380F7
< :200340000B009130B403B5813FB401B5813FB301018491059F1A1B1C0F10112313182021FD
< :2003600022171516150E0F101112194225240C011413123C1223111915164F0301B9FBF10C
< :200380000000FC7F7FC1F97F7F7F7F480863656C6C696E666F4C02CE0191027F7F7C0314D0
< :2003A000F8EE9301F7F49305F4FB9312EBFC7F72013F47000000000000000000000000004D
---
> :2001800004000B060C1926FE0D010E09140195014865786149030C2B05B308ED049150E1B5
> :2001A00064490101A814F8EEF1ECECF4FB9101B164490102A412F1F4F4D102B1644901037F
> :2001C0000500A412F1F4F4D103B164490104A319FFE9E104A4029CFF31F154A1FF48010427
> :2001E000A402DAFF32F154A1FE480104A402EFFF33F154A1FD480104A2020B9134F154A1CC
> :200200000600FC480104A202249135F154A1FB480104A202649136F154A1FA480105B164C1
> :200220004901614B0106B164490163A4FAF1ECF1D107B16449015DA3EEECF47F7F7F7F7F13
> :2002400007007F7F7F7F7F7F7F7F7F7F7F7F7F7F7F7F500614F8EEF1ECEC480601F7F4FB59
> :20026000EEF1480412F1F4F44A0612EBFCFCFBEE48050DF1FCFBED4904070501124A04064B
> :200280000800F1ECF14A0312140C7F7F7F7F7F59014DE262C04C014EA101B264C04C014C37
> :2002A000A102B29CC04C014BA103B264C07F7F7F7F7F7F7F7F7F7F7F7F7F7F5601FF7F7F35
> :2002C00009007F7F7F7F7F7F7F7F7F7F7F7F7D02120AB108A3FF130AB108B2140AB108B319
> :2002E00014F6FFA108B313F6FFA108B312F6FFA1087F7F7F7F7F7F58030F1401480A0F0BA0
> :200300000A0053544152544C4F4798100B67656172646EB8110B676561727570B6160B46CA
> :200320004F544FD9F10B53544F504C4F477F7F7F7F7F7F7F7F7F7F7F7F7F7F60018091303A
> :200340000B00B203809130B403B5813FB401B5813FB301018491059F1A1B1C0F1011231321
> :2003600018202122171516150E0F101112194225240C011413123C1223111915164F030158
> :200380000000B9FBF1FC7F7FC1F97F7F7F7F480863656C6C696E666F4C02CE0191027F7FBE
> :2003A0007C0314F8EE9301F7F49305F4FB9312EBFC7F72013F4700000000000000000000BA

/g
wolke

@bsongis
Copy link
Member

bsongis commented Nov 5, 2014

Would you give a link to your 2 eeproms? These eepe files are in compressed format, which is not the raw memory dump. It would be interesting to dump both of them and see which data have been written in the memory write overflow.

@bsongis
Copy link
Member

bsongis commented Nov 5, 2014

On which screen are you when the problem happens? On your custom Lua telemetry screen?

@bsongis
Copy link
Member

bsongis commented Nov 5, 2014

1st LS andsw is at offset 8 in the logical switches array. This array comes after the big array of 512 curve points (512 bytes) which is often unused. It would be interesting to know if there is some difference in this array as well ...

@bsongis
Copy link
Member

bsongis commented Nov 5, 2014

Also would it be reproductible in the latest stable nightly build here:
http://jenkins.open-tx.org/nightly-20/
projectkk2glider added an interesting fix there ...

@projectkk2glider
Copy link
Member

My main suspect is the audio thread. There are also some reports (on forums) that audio sometimes stops working (only wav, the tones continue to play) until radio is power cycled.

As Bertrand already pointed out, there are a couple of fixes related to this (SD card and audio thread) in master. @wolkstein please test with some of the latest nightly builds and report back.

@wolkstein
Copy link
Author

hi,
@bsongis
both eepe files are here.
https://github.com/wolkstein/MavLink_FrSkySPort/tree/single-cell-lipo-voltage/Lua_Telemetry/SettingsWolkstein
the custom apm telemetry screen was running.

i will try a nightly build. can i install(flash) the nightly 2.0.13 builds via dfu util? i am on linux and i have companion Version 2.0.12, Sep 19 2014 installed. my dfu-util version is dfu-util 0.7.

edit:
is the raw memory dump inside the eepe files?. i ask because i overwrite(remove And Switch) in my bugy bManita model on taranis. i thinkit now it is to late for an dump.

@bsongis
Copy link
Member

bsongis commented Nov 5, 2014

Those eepe files are compressed dumps, it would be nice to store somewhere the eepe before / after the bug so that I can compare them (after inflate, Companion can do that with one line uncommented in opentxinterface.cpp)

Yes using Linux is the easiest way. I do that everyday.

@bsongis
Copy link
Member

bsongis commented Nov 6, 2014

No news, good news?

@wolkstein
Copy link
Author

:) yesterday i install a nightly build on my taranis. but i have no time to play around. think at the weekend i will do some tests.

@bsongis
Copy link
Member

bsongis commented Nov 6, 2014

We all cross the fingers :)

@wolkstein
Copy link
Author

hi,
ok today i found some time for testing:). but it ends with a new issue:(
firstly i was not able to reproduce the LS Switch bug. but this happens because my taranis today play suddenly only "one" sound file with stuttering. in my case "stabilize active". i noticed that the stuttering wasn't randomly. the stuttering was exactly, each time, at the same position in sound. between stabilize and active. so i decide to interrupt the switch test and pay my attention to the audio or maybe more (sdcard/filesystem) problem.

i do some things to circle the effect. info, i get this (my first taranis) two weeks ago with this sdcard inside!!

firstly, i switch back from the nightly build to my old firmware 2.0.12
no changes, stuttering exactly as before.

second, i put the sd card into my computer an open the affected soundfile with audacity. soundfile was good no stuttering. samplerate, sample format and sample depth are as expected. so no problem here. i opend the file directly from sdcard, no caching to my computer!!

third, i make a file system check with gparted and fsck.fat.
gparted reports that the quantity of cluster is not as expected 60020 against 60021 expected clusters. size of fat is wrong. size was 256 sectors against 235 expected sectors.
fsck.fat -av reports the same.

wolke@schnell:~$ sudo fsck.fat -av /dev/mmcblk0p1
fsck.fat 3.0.26 (2014-03-07)
fsck.fat 3.0.26 (2014-03-07)
Checking we can access the last sector of the filesystem
Boot sector contents:
System ID "mkfs.fat"
Media byte 0xf8 (hard disk)
       512 bytes per logical sector
     32768 bytes per cluster
         1 reserved sector
First FAT starts at byte 512 (sector 1)
         2 FATs, 16 bit entries
    131072 bytes per FAT (= 256 sectors)
Root directory starts at byte 262656 (sector 513)
      1024 root directory entries
Data area starts at byte 295424 (sector 577)
     60020 data clusters (1966735360 bytes)
16 sectors/track, 4 heads
       137 hidden sectors
   3841911 sectors total
Reclaiming unconnected clusters.
/dev/mmcblk0p1: 1685 files, 4394/60020 clusters

you can see clusters and sectors are not as expected.

next step, i backup my sdcard and reformat the card with fat16 "mkfs.fat -F16 -v -I -n". than i copying all files back to card.

after starting taranis with my old 2.0.12 firmware "stabilize active" sounds without distortion. also sound was good after re flashing to nightly build firmware.

so at the end no result:(

but today, before i noticed the sound problem, i do ~100 file copy actions, mount and umount actions with taranis conected as usbdevice. maybe because this the problem starts. stabilise active was stuttering at the first time it was played today.

/g
wolke

@bsongis
Copy link
Member

bsongis commented Nov 7, 2014

Interesting. So you already know our answer ... would you try to reproduce those ~100 copy actions and see if you reproduce the problem ?

@projectkk2glider
Copy link
Member

This is very similar to audio stuttering issue in #1779. Looks like reading fragmented file with FatFs takes more SD sectors reads than without fragmentation (well that is expected), but exactly how much more? Maybe I will research this just for my curiosity sake.

But to solve this part of the issue, maybe we should recommend users prepare their SD card like so:

  • remove any unnecessary files (like logs)
  • defragment SD card
  • and when adding new sounds to SD card, do a defragment after that.

@lvale
Copy link

lvale commented Nov 8, 2014

Hi

I know it's my fault because of my lousy script :)

Andre's workaround for the audio stuttering problem is the same I've pointed at the other issue #1779 although I'm a bit more radical because I format the card.

I'm building some audio files that will have different total occupied sectors (file length).

@bsongis
Copy link
Member

bsongis commented Nov 8, 2014

So ... do we close both issues and release?

@lvale
Copy link

lvale commented Nov 8, 2014

If I don't have more unexpected reboots :) Since it's raining this weekend I was planning on more testing of a recent build as you asked.

@nigelsheffield
Copy link

I had this for the first time ever today, only difference from anything before was I ran this winter postal task script from on4mj.
http://www.rcgroups.com/forums/member.php?u=536303
, I did the full 20 mins with this script running on a copy of a model , then I switched back to my normal model without this script but with my normal scripts which I have been running no problems, I noticed I had no thermal mode and the LS1 had !sa added like described.

So to be clear it did not change the model with the difference script in but changed the old normal model on load after flying with the other script.

@wolkstein
Copy link
Author

hi,
i also do some time tests today because the LS1 problem. strange, because i get the problem in one model but not in an new model which i have created from scratch today. both models use the same scripts!!.
my old models FLY-bMANITA and FLY-AMANITA was a copy from an model-template i found somewhere in this thread. http://diydrones.com/forum/topics/amp-to-frsky-x8r-sport-converter?id=705844%3ATopic%3A1539556&page=1#comments i do not remember where i found it exactly:(. with this models together with my scripts i get the "!=SA↑" as and in LS1 problem. but with my new model created from scratch not!! i do long-time tests >45 minutes using massive audio output.
scripts in use:
https://github.com/wolkstein/MavLink_FrSkySPort/blob/single-cell-lipo-voltage/Lua_Telemetry/DisplayApmPosition/SCRIPTS/Model_Name/wolkstein-telem1.lua
and
https://github.com/wolkstein/MavLink_FrSkySPort/blob/single-cell-lipo-voltage/Lua_Telemetry/DisplayApmPosition/SCRIPTS/MIXES/cellinfo.lua

models in test, FLY-bMANITA and amanita. amanita is the new model i create today from scratch.
https://raw.githubusercontent.com/wolkstein/MavLink_FrSkySPort/single-cell-lipo-voltage/Lua_Telemetry/SettingsWolkstein/fly-bmanita-amanita-november9-longtimetest.eepe

today, all tests are on firmware nightly build from 5.11.2014.

/g
wolke

edit:
in model amanita i add an SB↓"and switch" in L6. this switch was created by user not by bug!!

@bsongis
Copy link
Member

bsongis commented Nov 9, 2014

Do you mean that you had the problem with our nightly builds?

@projectkk2glider
Copy link
Member

Same mutexID appearing twice in TCBTbl is perfectly normal. It just means that two threads are using it. One is inside mutex, other is waiting for the first one to release mutex and will take it right after that.

What is problematic is the case of the mutex being entered two times (your mutexCheck variable). That is plain wrong and must not happen. There are two possibilities why this happens:

  • first one is a bug in the code that uses mutexes - ie mutex enter and leave calls are not used in pairs.
  • second is a bug in the mutex implementation itself.

I will do some more research about the second point. Maybe we need to add some memory barrier statements also.


Now about the bug in FatFs f_close and 'f_closedir'. This bug happens only if SD card present signal is at logical 1 when one of above functions is called. This would only happen if for example the mechanical switch inside SD card slot would be defective - ie have a bad contact.

Anyway it is a simple fix in FatFs that I have already done and have stashed in my workspace. I will commit it later.

@projectkk2glider
Copy link
Member

I am just checking assembler output of CoEnterMutexSection() with two optimization levels s and 2. First results look horrifying! The -O2 version is completely screwed, the -Os is a little better, but stil possible bug. Stay tuned...

@projectkk2glider
Copy link
Member

First problem in CoOS. When entering mutex there is a check if scheduler is locked, if it is the call is abandoned, otherwise it continues. Code fragment:

    if(OSSchedLock != 0)                /* Is OS lock?                        */
    {                                
        return E_OS_IN_LOCK;            /* Yes,error return                   */
    }   

#if CFG_PAR_CHECKOUT_EN >0
    if(mutexID >= MutexFreeID)          /* Invalid 'mutexID'                  */
    {
        return E_INVALID_ID;    
    }
#endif

    OsSchedLock();  // just does OSSchedLock++

I think that the OSSchedLock checking and incrementing should be an atomic operation. As it is now there is possibility of thread being interrupted between OSSchedLock check and increment.

Same problem is in CoOS flags and mailboxes, but we don't use them.

@bsongis
Copy link
Member

bsongis commented Nov 19, 2014

Would it be a problem in such case? I didn't read everything but I think no, it seems rather an internal sanity test, if CoOS schedules another task at this time, the schedlock will be incremented and decremented (possibly even many times if another task also takes any mutex)

Instead the code (checking and changing mutexState) after this lock seems to me really problematic. If it is not considered as volatile, I am afraid it can be moved before the lock when O2 optims enabled.

@projectkk2glider
Copy link
Member

OsSchedLock(); section at -Os optimization level in function CoEnterMutexSection() (my comments begin with ###):

108:CoOS/kernel/mutex.c ****     OsSchedLock();
 194 0012 1C78        ldrb  r4, [r3, #0]  @ zero_extendqisi2  @ OSSchedLock, OSSchedLock   ### loads current value of OSSchedLock
 196 0014 2E4A        ldr r2, .L25+8  @ tmp175,       @ Loads address of MutexTbl which it needs later (this cames from code reordering even at -Os optimization level!)
 198 0016 651C        adds  r5, r4, #1  @ tmp171, OSSchedLock,    ### adds 1 TO OSSchedLock
 199 0018 E9B2        uxtb  r1, r5  @ OSSchedLock.6, tmp171  ###Zero extend Byte. Extends an 8-bit value to a 32-bit value.
 201 001a 1970        strb  r1, [r3, #0]  @ OSSchedLock.6, OsSchedLock   ### stores new value of OSSchedLock

This is ok, still there is little code reordering, but it does not affect desired result.


Same section at -O2:

109:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;
 212 0016 2E4C        ldr r4, .L31+8  @ tmp174,       ###loads address of TCBRunning
 110:CoOS/kernel/mutex.c ****     pMutex  = &MutexTbl[mutexID];
 214 0018 2E49        ldr r1, .L31+12 @ tmp175,       ### loads address of MutexTbl
 109:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;
 216 001a 2468        ldr r4, [r4, #0]  @ pCurTcb, TCBRunning     ### loads pCurTcb in r4
 217                .LVL11:
 108:CoOS/kernel/mutex.c ****     OsSchedLock();
 219 001c 1E78        ldrb  r6, [r3, #0]  @ zero_extendqisi2  @ OSSchedLock, OSSchedLock    ### loads current value of OSSchedLock
 110:CoOS/kernel/mutex.c ****     pMutex  = &MutexTbl[mutexID];
 221 001e 01EBC005    add r5, r1, r0, lsl #3  @, pMutex, tmp175, mutexID,       ### gets pMutex  = &MutexTbl[mutexID];      
 222                .LVL12:
 112:CoOS/kernel/mutex.c ****     pCurTcb->mutexID = mutexID;
 224 0022 E071        strb  r0, [r4, #7]  @ mutexID, pCurTcb_9->mutexID       ### ERROR does pCurTcb->mutexID = mutexID; before  storing OSSchedLock
 108:CoOS/kernel/mutex.c ****     OsSchedLock();
 226 0024 0136        adds  r6, r6, #1  @ tmp171, OSSchedLock,              ### adds 1 to OSSchedLock
 113:CoOS/kernel/mutex.c ****     if(pMutex->mutexFlag == MUTEX_FREE)       /* If mutex is available        */   
 228 0026 6A78        ldrb  r2, [r5, #1]  @ zero_extendqisi2  @ D.4522, pMutex_12->mutexFlag  ### ERROR loads pMutex_12->mutexFlag before  storing OSSchedLock
 108:CoOS/kernel/mutex.c ****     OsSchedLock();
 230 0028 F6B2        uxtb  r6, r6  @ OSSchedLock.6, tmp171
 231                .LVL13:
 232 002a 1E70        strb  r6, [r3, #0]  @ OSSchedLock.6, OSSchedLock    ### stores new value of OSSchedLock

This is very bad: before OSSchedLock is incremented and stored in memory this happens:

  • pCurTcb->mutexID = mutexID;
  • pMutex->mutexFlag is read for later use in comparison

@bsongis please review my findings.

@projectkk2glider
Copy link
Member

Ha, compiler barrier completely cures above problem with -O2 optimization. New source:

    OsSchedLock();
    asm volatile ("" : : : "memory");         // prevents compiler reordering
    pCurTcb = TCBRunning;
    pMutex  = &MutexTbl[mutexID];

And resulting assembly:

108:CoOS/kernel/mutex.c ****     OsSchedLock();
 211                    .loc 1 108 0
 212 0016 1C78          ldrb    r4, [r3, #0]    @ zero_extendqisi2  @ OSSchedLock, OSSchedLock
 213 0018 651C          adds    r5, r4, #1  @ tmp171, OSSchedLock,
 214 001a E9B2          uxtb    r1, r5  @ OSSchedLock.6, tmp171
 215                .LVL11:
 216 001c 1970          strb    r1, [r3, #0]    @ OSSchedLock.6, OSSchedLock      ### OSSchedLock stored before any other operation on mutex - SUCCESS
 109:CoOS/kernel/mutex.c ****     asm volatile ("" : : : "memory");         // prevents compiler reordering
 217                    .loc 1 109 0
 110:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;
 218                    .loc 1 110 0
 219 001e 2C4B          ldr r3, .L31+8  @ tmp174,
 111:CoOS/kernel/mutex.c ****     pMutex  = &MutexTbl[mutexID];
 220                    .loc 1 111 0
 221 0020 2C4A          ldr r2, .L31+12 @ tmp175,
 110:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;

@bsongis
Copy link
Member

bsongis commented Nov 19, 2014

Yes I confirm this is really bad to compile CoOS in -O2
I propose to continue with -Os and keep the barrier as well (does it change something in -Os) ?
And we release ASAP. This LS1 bug is a nightmare.
I will send an email to the CoOS team tomorrow.

@projectkk2glider
Copy link
Member

With -Os it ok, the one reorder (which luckilly was not critical) is removed:

194 0012 1C78           ldrb    r4, [r3, #0]    @ zero_extendqisi2  @ OSSchedLock, OSSchedLock
 195 0014 651C          adds    r5, r4, #1  @ tmp171, OSSchedLock,
 196 0016 E9B2          uxtb    r1, r5  @ OSSchedLock.6, tmp171
 197                .LVL7:
 198 0018 1970          strb    r1, [r3, #0]    @ OSSchedLock.6, OSSchedLock
 109:CoOS/kernel/mutex.c ****     asm volatile ("" : : : "memory");         // prevents compiler reordering
 199                    .loc 1 109 0
 110:CoOS/kernel/mutex.c ****     pCurTcb = TCBRunning;
 200                    .loc 1 110 0
 201 001a 2D4B          ldr r3, .L25+8  @ tmp174,
 111:CoOS/kernel/mutex.c ****     pMutex  = &MutexTbl[mutexID];

So yes, I agree on -Os and on adding barrier. The barrier should also be added to some other functions like __disable_irq().

@projectkk2glider
Copy link
Member

And I think that volatile for mutex structure or any of its members is not needed, because it is protected by OSSchedLock.

@bsongis
Copy link
Member

bsongis commented Nov 19, 2014

Right I introduced it to prevent them from being read/modified before the lock, obviously it works as well, prevents the reorder (I didn't checked though) but it's not needed.

@projectkk2glider
Copy link
Member

Will you add memory barriers?

@bsongis
Copy link
Member

bsongis commented Nov 19, 2014

Yes. Perhaps in changing the OsSchedLock() macro. Are you sure that memory barrier is needed for __disable_irq(), it's already ASM code?

@projectkk2glider
Copy link
Member

Are you sure that memory barrier is needed for __disable_irq(), it's already ASM code.

It can't hurt.

bsongis added a commit that referenced this issue Nov 19, 2014
@lvale
Copy link

lvale commented Nov 19, 2014

Congrats guys.
Chasing compiler "optimizations" is a real PIA....

@LapinFou
Copy link
Contributor

Congratulation for this bug fix! A difficult one to tackle. 👍
I don't know if it related to the compilation directive, but the 2.0.13 is 20% smaller than the 2.0.12!

@projectkk2glider
Copy link
Member

I don't know if it related to the compilation directive, but the 2.0.13 is 20% smaller than the 2.0.12!

It is, it is now optimized for size mainly.

@LapinFou
Copy link
Contributor

More room for some new great feature !!!
;o)
Again, congrats to all of you for your hard work.

@MikeBland
Copy link

I've been using -Os for ersky9x anyway. I had a look at the assembler and did see the odd assembler instruction that appeared in the middle of the sequence for Incrementing the OsSchedLock variable.
I tried the following:
#define OsSchedLock() { asm(""); OSSchedLock++; asm volatile ("" : : : "memory"); }
and that seems to make every reference work OK.
I may be using a different version of the compiler, but you might try this.
I have found using asm("") useful in the AVR compiler as well.

Mike.

@bsongis
Copy link
Member

bsongis commented Nov 21, 2014

Yes Mike, now we have:
#define OsSchedLock() { OSSchedLock++; asm volatile ("" : : : "memory"); }
And we compile in -Os
We are using the very last 4.7 arm-gcc release

@MikeBland
Copy link

But I have the extra asm("") in front as well. This brackets the OSSchedLock++ with 2 asm directives, so the compiler doesn't let other code cross either boundary.

Mike.

@projectkk2glider
Copy link
Member

In my current understanding putting asm("") will change nothing. To archive a memory barrier the full statement must be included: asm volatile ("" : : : "memory")

I am also concerned about IRQ disable/enable functions. I propose (again) that we change them to (only add "memory" modifier):

__attribute__( ( always_inline ) ) static __INLINE void __enable_irq(void)
{
  __ASM volatile ("cpsie i" : : : "memory");
}
__attribute__( ( always_inline ) ) static __INLINE void __disable_irq(void)
{
  __ASM volatile ("cpsid i" : : : "memory");
}

Without the "memory" modifier the compiler is free to reorder code around __enable_irq() statement.

@MikeBland
Copy link

I believe there are TWO different things here, what the compiler does and what the processor does.

The code I get for OsSchedLock is usually like:

4083ca: 7e5a ldrb r2, [r3, #25]
4083cc: 3201 adds r2, #1
4083ce: b2d2 uxtb r2, r2
4083d0: 765a strb r2, [r3, #25]

Without the leading asm(""), sometimes other instructions are interleaved within this sequence.
With the leading asm(""), this sequence always appears as shown, with nothing else in between the 4 instructions.

I believe this is because the compiler assumes any asm code might have an effect the compiler doesn't "know" about, so must not rearrange an code before the asm to appear after the asm.

The second thing is the processor may reorder physical memory accesses, regardless of the actual instruction order. It may be necessary to use a real DMB (Data Memory Barrier) instruction. This may only be necessary for multiple processors, rather than multiple tasks on a single processor.

According to the datasheets for the STM ARM processor, and the Atmel ARM processor, this instruction is available.

I found this:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka14041.html

On the AVR compiler, I have noticed that if you have a return in the middle of a procedure, the compiler sometimes generates a set of pop instructions and a return in addition to those at the end of the procedure. terminating the procedure with:
asm("") ;
return ;
stops this happening and saves quite a few bytes of flash.

Mike.

projectkk2glider added a commit that referenced this issue Feb 14, 2015
projectkk2glider added a commit that referenced this issue Feb 20, 2015
@CaptainSwashplate
Copy link

Wow - quite the thread.
I was one (the only?) person reporting this problem in the forum
http://openrcforums.com/forum/viewtopic.php?f=92&t=6178
Thanks for tracking it down. I have had my intentions changed externally last year in October but now updating everything and looking forward to trying the new firmware on easter. If everything goes well with the test-plane I'll go forward and convert the next models.
Thanks for your hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests