Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMP: CFE TIME uses OSAL IntLock/IntUnlock for mutual exclusion #74

Closed
skliper opened this issue Sep 30, 2019 · 10 comments
Closed

SMP: CFE TIME uses OSAL IntLock/IntUnlock for mutual exclusion #74

skliper opened this issue Sep 30, 2019 · 10 comments
Labels
Milestone

Comments

@skliper
Copy link
Contributor

skliper commented Sep 30, 2019

Interrupt locking simply prevents incoming interrupts; is not a mutual exclusion mechanism and will NOT achieve "exclusive access" on all types of processors. Mutual exclusion is more of a side-effect that occurs on a single-core processor that uses a timer interrupt to perform task switching duties. On a multi-core processor, this will not work.

Furthermore, in the POSIX OSAL, IntLock/Unlock are no-ops, and interrupt control is a kernel-level function and not something that user space tasks can do (even as root).

This should be replaced with a Mutex, as this is what the code is really trying to do.

@skliper skliper added this to the 6.6.0 milestone Sep 30, 2019
@skliper skliper self-assigned this Sep 30, 2019
@skliper skliper added the bug label Sep 30, 2019
@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Imported from trac issue 43. Created by jphickey on 2015-05-07T11:04:05, last modified: 2019-03-05T14:58:28

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by acudmore on 2015-05-19 12:46:24:

It looks like in most of the places where IntLock is used in TIME, a Mutex would work, since it is just updating a state variable.

There are a couple of places where IntLock/IntUnlock may have been intended to make sure the variables were updated without any task switches. It is also possible that the API to read these variables that are changing do not have mutex locks, which could return invalid data if this sequence of instructions are interrupted by a context switch.

But I agree that the Posix port is not helping with it's empty IntLock/IntUnlock. Maybe the OSAL IntLock/IntUnlock could be implemented with a Mutex?

I also agree that SMP is going to pose a problem.

I would suggest:

  • Replacing the obvious mutex cases
  • Looking into if an IntLock is needed in any of the time critical cases
  • Implementing IntLock in the POSIX OSAL with a Mutex ( not great, but better than nothing )
  • If the IntLock/Unlock is removed from TIME, then does the Mutex have to be put around the read functions?

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by glimes on 2016-11-08 14:17:08:

Current crop of cfe-next are all going into CFE 6.6

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2017-10-16 13:30:02:

I have started looking into this issue and as I peel back the onion on the TIME State machines -- I only find more race conditions...

However, many of these issues can be traced back to the "1HzISR", and if we fix this, then we can get a solution "good enough for now".

Background Info:

Currently the "1Hz ISR" performs some updates to the global time state in CFE_TIME_TaskData (global), then posts a semaphore to run the "Local 1Hz Task".

When the ISR reads and modifies the global state, it does so with out any explicit locking. Instead this relies on the OS_IntLock() to prevent this ISR from firing while the CFE TIME Task is processing commands and actively updating the same global state variables.

An interrupt lock like this is simply not possible in an SMP environment, because the interrupt could be serviced by another core. Even if it were possible to lock interrupts on //all// cores, it is still not sufficient because the 1Hz could have //already gotten triggered// and being serviced by another core by the time the interrupt lock is done.

The "Local 1Hz Task" which is triggered by the ISR is minimal, its only job is to generate up to two software bus messages, depending on compile-time config:

  • "Fake Tone" message - this drives the rest of the "time at tone" state machine in the absence of an external source for this tone message.
  • "1 Hz" message - this provides a general purpose software bus message that other applications can subscribe to, to perform any periodic work.

Both of these messages are currently //optional// based on compile-time platform configuration.

Proposed Solution:

Move the work of the 1Hz ISR task into the regular TIME task, leveraging the existing 1HZ software bus message. Most updates to the global state were already triggered via software bus messages anyway; "tone" messages come in through this same path and are processed by the same thread, as well as all configuration from the ground system. The ISR was the only other "writer" into the global state, so moving this to a software bus triggered message would implicitly synchronize everything as it would now all be done by the same thread.

This means:

  • "1Hz message" becomes required, not optional anymore (platform config for this goes away, always enabled)
  • "Fake Tone" becomes piggybacked onto this message, and does not need to be a separate message.
  • CFE TIME subscribes to 1Hz, rather than "Fake Tone" message.
  • The 1Hz message handler first does the work previously done directly in the ISR, then calls the "Tone 1Hz Task" if FAKE_TONE functionality is enabled (this is what the Fake Tone handler did).

OTHER NOTES:

Going through of the TIME subsystem in this level of detail further reinforces the need to overhaul time services in general. Multiple tasks and helper threads are created here, without any real justification, which only increases the risks of having thread safety issues. The state machines are way too complicated with multiple levels of indirection, to support features that do not actually exist. For instance see #123, which suggest the state machines in here have been hacked to support certain configurations which are not even possible with any current OSAL or PSP.

Perhaps most importantly: the update methods used will almost certainly "jump" the clock with updates come in, which is extremely undesirable for a control algorithm. For many applications, particularly anything involving real time control, the result of this algorithm is probably totally unusable.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2017-10-16 20:32:56:

Fix is ready for review.

Branch trac-43-time-no-intlock, commit [changeset:ebbb081]

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by glimes on 2017-10-17 12:12:07:

recommend accept.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by acudmore on 2017-10-20 12:00:04:

recommend approve

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sstrege on 2017-10-20 12:59:17:

Recommend/accept

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by glimes on 2017-10-24 15:47:36:

commit [changeset:ebbb081] integrated to development.
commit [changeset:2a42f5f] integrated to development.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-03-05 14:58:28:

Milestone renamed

@skliper skliper closed this as completed Sep 30, 2019
@skliper skliper removed their assignment Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant