Employ mastership info for session management #1189

adibrastegarnia · 2020-08-31T19:19:39Z

This PR is making the required changes in the SB of the onos-config to employee mastership information for session management. It is a WIP and the code still needs more work and I will update this PR with more details about the changes.

#960 #843 #962

P.S. there are some log messages for debugging but I will remove them at the end.

adibrastegarnia · 2020-09-01T17:24:36Z

pkg/southbound/synchronizer/sessionManager.go

+			return err
+		}
+	case topodevice.ListResponse_UPDATED:
+		log.Info("Process device event updated")


@kuujo
In the current code, we check the address and if there is a change, we reconnect but if we change any other fields, we also need to update the state of a device. Should we create a new session anyway?

I think either way is fine.

adibrastegarnia · 2020-09-01T17:26:23Z

pkg/southbound/synchronizer/sessionManager_test.go

+ * Check device is added as a synchronizer correctly, times out on no gRPC device
+ * and then un-does everything
+ */
+func TestSessionManager(t *testing.T) {


We need a new unit test here

pkg/southbound/synchronizer/session.go

adibrastegarnia · 2020-09-02T19:28:13Z

test/gnmi/suite.go

@@ -65,6 +66,7 @@ func (s *TestSuite) SetupTestSuite() error {
 	err = helm.Chart("onos-config", onostest.OnosChartRepo).
 		Release("onos-config").
 		Set("image.tag", "latest").
+		Set("replicaCount", 2).


I will change it later.

kuujo

Looks really good. We just have to make sure the device state will be retried until it’s either successful or it’s superseded by a later master. Optimistic locking requires retries to function correctly. Also, if we’re going to retry device updates, we have to ensure multiple concurrent state changes to the same device occur in the correct order. If we just start a new goroutine for each update then they could be performed in any order. Perhaps in addition to the term we should store a local logical timestamp in the device attributes.

pkg/southbound/synchronizer/deviceUpdate.go

kuujo · 2020-09-02T19:56:16Z

pkg/southbound/synchronizer/sessionManager.go

+			return err
+		}
+	case topodevice.ListResponse_UPDATED:
+		log.Info("Process device event updated")


I think either way is fine.

kuujo · 2020-09-02T20:01:09Z

Either that or we can serialize updates to each device state on a separate channel.

SeanCondon · 2020-09-02T14:54:54Z

pkg/southbound/synchronizer/sessionManager.go

+}
+
+// NewSessionManager create a new session manager
+func NewSessionManager(options ...func(*SessionManager)) (*SessionManager, error) {


I think I reviewed something like this last week - maybe this is the same PR - you should not leave it go too long with out merging, or you might have conflicts

@SeanCondon
Thanks for the review. This is the PR that I opened 2 days ago and was waiting to get some feedback from you and @kuujo before finalizing it. So I got some feedback now and I will fix them and after a bit testing we merge. I will add new unit tests later based on new changes.

@SeanCondon
The PR that you reviewed was a prerequisite to this one and we merged that one.

SeanCondon · 2020-09-02T14:58:28Z

pkg/southbound/synchronizer/sessionManager.go

+		operationalStateCacheLock: sm.operationalStateCacheLock,
+		deviceChangeStore:         sm.deviceChangeStore,
+		device:                    device,
+		target:                    sm.newTargetFn(),


this calls the function, rather than storing the function name. Remove the brackets

I tried to avoid any code changes which are not related to session management thing but these are minor things and I will improve them in the same PR.

SeanCondon · 2020-09-02T14:59:18Z

pkg/southbound/synchronizer/sessionManager.go

+		session.device.Attributes = make(map[string]string)
+	}
+
+	err := session.open()


the mutex is locked at this stage. Is it necessary - it could block everything up

SeanCondon · 2020-09-02T15:00:41Z

pkg/southbound/synchronizer/sessionManager.go

+	modelRegistry             *modelregistry.ModelRegistry
+	sessions                  map[topodevice.ID]*Session
+	operationalStateCache     map[topodevice.ID]devicechange.TypedValueMap
+	newTargetFn               func() southbound.TargetIf


I don't think you need func() here - just the function name should be fine

kuujo · 2020-09-03T23:50:05Z

pkg/southbound/synchronizer/deviceUpdate.go

+
+				session := sm.sessions[id]
+				if session != nil {
+					currentTerm, err := session.getCurrentTerm()


Even though we’re getting the device’s current term from the topology and checking it against the local mastership term, this particular implementation doesn’t guarantee the update cannot be overwritten by an older master. In order to prevent that, the read-modify-write sequence comparing the stored term with the local term and then updating the device must be atomic, and in order for it to be atomic it must operate on the same revision of the device. When a device is read from the topology service, the device is returned with metadata that includes a revision number. When a device is updated, the topology service compares the device’s revision number with the stored revision number and rejects the update if the revisions don’t match. If an update is rejected, the client attempts the atomic read-modify-write sequence again.

But the code here is actually reading the device twice: once to retrieve the term, and once to update the device, so the sequence is not atomic. The term can change between the first and second read, and that would allow an older master to overwrite an update from a newer master. For example:

Node A reads the device with revision 1 and verifies the stored term matches its local term

Node B is elected master for a later term and updates the device state, incrementing it’s revision number to 2

Node A proceeds to read the device again — this time with revision 2 — and updates it without checking the term. The update is successful since the device was read again, so it overwrites the update from new master node B

In this scenario, if node A’s update is attempted using the same object it read in the first step (revision 1), it would be rejected. Upon retrying, node A would read revision 2 and recognize the later term indicating it’s no longer the master.

Both the condition checking the device’s stored term and the write updating the device state must occur on the same device object returned from the topology service.

kuujo · 2020-09-04T00:20:46Z

pkg/southbound/synchronizer/sessionManager.go

+
+	go sm.processDeviceEvents(sm.topoChannel)
+	go func() {
+		err := sm.updateDeviceState()


I think we should create a separate event channel and goroutine for each session to allow session state changes to be notified and the topology to be updated concurrently. A new goroutine to process session state changes could be created in the session manager or in the session itself. We only need to maintain the order of updates within each session. Sharing a channel across sessions can only block the processing of session updates unnecessarily. A per-session loop could also simplify the algorithm for updating the device state.

kuujo · 2020-09-04T07:06:38Z

pkg/southbound/synchronizer/deviceUpdate.go

+	protocolState.ChannelState = channel
+	protocolState.ServiceState = service
+	topoDevice.Protocols = append(topoDevice.Protocols, protocolState)
+	mastershipState, err := s.mastershipStore.GetMastership(topoDevice.ID)


To ensure the update is atomic we just need to compare the mastership state to the stored device term after this.

kuujo · 2020-09-04T07:10:00Z

The file naming convention in Go is to use lower_case_and_underscores.go. We're mixing this with lowerCamelCase.go here and in other areas of onos-config. We should avoid doing that here and fix the places where it's already done in another PR.

kuujo · 2020-09-08T05:24:48Z

pkg/southbound/synchronizer/device_update.go

+
+	// Do not update the state of a device if the node encounters a mastership term greater than its own
+	if uint64(mastershipState.Term) < uint64(currentTerm) {
+		return errors.New("device mastership term is greater than node mastership term")


When a higher mastership term is discovered, this node needs to stop attempting this update. But since this function is being called by the backoff library, this error will just cause it to retry and create a deadlock since the update will always fail after this condition is met. The error returned has to be backoff.NewPermanent or whatever the error was that stops the retries.

Ooops. You are right. We should not return the error here.

kuujo · 2020-09-08T05:28:07Z

pkg/southbound/synchronizer/device_update.go

+	protocolState.ChannelState = channel
+	protocolState.ServiceState = service
+	topoDevice.Protocols = append(topoDevice.Protocols, protocolState)
+	mastershipState, err := s.mastershipStore.GetMastership(topoDevice.ID)


This mastership state doesn’t necessarily represent the state on which this update should be based. Since we’re reading from the mastership store on every update attempt, it’s possible for this node to read a different (newer) mastership state (e.g. for a new master/term) than this update is based on. Who knows if this node is even the master any more. Rather than reading from the store here, this function should take as an argument the mastership state on which the update is based so that every attempt to perform the update is based on that state. Reading stores multiple times can alway introduce this sort of consistency issue. The only reason we should read the store here is to determine whether the state on which the update is based should be cancelled due to a new master/higher term, but that’s not strictly necessary, it would just be an optimization since that will be determined when the device is updated by the new master anyways.

…state

…e device object

kuujo · 2020-09-10T22:14:21Z

pkg/southbound/synchronizer/session.go

+		s.mu.Lock()
+		s.connected = false
+		s.mu.Unlock()
+		state, _ := s.mastershipStore.GetMastership(s.device.ID)


The session should be created with a mastership state. It shouldn't need to read the mastership state at all. That creates another race where the mastership could change between the time the session was created and the time it was opened. Just remove the mastershipStore field.

@kuujo
Done. I hope I addressed it properly.

adibrastegarnia added enhancement New feature or request WIP Work in progress do not merge ⚠️ Do not Merge, used for testing or not ready labels Aug 31, 2020

adibrastegarnia requested review from kuujo, tomikazi and SeanCondon August 31, 2020 19:19

adibrastegarnia force-pushed the mastership_sb branch from 1c61c4f to b0865b8 Compare August 31, 2020 23:45

adibrastegarnia commented Sep 1, 2020

View reviewed changes

pkg/southbound/synchronizer/session.go Outdated Show resolved Hide resolved

adibrastegarnia force-pushed the mastership_sb branch 2 times, most recently from 37887ee to e4d07ab Compare September 1, 2020 23:46

adibrastegarnia commented Sep 2, 2020

View reviewed changes

kuujo reviewed Sep 2, 2020

View reviewed changes

SeanCondon reviewed Sep 2, 2020

View reviewed changes

adibrastegarnia force-pushed the mastership_sb branch from 8b465a4 to 67ac16d Compare September 3, 2020 23:01

kuujo reviewed Sep 3, 2020

View reviewed changes

kuujo reviewed Sep 4, 2020

View reviewed changes

adibrastegarnia marked this pull request as draft September 4, 2020 15:33

adibrastegarnia added this to the Aether 2020Q3 milestone Sep 4, 2020

adibrastegarnia force-pushed the mastership_sb branch 4 times, most recently from e2e6169 to c24f3f7 Compare September 5, 2020 06:09

adibrastegarnia marked this pull request as ready for review September 5, 2020 06:15

kuujo reviewed Sep 8, 2020

View reviewed changes

adibrastegarnia force-pushed the mastership_sb branch from d6f889e to 3c14c95 Compare September 8, 2020 20:39

adibrastegarnia added 21 commits September 9, 2020 15:08

Implement session management using mastership information

3bc72fa

Use exponential backoff retry package

16b4429

Initialize topo channel from manager for now

7f3fa20

checking term for updating the device state

f491825

Fix get current term

43756a3

return the test to normal condition

c3ad54d

Fix a small bug

a09b11e

Handle device updates

ce2b464

Use one single channel and single go routine for updating the device …

5b797da

…state

revert the change

92c07a9

Fix retries to make sure we stop it if there is mastership change

6600fc1

For debugging

d10abe5

Create a channel per session for handling device response events

00aad2f

comment out the session manager test

fac7cb5

Fix consistency bug and make sure we check mastership term on the sam…

0109cb5

…e device object

Remove old files and add some unit tests

2b40c12

update device state unit test

97a6dcf

clean up the code

a837373

Move handling of mastership events to session manager

5a31480

Set replicaCount to 1

d6362af

fix unit test

62fb577

adibrastegarnia force-pushed the mastership_sb branch from e911c98 to 62fb577 Compare September 9, 2020 22:14

adibrastegarnia removed the WIP Work in progress label Sep 9, 2020

kuujo reviewed Sep 10, 2020

View reviewed changes

adibrastegarnia added 2 commits September 10, 2020 17:01

Use one mastership state instead of reading from store

ca9b2ba

Fix unit test

82f8e46

adibrastegarnia removed the do not merge ⚠️ Do not Merge, used for testing or not ready label Sep 11, 2020

adibrastegarnia merged commit aed547f into onosproject:master Sep 11, 2020

adibrastegarnia deleted the mastership_sb branch September 11, 2020 22:03

adibrastegarnia mentioned this pull request Sep 14, 2020

gNMI Support for Multiple Client Roles and Master Arbitration #1210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Employ mastership info for session management #1189

Employ mastership info for session management #1189

adibrastegarnia commented Aug 31, 2020 •

edited

Loading

adibrastegarnia Sep 1, 2020

kuujo Sep 2, 2020

adibrastegarnia Sep 1, 2020

adibrastegarnia Sep 2, 2020

kuujo left a comment

kuujo Sep 2, 2020

kuujo commented Sep 2, 2020

SeanCondon Sep 2, 2020

adibrastegarnia Sep 2, 2020

adibrastegarnia Sep 2, 2020

SeanCondon Sep 2, 2020

adibrastegarnia Sep 2, 2020

SeanCondon Sep 2, 2020

SeanCondon Sep 2, 2020

kuujo Sep 3, 2020 •

edited

Loading

kuujo Sep 4, 2020

kuujo Sep 4, 2020

kuujo commented Sep 4, 2020

kuujo Sep 8, 2020

adibrastegarnia Sep 8, 2020

kuujo Sep 8, 2020 •

edited

Loading

kuujo Sep 10, 2020

adibrastegarnia Sep 11, 2020

Employ mastership info for session management #1189

Employ mastership info for session management #1189

Conversation

adibrastegarnia commented Aug 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuujo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuujo commented Sep 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuujo Sep 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuujo commented Sep 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuujo Sep 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adibrastegarnia commented Aug 31, 2020 •

edited

Loading

kuujo Sep 3, 2020 •

edited

Loading

kuujo Sep 8, 2020 •

edited

Loading