Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

scheduler: add Relay as a worker stage and fix bug #2219

Merged
merged 23 commits into from Oct 28, 2021

Conversation

lance6716
Copy link
Collaborator

@lance6716 lance6716 commented Oct 13, 2021

What problem does this PR solve?

close #2204

What is changed and how it works?

Now scheduler of DM master will maintain DM worker in 4 stage: Offline, Free, Relay, Bound. Relay is newly added to easily and correctly handle some stage transition.

Check List

Tests

  • Unit test
  • Integration test

Code changes

  • Has exported function/method change
  • Has exported variable/fields change

Side effects

  • Increased code complexity
  • Breaking backward compatibility (but fix a bug)

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation AND METRICS
  • Need to be included in the release note

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Oct 13, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Ehco1996
  • GMHDBJD

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@Ehco1996 Ehco1996 added this to the v2.0.8 milestone Oct 14, 2021
@lance6716 lance6716 changed the title WIP: add Relay as a worker stage and fix bug scheduler: add Relay as a worker stage and fix bug Oct 14, 2021
@lance6716 lance6716 added needs-cherry-pick-release-2.0 This PR should be cherry-picked to release-2.0. Remove this label after cherry-picked to release-2.0 needs-update-docs Should update docs after this PR is merged. Remove this label once the docs are updated needs-update-release-note This PR should be added into release notes. Remove this label once the release notes are updated labels Oct 14, 2021
@lance6716
Copy link
Collaborator Author

for some TODO, I need reviewer's opinion and fix them in next PR

@lance6716
Copy link
Collaborator Author

lance6716 commented Oct 15, 2021

Can we add more comments to descript our behavior to make it easier for our collaborators to read? For example (just suggestion):

If worker fails, it fallbacks to other workers without relay. Last worker goes online and dispatches back to open relay sync.
worker hangs with priority fallback other relay worker. If no other relay worker is found source will fallback idle worker, and with relay sync. If last worker goes online and schedules back with relay sync.

Later I'll add some comment of choose bound to tryBoundForWorker and tryBoundForSource. Currently we don't implement "Last worker goes online and dispatches back to open relay sync" in my memory.

@Ehco1996 Ehco1996 modified the milestones: v2.0.8, v2.1.0 Oct 18, 2021
Copy link
Contributor

@Ehco1996 Ehco1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
after rest comment are resolved

@ti-chi-bot ti-chi-bot added the status/LGT1 One reviewer already commented LGTM label Oct 19, 2021
// returns (true, nil) after bounded.
// tryBoundForWorker tries to bind a source to the given worker. The order of picking source is
// - try to bind the last bound source
// - try to bind sources on which the worker has unfinished load task
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think try to bind to the unfinished load task priority is higher than last bound souce.

relaySource := workers[i].RelaySourceID()
another := i ^ 1 // make use of XOR to flip 0 and 1
toBindSource := inputSources[another]
if relaySource != "" && toBindSource != "" && relaySource != toBindSource {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SourceA finished dump phase on WorkerA and then workerA down, then sourceA was scheduled to WorkerB and enable relay on workerB, then workerA up again, sourceA should be transferred to workerA again. But in this situation, it will return a false positive error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before transferWorkerAndSource: (workerA, ""), (workerB, sourceA), workerB has relay source of sourceA

in this check of line 555, when iterating workerA, relaySource is "" so not pass; when iterating workerB toBindSouce is "" so not pass. It won't raise false positive error.
This is to check workerB has started relay for sourceA but requested to bind to sourceB

// set the stage as Bound and record the bound relationship if exists.
if bound, ok := sbm[name]; ok {
// source bounds without source configuration should be deleted later
if _, ok := scm[bound.Source]; ok {
boundsToTrigger = append(boundsToTrigger, bound)
// TODO: if etcd has saved KV that worker1 started relay for source1, but bound to source2, should
// we stop relay or stop source to avoid DM master leader failed to bootstrap?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe stop source is a simpler choice because we don't support migrate relay bind relation to another worker.

@@ -1521,6 +1521,10 @@ func (t *testScheduler) TestTransferWorkerAndSource(c *C) {
c.Assert(s.bounds[sourceID2], DeepEquals, worker1)
c.Assert(s.bounds[sourceID3], DeepEquals, worker4)
c.Assert(s.bounds[sourceID4], DeepEquals, worker3)

c.Assert(worker1.StartRelay(sourceID2), IsNil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add some unit tests for the newly added logic

defer w.mu.Unlock()

switch w.stage {
case WorkerOffline, WorkerRelay:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the worker is in these two states? maybe add a log.DPanic here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkerOffline can directly turn on relay, to ensure the worker has no time sync without relay.
WorkerRelay can turn on relay just like WorkerBound can ToBound again. Maybe we'll change it in future.

@lance6716
Copy link
Collaborator Author

/cc @GMHDBJD

@@ -518,7 +518,7 @@ ErrSchedulerSourceOpRelayExist,[code=46023:class=scheduler:scope=internal:level=
ErrSchedulerLatchInUse,[code=46024:class=scheduler:scope=internal:level=low], "Message: when %s, resource %s is in use by other client, Workaround: Please try again later"
ErrSchedulerSourceCfgUpdate,[code=46025:class=scheduler:scope=internal:level=low], "Message: source can only update relay-log related parts for now"
ErrSchedulerWrongWorkerInput,[code=46026:class=scheduler:scope=internal:level=medium], "Message: require DM master to modify worker [%s] with source [%s], but currently the worker is bound to source [%s]"
ErrSchedulerCantTransferToRelayWorker,[code=46027:class=scheduler:scope=internal:level=medium], "Message: require DM worker to be bound to source [%s], but it has been started relay for source [%s]"
ErrSchedulerBoundDiffWithStartedRelay,[code=46027:class=scheduler:scope=internal:level=medium], "Message: require DM worker [%s] to be bound to source [%s], but it has been started relay for source [%s]"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workaround? stop-relay for previous source and start for current source?

Comment on lines 1133 to 1138
for _, workerName := range workers {
worker := s.workers[workerName]
if relaySource := worker.RelaySourceID(); relaySource != "" && relaySource != source {
busyWorkers = append(busyWorkers, workerName)
busySources = append(busySources, relaySource)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about move to L1117

delete(s.relayWorkers[source], w)
for _, workerName := range workers {
delete(s.relayWorkers[source], workerName)
s.workers[workerName].StopRelay()
Copy link
Collaborator

@GMHDBJD GMHDBJD Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call tryBoundForWorker for this worker if it stopRelay and become free? or in later pr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to wait DM worker DisableRelay then it can receive a new source bound 😂 much work to do, I'll leave it in future

// - remove the bound relationship in the scheduler.
// this func is called after the bound relationship removed from etcd.
func (s *Scheduler) updateStatusForUnbound(source string) {
w, ok := s.bounds[source]
if !ok {
return
}
w.ToFree()
if err := w.Unbound(); err != nil {
s.logger.DPanic("can updateStatusForUnbound", zap.Error(err))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s.logger.DPanic("can updateStatusForUnbound", zap.Error(err))
s.logger.DPanic("cannot updateStatusForUnbound", zap.Error(err))

@@ -270,7 +270,45 @@ function test_last_bound() {
echo "[$(date)] <<<<<< finish test_last_bound >>>>>>"
}

function test_exclusive_relay() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add test for original issue.
e.g.

worker1 --relay-> source1
worker2 --relay-> source2
worker3 free
restart worker1,then worker3 --> source1
restart worker2, then worker1 --> source2
now worker1 should not error

Copy link
Collaborator Author

@lance6716 lance6716 Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean worker1 should not be bound? source2 should keep unbound until worker2 has up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queshi.

Copy link
Collaborator

@GMHDBJD GMHDBJD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added status/LGT2 Two reviewers already commented LGTM, ready for merge and removed status/LGT1 One reviewer already commented LGTM labels Oct 28, 2021
@lance6716
Copy link
Collaborator Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 93c4d59

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #2275.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-cherry-pick-release-2.0 This PR should be cherry-picked to release-2.0. Remove this label after cherry-picked to release-2.0 needs-update-docs Should update docs after this PR is merged. Remove this label once the docs are updated needs-update-release-note This PR should be added into release notes. Remove this label once the release notes are updated size/XXL status/can-merge status/LGT2 Two reviewers already commented LGTM, ready for merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mysql source worker %s has already started with source %s, but get a request with source %s
7 participants