Support choosing nodes to schedule replicas #85

JacieChao · 2018-06-13T01:57:51Z

For now, just add Node CRD for multiple disk configuration and add a replica scheduler to schedule replica to specified disk on node.

yasker · 2018-06-13T02:28:08Z

@JacieChao

Please separate your commit into following commits:

Add CRD to https://github.com/rancher/longhorn-manager/blob/master/k8s/pkg/apis/longhorn/v1alpha1/types.go
Generate the CRD using the code generator
Add related fields to datastore
Add related fields to API.
Other logics.

Make sure after each commit, the manager can be built.

You can split 5 into more commits if reasonable, or switch position with 4.

yasker · 2018-06-13T02:30:09Z

Do you need the scheduler to be a controller? I think a function call from replica_controller should be enough. What do you think?

JacieChao · 2018-06-13T02:37:02Z

@yasker Sure, I think it's enough for now. At first I want to create a controller to deal with replica scheduling alone. But I think it's easier to be called from replica_controller. I will remove the unused code in scheduler and separate my commits.

yasker · 2018-06-13T18:28:40Z

api/model.go

+	Disks           map[string]types.Disk `json:"disks"`
+}
+
+type DiskInput struct {


I think it's maybe unnecessary to separate API for add/delete disk.

Considering in the UI, normally it is not going to call API every time when a disk was added or removed. It's easy for UI to call backend that the things have been determined. Also it's easier for the backend to store it in one transaction. And it's a rarely modified parameter, performance shouldn't be a problem. So it should be OK to just use UpdateNode to perform the update on all the fields of the node.

So many code below can be simplied.

yasker · 2018-06-13T18:33:49Z

datastore/kubernetes.go

@@ -147,3 +147,22 @@ func (s *DataStore) DeleteEngineImageDaemonSet(name string) error {
 	}
 	return nil
 }
+
+func (s *DataStore) GetManagerNode() ([]string, error) {


Just use GetManagerNodeIPMap with the key as the node name.

yasker · 2018-06-13T18:38:25Z

datastore/longhorn.go

+			return nil, err
+		}
+		if node == nil {
+			node, err = s.CreateDefaultNode(nodeName)


It's weird to have GetNodeList to have a side-effect to create the nonexisting nodes. You can have a manager to create its own node if it doesn't exist when starting up at e.g. https://github.com/rancher/longhorn-manager/blob/master/app/daemon.go#L100 .

yasker · 2018-06-13T18:39:20Z

manager/volume.go

+	}
+	if node == nil {
+		// create node for default path
+		return m.ds.CreateDefaultNode(name)


Same, don't write as a side-effect of reading.

yasker · 2018-06-13T18:40:52Z

datastore/longhorn.go

+		ObjectMeta: metav1.ObjectMeta{
+			Name: name,
+		},
+		Spec: types.NodeSpec{


It's better add a finalizer, so we can handle the deletion (which we won't allow for the user but node does go down (and take manager with it) sometime and we need to react to that).

yasker · 2018-06-13T18:46:09Z

api/router.go

@@ -97,6 +97,17 @@ func NewRouter(s *Server) *mux.Router {
 	r.Methods("GET").Path("/v1/hosts").Handler(f(schemas, s.NodeList))
 	r.Methods("GET").Path("/v1/hosts/{id}").Handler(f(schemas, s.NodeGet))

+	r.Methods("GET").Path("/v1/nodes").Handler(f(schemas, s.MountNodeList))


The old /v1/hosts will be removed later, though we need some UI change to accommodate this. You can use the name NodeList and NodeGet, just rename the previous one to HostList/HostGet (in a separate commit). MountNode sounds weird.

yasker · 2018-06-13T18:56:23Z

scheduler/replica_scheduler.go

+}
+
+func NewReplicaScheduler(
+	ds *datastore.DataStore) *ReplicaScheduler {


Um... It's a weird line break here. Just merge it with the upper one.

yasker · 2018-06-13T19:25:00Z

controller/replica_controller.go

@@ -270,6 +273,11 @@ func (rc *ReplicaController) syncReplica(key string) (err error) {
 			return err
 		}
 	}
+	// check whether the replica need to be scheduled
+	err = rc.scheduler.ScheduleReplica(replica)


I think there is a problem with this approach.

Not sure if you observed it, but I think it may result in sometimes two replicas are scheduled on the same host while the other host has none. Since there are multiple workers working to update the replicas simultaneously, you may end up with multiple replicas was scheduled at the exact same time, and choosing the same node for the multiple replicas because when the schedule started, they cannot see the end result of others.

It's more reliable to do scheduling in the volume controller when we create the replica for the first time. It's in sequence for a single volume. https://github.com/rancher/longhorn-manager/blob/master/controller/volume_controller.go#L629

yasker · 2018-06-13T19:32:06Z

scheduler/replica_scheduler.go

+		if err != nil {
+			return nil
+		}
+		rchecker := checker.NewReplicaChecker(nodeIPMap, replicas)


It looks unnecessary to create another package checker. It's only one function call we needed.

Make it simple for now. We can expand if it's necessary.

yasker · 2018-06-13T19:33:12Z

scheduler/replica_scheduler.go

+
+		// TODO Need to add capacity.
+		// Just make sure replica of the same volume be scheduled to different nodes for now.
+		nodeMap := rchecker.ReplicasAffinity()


Use a better name, e.g. preferredNodes. It looks too similar to nodeIPMap which can cause confusion.

yasker · 2018-06-13T19:54:57Z

Now I think we need a node controller and node status as well. See updated design.

JacieChao · 2018-06-14T13:57:14Z

@yasker NodeController is still working in progress. I will add a new commit when I finish it.

yasker

@JacieChao

It looks good so far, though the scheduling code apparently lacks of testing.

It's easy to test the scheduler by disabling the Kubernetes scheduler. You can make a commit to disable anti-affinity rule and make the replica controller error out if NodeID or data path wasn't filled. The commit can be after the scheduler. It can help you to spot the bugs in the scheduler node.

Then you can add a unit test case here to test your scheduling.

Then you should able to add test cases to longhorn-tests to test the customized scheduling and make sure the anti-affinity scheduling works.

After that, please continue to work on the node controller and multiple disk (randomly) scheduling.

yasker · 2018-06-14T22:53:04Z

controller/volume_controller.go

@@ -630,6 +631,11 @@ func (vc *VolumeController) replenishReplicas(v *longhorn.Volume, rs map[string]
 		if err != nil {
 			return err
 		}
+		// check whether the replica need to be scheduled
+		err = vc.scheduler.ScheduleReplica(r)


Thinking about it, the scheduling should be done when before the replica was created in the datastore. Because once it's created, the replica controller will pick it up and trying to start replica pod with it. This can happen before the scheduling result written to the datastore. So it should be put into the createReplica() instead, replace the DataPath assignment line.

yasker · 2018-06-14T22:59:40Z

scheduler/replica_scheduler.go

+		// if other replica has allocated to different nodes, then choose a random one
+		nodeID := ""
+		if len(preferredNodes) == 0 {
+			nodeID = rcs.getRandomNode(preferredNodes)


preferredNodes is empty here...

yasker · 2018-06-14T23:10:05Z

api/model.go

+	client.Resource
+	Name            string                    `json:"name"`
+	AllowScheduling bool                      `json:"allowScheduling"`
+	Disks           map[string]types.DiskSpec `json:"disks"`


Define a new type Disk in the API to cover both DiskSpec and DiskStatus.

JacieChao · 2018-06-15T10:21:45Z

@yasker I have added unit test for replica_scheduler and updated unit test case for volume_controller and replica_controller. I will add test cases in longhorn-tests later.

yasker · 2018-06-15T19:35:54Z

controller/replica_controller.go

@@ -371,24 +371,10 @@ func (rc *ReplicaController) CreatePodSpec(obj interface{}) (*v1.Pod, error) {
 	// will pin it down to the same host because we have data on it
 	if r.Spec.NodeID != "" {


r.Spec.NodeID cannot be "" after your scheduling change.

yasker · 2018-06-15T19:36:58Z

controller/replica_controller.go

+	}
+	// error out if NodeID and DataPath wasn't filled in scheduler
+	if r.Spec.NodeID == "" || r.Spec.DataPath == "" {
+		return nil, fmt.Errorf("There has no avaible node for replica %v", r)


Wrong error message. It's a bug.

fmt.Errorf("BUG: Node or datapath wasn't set for replica %v", r.Name)

Pull this functional change into another commit or previous commit. It's kind of been smuggled into a commit suppose to add tests.

yasker · 2018-06-15T19:43:24Z

scheduler/replica_scheduler_test.go

+	r, err := s.ScheduleReplica(r1)
+	assert.Nil(err)
+	assert.NotNil(r)
+	// assert could not scheduler to node2 and node3


Please also add a test for anti-affinity implementation of the scheduler.

yasker · 2018-06-15T19:56:06Z

scheduler/replica_scheduler.go

+}
+
+func (rcs *ReplicaScheduler) ScheduleReplica(replica *longhorn.Replica) (*longhorn.Replica, error) {
+	// only called when replica is starting for the first time


If this function should only be called when NodeID == "", then error out if NodeID is not empty. It will help to expose the potential bugs.

yasker · 2018-06-15T20:00:30Z

scheduler/replica_scheduler_test.go

+	}
+}
+
+func TestReplicaScheduler_ScheduleReplica(t *testing.T) {


Go style is CamelCase without underscore(_). Just do TestReplicaScheduler should be fine for now.

yasker

The test cases are really nice now. Good job!

yasker · 2018-06-19T18:18:18Z

app/daemon.go

@@ -18,6 +18,7 @@ import (
 	"github.com/rancher/longhorn-manager/util"

 	longhorn "github.com/rancher/longhorn-manager/k8s/pkg/apis/longhorn/v1alpha1"
+	"os"


Put os along with fmt

Basically the import section looks like this:

<build-in> (os, fmt etc) <3rd party> (logrus etc) <current project> (longhorn-manager/xxx) <manager k8s related> (longhorn-manager/k8s/xxx)

The last section exists is because it's normally really long and can become ugly if put along with the <current project> section

yasker · 2018-06-19T18:21:21Z

scheduler/replica_scheduler.go

+		}
+		replica.Spec.DataPath = dataPath + "/replicas/" + replica.Spec.VolumeName + "-" + util.RandomID()
+	} else {
+		return nil, fmt.Errorf("BUG: Replica %v has been scheduled to node %v", replica.Name, replica.Spec.NodeID)


OK, it's a really long if with extra indents.

Just do this at the beginning of the function:

if replica.Spec.NodeID != "" { return nil, fmt.Errorf("BUG: Replica %v has been scheduled to node %v", replica.Name, replica.Spec.NodeID) } ...

yasker

Other than the coding style issues, LGTM.

After updating the code, please continue to work on the longhorn-tests for the phase 1.

yasker · 2018-06-21T18:37:09Z

controller/volume_controller_test.go

@@ -21,6 +21,7 @@ import (
 	lhinformerfactory "github.com/rancher/longhorn-manager/k8s/pkg/client/informers/externalversions"

 	. "gopkg.in/check.v1"
+	"k8s.io/api/core/v1"


This k8s.io/api/core/v1 should be put along with other k8s.io import above.

yasker · 2018-06-21T18:39:31Z

scheduler/replica_scheduler_test.go

+package scheduler
+
+import (
+	"fmt"


Follow the pattern of other files for import:

"fmt" "testing" <space> "k8s.io/xxx" ... <space> "github.com/rancher/longhorn-manager/xxxx" ... <space> "github.com/rancher/longhorn-manager/k8s/xxx" ...

yasker · 2018-06-26T21:11:06Z

controller/node_controller.go

+		return err
+	}
+	if node == nil {
+		logrus.Debugf("Longhorn node %v does not exist, regenerate a default one", key)


I think there are some confusions on how to handle the node deletion. I've updated the design doc for that. Please check Node creation and deletion section.

yasker · 2018-06-26T21:17:23Z

controller/node_controller.go

+	for _, pod := range managerPods {
+		err = nc.syncStatusWithPod(pod, node)
+		if err != nil {
+			return err


As I said in the design doc, don't retry if conflict happens, assuming other managers are updating it. I am afraid the requeue may cause a storm of updating conflict in the larger system.

yasker · 2018-06-26T21:49:01Z

deploy/01-prerequisite/03-crd.yaml

+    listKind: NodeList
+    plural: nodes
+    shortNames:
+    - lhnode


Use lhn instead.

JacieChao · 2018-06-27T06:58:52Z

@yasker
Tested with longhorn-tests. Here's the result

+ flake8 .
+ py.test -v .
============================= test session starts ==============================
platform linux2 -- Python 2.7.14, pytest-2.9.2, py-1.5.3, pluggy-0.3.1 -- /usr/local/bin/python
cachedir: .cache
rootdir: /integration/tests, inifile:
collecting ... collected 19 items
test_basic.py::test_hosts_and_settings PASSED
test_basic.py::test_volume_basic PASSED
test_basic.py::test_volume_iscsi_basic PASSED
test_basic.py::test_snapshot PASSED
test_basic.py::test_backup PASSED
test_basic.py::test_volume_multinode PASSED
test_basic.py::test_replica_scheduler PASSED
test_csi.py::test_csi_volume_mount SKIPPED
test_csi.py::test_csi_volume_io SKIPPED
test_driver.py::test_volume_mount PASSED
test_driver.py::test_volume_io PASSED
test_engine_upgrade.py::test_engine_image PASSED
test_engine_upgrade.py::test_engine_offline_upgrade PASSED
test_engine_upgrade.py::test_engine_live_upgrade PASSED
test_engine_upgrade.py::test_engine_image_incompatible PASSED
test_engine_upgrade.py::test_engine_live_upgrade_rollback PASSED
test_ha.py::test_ha_simple_recovery PASSED
test_ha.py::test_ha_salvage PASSED
test_recurring_job.py::test_recurring_job SKIPPED
=================== 16 passed, 3 skipped in 1492.87 seconds ====================

I tried to tear down one longhorn-manager to check that Node CRD status set to down and not available to schedule replica.

I have updated README.md about cleanup node crd before tearing down longhorn system and tested with cleanup process.

Update node cleanup process.

yasker · 2018-06-27T19:16:52Z

LGTM. Merged, thanks.

yasker · 2018-06-27T20:05:55Z

@JacieChao And next time, remember to link your PR to the original issue.

JacieChao changed the title ~~Enable Multiple disk sche~~ Enable multiple disk scheduling Jun 13, 2018

JacieChao force-pushed the dev branch from 1b228a4 to 12d802b Compare June 13, 2018 02:27

JacieChao force-pushed the dev branch 2 times, most recently from 1f7270c to f8d84fc Compare June 13, 2018 05:06

yasker requested changes Jun 13, 2018

View reviewed changes

JacieChao force-pushed the dev branch 2 times, most recently from e77c1dc to f860650 Compare June 14, 2018 13:48

yasker requested changes Jun 14, 2018

View reviewed changes

yasker reviewed Jun 14, 2018

View reviewed changes

JacieChao force-pushed the dev branch 5 times, most recently from 8a72a36 to 51fc723 Compare June 15, 2018 10:19

yasker requested changes Jun 15, 2018

View reviewed changes

JacieChao force-pushed the dev branch from 51fc723 to 19b0d71 Compare June 19, 2018 07:34

yasker requested changes Jun 19, 2018

View reviewed changes

JacieChao force-pushed the dev branch 3 times, most recently from bf8c807 to de8cfcd Compare June 21, 2018 09:36

yasker requested changes Jun 21, 2018

View reviewed changes

JacieChao force-pushed the dev branch 2 times, most recently from 756106f to 69ab804 Compare June 22, 2018 09:13

JacieChao force-pushed the dev branch 6 times, most recently from 0927fe1 to 348f4c7 Compare June 26, 2018 09:43

yasker requested changes Jun 26, 2018

View reviewed changes

yasker reviewed Jun 26, 2018

View reviewed changes

deploy/01-prerequisite/03-crd.yaml Outdated

listKind: NodeList

plural: nodes

shortNames:

- lhnode

Copy link

Member

yasker Jun 26, 2018 •

edited

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use lhn instead.

JacieChao force-pushed the dev branch from 348f4c7 to 20fe59b Compare June 27, 2018 03:33

JacieChao added 8 commits June 27, 2018 15:00

Add Node CRD

0a20598

Generate Node CRD

7e1fd89

Add Node CRD to datastore

d8eb77c

Add Node API

0e2a0c6

Add replica scheduler

acde06f

Add unit test for replica scheduler

f77f9bf

Add Node Controller

914f42b

Update README.md

3d4fb31

Update node cleanup process.

JacieChao force-pushed the dev branch from 20fe59b to 3d4fb31 Compare June 27, 2018 07:00

yasker approved these changes Jun 27, 2018

View reviewed changes

yasker merged commit f3c7858 into longhorn:master Jun 27, 2018

yasker changed the title ~~Enable multiple disk scheduling~~ Support choosing nodes to schedule replicas Jun 27, 2018

yasker mentioned this pull request Jun 27, 2018

Multiple disk with capacity based scheduling longhorn/longhorn#47

Closed

		@@ -371,24 +371,10 @@ func (rc ReplicaController) CreatePodSpec(obj interface{}) (v1.Pod, error) {
		// will pin it down to the same host because we have data on it
		if r.Spec.NodeID != "" {

Support choosing nodes to schedule replicas #85

Support choosing nodes to schedule replicas #85

Conversation

JacieChao commented Jun 13, 2018 • edited

yasker commented Jun 13, 2018

yasker commented Jun 13, 2018

JacieChao commented Jun 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yasker commented Jun 13, 2018

JacieChao commented Jun 14, 2018

yasker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacieChao commented Jun 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yasker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yasker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yasker Jun 26, 2018 • edited

Choose a reason for hiding this comment

JacieChao commented Jun 27, 2018

yasker commented Jun 27, 2018

yasker commented Jun 27, 2018

JacieChao commented Jun 13, 2018 •

edited

yasker Jun 26, 2018 •

edited