Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support choosing nodes to schedule replicas #85

Merged
merged 8 commits into from Jun 27, 2018
Merged

Conversation

JacieChao
Copy link
Contributor

@JacieChao JacieChao commented Jun 13, 2018

For now, just add Node CRD for multiple disk configuration and add a replica scheduler to schedule replica to specified disk on node.

@JacieChao JacieChao changed the title Enable Multiple disk sche Enable multiple disk scheduling Jun 13, 2018
@yasker
Copy link
Member

yasker commented Jun 13, 2018

@JacieChao

Please separate your commit into following commits:

  1. Add CRD to https://github.com/rancher/longhorn-manager/blob/master/k8s/pkg/apis/longhorn/v1alpha1/types.go
  2. Generate the CRD using the code generator
  3. Add related fields to datastore
  4. Add related fields to API.
  5. Other logics.

Make sure after each commit, the manager can be built.

You can split 5 into more commits if reasonable, or switch position with 4.

@yasker
Copy link
Member

yasker commented Jun 13, 2018

Do you need the scheduler to be a controller? I think a function call from replica_controller should be enough. What do you think?

@JacieChao
Copy link
Contributor Author

@yasker Sure, I think it's enough for now. At first I want to create a controller to deal with replica scheduling alone. But I think it's easier to be called from replica_controller. I will remove the unused code in scheduler and separate my commits.

api/model.go Outdated
Disks map[string]types.Disk `json:"disks"`
}

type DiskInput struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's maybe unnecessary to separate API for add/delete disk.

Considering in the UI, normally it is not going to call API every time when a disk was added or removed. It's easy for UI to call backend that the things have been determined. Also it's easier for the backend to store it in one transaction. And it's a rarely modified parameter, performance shouldn't be a problem. So it should be OK to just use UpdateNode to perform the update on all the fields of the node.

So many code below can be simplied.

@@ -147,3 +147,22 @@ func (s *DataStore) DeleteEngineImageDaemonSet(name string) error {
}
return nil
}

func (s *DataStore) GetManagerNode() ([]string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use GetManagerNodeIPMap with the key as the node name.

return nil, err
}
if node == nil {
node, err = s.CreateDefaultNode(nodeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird to have GetNodeList to have a side-effect to create the nonexisting nodes. You can have a manager to create its own node if it doesn't exist when starting up at e.g. https://github.com/rancher/longhorn-manager/blob/master/app/daemon.go#L100 .

}
if node == nil {
// create node for default path
return m.ds.CreateDefaultNode(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, don't write as a side-effect of reading.

ObjectMeta: metav1.ObjectMeta{
Name: name,
},
Spec: types.NodeSpec{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better add a finalizer, so we can handle the deletion (which we won't allow for the user but node does go down (and take manager with it) sometime and we need to react to that).

api/router.go Outdated
@@ -97,6 +97,17 @@ func NewRouter(s *Server) *mux.Router {
r.Methods("GET").Path("/v1/hosts").Handler(f(schemas, s.NodeList))
r.Methods("GET").Path("/v1/hosts/{id}").Handler(f(schemas, s.NodeGet))

r.Methods("GET").Path("/v1/nodes").Handler(f(schemas, s.MountNodeList))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old /v1/hosts will be removed later, though we need some UI change to accommodate this. You can use the name NodeList and NodeGet, just rename the previous one to HostList/HostGet (in a separate commit). MountNode sounds weird.

}

func NewReplicaScheduler(
ds *datastore.DataStore) *ReplicaScheduler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um... It's a weird line break here. Just merge it with the upper one.

@@ -270,6 +273,11 @@ func (rc *ReplicaController) syncReplica(key string) (err error) {
return err
}
}
// check whether the replica need to be scheduled
err = rc.scheduler.ScheduleReplica(replica)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a problem with this approach.

Not sure if you observed it, but I think it may result in sometimes two replicas are scheduled on the same host while the other host has none. Since there are multiple workers working to update the replicas simultaneously, you may end up with multiple replicas was scheduled at the exact same time, and choosing the same node for the multiple replicas because when the schedule started, they cannot see the end result of others.

It's more reliable to do scheduling in the volume controller when we create the replica for the first time. It's in sequence for a single volume. https://github.com/rancher/longhorn-manager/blob/master/controller/volume_controller.go#L629

if err != nil {
return nil
}
rchecker := checker.NewReplicaChecker(nodeIPMap, replicas)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks unnecessary to create another package checker. It's only one function call we needed.

Make it simple for now. We can expand if it's necessary.


// TODO Need to add capacity.
// Just make sure replica of the same volume be scheduled to different nodes for now.
nodeMap := rchecker.ReplicasAffinity()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a better name, e.g. preferredNodes. It looks too similar to nodeIPMap which can cause confusion.

@yasker
Copy link
Member

yasker commented Jun 13, 2018

Now I think we need a node controller and node status as well. See updated design.

@JacieChao
Copy link
Contributor Author

@yasker NodeController is still working in progress. I will add a new commit when I finish it.

Copy link
Member

@yasker yasker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacieChao

It looks good so far, though the scheduling code apparently lacks of testing.

It's easy to test the scheduler by disabling the Kubernetes scheduler. You can make a commit to disable anti-affinity rule and make the replica controller error out if NodeID or data path wasn't filled. The commit can be after the scheduler. It can help you to spot the bugs in the scheduler node.

Then you can add a unit test case here to test your scheduling.

Then you should able to add test cases to longhorn-tests to test the customized scheduling and make sure the anti-affinity scheduling works.

After that, please continue to work on the node controller and multiple disk (randomly) scheduling.

@@ -630,6 +631,11 @@ func (vc *VolumeController) replenishReplicas(v *longhorn.Volume, rs map[string]
if err != nil {
return err
}
// check whether the replica need to be scheduled
err = vc.scheduler.ScheduleReplica(r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, the scheduling should be done when before the replica was created in the datastore. Because once it's created, the replica controller will pick it up and trying to start replica pod with it. This can happen before the scheduling result written to the datastore. So it should be put into the createReplica() instead, replace the DataPath assignment line.

// if other replica has allocated to different nodes, then choose a random one
nodeID := ""
if len(preferredNodes) == 0 {
nodeID = rcs.getRandomNode(preferredNodes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preferredNodes is empty here...

api/model.go Outdated
client.Resource
Name string `json:"name"`
AllowScheduling bool `json:"allowScheduling"`
Disks map[string]types.DiskSpec `json:"disks"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define a new type Disk in the API to cover both DiskSpec and DiskStatus.

@JacieChao
Copy link
Contributor Author

@yasker I have added unit test for replica_scheduler and updated unit test case for volume_controller and replica_controller. I will add test cases in longhorn-tests later.

@@ -371,24 +371,10 @@ func (rc *ReplicaController) CreatePodSpec(obj interface{}) (*v1.Pod, error) {
// will pin it down to the same host because we have data on it
if r.Spec.NodeID != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r.Spec.NodeID cannot be "" after your scheduling change.

}
// error out if NodeID and DataPath wasn't filled in scheduler
if r.Spec.NodeID == "" || r.Spec.DataPath == "" {
return nil, fmt.Errorf("There has no avaible node for replica %v", r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong error message. It's a bug.

fmt.Errorf("BUG: Node or datapath wasn't set for replica %v", r.Name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull this functional change into another commit or previous commit. It's kind of been smuggled into a commit suppose to add tests.

r, err := s.ScheduleReplica(r1)
assert.Nil(err)
assert.NotNil(r)
// assert could not scheduler to node2 and node3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a test for anti-affinity implementation of the scheduler.

}

func (rcs *ReplicaScheduler) ScheduleReplica(replica *longhorn.Replica) (*longhorn.Replica, error) {
// only called when replica is starting for the first time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function should only be called when NodeID == "", then error out if NodeID is not empty. It will help to expose the potential bugs.

}
}

func TestReplicaScheduler_ScheduleReplica(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go style is CamelCase without underscore(_). Just do TestReplicaScheduler should be fine for now.

Copy link
Member

@yasker yasker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases are really nice now. Good job!

app/daemon.go Outdated
@@ -18,6 +18,7 @@ import (
"github.com/rancher/longhorn-manager/util"

longhorn "github.com/rancher/longhorn-manager/k8s/pkg/apis/longhorn/v1alpha1"
"os"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put os along with fmt

Basically the import section looks like this:

<build-in> (os, fmt etc)
<3rd party> (logrus etc)
<current project> (longhorn-manager/xxx)
<manager k8s related> (longhorn-manager/k8s/xxx)

The last section exists is because it's normally really long and can become ugly if put along with the <current project> section

}
replica.Spec.DataPath = dataPath + "/replicas/" + replica.Spec.VolumeName + "-" + util.RandomID()
} else {
return nil, fmt.Errorf("BUG: Replica %v has been scheduled to node %v", replica.Name, replica.Spec.NodeID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's a really long if with extra indents.

Just do this at the beginning of the function:

if replica.Spec.NodeID != "" {
    return nil, fmt.Errorf("BUG: Replica %v has been scheduled to node %v", replica.Name, replica.Spec.NodeID)
}
...

Copy link
Member

@yasker yasker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the coding style issues, LGTM.

After updating the code, please continue to work on the longhorn-tests for the phase 1.

@@ -21,6 +21,7 @@ import (
lhinformerfactory "github.com/rancher/longhorn-manager/k8s/pkg/client/informers/externalversions"

. "gopkg.in/check.v1"
"k8s.io/api/core/v1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This k8s.io/api/core/v1 should be put along with other k8s.io import above.

package scheduler

import (
"fmt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the pattern of other files for import:

"fmt"
"testing"
<space>
"k8s.io/xxx"
...
<space>
"github.com/rancher/longhorn-manager/xxxx"
...
<space>
"github.com/rancher/longhorn-manager/k8s/xxx"
...

return err
}
if node == nil {
logrus.Debugf("Longhorn node %v does not exist, regenerate a default one", key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some confusions on how to handle the node deletion. I've updated the design doc for that. Please check Node creation and deletion section.

for _, pod := range managerPods {
err = nc.syncStatusWithPod(pod, node)
if err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the design doc, don't retry if conflict happens, assuming other managers are updating it. I am afraid the requeue may cause a storm of updating conflict in the larger system.

listKind: NodeList
plural: nodes
shortNames:
- lhnode
Copy link
Member

@yasker yasker Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use lhn instead.

@JacieChao
Copy link
Contributor Author

@yasker
Tested with longhorn-tests. Here's the result

+ flake8 .
+ py.test -v .
============================= test session starts ==============================
platform linux2 -- Python 2.7.14, pytest-2.9.2, py-1.5.3, pluggy-0.3.1 -- /usr/local/bin/python
cachedir: .cache
rootdir: /integration/tests, inifile:
collecting ... collected 19 items
test_basic.py::test_hosts_and_settings PASSED
test_basic.py::test_volume_basic PASSED
test_basic.py::test_volume_iscsi_basic PASSED
test_basic.py::test_snapshot PASSED
test_basic.py::test_backup PASSED
test_basic.py::test_volume_multinode PASSED
test_basic.py::test_replica_scheduler PASSED
test_csi.py::test_csi_volume_mount SKIPPED
test_csi.py::test_csi_volume_io SKIPPED
test_driver.py::test_volume_mount PASSED
test_driver.py::test_volume_io PASSED
test_engine_upgrade.py::test_engine_image PASSED
test_engine_upgrade.py::test_engine_offline_upgrade PASSED
test_engine_upgrade.py::test_engine_live_upgrade PASSED
test_engine_upgrade.py::test_engine_image_incompatible PASSED
test_engine_upgrade.py::test_engine_live_upgrade_rollback PASSED
test_ha.py::test_ha_simple_recovery PASSED
test_ha.py::test_ha_salvage PASSED
test_recurring_job.py::test_recurring_job SKIPPED
=================== 16 passed, 3 skipped in 1492.87 seconds ====================

I tried to tear down one longhorn-manager to check that Node CRD status set to down and not available to schedule replica.

I have updated README.md about cleanup node crd before tearing down longhorn system and tested with cleanup process.

@yasker yasker merged commit f3c7858 into longhorn:master Jun 27, 2018
@yasker
Copy link
Member

yasker commented Jun 27, 2018

LGTM. Merged, thanks.

@yasker yasker changed the title Enable multiple disk scheduling Support choosing nodes to schedule replicas Jun 27, 2018
@yasker
Copy link
Member

yasker commented Jun 27, 2018

@JacieChao And next time, remember to link your PR to the original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants