New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support choosing nodes to schedule replicas #85
Conversation
Please separate your commit into following commits:
Make sure after each commit, the manager can be built. You can split 5 into more commits if reasonable, or switch position with 4. |
Do you need the scheduler to be a controller? I think a function call from replica_controller should be enough. What do you think? |
@yasker Sure, I think it's enough for now. At first I want to create a controller to deal with replica scheduling alone. But I think it's easier to be called from replica_controller. I will remove the unused code in scheduler and separate my commits. |
1f7270c
to
f8d84fc
Compare
api/model.go
Outdated
Disks map[string]types.Disk `json:"disks"` | ||
} | ||
|
||
type DiskInput struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's maybe unnecessary to separate API for add/delete disk.
Considering in the UI, normally it is not going to call API every time when a disk was added or removed. It's easy for UI to call backend that the things have been determined. Also it's easier for the backend to store it in one transaction. And it's a rarely modified parameter, performance shouldn't be a problem. So it should be OK to just use UpdateNode
to perform the update on all the fields of the node.
So many code below can be simplied.
datastore/kubernetes.go
Outdated
@@ -147,3 +147,22 @@ func (s *DataStore) DeleteEngineImageDaemonSet(name string) error { | |||
} | |||
return nil | |||
} | |||
|
|||
func (s *DataStore) GetManagerNode() ([]string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use GetManagerNodeIPMap
with the key as the node name.
datastore/longhorn.go
Outdated
return nil, err | ||
} | ||
if node == nil { | ||
node, err = s.CreateDefaultNode(nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird to have GetNodeList
to have a side-effect to create the nonexisting nodes. You can have a manager to create its own node if it doesn't exist when starting up at e.g. https://github.com/rancher/longhorn-manager/blob/master/app/daemon.go#L100 .
manager/volume.go
Outdated
} | ||
if node == nil { | ||
// create node for default path | ||
return m.ds.CreateDefaultNode(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, don't write as a side-effect of reading.
ObjectMeta: metav1.ObjectMeta{ | ||
Name: name, | ||
}, | ||
Spec: types.NodeSpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better add a finalizer, so we can handle the deletion (which we won't allow for the user but node does go down (and take manager with it) sometime and we need to react to that).
api/router.go
Outdated
@@ -97,6 +97,17 @@ func NewRouter(s *Server) *mux.Router { | |||
r.Methods("GET").Path("/v1/hosts").Handler(f(schemas, s.NodeList)) | |||
r.Methods("GET").Path("/v1/hosts/{id}").Handler(f(schemas, s.NodeGet)) | |||
|
|||
r.Methods("GET").Path("/v1/nodes").Handler(f(schemas, s.MountNodeList)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old /v1/hosts
will be removed later, though we need some UI change to accommodate this. You can use the name NodeList
and NodeGet
, just rename the previous one to HostList
/HostGet
(in a separate commit). MountNode sounds weird.
scheduler/replica_scheduler.go
Outdated
} | ||
|
||
func NewReplicaScheduler( | ||
ds *datastore.DataStore) *ReplicaScheduler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um... It's a weird line break here. Just merge it with the upper one.
controller/replica_controller.go
Outdated
@@ -270,6 +273,11 @@ func (rc *ReplicaController) syncReplica(key string) (err error) { | |||
return err | |||
} | |||
} | |||
// check whether the replica need to be scheduled | |||
err = rc.scheduler.ScheduleReplica(replica) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a problem with this approach.
Not sure if you observed it, but I think it may result in sometimes two replicas are scheduled on the same host while the other host has none. Since there are multiple workers working to update the replicas simultaneously, you may end up with multiple replicas was scheduled at the exact same time, and choosing the same node for the multiple replicas because when the schedule started, they cannot see the end result of others.
It's more reliable to do scheduling in the volume controller when we create the replica for the first time. It's in sequence for a single volume. https://github.com/rancher/longhorn-manager/blob/master/controller/volume_controller.go#L629
scheduler/replica_scheduler.go
Outdated
if err != nil { | ||
return nil | ||
} | ||
rchecker := checker.NewReplicaChecker(nodeIPMap, replicas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks unnecessary to create another package checker
. It's only one function call we needed.
Make it simple for now. We can expand if it's necessary.
scheduler/replica_scheduler.go
Outdated
|
||
// TODO Need to add capacity. | ||
// Just make sure replica of the same volume be scheduled to different nodes for now. | ||
nodeMap := rchecker.ReplicasAffinity() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a better name, e.g. preferredNodes
. It looks too similar to nodeIPMap
which can cause confusion.
Now I think we need a node controller and node status as well. See updated design. |
e77c1dc
to
f860650
Compare
@yasker NodeController is still working in progress. I will add a new commit when I finish it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good so far, though the scheduling code apparently lacks of testing.
It's easy to test the scheduler by disabling the Kubernetes scheduler. You can make a commit to disable anti-affinity rule and make the replica controller error out if NodeID or data path wasn't filled. The commit can be after the scheduler. It can help you to spot the bugs in the scheduler node.
Then you can add a unit test case here to test your scheduling.
Then you should able to add test cases to longhorn-tests to test the customized scheduling and make sure the anti-affinity scheduling works.
After that, please continue to work on the node controller and multiple disk (randomly) scheduling.
controller/volume_controller.go
Outdated
@@ -630,6 +631,11 @@ func (vc *VolumeController) replenishReplicas(v *longhorn.Volume, rs map[string] | |||
if err != nil { | |||
return err | |||
} | |||
// check whether the replica need to be scheduled | |||
err = vc.scheduler.ScheduleReplica(r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, the scheduling should be done when before the replica was created in the datastore. Because once it's created, the replica controller will pick it up and trying to start replica pod with it. This can happen before the scheduling result written to the datastore. So it should be put into the createReplica() instead, replace the DataPath
assignment line.
scheduler/replica_scheduler.go
Outdated
// if other replica has allocated to different nodes, then choose a random one | ||
nodeID := "" | ||
if len(preferredNodes) == 0 { | ||
nodeID = rcs.getRandomNode(preferredNodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preferredNodes
is empty here...
api/model.go
Outdated
client.Resource | ||
Name string `json:"name"` | ||
AllowScheduling bool `json:"allowScheduling"` | ||
Disks map[string]types.DiskSpec `json:"disks"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define a new type Disk in the API to cover both DiskSpec and DiskStatus.
8a72a36
to
51fc723
Compare
@yasker I have added unit test for replica_scheduler and updated unit test case for volume_controller and replica_controller. I will add test cases in longhorn-tests later. |
controller/replica_controller.go
Outdated
@@ -371,24 +371,10 @@ func (rc *ReplicaController) CreatePodSpec(obj interface{}) (*v1.Pod, error) { | |||
// will pin it down to the same host because we have data on it | |||
if r.Spec.NodeID != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.Spec.NodeID cannot be "" after your scheduling change.
controller/replica_controller.go
Outdated
} | ||
// error out if NodeID and DataPath wasn't filled in scheduler | ||
if r.Spec.NodeID == "" || r.Spec.DataPath == "" { | ||
return nil, fmt.Errorf("There has no avaible node for replica %v", r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong error message. It's a bug.
fmt.Errorf("BUG: Node or datapath wasn't set for replica %v", r.Name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull this functional change into another commit or previous commit. It's kind of been smuggled into a commit suppose to add tests.
scheduler/replica_scheduler_test.go
Outdated
r, err := s.ScheduleReplica(r1) | ||
assert.Nil(err) | ||
assert.NotNil(r) | ||
// assert could not scheduler to node2 and node3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a test for anti-affinity implementation of the scheduler.
} | ||
|
||
func (rcs *ReplicaScheduler) ScheduleReplica(replica *longhorn.Replica) (*longhorn.Replica, error) { | ||
// only called when replica is starting for the first time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function should only be called when NodeID == ""
, then error out if NodeID is not empty. It will help to expose the potential bugs.
scheduler/replica_scheduler_test.go
Outdated
} | ||
} | ||
|
||
func TestReplicaScheduler_ScheduleReplica(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go style is CamelCase without underscore(_
). Just do TestReplicaScheduler
should be fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases are really nice now. Good job!
app/daemon.go
Outdated
@@ -18,6 +18,7 @@ import ( | |||
"github.com/rancher/longhorn-manager/util" | |||
|
|||
longhorn "github.com/rancher/longhorn-manager/k8s/pkg/apis/longhorn/v1alpha1" | |||
"os" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put os
along with fmt
Basically the import section looks like this:
<build-in> (os, fmt etc)
<3rd party> (logrus etc)
<current project> (longhorn-manager/xxx)
<manager k8s related> (longhorn-manager/k8s/xxx)
The last section exists is because it's normally really long and can become ugly if put along with the <current project>
section
scheduler/replica_scheduler.go
Outdated
} | ||
replica.Spec.DataPath = dataPath + "/replicas/" + replica.Spec.VolumeName + "-" + util.RandomID() | ||
} else { | ||
return nil, fmt.Errorf("BUG: Replica %v has been scheduled to node %v", replica.Name, replica.Spec.NodeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it's a really long if with extra indents.
Just do this at the beginning of the function:
if replica.Spec.NodeID != "" {
return nil, fmt.Errorf("BUG: Replica %v has been scheduled to node %v", replica.Name, replica.Spec.NodeID)
}
...
bf8c807
to
de8cfcd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the coding style issues, LGTM.
After updating the code, please continue to work on the longhorn-tests
for the phase 1.
controller/volume_controller_test.go
Outdated
@@ -21,6 +21,7 @@ import ( | |||
lhinformerfactory "github.com/rancher/longhorn-manager/k8s/pkg/client/informers/externalversions" | |||
|
|||
. "gopkg.in/check.v1" | |||
"k8s.io/api/core/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This k8s.io/api/core/v1
should be put along with other k8s.io
import above.
package scheduler | ||
|
||
import ( | ||
"fmt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow the pattern of other files for import:
"fmt"
"testing"
<space>
"k8s.io/xxx"
...
<space>
"github.com/rancher/longhorn-manager/xxxx"
...
<space>
"github.com/rancher/longhorn-manager/k8s/xxx"
...
756106f
to
69ab804
Compare
0927fe1
to
348f4c7
Compare
controller/node_controller.go
Outdated
return err | ||
} | ||
if node == nil { | ||
logrus.Debugf("Longhorn node %v does not exist, regenerate a default one", key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some confusions on how to handle the node deletion. I've updated the design doc for that. Please check Node creation and deletion
section.
for _, pod := range managerPods { | ||
err = nc.syncStatusWithPod(pod, node) | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the design doc, don't retry if conflict happens, assuming other managers are updating it. I am afraid the requeue may cause a storm of updating conflict in the larger system.
deploy/01-prerequisite/03-crd.yaml
Outdated
listKind: NodeList | ||
plural: nodes | ||
shortNames: | ||
- lhnode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use lhn
instead.
@yasker
I tried to tear down one longhorn-manager to check that Node CRD status set to I have updated |
LGTM. Merged, thanks. |
@JacieChao And next time, remember to link your PR to the original issue. |
For now, just add Node CRD for multiple disk configuration and add a replica scheduler to schedule replica to specified disk on node.