Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS EBS provisioner should get node zone info from k8s #76976

Merged
merged 1 commit into from May 8, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions staging/src/k8s.io/legacy-cloud-providers/aws/BUILD
Expand Up @@ -84,6 +84,7 @@ go_test(
"//staging/src/k8s.io/apimachinery/pkg/apis/meta/v1:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/types:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/util/sets:go_default_library",
"//staging/src/k8s.io/client-go/kubernetes/fake:go_default_library",
"//staging/src/k8s.io/cloud-provider/volume:go_default_library",
"//vendor/github.com/aws/aws-sdk-go/aws:go_default_library",
"//vendor/github.com/aws/aws-sdk-go/service/ec2:go_default_library",
Expand Down
120 changes: 69 additions & 51 deletions staging/src/k8s.io/legacy-cloud-providers/aws/aws.go
Expand Up @@ -57,7 +57,7 @@ import (
v1core "k8s.io/client-go/kubernetes/typed/core/v1"
"k8s.io/client-go/pkg/version"
"k8s.io/client-go/tools/record"
cloudprovider "k8s.io/cloud-provider"
"k8s.io/cloud-provider"
nodehelpers "k8s.io/cloud-provider/node/helpers"
servicehelpers "k8s.io/cloud-provider/service/helpers"
cloudvolume "k8s.io/cloud-provider/volume"
Expand Down Expand Up @@ -205,6 +205,16 @@ const volumeAttachmentStuck = "VolumeAttachmentStuck"
// Indicates that a node has volumes stuck in attaching state and hence it is not fit for scheduling more pods
const nodeWithImpairedVolumes = "NodeWithImpairedVolumes"

const (
// These constants help to identify if a node is a master or a minion
labelKeyNodeRole = "kubernetes.io/role"
nodeMasterRole = "master"
nodeMinionRole = "node"
labelKeyNodeMaster = "node-role.kubernetes.io/master"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what behavior is being attached to these labels in this PR? Since these are ad hoc, it is not expected these labels will implicitly drive behavior of kubernetes components (the only existing example of LB routing not including nodes with a master label is considered a bug)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, components of kube may not rely on role labels for proper function. Node role labels are informational only and optional

labelKeyNodeCompute = "node-role.kubernetes.io/compute"
labelKeyNodeMinion = "node-role.kubernetes.io/node"
)

const (
// volumeAttachmentConsecutiveErrorLimit is the number of consecutive errors we will ignore when waiting for a volume to attach/detach
volumeAttachmentStatusConsecutiveErrorLimit = 10
Expand Down Expand Up @@ -1598,69 +1608,77 @@ func (c *Cloud) InstanceType(ctx context.Context, nodeName types.NodeName) (stri
// GetCandidateZonesForDynamicVolume retrieves a list of all the zones in which nodes are running
// It currently involves querying all instances
func (c *Cloud) GetCandidateZonesForDynamicVolume() (sets.String, error) {
// We don't currently cache this; it is currently used only in volume
// creation which is expected to be a comparatively rare occurrence.

// TODO: Caching / expose v1.Nodes to the cloud provider?
// TODO: We could also query for subnets, I think

// Note: It is more efficient to call the EC2 API twice with different tag
// filters than to call it once with a tag filter that results in a logical
// OR. For really large clusters the logical OR will result in EC2 API rate
// limiting.
instances := []*ec2.Instance{}

baseFilters := []*ec2.Filter{newEc2Filter("instance-state-name", "running")}
zones := sets.NewString()

filters := c.tagging.addFilters(baseFilters)
di, err := c.describeInstances(filters)
// TODO: list from cache?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this should list from an informer-fed cache. rapidly repeatedly listing all nodes (in the example scenario of provisioning 50 PVs in 30 seconds) in a 5000 node cluster is problematic.

Copy link
Contributor Author

@zhan849 zhan849 May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I do have plan to optimize that part further(#78140 is another example that can benefit from informer), as cloud provider is now out of tree, making such change would be easier.

The intention is to move away unnecessary “list” calls from cloud provider, which has degree of inefficiency even compared with listing from apiserver and would cause cross cluster effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a useful example could be the informer-fed node zone cache GCE maintains (see https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/gce/gce.go#L683-L745)

AWS has the extra wrinkle of excluding masters, but I wonder if that could be done as a much smaller metadata query that is marked as needing to be run when nodes are added/removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah seems GCE has been doing it for a while, thanks a lot Jordan! tbh having a dedicated zone for master is probably not a common, or even recommended practice, and for non-topology-aware storage class, it's a good check here though as in such case, having volume provisioner to provision a volume in a zone that is more likely to get pod scheduled can reduce operational overhead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend talking with @msau42 about how best to filter the available zones (and whether there's additional info in the CreateVolume call that could be used to narrow down which nodes' zones should be considered)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jordan. Pinged Michelle for inputs from him.

nodes, err := c.kubeClient.CoreV1().Nodes().List(metav1.ListOptions{})
if err != nil {
klog.Errorf("Failed to get nodes from api server: %#v", err)
return nil, err
}

instances = append(instances, di...)

if c.tagging.usesLegacyTags {
filters = c.tagging.addLegacyFilters(baseFilters)
di, err = c.describeInstances(filters)
if err != nil {
return nil, err
for _, n := range nodes.Items {
if !c.isNodeReady(&n) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node readiness is not the same as the running condition in the aws query... will being more restrictive here cause problems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention of this function is to prevent the case that we create a volume in a region that the pod cannot even be scheduled on to. From cloud provider info, “running” is the most we can do but “NodeReady” brings more confidence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @zhan849 the more correct status is that the node is Ready according to the API server vs simply running according to the EC2 control plane. It's possible that a node is running but has not yet joined the cluster or may never join for a variety of reasons.

klog.V(4).Infof("Ignoring not ready node %q in zone discovery", n.Name)
continue
}

instances = append(instances, di...)
}

if len(instances) == 0 {
return nil, fmt.Errorf("no instances returned")
}

zones := sets.NewString()

for _, instance := range instances {
// We skip over master nodes, if the installation tool labels them with one of the well-known master labels
// This avoids creating a volume in a zone where only the master is running - e.g. #34583
// This is a short-term workaround until the scheduler takes care of zone selection
master := false
for _, tag := range instance.Tags {
tagKey := aws.StringValue(tag.Key)
if awsTagNameMasterRoles.Has(tagKey) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking against metadata was probably more reasonable if that was outside the node's control. does checking against these label keys imply all aws clusters must label node objects with these labels now? I see this PR is being cherry picked to 1.14. are all installations making use of the aws cloud provider aware they must now label nodes this way?

master = true
// In some cluster provisioning software, a node can be both a minion and a master. Therefore we white-list
// here, and only filter out node that is not minion AND is labeled as master explicitly
if c.isMinionNode(&n) || !c.isMasterNode(&n) {
if zone, ok := n.Labels[v1.LabelZoneFailureDomain]; ok {
zones.Insert(zone)
} else {
klog.Warningf("Node %s does not have zone label, ignore for zone discovery.", n.Name)
}
} else {
klog.V(4).Infof("Ignoring master node %q in zone discovery", n.Name)
}
}

if master {
klog.V(4).Infof("Ignoring master instance %q in zone discovery", aws.StringValue(instance.InstanceId))
continue
}
klog.V(2).Infof("Found instances in zones %s", zones)
return zones, nil
}

if instance.Placement != nil {
zone := aws.StringValue(instance.Placement.AvailabilityZone)
zones.Insert(zone)
// isNodeReady checks node condition and return true if NodeReady is marked as true
func (c *Cloud) isNodeReady(node *v1.Node) bool {
for _, c := range node.Status.Conditions {
if c.Type == v1.NodeReady {
return c.Status == v1.ConditionTrue
}
}
return false
}

// isMasterNode checks if the node is labeled as master
func (c *Cloud) isMasterNode(node *v1.Node) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is elevating an ad hoc label to implicitly drive behavior. these labels are not uniformly applied, and are not maintained by kubernetes components (unlike the os/arch labels), so I would not expect them to be used this way

// Master node has one or more of the following labels:
//
// kubernetes.io/role: master
// node-role.kubernetes.io/master: ""
// node-role.kubernetes.io/master: "true"
if val, ok := node.Labels[labelKeyNodeMaster]; ok && val != "false" {
return true
} else if role, ok := node.Labels[labelKeyNodeRole]; ok && role == nodeMasterRole {
return true
}
return false
}

klog.V(2).Infof("Found instances in zones %s", zones)
return zones, nil
// isMinionNode checks if the node is labeled as minion
func (c *Cloud) isMinionNode(node *v1.Node) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not safe. Nothing guarantees these labels and a conformant cluster is not required to set them. This would break backwards compatibility with existing clusters on upgrade as well.

// Minion node has one or more oof the following labels:
//
// kubernetes.io/role: "node"
// node-role.kubernetes.io/compute: "true"
// node-role.kubernetes.io/node: ""
if val, ok := node.Labels[labelKeyNodeMinion]; ok && val != "false" {
return true
} else if val, ok := node.Labels[labelKeyNodeCompute]; ok && val != "false" {
return true
} else if role, ok := node.Labels[labelKeyNodeRole]; ok && role == nodeMinionRole {
return true
}
return false
}

// GetZone implements Zones.GetZone
Expand Down
113 changes: 113 additions & 0 deletions staging/src/k8s.io/legacy-cloud-providers/aws/aws_test.go
Expand Up @@ -36,6 +36,7 @@ import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
"k8s.io/apimachinery/pkg/util/sets"
"k8s.io/client-go/kubernetes/fake"
cloudvolume "k8s.io/cloud-provider/volume"
)

Expand Down Expand Up @@ -1834,6 +1835,118 @@ func TestCreateDisk(t *testing.T) {
awsServices.ec2.(*MockedFakeEC2).AssertExpectations(t)
}

func TestGetCandidateZonesForDynamicVolume(t *testing.T) {
tests := []struct {
name string
labels map[string]string
ready bool
expectedZones sets.String
}{
{
name: "master node with role label",
labels: map[string]string{labelKeyNodeRole: nodeMasterRole, v1.LabelZoneFailureDomain: "us-east-1a"},
ready: true,
expectedZones: sets.NewString(),
},
{
name: "master node with master label empty",
labels: map[string]string{labelKeyNodeMaster: "", v1.LabelZoneFailureDomain: "us-east-1a"},
ready: true,
expectedZones: sets.NewString(),
},
{
name: "master node with master label true",
labels: map[string]string{labelKeyNodeMaster: "true", v1.LabelZoneFailureDomain: "us-east-1a"},
ready: true,
expectedZones: sets.NewString(),
},
{
name: "master node with master label false",
labels: map[string]string{labelKeyNodeMaster: "false", v1.LabelZoneFailureDomain: "us-east-1a"},
ready: true,
expectedZones: sets.NewString("us-east-1a"),
},
{
name: "minion node with role label",
labels: map[string]string{labelKeyNodeRole: nodeMinionRole, v1.LabelZoneFailureDomain: "us-east-1a"},
ready: true,
expectedZones: sets.NewString("us-east-1a"),
},
{
name: "minion node with minion label",
labels: map[string]string{labelKeyNodeMinion: "", v1.LabelZoneFailureDomain: "us-east-1a"},
ready: true,
expectedZones: sets.NewString("us-east-1a"),
},
{
name: "minion node with compute label",
labels: map[string]string{labelKeyNodeCompute: "true", v1.LabelZoneFailureDomain: "us-east-1a"},
ready: true,
expectedZones: sets.NewString("us-east-1a"),
},
{
name: "master and minion node",
labels: map[string]string{labelKeyNodeMaster: "true", labelKeyNodeCompute: "true", v1.LabelZoneFailureDomain: "us-east-1a"},
ready: true,
expectedZones: sets.NewString("us-east-1a"),
},
{
name: "node not ready",
labels: map[string]string{v1.LabelZoneFailureDomain: "us-east-1a"},
ready: false,
expectedZones: sets.NewString(),
},
{
name: "node has no zone",
labels: map[string]string{},
ready: true,
expectedZones: sets.NewString(),
},
{
name: "node with no label",
labels: map[string]string{v1.LabelZoneFailureDomain: "us-east-1a"},
ready: true,
expectedZones: sets.NewString("us-east-1a"),
},
}

for i, test := range tests {
t.Run(test.name, func(t *testing.T) {
awsServices := newMockedFakeAWSServices(TestClusterID)
c, _ := newAWSCloud(CloudConfig{}, awsServices)
c.kubeClient = fake.NewSimpleClientset()
nodeName := fmt.Sprintf("node-%d", i)
_, err := c.kubeClient.CoreV1().Nodes().Create(&v1.Node{
ObjectMeta: metav1.ObjectMeta{
Name: nodeName,
Labels: test.labels,
},
Status: genNodeStatus(test.ready),
})
assert.Nil(t, err)
zones, err := c.GetCandidateZonesForDynamicVolume()
assert.Nil(t, err)
assert.Equal(t, test.expectedZones, zones)
})
}
}

func genNodeStatus(ready bool) v1.NodeStatus {
status := v1.ConditionFalse
if ready {
status = v1.ConditionTrue
}
ret := v1.NodeStatus{
Conditions: []v1.NodeCondition{
{
Type: v1.NodeReady,
Status: status,
},
},
}
return ret
}

func newMockedFakeAWSServices(id string) *FakeAWSServices {
s := NewFakeAWSServices(id)
s.ec2 = &MockedFakeEC2{FakeEC2Impl: s.ec2.(*FakeEC2Impl)}
Expand Down