Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenStack cloud provider #1690

Merged
merged 7 commits into from Mar 15, 2019
Merged

Conversation

@tghartland
Copy link
Contributor

tghartland commented Feb 14, 2019

This pull request adds a cloud provider for OpenStack. This requires a more up to date version of gophercloud. The specific newer commit of gophercloud is chosen because the commit following it changes a function signature, and this leads to problems with k8s.io/kubernetes itself.

As of right now this autoscaler only supports OpenStack clusters using Magnum and Heat, but this is done through an interface OpenstackManager. I will be adding another implementation of this interface to support Magnum clusters with proper nodegroups when that is available, but this could also be extended to support any other kind of OpenStack cluster. There is only the Magnum/Heat manager now but it is already set up to support more via changing an environment variable.

All nodegroups managed by the OpenstackCloudProvider interact with openstack through a single instance of the manager.

Since Magnum does not yet have nodegroups, only a single nodegroup is currently allowed and this represents the entire cluster.

if err != nil {
return fmt.Errorf("scale up error: %v", err)
}
err = ng.WaitForClusterStatus(clusterStatusUpdateComplete, waitForCompleteStatusTimout)

This comment has been minimized.

Copy link
@MaciekPytel

MaciekPytel Feb 14, 2019

Contributor

Does this wait until new nodes actually boot and register in k8s? In general this method should just request VMs, not block until they boot up.

This comment has been minimized.

Copy link
@tghartland

tghartland Feb 14, 2019

Author Contributor

The Nodegroup interface suggests that the scale up/down implementations should wait for the operation to complete.

// IncreaseSize increases the size of the node group. To delete a node you need
// to explicitly name it and use DeleteNode. This function should wait until
// node group size is updated. Implementation required.
IncreaseSize(delta int) error

This comment has been minimized.

Copy link
@MaciekPytel

MaciekPytel Feb 14, 2019

Contributor

It was designed with the assumption that underlying cloud provider has a 'desired size'. The method must wait until that 'desired size' is updated, not until VMs are actually created.

CA takes 'upcoming' nodes into account in its simulations and it can trigger another scale-up while the previous nodes are still starting up if even more pending pods show up. You'll lose that ability if you block until VMs are created.

This comment has been minimized.

Copy link
@tghartland

tghartland Feb 14, 2019

Author Contributor

It was designed with the assumption that underlying cloud provider has a 'desired size'. The method must wait until that 'desired size' is updated, not until VMs are actually created.

Ok, that makes sense. The node_count in magnum is updated at the start of the scaling, so this would be the point where it should stop waiting.

Even so I have it set up so that any scales that are triggered while the cluster is an in updating status are ignored anyway. In my own experience letting it send another update to the cluster while it was still updating would always end up with the cluster in UPDATE_FAILED status.

This comment has been minimized.

Copy link
@strigazi

strigazi Feb 14, 2019

CA takes 'upcoming' nodes into account in its simulations and it can trigger another scale-up while the previous nodes are still starting up if even more pending pods show up. You'll lose that ability if you block until VMs are created.

Do you have a pointer to the code that does the simulations? With our implementation all vms are identical, so the simulations could be made, but, another scale up will not be possible at the moment (this could change in the future)

This comment has been minimized.

Copy link
@MaciekPytel

MaciekPytel Feb 14, 2019

Contributor

Do you have a pointer to the code that does the simulations?

Half of CA codebase is dedicated to running simulations :) If you ask a specific question I can point you to relevant code, but just dealing with upcoming nodes would be hundreds LOC spread between various files in clusterstate/, core/ and expander/.

This comment has been minimized.

Copy link
@scott-chaney

scott-chaney Feb 14, 2019

It was designed with the assumption that underlying cloud provider has a 'desired size'. The method must wait until that 'desired size' is updated, not until VMs are actually created.

Magnum and the underlying Heat stack don't have the notion of "Desired Size" being decoupled from a scaling event. If the node count parameter changes on the stack, that immediately triggers the stack's size update policy. There is also no imperative method to add or remove nodes with the Heat::ResourceGroup that the Magnum clusters are built on (there no "add X nodes" method for the ResourceGroup). I think this situation can be made more consistent with how the other cloud providers work by keeping track of the "target size" as scaling events are processed by the cloudprovider code. This would allow the openstack autoscaler code to be able to issue concurrent updates, as long as the previous "target size" was calculated correctly from the previous update (the mutex should ensure that the new "target size" is set before another thread evaluates a scaling operation)

With our implementation all vms are identical, so the simulations could be made, but, another scale up will not be possible at the moment (this could change in the future)

From my tests, the Heat stack is able to handle multiple scale up requests before one finishes, as long as they know the true reference from the "delta" parameter. I think this could be improved by updating the ng.targetSize as updates come in to the cloudprovider code.

In my own experience letting it send another update to the cluster while it was still updating would always end up with the cluster in UPDATE_FAILED status.

I have not experienced this issue, which version of Heat were you observing this in?

Edit: I was thinking from Heat's perspective (for concurrent updates), it is possible that a Magnum update when another update is in flight does yield the failed status.

@MaciekPytel

This comment has been minimized.

Copy link
Contributor

MaciekPytel commented Feb 14, 2019

Since Magnum does not yet have nodegroups, only a single nodegroup is currently allowed and this represents the entire cluster.

The requirement for NodeGroup is that it's made up of identical nodes. That basically implies that you can only have one type of VMs in your cluster. Is that by design?

@MaciekPytel

This comment has been minimized.

Copy link
Contributor

MaciekPytel commented Feb 14, 2019

I don't know anything about how kubernetes works on openstack. I can point out a few things from CA side, but I'd appreciate a review from someone who actually knows something about openstack.

@kubernetes/sig-openstack-pr-reviews
+cc People active on #734:
@ricolin @chaosaffe @adsl123gg @dklyle

@MaciekPytel

This comment has been minimized.

Copy link
Contributor

MaciekPytel commented Feb 14, 2019

cc: @multi-io who was also working on openstack support (via cluster API).

@tghartland

This comment has been minimized.

Copy link
Contributor Author

tghartland commented Feb 14, 2019

@MaciekPytel

The requirement for NodeGroup is that it's made up of identical nodes. That basically implies that you can only have one type of VMs in your cluster. Is that by design?

This is how things currently are in magnum. The only difference can be between the master and minion nodes. Support for nodegroups is being worked on and this would allow for variation within the minions.
https://github.com/openstack/magnum-specs/blob/master/specs/stein/magnum-nodegroups.rst

@strigazi

This comment has been minimized.

@strigazi

This comment has been minimized.

Copy link

strigazi commented Feb 14, 2019

Since Magnum does not yet have nodegroups, only a single nodegroup is currently allowed and this represents the entire cluster.

The requirement for NodeGroup is that it's made up of identical nodes. That basically implies that you can only have one type of VMs in your cluster. Is that by design?

Nodes within a nodegroup are identical by design.

This is an attempt to implement 6 from #734 (comment)

@strigazi

This comment has been minimized.

Copy link

strigazi commented Feb 14, 2019

cc: @multi-io who was also working on openstack support (via cluster API).

where is that work?

@MaciekPytel

This comment has been minimized.

Copy link
Contributor

MaciekPytel commented Feb 14, 2019

@chaosaffe

This comment has been minimized.

Copy link

chaosaffe commented Feb 14, 2019

The question from my side is why are we supporting magnum or heat when both APIs require and interact directly with nova?
Targeting several wrappers around the nova API (i.e. magnum, heat) seems like it creates an additional maintenance burden without any apparent advantage I can see (please correct me if I am missing something obvious)

Why not target nova directly, and therefore all possible OpenStack environments managing compute?

@strigazi

This comment has been minimized.

Copy link

strigazi commented Feb 14, 2019

The question from my side is why are we supporting magnum or heat when both APIs require and interact directly with nova?

This solution (see here https://github.com/kubernetes/autoscaler/pull/1690/files#diff-fee06febcf57b148fb92f53f3ddd44e2R28) is meant to answer this question. We don't have to support only one solution. #1481 addresses a solution for users of the cluster-api. This PR is not meant to be the defacto openstack solution, it can go in it's own directory.

Targeting several wrappers around the nova API (i.e. magnum, heat) seems like it creates an additional maintenance burden without any apparent advantage I can see (please correct me if I am missing something obvious)

Why not target nova directly, and therefore all possible OpenStack environments managing compute?

The advantage is that people use different solutions, like cluster-api, magnum, heat, senlin. Maybe I'm forgetting one.

This autoscaler implementation should not be a blocker for cluster-api users. In the same way that in this repo we have many cloud providers.

@MaciekPytel

This comment has been minimized.

Copy link
Contributor

MaciekPytel commented Feb 14, 2019

I don't have enough context to have an opinion on what is the right way to implement openstack cloudprovider. However, as CA maintainer I'd like to avoid ending up with several different openstack implementations and in particular everyone implementing their own version in this repo.
I probably wouldn't mind having a cluster api based one and a non-cluster api one, but having a separate one for magnum, nova, senlin, heat, whatever seems like too much. That implies cloudproviders tailored to individual use-cases rather than generic support for openstack.

@dims

This comment has been minimized.

Copy link
Member

dims commented Feb 14, 2019

/assign @losipiuk

@strigazi

This comment has been minimized.

Copy link

strigazi commented Feb 14, 2019

I don't have enough context to have an opinion on what is the right way to implement openstack cloudprovider. However, as CA maintainer I'd like to avoid ending up with several different openstack implementations and in particular everyone implementing their own version in this repo.
I probably wouldn't mind having a cluster api based one and a non-cluster api one, but having a separate one for magnum, nova, senlin, heat, whatever seems like too much. That implies cloudproviders tailored to individual use-cases rather than generic support for openstack.

@MaciekPytel here is the entrypoint to have one openstack cloud-provider which could be configured wither for the cluster-api or magnum. afaik only cluster-api and magnum are at a state that can be used today. @chaosaffe , when you say nova, you mean cluster-api, right?

@MaciekPytel

This comment has been minimized.

Copy link
Contributor

MaciekPytel commented Feb 14, 2019

@MaciekPytel here is the entrypoint to have one openstack cloud-provider which could be configured wither for the cluster-api or magnum

I'd like to have an openstack implementation that works for majority of people and avoid implementations targeting someone's specific setup in official repo. If there is a consensus that a small number of different implementations (or few different X_manager.go files in your implementation) are required I'm also fine with that.
As to whether this PR is the right way to implement openstack provider, I'd rather defer to @kubernetes/sig-openstack-pr-reviews. If there is a consensus that this is how openstack provider should work I'm happy to do CA-side code review and merge this.

@jim-bach

This comment has been minimized.

Copy link

jim-bach commented Feb 14, 2019

i think being able to provide a optional code path somehow with as you said another X_manager.go but all of the kubernetes and most of the openstack conventions the same is reasonable

@ricolin

This comment has been minimized.

Copy link

ricolin commented Feb 14, 2019

@tghartland Sorry for not update #1226 , for some personal stuff that keeps me busy
Just post my current work in #1691 so we can have a better discussion here.
My approach is to have a common path and separate each use cases inside openstack manager
I like your approach as well, and I actually think we should focus on magnum now. Since some common design are magnum specific in your patch, maybe we can leave openstack_manager_heat alone since it's not even usable isn't it? If we plan to keep others(Nova, Heat, Senlin) on the table, than I think we better have a common path for them all.

}
klog.V(1).Infof("Deleting nodes: %v", nodeNames)

updatePossible, currentStatus, err := ng.openstackManager.CanUpdate()

This comment has been minimized.

Copy link
@scott-chaney

scott-chaney Feb 14, 2019

Similarly to the IncreaseSize function, they stack/cluster should be able to handle multiple concurrent updates, as long as the intended result of the scaling action is stored and the subsequent action can act on that data.

// the clusterUpdateMutex lock. One they get it, the deletion will
// already be done and they will return above at the check
// for len(ng.nodesToDelete) == 0.
time.Sleep(deleteNodesBatchingDelay)

This comment has been minimized.

Copy link
@scott-chaney

scott-chaney Feb 14, 2019

I like that this saves on Heat API hits, but for this function I think it might be better to just issue the updates as they come in and not poll the API waiting for updates to finish. I think the ng.nodesToDelete should be appended as deletes come in (stored across calls) and just resubmitted. Once a scale up is issued, the ng.nodesToDelete can be purged to prevent accidental removal of a node with a reused IP. There should be no harm if 100s of IPs are specified when removing a single node as long as we know there were no scale ups operations in between scale down.

return fmt.Errorf("size decrease too large, desired:%d min:%d", minionCount-len(nodes), ng.MinSize())
}

var minionIPs []string

This comment has been minimized.

Copy link
@scott-chaney

scott-chaney Feb 15, 2019

I was not able to successfully target minions by IP when scaling down, I had to use the node index. I attempted in Queens Heat with both Juno-era and Queens-era templates. When I inspected the cluster's minion ResourceGroup, I noticed that the "refs" attribute was null. Would you be willing to share the output of the following API call?

curl -H 'X-Auth-Token:{OS_TOKEN}' {heat-api}/v1/{tenant_id}/stacks/{stack_name}/{stack_id}/resources/{resource_name}

Where resource_name is the kube_minions resourceGroup. I am curious as to if the null refs attribute is the kicker for this.

This comment has been minimized.

Copy link
@tghartland

tghartland Feb 15, 2019

Author Contributor

I wasn't able to find a list of what values are accepted by heat for minions_to_remove, and the only examples I saw were with the node index or IP, so I chose to use IP since it's directly available as a property of the node. I've now tried with the machine ID and that works as well for me. I won't be able to rework the tests for it and push the commit until Monday, but you can retest after that.

This comment has been minimized.

Copy link
@jim-bach

jim-bach Feb 15, 2019

we're not able to target the nodes by IP through this mechanism, even when passed through it just chooses the "newest" for downscaling. Only the index (maybe the machineID too? @scott-chaney) If you can show us the curl from that output it will help us reduce our search space to have a better view of our problem.

This comment has been minimized.

Copy link
@tghartland

tghartland Feb 15, 2019

Author Contributor
{
   "resource":{
      "resource_name":"kube_minions",
      "description":"",
      "links":[
         {
            "href":"https://openstack.cern.ch:8004/v1/d656bd69-82a2-4efe-bbee-abd0b0a83252/stacks/scaler-01-owjrxq5vkizf/5f3ad568-3260-4edb-91bb-c80dacaa275f/resources/kube_minions",
            "rel":"self"
         },
         {
            "href":"https://openstack.cern.ch:8004/v1/d656bd69-82a2-4efe-bbee-abd0b0a83252/stacks/scaler-01-owjrxq5vkizf/5f3ad568-3260-4edb-91bb-c80dacaa275f",
            "rel":"stack"
         },
         {
            "href":"https://openstack.cern.ch:8004/v1/d656bd69-82a2-4efe-bbee-abd0b0a83252/stacks/scaler-01-owjrxq5vkizf-kube_minions-g2tvz4jsooom/76467cd3-15db-43b8-a392-f8433e30d468",
            "rel":"nested"
         }
      ],
      "logical_resource_id":"kube_minions",
      "creation_time":"2019-02-15T13:25:01Z",
      "resource_status_reason":"state changed",
      "updated_time":"2019-02-15T13:50:40Z",
      "required_by":[

      ],
      "resource_status":"UPDATE_COMPLETE",
      "physical_resource_id":"76467cd3-15db-43b8-a392-f8433e30d468",
      "attributes":{
         "attributes":null,
         "refs":null,
         "refs_map":null,
         "removed_rsrc_list":[

         ]
      },
      "resource_type":"OS::Heat::ResourceGroup"
   }
}

I see refs as null as well.

This comment has been minimized.

Copy link
@jim-bach

This comment has been minimized.

Copy link
@tghartland

tghartland Feb 19, 2019

Author Contributor

@scott-chaney please disregard my previous comment. Machine IDs do not work (they are just ignored by heat). IPs should work because heat is able to resolve them back to the node indices.

$ openstack stack show scaler-03-u6xxwz25daec-kube_minions-4ylgp5orn4qo
.....
| outputs               |
|                       |   output_key: refs_map
|                       |   output_value:
|                       |     '0': 188.185.64.117
.....

The solution is, as you said, to delete by the node index. Getting this from heat (without just parsing it from the node name) will be a bit of a pain but I'll start on that today. @strigazi and I have identified the change needed in the heat templates (here) to instead map between index and machine ID in the stack output, which we think is more useful than having the IP there. If we get that change merged then it will be possible to delete via the machine IDs as well, which would simplify things here.

This comment has been minimized.

Copy link
@tghartland

tghartland Feb 19, 2019

Author Contributor

@scott-chaney As for why IPs were failing for you, could you describe or get node -o yaml a node and compare the output for status.addresses to what I have?

status:
  addresses:
  - address: 188.185.64.117
    type: InternalIP
  - address: scaler-03-u6xxwz25daec-minion-0
    type: Hostname

And also check that the output corresponds to what you see in the outputs of the kube_minions stack, as I showed above.

$ openstack stack list --nested | grep scaler-03 | grep minions
| 48c8ebdf-5716-4dd5-acec-bec60838be0f | scaler-03-u6xxwz25daec-kube_minions-4ylgp5orn4qo-0-yv4zmqtad5go              | CREATE_COMPLETE | 2019-02-19T13:30:53Z | None                 | 99d3d801-9d6d-4f22-aa60-3bcab4e372bd |
| 99d3d801-9d6d-4f22-aa60-3bcab4e372bd | scaler-03-u6xxwz25daec-kube_minions-4ylgp5orn4qo                             | CREATE_COMPLETE | 2019-02-19T13:30:52Z | None                 | 91c102be-9b1c-4c12-a96f-b28b4c257cc8 |

$ openstack stack show scaler-03-u6xxwz25daec-kube_minions-4ylgp5orn4qo
........
|                       |   output_value:                                                                                           |
|                       |     '0': 188.185.64.117                                                                                   |
........

This comment has been minimized.

Copy link
@scott-chaney

scott-chaney Feb 20, 2019

@tghartland thanks for helping to debug.

When viewing the nested stack output, I see a [resource_name]instance_id mapping. Your initial change of using the instance_id for scaling does appear to work for us.

This doesn't quite align with my expectations given the documentation for removal_policies on ResourceGroups. I would think the resource should have the refs and refs_map (rather than just the nested stack having them) I am going to dig into the code a bit.

I am also not sure how our nested stacks have different resource mappings (yours with [resource_name]ip_addr and ours with [resource_name]instance_id).

The default for a simple ResourceGroup appears to be [resource_name]instance_id. Do you somehow override this in your template or Heat configuration?

This comment has been minimized.

Copy link
@strigazi

strigazi Feb 20, 2019

@scott-chaney
The mapping to ips comes from here [0] but instead of having [1] we have [2]. We don't have the last patch that changes it to the stack id. We are going to do a new change to use the nova server uuid [3] which is the most reasonable. If you have a mapping with instance uuids, you must be using [3].

[0] https://github.com/openstack/magnum/blob/master/magnum/drivers/k8s_fedora_atomic_v1/templates/kubeminion.yaml#L546-L549
[1] {get_param: "OS::stack_id"}
[2] {get_attr: [kube_minion_eth0, fixed_ips, 0, ip_address]}
[3] {get_resource: kube-minion}

This comment has been minimized.

Copy link
@tghartland

tghartland Feb 22, 2019

Author Contributor

@scott-chaney to prevent being stuck between deleting by machine ID and it working for you, and deleting by IP and it working for @strigazi and myself, I have pushed a change that resolves back to the minion index using the nested kube_minions stack mapping, which checks for both the ID and the IP.

b3d2313#diff-d35c147e215a3b4712ad5f5cc79e6b82R356

@mwielgus mwielgus added the lgtm label Mar 15, 2019
@tghartland

This comment has been minimized.

Copy link
Contributor Author

tghartland commented Mar 15, 2019

@mwielgus the formatting is one tab off in the change you made in builder_all.go. I have just rebased onto the master branch and fixed the conflict in builder_all.go myself, I can force push that to this branch if you want and then it will merge.

@mwielgus

This comment has been minimized.

Copy link
Contributor

mwielgus commented Mar 15, 2019

Yes, please push the new change. Something went terribly bad in the in-github editor. Sorry for inconvenience.

@tghartland tghartland force-pushed the cernops:openstack-autoscaler branch from 211d591 to 2f6b02b Mar 15, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 15, 2019

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 15, 2019
@mwielgus mwielgus added the lgtm label Mar 15, 2019
@k8s-ci-robot k8s-ci-robot merged commit e56c913 into kubernetes:master Mar 15, 2019
2 of 3 checks passed
2 of 3 checks passed
tide Not mergeable. Job continuous-integration/travis-ci/pr has not succeeded.
Details
cla/linuxfoundation tghartland authorized
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@strigazi

This comment has been minimized.

Copy link

strigazi commented Mar 15, 2019

Thank you @tghartland !!!

@mwielgus

This comment has been minimized.

Copy link
Contributor

mwielgus commented Mar 15, 2019

Merged. Thank you for the contribution!

openstacker added a commit to openstacker/gophercloud that referenced this pull request Mar 21, 2019
Recently in Magnum we introduced a new API v1/clusters//actions/resize [1]
to enable the capability that consumer can resize the node count of the
cluster and meanwhile the API allows to pass in the node instance ID
so as to delete particular nodes. That's very useful for the auto-healing
and auto-scaling cases which Magnum will leverage k8s cluster Autoscaler
to achieve that.

Without this new API, the consumer has to call Heat API firstly to remove
the particular to remove the nodes and then call Magnum API again to
update the node count, which is not ideal obviously. Hence the new API
was proposed. More information please refer [2] [3].

[1] https://review.openstack.org/638572
[2] https://review.openstack.org/631378
[3] kubernetes/autoscaler#1690
openstacker added a commit to openstacker/gophercloud that referenced this pull request Mar 21, 2019
Recently in Magnum we introduced a new API v1/clusters//actions/resize [1]
to enable the capability that consumer can resize the node count of the
cluster and meanwhile the API allows to pass in the node instance ID
so as to delete particular nodes. That's very useful for the auto-healing
and auto-scaling cases which Magnum will leverage k8s cluster Autoscaler
to achieve that.

Without this new API, the consumer has to call Heat API firstly to remove
the particular to remove the nodes and then call Magnum API again to
update the node count, which is not ideal obviously. Hence the new API
was proposed. More information please refer [2] [3].

[1] https://review.openstack.org/638572
[2] https://review.openstack.org/631378
[3] kubernetes/autoscaler#1690
openstacker added a commit to openstacker/gophercloud that referenced this pull request Mar 22, 2019
Recently in Magnum we introduced a new API v1/clusters//actions/resize [1]
to enable the capability that consumer can resize the node count of the
cluster and meanwhile the API allows to pass in the node instance ID
so as to delete particular nodes. That's very useful for the auto-healing
and auto-scaling cases which Magnum will leverage k8s cluster Autoscaler
to achieve that.

Without this new API, the consumer has to call Heat API firstly to remove
the particular to remove the nodes and then call Magnum API again to
update the node count, which is not ideal obviously. Hence the new API
was proposed. More information please refer [2] [3].

[1] https://review.openstack.org/638572
[2] https://review.openstack.org/631378
[3] kubernetes/autoscaler#1690
ozerovandrei added a commit to gophercloud/gophercloud that referenced this pull request Mar 23, 2019
Recently in Magnum we introduced a new API v1/clusters//actions/resize [1]
to enable the capability that consumer can resize the node count of the
cluster and meanwhile the API allows to pass in the node instance ID
so as to delete particular nodes. That's very useful for the auto-healing
and auto-scaling cases which Magnum will leverage k8s cluster Autoscaler
to achieve that.

Without this new API, the consumer has to call Heat API firstly to remove
the particular to remove the nodes and then call Magnum API again to
update the node count, which is not ideal obviously. Hence the new API
was proposed. More information please refer [2] [3].

[1] https://review.openstack.org/638572
[2] https://review.openstack.org/631378
[3] kubernetes/autoscaler#1690
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.