-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation doesn't catch CIDR blocks that Azure NSG won't accept #5599
Comments
Hi @phealy, I'm not sure if continuing the NSG deployment with just an event is a good idea. cx may just look at their service, see it's working without noticing the error event. Ignoring the "bad" network prefixes may generate an nsg rule that allows Internet access and can pose security issue. I'd rather make a validation in controller manager and report error before sending it to NRP. |
@jwtty I'll edit my original comment to be clearer about this case - I'm not suggesting that we skip adding the NSG rules entirely if an entry is malformed; I'm saying we add all of the rules except the one that's bad, especially including the default deny at the end. This way we fail-secure - if any entry is bad, the service ends up more locked down than the customer wants, not less. If all of the entries are malformed, then the service ends up with no access allowed. The cx would likely notice the failure immediately because the source they tried to add isn't actually allowed access. This still isn't great for troubleshooting if they don't see the event, but at least it's failing secure instead of now where it can fail either way depending on the state prior to the update. Consider this scenario for why I'd prefer to do it this way:
I do agree that validating up front would be a much better user experience, so what if we do both options? Let's fix the validation here to fail gracefully and securely and add a validating webhook in AKS to reject the update up front with an immediate failure and clear error message. That way non-AKS systems that are using cloud-provider-azure without our webhook still have good failure behavior, but AKS customers (or anyone else who wants to add a validating webhook for this) get better validation up front and an immediate rejection. I don't know if we'd get support upstream for adding this validation in kube-controller-manager - the restriction on not using a host address in the rule is likely not applicable to all clouds and thus adding that restriction in could break other environments where that sort of setup is allowed. What do you think? |
As per discussion, we can include the correct network prefix in the route table rule and generate an event informing this. |
@jwtty Sorry, I have to reverse myself on this - we need to make sure to fail secure by rejecting the bad entry and still including the deny rule, even if all of the entries are malformed (have host bits set). I was thinking about the possible failure cases of correcting this automatically and realized there's a really bad one - when the mistake is in the prefix mask, rather than the IP address portion, we can't assume intent. Example 1: customer sets
Example 2: customer tries to set their VNet CIDR of
Given that failure case, can we please adjust the behavior to something like what I originally described, where we log an error/emit an event about the malformed CIDR and skip adding it to the allow rule, but still add everything else including the deny rule? A few examples to make sure we're on the same page:
Let's discuss internally about add front-end validation for this via a webhook. |
What happened:
An entry in
loadbalancerSourceRanges
was added that contained bits set in the host portion of the address (example: 10.0.0.1/24 instead of 10.0.0.0/24). Kubernetes accepts this because the validation on the service object is just checking for a properly formatted CIDR block. Azure NSG rules will not accept this because it's a host address rather than a network address, and this breaks the reconcile loop for the service. This prevents any further updates from occurring until the bad address entry is removed.An example error in the event is as follows:
What you expected to happen:
cloud-provider-azure should have validation to make sure that the host portion of the address is valid and skip that one entry from
loadBalancerSourceRanges
when creating the NSG rules (with an error event logged to Kubernetes), so that other updates are not blocked. All other updates should be applied.EDIT: The default deny rule should still be created even if all of the entries in
loadBalancerSourceRanges
fail validation - this ensures that we fail-secure by not leaving access open the customer wanted to close off. This addresses @jwtty's point here about making sure that we don't leave something unrestricted when it shouldn't be.How to reproduce it (as minimally and precisely as possible):
Add the following to a LoadBalancer service:
Anything else we need to know?:
Here is some quick go sample code to check if the CIDR block will be accepted by Azure networking:
$ go build ip.go && ./ip 10.0.0.0/24 10.0.0.1/24 fe80::/64 fe80::1/64 10.0.0.0/24 is a valid network CIDR block 10.0.0.1/24 is not a valid network CIDR block fe80::/64 is a valid network CIDR block fe80::1/64 is not a valid network CIDR block
Environment:
kubectl version
): checked in AKS 1.27 and 1.28cat /etc/os-release
): n/auname -a
): n/aThe text was updated successfully, but these errors were encountered: