Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure private endpoint support #1817

Merged
merged 2 commits into from
Jul 19, 2022

Conversation

m1kola
Copy link
Member

@m1kola m1kola commented Jul 8, 2022

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 8, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 8, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot requested review from abutcher and dlom July 8, 2022 16:48
@2uasimojo
Copy link
Member

Cool. So this new field is per spoke cluster, not per hub?

@m1kola
Copy link
Member Author

m1kola commented Jul 11, 2022

@2uasimojo yes, that's the idea. Each cluster has their own private endpoint so it has to be on per cluster basis.

@m1kola m1kola marked this pull request as ready for review July 15, 2022 15:39
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 15, 2022
@m1kola
Copy link
Member Author

m1kola commented Jul 15, 2022

I tested this in context of ARO and it seems to work as expected: without apiServerEndpointIP Hive fails to reach to the cluster. With apiServerEndpointIP populated Hive can talk to the cluster via private endpoint.

@codecov
Copy link

codecov bot commented Jul 15, 2022

Codecov Report

Merging #1817 (70383b9) into master (b87d3cb) will increase coverage by 0.00%.
The diff coverage is 46.42%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1817   +/-   ##
=======================================
  Coverage   41.68%   41.69%           
=======================================
  Files         353      353           
  Lines       33128    33156   +28     
=======================================
+ Hits        13810    13823   +13     
- Misses      18155    18170   +15     
  Partials     1163     1163           
Impacted Files Coverage Δ
pkg/remoteclient/mock/remoteclient_generated.go 54.79% <0.00%> (-14.18%) ⬇️
...shift/hive/apis/hive/v1/clusterdeployment_types.go 100.00% <ø> (ø)
pkg/remoteclient/remoteclient.go 70.00% <100.00%> (+2.65%) ⬆️

Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make update should rebuild the saas template that's causing verify to fail.

But first...

I've been staring at the REST/HTTP code/docs for a while now, and I can't get my head around how Dial plays in relation to Host. My goal was really just to vet that APIServerEndpointIP is a sensible name for the new field. Because the way it's used makes it feel more like a "proxy". Or does Host not get used at all when Dial is set? Would the original code still work if instead of using APIURLOverride to set Host, we used it to do this DialContext thing? And if so, would that mean that we wouldn't need the new field, and you could just use APIURLOverride for this purpose? If not, what happens if you set both fields?

All of this ignorance makes me leery of introducing this field to the permanent API we have to support until we retire v1.

We have a couple of ways we could make this field "softer" to mitigate those concerns.

  • We could wrap it in an alpha feature gate, so in order to use it, you have to explicitly opt into it in HiveConfig, and you're agreeing it might go away at any point without a version bump. (Of course we would consult with you before considering such a change.)
  • We could feed it in via an annotation instead of making it a real API field. That way it's not part of the documented API, and we've got the flexibility to change/remove it if we come up with something better later on.

Thoughts?

@@ -668,6 +668,11 @@ type ControlPlaneConfigSpec struct {
// active, Hive will use the override URL for further communications with the API server of the remote cluster.
// +optional
APIURLOverride string `json:"apiURLOverride,omitempty"`

// APIServerEndpointIP is the optional override of the API server IP address.
// Hive will use this IP address for creating tcp connections using client-go.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I wouldn't mention implementation details like client-go in this docstring.

@m1kola m1kola force-pushed the azure-private-endpoint-support branch 2 times, most recently from 6901df7 to 0018c56 Compare July 18, 2022 11:47
@m1kola
Copy link
Member Author

m1kola commented Jul 18, 2022

@2uasimojo verify should be fine now. Pleaes take another look.

I've been staring at the REST/HTTP code/docs for a while now, and I can't get my head around how Dial plays in relation to Host.

Let's imagine that we have an API server in kubeconfig set to example.com:6443 and DNS resolves to 93.184.216.34.

Without a custom dialer we will be making a TCP connection to 93.184.216.34 at port 6443 and sending an HTTP request with Host: example.com header.

With apiServerEndpointIP set to say 10.0.0.10 we will be using a custom dialer and making a TCP connection to 10.0.0.10 at port 6443 (original port from the kubeconfig). We still will be sending a HTTP request with Host: example.com header.

Or does Host not get used at all when Dial is set?

Host will not be used in transport/dial (TCP, layer 4), but will still be used on the application layer (in the Host header of HTTP request, layer 7).

Would the original code still work if instead of using APIURLOverride to set Host, we used it to do this DialContext thing? And if so, would that mean that we wouldn't need the new field, and you could just use APIURLOverride for this purpose?

We considered replacing API server hostname in kubeconfig with IP address such as example.com:6443 -> 10.0.0.10:6443, but it won't work because in this case we will be sending HTTP requests with Host: 10.0.0.10 header.

I believe it will be exact the same issue if we set APIURLOverride to 10.0.0.10.

If not, what happens if you set both fields?

Given that we have the following:

  • Kubeconfig has example.com:6443 (DNS resolves to 93.184.216.34)
  • APIURLOverride is set to bar.com:6443 (DNS resolves to 104.21.1.6)

I think this is what is going to happen:

  • Without apiServerEndpointIP field we will be making TCP connections to 104.21.1.6 at port 6443 and sending HTTP Host: bar.com in headers.
  • With apiServerEndpointIP set to 10.0.0.10 we will be making TCP connections to 10.0.0.10 at port 6443 and sending HTTP Host: bar.com in headers.

All of this ignorance makes me leery of introducing this field to the permanent API we have to support until we retire v1.

We have a couple of ways we could make this field "softer" to mitigate those concerns.

* We could wrap it in an alpha feature gate, so in order to use it, you have to explicitly opt into it in HiveConfig, and you're agreeing it might go away at any point without a version bump. (Of course we would consult with you before considering such a change.)

* We could feed it in via an annotation instead of making it a real API field. That way it's not part of the documented API, and we've got the flexibility to change/remove it if we come up with something better later on.

Thoughts?

I'll leave the decision on how to support this up to you. For ARO it is important to get guarantees that support won't get dropped and that this thing is tested (minimising chances of a regression).

@2uasimojo
Copy link
Member

Without a custom dialer we will be making a TCP connection to 93.184.216.34 at port 6443 and sending an HTTP request with Host: example.com header.

With apiServerEndpointIP set to say 10.0.0.10 we will be using a custom dialer and making a TCP connection to 10.0.0.10 at port 6443 (original port from the kubeconfig). We still will be sending a HTTP request with Host: example.com header.

And then what happens? Does the other end of the connection use example.com from that header? Does the DNS-resolved address 93.184.216.34 ever get used again?

IIUC the point is to allow the console and cluster workloads to be public, but hide the API behind a private endpoint, right? Hypothetically, would the effect here be the same as if DNS resolved console.example.com to 93.184.216.XX and api.example.com to 10.0.0.10?


Regarding unit testing, I thought I had left this comment last week, sorry:
We should be able to do similar to this just to prove that when the override is set in the ClusterDeployment, it makes its way into the REST Config. Though without some fancy mocking, I guess the only thing we would be able to check is that Config.Dial isn't nil...

@m1kola
Copy link
Member Author

m1kola commented Jul 18, 2022

And then what happens? Does the other end of the connection use example.com from that header? Does the DNS-resolved address 93.184.216.34 ever get used again?

There could potentially be multiple hosts at the same IP address. It is like setting up example.com and bar.com to be served by the same nginx at 10.0.0.10. If we do curl https://10.0.0.10/ we will end up with a request which as Host: 10.0.0.10 and nginx will have no idea wether it is a request for example.com or for bar.com.

IIUC the point is to allow the console and cluster workloads to be public, but hide the API behind a private endpoint, right? Hypothetically, would the effect here be the same as if DNS resolved console.example.com to 93.184.216.XX and api.example.com to 10.0.0.10?

We also support fully private clusters where both api server and ingress are private. To access the cluster you have to be on the same network as the cluster.

From ARO resource provider perspective we treat all the cluster as if they were private: we connect to all of them (even public ones) via private endpoint.

@m1kola m1kola force-pushed the azure-private-endpoint-support branch from 0018c56 to 5e3b2e4 Compare July 18, 2022 16:26
@m1kola m1kola requested a review from 2uasimojo July 18, 2022 16:27
@m1kola
Copy link
Member Author

m1kola commented Jul 18, 2022

@2uasimojo I updated Test_builder_RESTConfig to check for nil/not nil and added Test_DialContext to check that we correctly replace the address.

Please take another look.

Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dig the unit tests ✓

After talking with the team:
We're cool making it a real API field, as it seems like it would be generally useful,
BUT
We (okay, I) would like to understand why this wouldn't be solved simply by repointing DNS for the API server URL to the desired IP

Inline suggestions for a couple of renames.

Comment on lines 672 to 675
// APIServerEndpointIP is the optional override of the API server IP address.
// Hive will use this IP address for creating TCP connections.
// +optional
APIServerEndpointIP string `json:"apiServerEndpointIP,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked about this with the team.
We would like to propose that the name include the word "Override"; and suggest that "Endpoint" is probably redundant.
Probably also useful to mention that the port from the original API server URL will be used.

Suggested change
// APIServerEndpointIP is the optional override of the API server IP address.
// Hive will use this IP address for creating TCP connections.
// +optional
APIServerEndpointIP string `json:"apiServerEndpointIP,omitempty"`
// APIServerIPOverride is the optional override of the API server IP address.
// Hive will use this IP address (with the port from the original API URL) for creating TCP connections.
// +optional
APIServerIPOverride string `json:"apiServerEndpointIP,omitempty"`


type dialerMock func(ctx context.Context, network, address string)

func (d dialerMock) DialContext(ctx context.Context, network, address string) (net.Conn, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so: This DialContext gets called by the DialContext in remoteclient.go:345 as a method on the dialer we pass into the DialContext called in the test driver (L352 below) which is actually calling the DialContext in remoteclient.go:334 -- phew!

I think this would be (slightly) more scrutable if the func in remoteclient.go:334 were named something different, like CreateDialContext.

Copy link
Member Author

@m1kola m1kola Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2uasimojo Np, I renamed to createDialContext.

CreateDialContext will be returning DialContext which is a wrapper around go's standard library DialContext from Dialer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that in this repo mocks are being generated with GoMock for every interface. I was going to use these mocks, but then realised that I cant: there will be a circular import.

@m1kola m1kola force-pushed the azure-private-endpoint-support branch from 5e3b2e4 to 70383b9 Compare July 19, 2022 09:48
Copy link
Member Author

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We (okay, I) would like to understand why this wouldn't be solved simply by repointing DNS for the API server URL to the desired IP

We were considering solving it with the DNS however the issue is that we have customers who:

  1. Use our managed DNS aroapp.io
  2. Who use their own custom domain.

To handle clusters with custom domains we would have to create private DNS zone on the ARO RP side and there is a max limit of 1k private DNS zones per subscription. And we group multiple regions supported by ARO RP within a single subscription. So in every ARO RP sub, we would have an upper limit of 1k clusters with custom DNS. Sooner or later we will hit this limit.


type dialerMock func(ctx context.Context, network, address string)

func (d dialerMock) DialContext(ctx context.Context, network, address string) (net.Conn, error) {
Copy link
Member Author

@m1kola m1kola Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2uasimojo Np, I renamed to createDialContext.

CreateDialContext will be returning DialContext which is a wrapper around go's standard library DialContext from Dialer.

@@ -51,6 +53,10 @@ type Builder interface {
UseSecondaryAPIURL() Builder
}

type Dialer interface {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make Dialer interface and DialContext/CreateDialContext non-exported. These are internal implementation details and don't need to be used externally.


type dialerMock func(ctx context.Context, network, address string)

func (d dialerMock) DialContext(ctx context.Context, network, address string) (net.Conn, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that in this repo mocks are being generated with GoMock for every interface. I was going to use these mocks, but then realised that I cant: there will be a circular import.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 19, 2022

@m1kola: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@m1kola m1kola requested a review from 2uasimojo July 19, 2022 13:40
Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally this is good, so I'm happy to land it. Thanks for picking this up @m1kola!
Couple items for a followon PR:

  • Let's add to the docstring a statement like, "This field can be used when repointing the APIServer's DNS is not viable option."
  • I'd like to clean up the mocking. I think moving the Dialer interface to a separate file would probably be the first move; then we can either continue to generate that mock if we're going to use it (moving the interface should let us avoid the circular import) or just get rid of it.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 19, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, m1kola

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 19, 2022
@2uasimojo
Copy link
Member

/override tide

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 19, 2022

@2uasimojo: Overrode contexts on behalf of 2uasimojo: tide

In response to this:

/override tide

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@2uasimojo
Copy link
Member

This passed CI and nothing has merged since. ARO team is waiting on it, so I'm going to merge it manually. Otherwise it'd end up retesting.

@2uasimojo 2uasimojo merged commit c63c9b0 into openshift:master Jul 19, 2022
@m1kola m1kola deleted the azure-private-endpoint-support branch July 19, 2022 14:23
m1kola added a commit to m1kola/ARO-RP that referenced this pull request Jul 19, 2022
It includes support for private endpoint: openshift/hive#1817
m1kola added a commit to m1kola/ARO-RP that referenced this pull request Jul 19, 2022
It includes support for private endpoint: openshift/hive#1817
@@ -289,6 +295,14 @@ func (b *builder) RESTConfig() (*rest.Config, error) {
}
}

if override := b.cd.Spec.ControlPlaneConfig.APIServerIPOverride; override != "" {
dialer := &net.Dialer{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nts: net.Dialer implements DialContext -- that's the prod path. We define the interface so we can swap out the impl for UT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants