-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 128: AWS End-To-End tests #27156
Conversation
1165b77
to
880c27b
Compare
5455b0c
to
e549e7a
Compare
e549e7a
to
d5aa58d
Compare
rfd/0128-aws-integration-tests.md
Outdated
### GitHub OIDC Provider | ||
|
||
GitHub OIDC provider will be used to authenticate with AWS API. This will allow | ||
us to use GitHub actions to run the integration tests without having to manage | ||
AWS credentials. GitHub actions will be allowed to assume a set of roles that | ||
will simulate the minimum required permissions that each Teleport service | ||
requires to perform the required actions. An important consideration is that | ||
all other OIDC tokens from GitHub that don't belong to the Teleport Enterprise | ||
Account actions or originated from other repositories besides `gravitational/teleport` | ||
must be rejected. This includes tokens from other GitHub | ||
organizations and tokens from the Teleport Enterprise Account that don't belong | ||
to the GitHub actions pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that single individual user won't be able to run these test on local machine ? I think that we should consider a way to provide a fallback mechanism and allowing a user outside Teleport Organisation to run this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only specifies the scope for the Github action flows.
Individuals with access to Teleport AWS accounts can always create ec2 instances with the associated IAM policies for testing and debugging purposes.
that we don't introduce regressions by changing the permissions required by | ||
Teleport. To achieve this, GitHub action will be allowed to assume a set of | ||
roles that will simulate the minimum required permissions that each Teleport | ||
service requires to perform the required actions. Teleport services will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Kube Service/Kube Agent support STS Assume role call in the EKS kube access flow?
I think that this is the kind of feature that is not scoped only to integration tests flow and can be used in case of one Kube Service - multiple EKS AWS Account access so it is worth to handle/track this feature in a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes service will support assuming role as part of the matcher selector
I imagined something like:
kubernetes_service:
enabled: "yes"
resources:
- tags:
"env": "prod"
assume_aws_role: ...
Would we use our |
I'm more inclined to restrict the resources the IAM roles will access directly through their arn. So, even if you use '':'' matchers, only the visible resources can be used. Regarding the aws account, I think having a dedicated account for E2E tests is easier but I am not sure about the implications and caveats of adding another account |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider replacing the "integration test" terminology with "end-to-end" or "e2e" throughout the RFD
@smallinsky this RFD essentially describes everything we will require for DB service e2e tests - we just need to define the things we want to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass.
rfd/0128-aws-integration-tests.md
Outdated
## When | ||
|
||
AWS integration tests will be added to the Teleport CI pipeline as part of the | ||
cronjob that triggers integrations each day. During the release process, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the release process, the integration tests will be run automatically as part of the release process.
I'm not sure about this tbh. I understand the intent, but IMO blocking a release on the integration tests failure is suboptimal. I think that if something fails during release process, the issue was detected too late, and also imagine we need to push an important release we can't delay or something.
I know we originally discussed running these nightly but thinking some more it would make sense to see if we can run them on each PR, it's always better to have fast feedback loop. I think it should be doable if we design the tests themselves to they don't conflict.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for not blocking releases.
I think if we run them on every PR, we may restrict what our tests can do without potentially interfering with each other. I suppose if we can work around that, and the tests are relatively quick, then I'd be in favor of running on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we could use AWS CloudFormation to create the testing environment 🤔
We deploy everything we need there: EKS, EC2, DBs, ... and wait for it to complete
We then start running the tests against those resources
Something like this
https://cluster-api-aws.sigs.k8s.io/development/e2e.html
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/6705ad80eb1c3fe47036505485f3c24141e88a59/Makefile#L407
Creating those resources is a slow process for most (we know that EKS alone takes around 20 minutes 😭 ).
So, I'm not sure doing this for every PR is feasible
I would suggest we start with a nightly job and only for master
and branch/vX
and an on-demand way for running it against a specific commit or branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r0mant I am ok running tests for each PR but it is a bit less secure since we allow executing E2E for code that wasn't properly reviewed yet.
To improve security, if the contributor is external, we only allow E2E to run if the PR is approved.
For kube access, if we wanna test the token reload once it's about to expire we need to wait 15+min.
That part is already covered by UT so it's probably ok to remove that from the scope of E2E.
rfd/0128-aws-integration-tests.md
Outdated
already exists, it will be updated with the new labels. | ||
- The Kubernetes service receives the `kube_cluster` object and can generate | ||
a short-lived token to authenticate with the Kubernetes API. | ||
- The Kubernetes service can forward the requests to the Kubernetes API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will we be testing client connections? Can we use actual kubectl
so simulate end users?
Same goes for the database access cc @smallinsky. I wonder if we can build a Docker image with all the client tools we need (kubectl, psql, mysql, etc.) and use it to connect, instead of relying on drivers. This way we will also be able to test specific client version if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsh
ships tsh kubectl
so I was planning to use it to get secrets/services since both objects exist even if no cluster nodes exist.
rfd/0128-aws-integration-tests.md
Outdated
the account that matches the configured labels and then using the Kubernetes API | ||
to forward the requests to the cluster. | ||
|
||
The first step of this process happens on the Teleport Discovery service and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor this section to be more specific please w/r/t what kind of Teleport services will be provisioned and what EKS clusters it will be discovering. Basically, instead of explaining internals (i.e. kube_cluster objects, etc.) I would focus on explaining test scenarios. For example:
- The test will start discovery_service with config "xxx" and kube_service with config "yyy".
- Discovery service will look for EKS clusters with tags "a: b".
- The test will expect the EKS cluster "a: b" to join and will execute "kubectl get pods" on it.
And so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 9d98ad5
rfd/0128-aws-integration-tests.md
Outdated
Each integration test will use the minimum required | ||
AWS API permissions to test the integration. This will ensure that we don't | ||
introduce regressions by changing the permissions required by Teleport but those | ||
changes are not detected because another test requires the same permissions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another argument in favor of keeping the minimum required permissions, is that we keep that list up to date and well documented.
This way we can always provide the require policies with customers 👍
Teleport integrates deeply with AWS to provide a seamless experience for users | ||
when using auto-discovery of nodes, databases and Kubernetes clusters. This | ||
integration is critical for the success of Teleport and we want to ensure that | ||
we can test it reliably. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also AWS OIDC Integration.
Although not as critical (for now 😎 ), but I think is still worth mentioning it.
Even if saying that it's currently out of scope.
I'm not aware of other features using AWS, but if they exist I think we should be explicit about them.
rfd/0128-aws-integration-tests.md
Outdated
## When | ||
|
||
AWS integration tests will be added to the Teleport CI pipeline as part of the | ||
cronjob that triggers integrations each day. During the release process, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we could use AWS CloudFormation to create the testing environment 🤔
We deploy everything we need there: EKS, EC2, DBs, ... and wait for it to complete
We then start running the tests against those resources
Something like this
https://cluster-api-aws.sigs.k8s.io/development/e2e.html
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/6705ad80eb1c3fe47036505485f3c24141e88a59/Makefile#L407
Creating those resources is a slow process for most (we know that EKS alone takes around 20 minutes 😭 ).
So, I'm not sure doing this for every PR is feasible
I would suggest we start with a nightly job and only for master
and branch/vX
and an on-demand way for running it against a specific commit or branch
to generate the short-lived token and forward the requests to the Kubernetes | ||
API. [IAM Mapping](https://goteleport.com/docs/kubernetes-access/discovery/aws/#iam-mapping) | ||
|
||
Spin up the EKS clusters takes a long time and it's not feasible to do it for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have discussed this on the engineering offsite and one change to this system we'd like to make is instead of maintaining a set of persistent resources which can become obsolete, we want to recreate these on some cadence (e.g. daily) so we don't have to worry about patch management, updates, etc.
Can you please update the RFD to that effect? Basically, running "terraform destroy/apply" that recreates these on a daily basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will do once I am back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we run e2e tests on every PR, how will we handle this routine recreation of the test infrastructure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the RFD to include the recreate strategy on weekends to minimize the impact on development for AU/EU/US folks.
@tigrato thinking about setting up E2E tests - do you plan to start the teleport cluster using test suite helpers, or via config file and actually doing Also, have you given thought to the test environment? I'm writing these kinds of tests for DB access now, but I'd like to have the teleport cluster work with a locally trusted CA cert so I don't have to pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just need a bit more details about how we're going to handle infrastructure recreation automation
rfd/0128-aws-integration-tests.md
Outdated
deployed on a single availability zone is enough. To ensure that the EKS cluster | ||
is up to date and we don't have to deal with the cluster upgrades having | ||
major impacts on the development, we will run the Terraform script to destroy | ||
and recreate the cluster during the weekend. This ensures that the EKS cluster is | ||
available for the E2E tests during the week (with minor downtime on weekends) | ||
and that we don't have to deal with cluster upgrades/security patches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reedloden @wadells @jentfoo We're going to be maintaining persistent set of cloud resources in a CI AWS account for these new end-to-end integration tests. The plan is to be re-provisioning them weekly so we don't need to worry about patching. Do you have any suggestions on whats the best way to automate this process? Is this something that can be handled by Spacelift? Or do we just build a separate GHA workflow that runs on cron schedule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We opted to use the Spacelift schedule task to recreate the stack every Sunday at 12:00AM PST time
This PR enables AWS E2E integration tests for EKS auto-discovery. This process uses Github's OIDC connector to access AWS API by assuming the `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` role. ```yaml - name: Configure AWS Credentials uses: aws-actions/configure-aws-credentials@v2 with: aws-region: ${{ env.AWS_REGION }} role-to-assume: ${{ env.GHA_ASSUME_ROLE }} ``` `aws-actions/configure-aws-credentials` action generates a new ID token with the information required and signs it using Github's OIDC workflow. The role `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` is an intermediate role for the runner to be able to assume two distinct roles: - `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` - used by Kubernetes Service - `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` - used by Discovery Service The Discovery service will assume role `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` which defines the following policy: - `eks:ListClusters` - `eks:DescribeCluster` These are the minimal permissions required to list the available clusters and retrieve their state and labels. Teleport Discovery Service will pull the EKS cluster available and for each cluster to import, it will create a `kube_cluster` object in Auth Server. Once the cluster is discovered and the `kube_cluster` exists in Auth server, the Teleport Kubernetes Service will start proxying the cluster. For that, it must pull the cluster API endpoint and its CA data to create a client. Role `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` allows Kubernetes Service to describe the cluster and retrieve its details. - `eks:DescribeCluster` The IAM role used by the Kubernetes Service must be mapped to a Kubernetes Group that allows impersonation in order to be able to proxy requests with the user's permissions. ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: teleport-role rules: - apiGroups: - "" resources: - users - groups - serviceaccounts verbs: - impersonate - apiGroups: - "" resources: - pods verbs: - get - apiGroups: - "authorization.k8s.io" resources: - selfsubjectaccessreviews - selfsubjectrulesreviews verbs: - create --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: teleport-crb roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: teleport-role subjects: - kind: Group name: ${group_name} ``` During the cluster provisioning phase, we mapped the Kubernetes Service IAM role into a Kubernetes Group ` ${group_name}`. ```yaml mapRoles: - groups: - ${group_name} rolearn:arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role username: "eleport:{{SessionName}} ``` The final step is to validate the client is working correctly and that the Kubernetes Service was able to generate a valid token that can impersonate Kubernetes groups and users. For that, we simulate a user calling `kubectl get services -n default` through Teleport that must return 1 entry, the default service `kubernetes`. Implements #27156
294beaf
to
5f6c88d
Compare
This PR enables AWS E2E integration tests for EKS auto-discovery. This process uses Github's OIDC connector to access AWS API by assuming the `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` role. ```yaml - name: Configure AWS Credentials uses: aws-actions/configure-aws-credentials@v2 with: aws-region: ${{ env.AWS_REGION }} role-to-assume: ${{ env.GHA_ASSUME_ROLE }} ``` `aws-actions/configure-aws-credentials` action generates a new ID token with the information required and signs it using Github's OIDC workflow. The role `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` is an intermediate role for the runner to be able to assume two distinct roles: - `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` - used by Kubernetes Service - `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` - used by Discovery Service The Discovery service will assume role `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` which defines the following policy: - `eks:ListClusters` - `eks:DescribeCluster` These are the minimal permissions required to list the available clusters and retrieve their state and labels. Teleport Discovery Service will pull the EKS cluster available and for each cluster to import, it will create a `kube_cluster` object in Auth Server. Once the cluster is discovered and the `kube_cluster` exists in Auth server, the Teleport Kubernetes Service will start proxying the cluster. For that, it must pull the cluster API endpoint and its CA data to create a client. Role `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` allows Kubernetes Service to describe the cluster and retrieve its details. - `eks:DescribeCluster` The IAM role used by the Kubernetes Service must be mapped to a Kubernetes Group that allows impersonation in order to be able to proxy requests with the user's permissions. ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: teleport-role rules: - apiGroups: - "" resources: - users - groups - serviceaccounts verbs: - impersonate - apiGroups: - "" resources: - pods verbs: - get - apiGroups: - "authorization.k8s.io" resources: - selfsubjectaccessreviews - selfsubjectrulesreviews verbs: - create --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: teleport-crb roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: teleport-role subjects: - kind: Group name: ${group_name} ``` During the cluster provisioning phase, we mapped the Kubernetes Service IAM role into a Kubernetes Group ` ${group_name}`. ```yaml mapRoles: - groups: - ${group_name} rolearn:arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role username: "eleport:{{SessionName}} ``` The final step is to validate the client is working correctly and that the Kubernetes Service was able to generate a valid token that can impersonate Kubernetes groups and users. For that, we simulate a user calling `kubectl get services -n default` through Teleport that must return 1 entry, the default service `kubernetes`. Implements #27156
Rendered