RFD 128: AWS End-To-End tests #27156

tigrato · 2023-05-31T13:36:08Z

rfd/0128-aws-integration-tests.md

smallinsky · 2023-06-05T13:11:59Z

rfd/0128-aws-integration-tests.md

+### GitHub OIDC Provider
+
+GitHub OIDC provider will be used to authenticate with AWS API. This will allow
+us to use GitHub actions to run the integration tests without having to manage
+AWS credentials. GitHub actions will be allowed to assume a set of roles that
+will simulate the minimum required permissions that each Teleport service
+requires to perform the required actions. An important consideration is that
+all other OIDC tokens from GitHub that don't belong to the Teleport Enterprise
+Account actions or originated from other repositories besides `gravitational/teleport`
+must be rejected. This includes tokens from other GitHub
+organizations and tokens from the Teleport Enterprise Account that don't belong
+to the GitHub actions pipeline.


Does it mean that single individual user won't be able to run these test on local machine ? I think that we should consider a way to provide a fallback mechanism and allowing a user outside Teleport Organisation to run this test.

This only specifies the scope for the Github action flows.

Individuals with access to Teleport AWS accounts can always create ec2 instances with the associated IAM policies for testing and debugging purposes.

smallinsky · 2023-06-05T13:46:38Z

rfd/0128-aws-integration-tests.md

+that we don't introduce regressions by changing the permissions required by
+Teleport. To achieve this, GitHub action will be allowed to assume a set of
+roles that will simulate the minimum required permissions that each Teleport
+service requires to perform the required actions. Teleport services will be


Does Kube Service/Kube Agent support STS Assume role call in the EKS kube access flow?

I think that this is the kind of feature that is not scoped only to integration tests flow and can be used in case of one Kube Service - multiple EKS AWS Account access so it is worth to handle/track this feature in a separate issue.

Kubernetes service will support assuming role as part of the matcher selector

I imagined something like:

kubernetes_service: enabled: "yes" resources: - tags: "env": "prod" assume_aws_role: ...

GavinFrazar · 2023-06-05T22:01:58Z

Would we use our teleport-dev AWS account for this? and can we expand on the infra requirements to include a unique label to match on in the discovery service, such that only the intended EKS cluster(s) is tested against?

tigrato · 2023-06-06T07:50:42Z

Would we use our teleport-dev AWS account for this? and can we expand on the infra requirements to include a unique label to match on in the discovery service, such that only the intended EKS cluster(s) is tested against?

I'm more inclined to restrict the resources the IAM roles will access directly through their arn. So, even if you use '':'' matchers, only the visible resources can be used.

Regarding the aws account, I think having a dedicated account for E2E tests is easier but I am not sure about the implications and caveats of adding another account

GavinFrazar

nit: consider replacing the "integration test" terminology with "end-to-end" or "e2e" throughout the RFD

GavinFrazar · 2023-06-06T19:02:11Z

@smallinsky this RFD essentially describes everything we will require for DB service e2e tests - we just need to define the things we want to test

r0mant

First pass.

rfd/0128-aws-integration-tests.md

r0mant · 2023-06-06T20:21:00Z

rfd/0128-aws-integration-tests.md

+## When
+
+AWS integration tests will be added to the Teleport CI pipeline as part of the
+cronjob that triggers integrations each day. During the release process, the


During the release process, the integration tests will be run automatically as part of the release process.

I'm not sure about this tbh. I understand the intent, but IMO blocking a release on the integration tests failure is suboptimal. I think that if something fails during release process, the issue was detected too late, and also imagine we need to push an important release we can't delay or something.

I know we originally discussed running these nightly but thinking some more it would make sense to see if we can run them on each PR, it's always better to have fast feedback loop. I think it should be doable if we design the tests themselves to they don't conflict.

WDYT?

+1 for not blocking releases.

I think if we run them on every PR, we may restrict what our tests can do without potentially interfering with each other. I suppose if we can work around that, and the tests are relatively quick, then I'd be in favor of running on PR

I was wondering if we could use AWS CloudFormation to create the testing environment 🤔
We deploy everything we need there: EKS, EC2, DBs, ... and wait for it to complete
We then start running the tests against those resources

Something like this
https://cluster-api-aws.sigs.k8s.io/development/e2e.html
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/6705ad80eb1c3fe47036505485f3c24141e88a59/Makefile#L407

Creating those resources is a slow process for most (we know that EKS alone takes around 20 minutes 😭 ).
So, I'm not sure doing this for every PR is feasible
I would suggest we start with a nightly job and only for master and branch/vX and an on-demand way for running it against a specific commit or branch

@r0mant I am ok running tests for each PR but it is a bit less secure since we allow executing E2E for code that wasn't properly reviewed yet.

To improve security, if the contributor is external, we only allow E2E to run if the PR is approved.

For kube access, if we wanna test the token reload once it's about to expire we need to wait 15+min.
That part is already covered by UT so it's probably ok to remove that from the scope of E2E.

rfd/0128-aws-integration-tests.md

r0mant · 2023-06-06T20:26:56Z

rfd/0128-aws-integration-tests.md

+  already exists, it will be updated with the new labels.
+- The Kubernetes service receives the `kube_cluster` object and can generate
+  a short-lived token to authenticate with the Kubernetes API.
+- The Kubernetes service can forward the requests to the Kubernetes API.


How will we be testing client connections? Can we use actual kubectl so simulate end users?

Same goes for the database access cc @smallinsky. I wonder if we can build a Docker image with all the client tools we need (kubectl, psql, mysql, etc.) and use it to connect, instead of relying on drivers. This way we will also be able to test specific client version if needed.

tsh ships tsh kubectl so I was planning to use it to get secrets/services since both objects exist even if no cluster nodes exist.

r0mant · 2023-06-06T20:32:05Z

rfd/0128-aws-integration-tests.md

+the account that matches the configured labels and then using the Kubernetes API
+to forward the requests to the cluster.
+
+The first step of this process happens on the Teleport Discovery service and


Can you refactor this section to be more specific please w/r/t what kind of Teleport services will be provisioned and what EKS clusters it will be discovering. Basically, instead of explaining internals (i.e. kube_cluster objects, etc.) I would focus on explaining test scenarios. For example:

The test will start discovery_service with config "xxx" and kube_service with config "yyy".

Discovery service will look for EKS clusters with tags "a: b".

The test will expect the EKS cluster "a: b" to join and will execute "kubectl get pods" on it.

And so on.

Updated in 9d98ad5

marcoandredinis · 2023-06-07T08:20:06Z

rfd/0128-aws-integration-tests.md

+Each integration test will use the minimum required
+AWS API permissions to test the integration. This will ensure that we don't
+introduce regressions by changing the permissions required by Teleport but those
+changes are not detected because another test requires the same permissions.


I think another argument in favor of keeping the minimum required permissions, is that we keep that list up to date and well documented.
This way we can always provide the require policies with customers 👍

marcoandredinis · 2023-06-07T08:23:34Z

rfd/0128-aws-integration-tests.md

+Teleport integrates deeply with AWS to provide a seamless experience for users
+when using auto-discovery of nodes, databases and Kubernetes clusters. This
+integration is critical for the success of Teleport and we want to ensure that
+we can test it reliably.


There's also AWS OIDC Integration.
Although not as critical (for now 😎 ), but I think is still worth mentioning it.
Even if saying that it's currently out of scope.

I'm not aware of other features using AWS, but if they exist I think we should be explicit about them.

marcoandredinis · 2023-06-07T08:57:05Z

rfd/0128-aws-integration-tests.md

+## When
+
+AWS integration tests will be added to the Teleport CI pipeline as part of the
+cronjob that triggers integrations each day. During the release process, the


I was wondering if we could use AWS CloudFormation to create the testing environment 🤔
We deploy everything we need there: EKS, EC2, DBs, ... and wait for it to complete
We then start running the tests against those resources

Something like this
https://cluster-api-aws.sigs.k8s.io/development/e2e.html
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/6705ad80eb1c3fe47036505485f3c24141e88a59/Makefile#L407

Creating those resources is a slow process for most (we know that EKS alone takes around 20 minutes 😭 ).
So, I'm not sure doing this for every PR is feasible
I would suggest we start with a nightly job and only for master and branch/vX and an on-demand way for running it against a specific commit or branch

r0mant · 2023-06-13T22:18:52Z

rfd/0128-aws-integration-tests.md

+  to generate the short-lived token and forward the requests to the Kubernetes
+  API. [IAM Mapping](https://goteleport.com/docs/kubernetes-access/discovery/aws/#iam-mapping)
+
+Spin up the EKS clusters takes a long time and it's not feasible to do it for


We have discussed this on the engineering offsite and one change to this system we'd like to make is instead of maintaining a set of persistent resources which can become obsolete, we want to recreate these on some cadence (e.g. daily) so we don't have to worry about patch management, updates, etc.

Can you please update the RFD to that effect? Basically, running "terraform destroy/apply" that recreates these on a daily basis.

Sure. Will do once I am back

if we run e2e tests on every PR, how will we handle this routine recreation of the test infrastructure?

b60903b

Updated the RFD to include the recreate strategy on weekends to minimize the impact on development for AU/EU/US folks.

GavinFrazar · 2023-06-21T21:54:09Z

@tigrato thinking about setting up E2E tests - do you plan to start the teleport cluster using test suite helpers, or via config file and actually doing teleport start?

Also, have you given thought to the test environment? I'm writing these kinds of tests for DB access now, but I'd like to have the teleport cluster work with a locally trusted CA cert so I don't have to pass --insecure in tsh commands

r0mant

lgtm, just need a bit more details about how we're going to handle infrastructure recreation automation

rfd/0128-aws-integration-tests.md

r0mant · 2023-06-23T18:08:46Z

rfd/0128-aws-integration-tests.md

+deployed on a single availability zone is enough. To ensure that the EKS cluster
+is up to date and we don't have to deal with the cluster upgrades having
+major impacts on the development, we will run the Terraform script to destroy
+and recreate the cluster during the weekend. This ensures that the EKS cluster is
+available for the E2E tests during the week (with minor downtime on weekends)
+and that we don't have to deal with cluster upgrades/security patches.


@reedloden @wadells @jentfoo We're going to be maintaining persistent set of cloud resources in a CI AWS account for these new end-to-end integration tests. The plan is to be re-provisioning them weekly so we don't need to worry about patching. Do you have any suggestions on whats the best way to automate this process? Is this something that can be handled by Spacelift? Or do we just build a separate GHA workflow that runs on cron schedule?

cc @jof and @adaadb6

We opted to use the Spacelift schedule task to recreate the stack every Sunday at 12:00AM PST time

This PR enables AWS E2E integration tests for EKS auto-discovery. This process uses Github's OIDC connector to access AWS API by assuming the `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` role. ```yaml - name: Configure AWS Credentials uses: aws-actions/configure-aws-credentials@v2 with: aws-region: ${{ env.AWS_REGION }} role-to-assume: ${{ env.GHA_ASSUME_ROLE }} ``` `aws-actions/configure-aws-credentials` action generates a new ID token with the information required and signs it using Github's OIDC workflow. The role `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` is an intermediate role for the runner to be able to assume two distinct roles: - `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` - used by Kubernetes Service - `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` - used by Discovery Service The Discovery service will assume role `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` which defines the following policy: - `eks:ListClusters` - `eks:DescribeCluster` These are the minimal permissions required to list the available clusters and retrieve their state and labels. Teleport Discovery Service will pull the EKS cluster available and for each cluster to import, it will create a `kube_cluster` object in Auth Server. Once the cluster is discovered and the `kube_cluster` exists in Auth server, the Teleport Kubernetes Service will start proxying the cluster. For that, it must pull the cluster API endpoint and its CA data to create a client. Role `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` allows Kubernetes Service to describe the cluster and retrieve its details. - `eks:DescribeCluster` The IAM role used by the Kubernetes Service must be mapped to a Kubernetes Group that allows impersonation in order to be able to proxy requests with the user's permissions. ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: teleport-role rules: - apiGroups: - "" resources: - users - groups - serviceaccounts verbs: - impersonate - apiGroups: - "" resources: - pods verbs: - get - apiGroups: - "authorization.k8s.io" resources: - selfsubjectaccessreviews - selfsubjectrulesreviews verbs: - create --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: teleport-crb roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: teleport-role subjects: - kind: Group name: ${group_name} ``` During the cluster provisioning phase, we mapped the Kubernetes Service IAM role into a Kubernetes Group ` ${group_name}`. ```yaml mapRoles: - groups: - ${group_name} rolearn:arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role username: "eleport:{{SessionName}} ``` The final step is to validate the client is working correctly and that the Kubernetes Service was able to generate a valid token that can impersonate Kubernetes groups and users. For that, we simulate a user calling `kubectl get services -n default` through Teleport that must return 1 entry, the default service `kubernetes`. Implements #27156

tigrato force-pushed the rfd/0128-aws-integration-tests branch 2 times, most recently from 1165b77 to 880c27b Compare May 31, 2023 13:47

smallinsky self-requested a review May 31, 2023 13:52

smallinsky reviewed Jun 1, 2023

View reviewed changes

rfd/0128-aws-integration-tests.md Outdated Show resolved Hide resolved

rfd/0128-aws-integration-tests.md Outdated Show resolved Hide resolved

rfd/0128-aws-integration-tests.md Outdated Show resolved Hide resolved

tigrato force-pushed the rfd/0128-aws-integration-tests branch 2 times, most recently from 5455b0c to e549e7a Compare June 5, 2023 10:19

tigrato changed the title ~~RFD 128: AWS Integration tests~~ RFD 128: AWS End-To-End tests Jun 5, 2023

tigrato requested a review from r0mant June 5, 2023 12:26

tigrato marked this pull request as ready for review June 5, 2023 12:26

tigrato requested a review from smallinsky June 5, 2023 12:26

github-actions bot requested review from marcoandredinis and timothyb89 June 5, 2023 12:26

github-actions bot added rfd Request for Discussion size/md labels Jun 5, 2023

tigrato force-pushed the rfd/0128-aws-integration-tests branch from e549e7a to d5aa58d Compare June 5, 2023 12:33

smallinsky requested a review from GavinFrazar June 5, 2023 13:05

smallinsky reviewed Jun 5, 2023

View reviewed changes

GavinFrazar approved these changes Jun 6, 2023

View reviewed changes

r0mant reviewed Jun 6, 2023

View reviewed changes

marcoandredinis reviewed Jun 7, 2023

View reviewed changes

r0mant reviewed Jun 13, 2023

View reviewed changes

tigrato requested a review from r0mant June 21, 2023 10:16

r0mant approved these changes Jun 23, 2023

View reviewed changes

tigrato mentioned this pull request Jun 26, 2023

Support specifying assume_role_arn for Kube cluster matchers #28282

Merged

tigrato mentioned this pull request Jul 14, 2023

Add AWS EKS End-To-End tests #29109

Merged

smallinsky approved these changes Jul 19, 2023

View reviewed changes

public-teleport-github-review-bot bot removed the request for review from timothyb89 July 19, 2023 06:14

RFD 128: AWS End-To-End tests

5f6c88d

tigrato force-pushed the rfd/0128-aws-integration-tests branch from 294beaf to 5f6c88d Compare July 19, 2023 08:03

tigrato enabled auto-merge July 19, 2023 08:03

tigrato added this pull request to the merge queue Jul 19, 2023

Merged via the queue into master with commit 14dfccc Jul 19, 2023
20 checks passed

tigrato deleted the rfd/0128-aws-integration-tests branch July 19, 2023 08:21

GavinFrazar mentioned this pull request Jul 28, 2023

Add RDS Postgres end-to-end tests #29755

Merged

1 task

GavinFrazar mentioned this pull request Aug 12, 2023

AWS MySQL end-to-end test #30385

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFD 128: AWS End-To-End tests #27156

RFD 128: AWS End-To-End tests #27156

tigrato commented May 31, 2023 •

edited

smallinsky Jun 5, 2023

tigrato Jun 5, 2023

smallinsky Jun 5, 2023

tigrato Jun 5, 2023

GavinFrazar commented Jun 5, 2023

tigrato commented Jun 6, 2023

GavinFrazar left a comment

GavinFrazar commented Jun 6, 2023

r0mant left a comment

r0mant Jun 6, 2023

GavinFrazar Jun 6, 2023

marcoandredinis Jun 7, 2023

tigrato Jun 7, 2023 •

edited

r0mant Jun 6, 2023

tigrato Jun 7, 2023

r0mant Jun 6, 2023

tigrato Jun 7, 2023

marcoandredinis Jun 7, 2023

marcoandredinis Jun 7, 2023

marcoandredinis Jun 7, 2023

r0mant Jun 13, 2023

tigrato Jun 13, 2023

GavinFrazar Jun 21, 2023

tigrato Jun 21, 2023

GavinFrazar commented Jun 21, 2023

r0mant left a comment

r0mant Jun 23, 2023

jentfoo Jun 23, 2023

tigrato Jul 18, 2023

RFD 128: AWS End-To-End tests #27156

RFD 128: AWS End-To-End tests #27156

Conversation

tigrato commented May 31, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GavinFrazar commented Jun 5, 2023

tigrato commented Jun 6, 2023

GavinFrazar left a comment

Choose a reason for hiding this comment

GavinFrazar commented Jun 6, 2023

r0mant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tigrato Jun 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GavinFrazar commented Jun 21, 2023

r0mant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tigrato commented May 31, 2023 •

edited

tigrato Jun 7, 2023 •

edited