Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 128: AWS End-To-End tests #27156

Merged
merged 1 commit into from
Jul 19, 2023
Merged

Conversation

tigrato
Copy link
Contributor

@tigrato tigrato commented May 31, 2023

@tigrato tigrato force-pushed the rfd/0128-aws-integration-tests branch 2 times, most recently from 1165b77 to 880c27b Compare May 31, 2023 13:47
@smallinsky smallinsky self-requested a review May 31, 2023 13:52
rfd/0128-aws-integration-tests.md Outdated Show resolved Hide resolved
rfd/0128-aws-integration-tests.md Outdated Show resolved Hide resolved
rfd/0128-aws-integration-tests.md Outdated Show resolved Hide resolved
@tigrato tigrato force-pushed the rfd/0128-aws-integration-tests branch 2 times, most recently from 5455b0c to e549e7a Compare June 5, 2023 10:19
@tigrato tigrato changed the title RFD 128: AWS Integration tests RFD 128: AWS End-To-End tests Jun 5, 2023
@tigrato tigrato requested a review from r0mant June 5, 2023 12:26
@tigrato tigrato marked this pull request as ready for review June 5, 2023 12:26
@tigrato tigrato requested a review from smallinsky June 5, 2023 12:26
@github-actions github-actions bot added rfd Request for Discussion size/md labels Jun 5, 2023
@tigrato tigrato force-pushed the rfd/0128-aws-integration-tests branch from e549e7a to d5aa58d Compare June 5, 2023 12:33
Comment on lines 154 to 218
### GitHub OIDC Provider

GitHub OIDC provider will be used to authenticate with AWS API. This will allow
us to use GitHub actions to run the integration tests without having to manage
AWS credentials. GitHub actions will be allowed to assume a set of roles that
will simulate the minimum required permissions that each Teleport service
requires to perform the required actions. An important consideration is that
all other OIDC tokens from GitHub that don't belong to the Teleport Enterprise
Account actions or originated from other repositories besides `gravitational/teleport`
must be rejected. This includes tokens from other GitHub
organizations and tokens from the Teleport Enterprise Account that don't belong
to the GitHub actions pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that single individual user won't be able to run these test on local machine ? I think that we should consider a way to provide a fallback mechanism and allowing a user outside Teleport Organisation to run this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only specifies the scope for the Github action flows.

Individuals with access to Teleport AWS accounts can always create ec2 instances with the associated IAM policies for testing and debugging purposes.

that we don't introduce regressions by changing the permissions required by
Teleport. To achieve this, GitHub action will be allowed to assume a set of
roles that will simulate the minimum required permissions that each Teleport
service requires to perform the required actions. Teleport services will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Kube Service/Kube Agent support STS Assume role call in the EKS kube access flow?

I think that this is the kind of feature that is not scoped only to integration tests flow and can be used in case of one Kube Service - multiple EKS AWS Account access so it is worth to handle/track this feature in a separate issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes service will support assuming role as part of the matcher selector

I imagined something like:

kubernetes_service:
  enabled: "yes"
  resources:
  - tags:
      "env": "prod"
    assume_aws_role: ...

@GavinFrazar
Copy link
Contributor

Would we use our teleport-dev AWS account for this? and can we expand on the infra requirements to include a unique label to match on in the discovery service, such that only the intended EKS cluster(s) is tested against?

@tigrato
Copy link
Contributor Author

tigrato commented Jun 6, 2023

Would we use our teleport-dev AWS account for this? and can we expand on the infra requirements to include a unique label to match on in the discovery service, such that only the intended EKS cluster(s) is tested against?

I'm more inclined to restrict the resources the IAM roles will access directly through their arn. So, even if you use '':'' matchers, only the visible resources can be used.

Regarding the aws account, I think having a dedicated account for E2E tests is easier but I am not sure about the implications and caveats of adding another account

Copy link
Contributor

@GavinFrazar GavinFrazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider replacing the "integration test" terminology with "end-to-end" or "e2e" throughout the RFD

@GavinFrazar
Copy link
Contributor

@smallinsky this RFD essentially describes everything we will require for DB service e2e tests - we just need to define the things we want to test

Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass.

rfd/0128-aws-integration-tests.md Outdated Show resolved Hide resolved
## When

AWS integration tests will be added to the Teleport CI pipeline as part of the
cronjob that triggers integrations each day. During the release process, the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the release process, the integration tests will be run automatically as part of the release process.

I'm not sure about this tbh. I understand the intent, but IMO blocking a release on the integration tests failure is suboptimal. I think that if something fails during release process, the issue was detected too late, and also imagine we need to push an important release we can't delay or something.

I know we originally discussed running these nightly but thinking some more it would make sense to see if we can run them on each PR, it's always better to have fast feedback loop. I think it should be doable if we design the tests themselves to they don't conflict.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not blocking releases.

I think if we run them on every PR, we may restrict what our tests can do without potentially interfering with each other. I suppose if we can work around that, and the tests are relatively quick, then I'd be in favor of running on PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could use AWS CloudFormation to create the testing environment 🤔
We deploy everything we need there: EKS, EC2, DBs, ... and wait for it to complete
We then start running the tests against those resources

Something like this
https://cluster-api-aws.sigs.k8s.io/development/e2e.html
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/6705ad80eb1c3fe47036505485f3c24141e88a59/Makefile#L407

Creating those resources is a slow process for most (we know that EKS alone takes around 20 minutes 😭 ).
So, I'm not sure doing this for every PR is feasible
I would suggest we start with a nightly job and only for master and branch/vX and an on-demand way for running it against a specific commit or branch

Copy link
Contributor Author

@tigrato tigrato Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r0mant I am ok running tests for each PR but it is a bit less secure since we allow executing E2E for code that wasn't properly reviewed yet.

To improve security, if the contributor is external, we only allow E2E to run if the PR is approved.

For kube access, if we wanna test the token reload once it's about to expire we need to wait 15+min.
That part is already covered by UT so it's probably ok to remove that from the scope of E2E.

rfd/0128-aws-integration-tests.md Outdated Show resolved Hide resolved
already exists, it will be updated with the new labels.
- The Kubernetes service receives the `kube_cluster` object and can generate
a short-lived token to authenticate with the Kubernetes API.
- The Kubernetes service can forward the requests to the Kubernetes API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we be testing client connections? Can we use actual kubectl so simulate end users?

Same goes for the database access cc @smallinsky. I wonder if we can build a Docker image with all the client tools we need (kubectl, psql, mysql, etc.) and use it to connect, instead of relying on drivers. This way we will also be able to test specific client version if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tsh ships tsh kubectl so I was planning to use it to get secrets/services since both objects exist even if no cluster nodes exist.

the account that matches the configured labels and then using the Kubernetes API
to forward the requests to the cluster.

The first step of this process happens on the Teleport Discovery service and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor this section to be more specific please w/r/t what kind of Teleport services will be provisioned and what EKS clusters it will be discovering. Basically, instead of explaining internals (i.e. kube_cluster objects, etc.) I would focus on explaining test scenarios. For example:

  1. The test will start discovery_service with config "xxx" and kube_service with config "yyy".
  2. Discovery service will look for EKS clusters with tags "a: b".
  3. The test will expect the EKS cluster "a: b" to join and will execute "kubectl get pods" on it.

And so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 9d98ad5

Comment on lines 32 to 35
Each integration test will use the minimum required
AWS API permissions to test the integration. This will ensure that we don't
introduce regressions by changing the permissions required by Teleport but those
changes are not detected because another test requires the same permissions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another argument in favor of keeping the minimum required permissions, is that we keep that list up to date and well documented.
This way we can always provide the require policies with customers 👍

Comment on lines +27 to +30
Teleport integrates deeply with AWS to provide a seamless experience for users
when using auto-discovery of nodes, databases and Kubernetes clusters. This
integration is critical for the success of Teleport and we want to ensure that
we can test it reliably.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also AWS OIDC Integration.
Although not as critical (for now 😎 ), but I think is still worth mentioning it.
Even if saying that it's currently out of scope.

I'm not aware of other features using AWS, but if they exist I think we should be explicit about them.

## When

AWS integration tests will be added to the Teleport CI pipeline as part of the
cronjob that triggers integrations each day. During the release process, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could use AWS CloudFormation to create the testing environment 🤔
We deploy everything we need there: EKS, EC2, DBs, ... and wait for it to complete
We then start running the tests against those resources

Something like this
https://cluster-api-aws.sigs.k8s.io/development/e2e.html
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/6705ad80eb1c3fe47036505485f3c24141e88a59/Makefile#L407

Creating those resources is a slow process for most (we know that EKS alone takes around 20 minutes 😭 ).
So, I'm not sure doing this for every PR is feasible
I would suggest we start with a nightly job and only for master and branch/vX and an on-demand way for running it against a specific commit or branch

to generate the short-lived token and forward the requests to the Kubernetes
API. [IAM Mapping](https://goteleport.com/docs/kubernetes-access/discovery/aws/#iam-mapping)

Spin up the EKS clusters takes a long time and it's not feasible to do it for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed this on the engineering offsite and one change to this system we'd like to make is instead of maintaining a set of persistent resources which can become obsolete, we want to recreate these on some cadence (e.g. daily) so we don't have to worry about patch management, updates, etc.

Can you please update the RFD to that effect? Basically, running "terraform destroy/apply" that recreates these on a daily basis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do once I am back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we run e2e tests on every PR, how will we handle this routine recreation of the test infrastructure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b60903b

Updated the RFD to include the recreate strategy on weekends to minimize the impact on development for AU/EU/US folks.

@tigrato tigrato requested a review from r0mant June 21, 2023 10:16
@GavinFrazar
Copy link
Contributor

@tigrato thinking about setting up E2E tests - do you plan to start the teleport cluster using test suite helpers, or via config file and actually doing teleport start?

Also, have you given thought to the test environment? I'm writing these kinds of tests for DB access now, but I'd like to have the teleport cluster work with a locally trusted CA cert so I don't have to pass --insecure in tsh commands

Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just need a bit more details about how we're going to handle infrastructure recreation automation

rfd/0128-aws-integration-tests.md Outdated Show resolved Hide resolved
Comment on lines 193 to 198
deployed on a single availability zone is enough. To ensure that the EKS cluster
is up to date and we don't have to deal with the cluster upgrades having
major impacts on the development, we will run the Terraform script to destroy
and recreate the cluster during the weekend. This ensures that the EKS cluster is
available for the E2E tests during the week (with minor downtime on weekends)
and that we don't have to deal with cluster upgrades/security patches.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reedloden @wadells @jentfoo We're going to be maintaining persistent set of cloud resources in a CI AWS account for these new end-to-end integration tests. The plan is to be re-provisioning them weekly so we don't need to worry about patching. Do you have any suggestions on whats the best way to automate this process? Is this something that can be handled by Spacelift? Or do we just build a separate GHA workflow that runs on cron schedule?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jof and @adaadb6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We opted to use the Spacelift schedule task to recreate the stack every Sunday at 12:00AM PST time

tigrato added a commit that referenced this pull request Jul 19, 2023
This PR enables AWS E2E integration tests for EKS auto-discovery.

This process uses Github's OIDC connector to access AWS API by assuming the `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` role.

```yaml
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-region: ${{ env.AWS_REGION }}
          role-to-assume: ${{ env.GHA_ASSUME_ROLE }}
```

`aws-actions/configure-aws-credentials` action generates a new ID token with the information required and signs it using Github's OIDC workflow.

The role `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` is an intermediate role for the runner to be able to assume two distinct roles:

-  `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` - used by Kubernetes Service
-  `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` - used by Discovery Service

The Discovery service will assume role  `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` which defines the following policy:

- `eks:ListClusters`
- `eks:DescribeCluster`

These are the minimal permissions required to list the available clusters and retrieve their state and labels.

Teleport Discovery Service will pull the EKS cluster available and for each cluster to import, it will create a `kube_cluster` object in Auth Server.

Once the cluster is discovered and the `kube_cluster` exists in Auth server, the Teleport Kubernetes Service will start proxying the cluster.

For that, it must pull the cluster API endpoint and its CA data to create a client.  Role `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` allows Kubernetes Service to describe the cluster and retrieve its details.

- `eks:DescribeCluster`

The IAM role used by the Kubernetes Service must be mapped to a Kubernetes Group that allows impersonation in order to be able to proxy requests with the user's permissions.

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: teleport-role
rules:
- apiGroups:
  - ""
  resources:
  - users
  - groups
  - serviceaccounts
  verbs:
  - impersonate
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
- apiGroups:
  - "authorization.k8s.io"
  resources:
  - selfsubjectaccessreviews
  - selfsubjectrulesreviews
  verbs:
  - create
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: teleport-crb
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: teleport-role
subjects:
- kind: Group
  name: ${group_name}

```

During the cluster provisioning phase, we mapped the Kubernetes Service IAM role into a Kubernetes Group ` ${group_name}`.

```yaml

mapRoles:
- groups:
  - ${group_name}
  rolearn:arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role
  username: "eleport:{{SessionName}}
```

The final step is to validate the client is working correctly and that the Kubernetes Service was able to generate a valid token that can impersonate Kubernetes groups and users.

For that, we simulate a user calling `kubectl get services -n default` through Teleport that must return 1 entry, the default service `kubernetes`.

Implements #27156
@tigrato tigrato force-pushed the rfd/0128-aws-integration-tests branch from 294beaf to 5f6c88d Compare July 19, 2023 08:03
@tigrato tigrato enabled auto-merge July 19, 2023 08:03
@tigrato tigrato added this pull request to the merge queue Jul 19, 2023
Merged via the queue into master with commit 14dfccc Jul 19, 2023
20 checks passed
@tigrato tigrato deleted the rfd/0128-aws-integration-tests branch July 19, 2023 08:21
github-merge-queue bot pushed a commit that referenced this pull request Jul 19, 2023
This PR enables AWS E2E integration tests for EKS auto-discovery.

This process uses Github's OIDC connector to access AWS API by assuming the `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` role.

```yaml
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-region: ${{ env.AWS_REGION }}
          role-to-assume: ${{ env.GHA_ASSUME_ROLE }}
```

`aws-actions/configure-aws-credentials` action generates a new ID token with the information required and signs it using Github's OIDC workflow.

The role `arn:aws:iam::307493967395:role/tf-aws-e2e-gha-role` is an intermediate role for the runner to be able to assume two distinct roles:

-  `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` - used by Kubernetes Service
-  `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` - used by Discovery Service

The Discovery service will assume role  `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-discovery-service-access-role` which defines the following policy:

- `eks:ListClusters`
- `eks:DescribeCluster`

These are the minimal permissions required to list the available clusters and retrieve their state and labels.

Teleport Discovery Service will pull the EKS cluster available and for each cluster to import, it will create a `kube_cluster` object in Auth Server.

Once the cluster is discovered and the `kube_cluster` exists in Auth server, the Teleport Kubernetes Service will start proxying the cluster.

For that, it must pull the cluster API endpoint and its CA data to create a client.  Role `arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role` allows Kubernetes Service to describe the cluster and retrieve its details.

- `eks:DescribeCluster`

The IAM role used by the Kubernetes Service must be mapped to a Kubernetes Group that allows impersonation in order to be able to proxy requests with the user's permissions.

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: teleport-role
rules:
- apiGroups:
  - ""
  resources:
  - users
  - groups
  - serviceaccounts
  verbs:
  - impersonate
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
- apiGroups:
  - "authorization.k8s.io"
  resources:
  - selfsubjectaccessreviews
  - selfsubjectrulesreviews
  verbs:
  - create
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: teleport-crb
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: teleport-role
subjects:
- kind: Group
  name: ${group_name}

```

During the cluster provisioning phase, we mapped the Kubernetes Service IAM role into a Kubernetes Group ` ${group_name}`.

```yaml

mapRoles:
- groups:
  - ${group_name}
  rolearn:arn:aws:iam::307493967395:role/tf-eks-discovery-ci-cluster-kubernetes-service-access-role
  username: "eleport:{{SessionName}}
```

The final step is to validate the client is working correctly and that the Kubernetes Service was able to generate a valid token that can impersonate Kubernetes groups and users.

For that, we simulate a user calling `kubectl get services -n default` through Teleport that must return 1 entry, the default service `kubernetes`.

Implements #27156
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfd Request for Discussion size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants