Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2e tests fixes and improvements #4243

Merged
merged 5 commits into from
Feb 8, 2022
Merged

E2e tests fixes and improvements #4243

merged 5 commits into from
Feb 8, 2022

Conversation

castelblanque
Copy link
Collaborator

@castelblanque castelblanque commented Feb 7, 2022

Description of the change

This PR tries to fix problems with CI E2E tests in both local Kind and GKE environments.
It adds some logging of the parameters used in the e2e-test.sh script.
Different timeouts are set when running in Kind or GKE, as GKE turned out to be slower while developing. Why not having the same long timeout for both? This way we can have earlier feedback when running in Kind, which is what runs on every commit push.

One of the problems was in resources created by test 03 being unable to have images pulled due to a 401 Unauthorized error. Now the existing Docker credentials are used, but still the secret is created in the test.

More comments below.

Benefits

E2E tests should pass successfully now in both local Kind and GKE.

Possible drawbacks

No one knows what is the next surprise with E2E tests...

Applicable issues

castelblanque and others added 3 commits February 7, 2022 17:14
* Trying new CI setup

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Fix bearer token login on E2E tests (#2898)

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Added positive and negative permissions tests for regular user (#4081)

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Re-add rolebinding for kubeapps-user in its namespace (#4081)

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Fixing tests for GKE env (#2898)

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Fix for E2E running in GKE (#2898)

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Trying to fix E2E tests for GKE (#2898)

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Improving E2E tests setup (#2898)

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Fix for CI pipeline in GKE

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Fixes for E2E tests in GKE

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Trying E2E tests with lower cluster resources

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Removed temporary comments in CI config

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Temporary fix for e2e CI development

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Trying test without Docker credentials

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Use new integration image. Small fixes for E2E.

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* E2E Tests fix

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Fixing E2E tests

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Debugging CI

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Improving GKE CI setup

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Debugging E2E CI tests

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Adding missing timeout param for local E2E

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>

* Tidy up after debugging E2E test

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>
Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>
Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>
@@ -120,7 +120,7 @@ build_on_tag_or_prerelease: &build_on_tag_or_prerelease
tags:
only: /^v.*/
branches:
only: prerelease
only: e2e-tests
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally left until the PR is merged so that tests can be run in both local Kind and GKE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that mean, notw that you've tested in both, that you'd want to revert this now? (before review and landing?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That branch name will not be merged, of course. I left it intentionally while this PR is opened in case we need to push new stuff, that we can run the tests in GKE too. Didn't want to touch the actual prerelease branch.

@@ -241,6 +241,10 @@ common_envars: &common_envars
POSTGRESQL_VERSION: << pipeline.parameters.POSTGRESQL_VERSION >>
RUST_VERSION: << pipeline.parameters.RUST_VERSION >>

common_gke_envars: &common_gke_envars
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added common environment vars for GKE in a single place. Before, USE_MULTICLUSTER_OIDC_ENV was repeated.

@@ -364,7 +368,6 @@ install_additional_cluster: &install_additional_cluster
kubectl --context kind-kubeapps-ci-additional --kubeconfig ${HOME}/.kube/kind-config-kubeapps-ci-additional apply --kubeconfig=${HOME}/.kube/kind-config-kubeapps-ci-additional -f ./docs/user/manifests/kubeapps-local-dev-users-rbac.yaml &&
kubectl --context kind-kubeapps-ci-additional --kubeconfig ${HOME}/.kube/kind-config-kubeapps-ci-additional apply --kubeconfig=${HOME}/.kube/kind-config-kubeapps-ci-additional -f ./docs/user/manifests/kubeapps-local-dev-namespace-discovery-rbac.yaml &&

kubectl --context kind-kubeapps-ci-additional --kubeconfig ${HOME}/.kube/kind-config-kubeapps-ci-additional delete rolebinding kubeapps-user -n kubeapps-user-namespace &&
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding new tests, it was needed this rolebinding created in the file kubeapps-local-dev-users-rbac.yaml.

# that only applies to Kind clusters
name: Apply customizations to GKE cluster
command: |
kubectl create namespace kubeapps-user-namespace
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies only to GKE. Needed now that we are making use of new rolebindings (see script/e2e-test.sh).

Comment on lines +9 to +16
console.log("Setting up Playwright tests");
console.log(`>> Global timeout: ${config.globalTimeout / timeDisplayFactor} mins`);
config.projects.forEach(project => {
console.log(
`>> Project ${project.name} test timeout: ${project.timeout / timeDisplayFactor} mins`,
);
});
console.log(`>> Deployments timeout: ${utils.getDeploymentTimeout() / timeDisplayFactor} mins`);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds some logging about different timeouts at the beginning of Playwright execution.

@@ -21,7 +21,7 @@ spec:
- tail
- -f
- /dev/null
image: kubeapps/integration-tests-pw:v1.0.0
image: kubeapps/integration-tests-pw:v1.0.1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually pushed.

Comment on lines +14 to +17
// Change to user's namespace using UI
await page.click(".kubeapps-dropdown .kubeapps-nav-link");
await page.selectOption('select[name="namespaces"]', "default");
await page.click('cds-button:has-text("Change Context")');
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to different RBAC in GKE, test was running in a different namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - because token authentication is being used in GKE, it would have been defaulting to the namespace of the token - nice catch.

Comment on lines +140 to +145
await expect(page.locator(".left-menu")).toContainText("Up to date", { timeout: deployTimeout });
await page.waitForTimeout(5000);
await page.click('a.nav-link:has-text("Applications")');
await page.waitForTimeout(3000); // Sometimes typing was too fast to get the result shown
await page.locator("input#search").fill(appName);
await page.waitForTimeout(3000);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps in the future we can fine-tune this timeouts, or even remove them completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I worry about adding timeouts like this unless absolutely necessary. As above, it's not clear to me why we can't instead wait for the selector below to be available before clicking. There's 11 seconds of waiting in the last 5 lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like those timeouts either, they came later in the process due to necessity. Without them, sequential execution of e.g.

  await page.click('a.nav-link:has-text("Applications")');
  await page.locator("input#search").fill(appName);

sometimes would not produce any search result (and the app was there for sure).
Same happened in other combination of instructions where, I would say, frontend was unable to handle the speed of the test execution.


// Trigger deploy
await page.locator('cds-button:has-text("Deploy")').click();

// Check deployment
await page.waitForSelector("css=.application-status-pie-chart-number >> text=1", {
timeout: 60000,
timeout: utils.getDeploymentTimeout(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every waitForSelector for a deployment is unified now using the same timeout.

Comment on lines +19 to +20
DEX_IP=${8:-"172.18.0.2"}
ADDITIONAL_CLUSTER_IP=${9:-"172.18.0.3"}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally left these two as last params: not existing when running in GKE. Otherwise params indexes would not be correct.

Comment on lines +194 to +196
--set frontend.replicaCount=1
--set kubeops.replicaCount=1
--set dashboard.replicaCount=1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running tests with 1 replica here showed to work fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it might be worth checking the commit msg there? I remember some environment we found during release was returning 502s, but it may have been for the kubeappsapis service only, which you've left with 2 below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, only service with 2 replicas was kubeappsapis indeed.

Comment on lines +393 to +394
kubectl create rolebinding kubeapps-view-user-apprepo-read -n kubeapps-user-namespace --clusterrole=kubeapps:kubeapps:apprepositories-read --serviceaccount kubeapps:kubeapps-view
kubectl create rolebinding kubeapps-view-user -n kubeapps-user-namespace --clusterrole=edit --serviceaccount kubeapps:kubeapps-view
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has same effect as RBAC Yaml file used for OIDC.

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>
getRandomName: base => {
const randomNumber = Math.floor(Math.random() * Math.floor(100000));
return base + "-" + randomNumber;
},

getDeploymentTimeout: () => {
// TODO(castelblanque) Fine tune this for cases of tests with two deployments or more
return (process.env.TEST_TIMEOUT ? parseInt(process.env.TEST_TIMEOUT) / 2 : 2) * 60 * 1000;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting time for deployments to be ready defaults to half the timeout of each test. There are some tests in which we wait for two deployments, therefore we might run over the total time.

@castelblanque castelblanque changed the title E2e tests E2e tests fixes and improvements Feb 7, 2022
Copy link
Contributor

@absoludity absoludity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rafa - glad to see more consistency with the GKE tests.

This doesn't address the issue we mentioned on the standup yesterday of the screenshots not appearing though right? (as in, it's still worth my time to investigate that separately?)

@@ -120,7 +120,7 @@ build_on_tag_or_prerelease: &build_on_tag_or_prerelease
tags:
only: /^v.*/
branches:
only: prerelease
only: e2e-tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that mean, notw that you've tested in both, that you'd want to revert this now? (before review and landing?)

@@ -241,6 +241,10 @@ common_envars: &common_envars
POSTGRESQL_VERSION: << pipeline.parameters.POSTGRESQL_VERSION >>
RUST_VERSION: << pipeline.parameters.RUST_VERSION >>

common_gke_envars: &common_gke_envars
USE_MULTICLUSTER_OIDC_ENV: "false"
TEST_TIMEOUT: 6 # Timeout minutes for each test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that most times are in milliseconds, can we rename this to TEST_TIMEOUT_MINUTES rather than relying on the comment here? Ah - I see, you're keeping it consistent with the existing CI_TIMEOUT. Up to you (whether you leave them as they are or update with the _MINUTES suffix, but yes, better not to do one and not the other).

GKE_REGULAR_VERSION_LATEST_RELEASE:
<<: *gke_test
environment:
GKE_BRANCH: << pipeline.parameters.GKE_REGULAR_VERSION >>
TEST_LATEST_RELEASE: 1
USE_MULTICLUSTER_OIDC_ENV: "false"
<<: *common_envars
<<: [*common_envars, *common_gke_envars]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer, including the oidc env in the common ones :)

Comment on lines +14 to +17
// Change to user's namespace using UI
await page.click(".kubeapps-dropdown .kubeapps-nav-link");
await page.selectOption('select[name="namespaces"]', "default");
await page.click('cds-button:has-text("Change Context")');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - because token authentication is being used in GKE, it would have been defaulting to the namespace of the token - nice catch.

Comment on lines +42 to +43
await page.fill("input#kubeapps-docker-cred-username", process.env.DOCKER_USERNAME);
await page.fill("input#kubeapps-docker-cred-password", process.env.DOCKER_PASSWORD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm still confused why we are adding real credentials here, given that the image itself is public (the chartmuseum repo is private, but the chart in the repo refers to public images), so no image pull secret should be necessary (as a user, I wouldn't create one in this situation right?). If it is simply because we want to verify creating the image pull secret, even if not used, and can no longer send dummy credentials to fetch a public image, then let's be clear about that here with a comment, something like:

// We want to verify the use of adding an image pull secret here even though the images for the chart
// are public. We're setting the actual credentials as the docker registry server rejects a request to pull
// a public images with invalid credentials, returning a 403.

so it doesn't take us time to unravel in a year why we were sending creds in the first place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't changed the essence of this test. Docker credentials were being added before in the same way, except that it was fake credentials. Does it make sense to instruct the retrieval of public images with credentials? Maybe we were relying in some Docker URL feature that changes over time or is inconsistent?

As mentioned here, that test was failing in both Kind and GKE due to Docker being unable to pull the image docker.io/bitnami/apache:2.4.48-debian-10-r74 (and r75) with a 401 Unauthorized. That error looks right to me, as we were sending non existing credentials. In Kind I worked around that by pre-pushing the apache images to Kind (only service images were being pushed before). But when coming to GKE, I preferred to use actual credentials and not using shortcuts given that the credentials were already set up in CI...and that we really wanted to test the application of Docker credentials I assume. Actually we might be able to remove the pre-pushing used for Kind, haven't tried.

Why it wasn't failing before? I'm not sure that we were having failures correctly detected with Puppeteer before (remember the "retry" pilgrimage? :D ). Not sure neither that credentials were sometimes being ignored/lost when pulling the images, maybe due to https://index.docker.io/v1/ inconsistently ignoring them due to requesting a public image.

await page.click('a.nav-link:has-text("Applications")');
await page.waitForTimeout(3000); // Sometimes typing was too fast to get the result shown
await page.locator("input#search").fill(appName);
await page.waitForTimeout(3000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an extra 6 seconds added to the test here. Could we instead just wait until the selector below is available to click on?

Comment on lines +140 to +145
await expect(page.locator(".left-menu")).toContainText("Up to date", { timeout: deployTimeout });
await page.waitForTimeout(5000);
await page.click('a.nav-link:has-text("Applications")');
await page.waitForTimeout(3000); // Sometimes typing was too fast to get the result shown
await page.locator("input#search").fill(appName);
await page.waitForTimeout(3000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I worry about adding timeouts like this unless absolutely necessary. As above, it's not clear to me why we can't instead wait for the selector below to be available before clicking. There's 11 seconds of waiting in the last 5 lines.

@@ -73,7 +75,8 @@ test.describe("Limited user simple deployments", () => {

// Search for package deployed
await page.click('a.nav-link:has-text("Applications")');
await page.locator("input#search").type("alertmanager");
await page.waitForTimeout(3000); // Sometimes typing was too fast to get the result shown
await page.locator("input#search").fill("alertmanager");
await page.waitForTimeout(3000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with the timeouts rather than waiting for the selector.

Comment on lines +194 to +196
--set frontend.replicaCount=1
--set kubeops.replicaCount=1
--set dashboard.replicaCount=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it might be worth checking the commit msg there? I remember some environment we found during release was returning 502s, but it may have been for the kubeappsapis service only, which you've left with 2 below.

Signed-off-by: Rafa Castelblanque <rcastelblanq@vmware.com>
@castelblanque castelblanque merged commit b584702 into main Feb 8, 2022
@castelblanque castelblanque deleted the e2e-tests branch February 8, 2022 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants