Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis tests failing #300

Closed
lukehoban opened this issue Jan 4, 2020 · 7 comments · Fixed by #309
Closed

Travis tests failing #300

lukehoban opened this issue Jan 4, 2020 · 7 comments · Fixed by #309
Assignees
Labels
area/testing p1 A bug severe enough to be the next item assigned to an engineer
Milestone

Comments

@lukehoban
Copy link
Member

lukehoban commented Jan 4, 2020

Travis tests have failed or errored for the last 16 days, and for 20 of the 21 latest runs in master.

@lukehoban lukehoban added this to the 0.31 milestone Jan 4, 2020
@lukehoban lukehoban mentioned this issue Jan 4, 2020
@metral
Copy link
Contributor

metral commented Jan 6, 2020

I've combed through the last handful of failed CI tests on master, and the errors are the same as reported in #267:

Error: 2 UNKNOWN: invocation of aws:ec2/getRouteTable:getRouteTable returned an error:
invoking aws:ec2/getRouteTable:getRouteTable:
Your query returned no results. Please change your search criteria and try again, resp: undefined

We're seeing this error across all examples/tests that have at least 1 cluster defined that does not use a newly created VPC, and instead depends on the account's default VPC.

Givens:

  • The CI account in us-west-2 does have a default VPC with route table set.
  • All subnets for the default VPC have route tables associated.
  • All route tables have main association set to Yes

Failing examples/tests:

  • examples/cluster
  • examples/nodegroup
  • examples/managed-nodegroups
  • examples/storage-classes
  • examples/tags
  • examples/tests/tag-input-types
  • examples/tests/nodegroup-options

for (const subnetId of subnetIds) {
// Fetch the route table for this subnet.
const routeTable = await getRouteTableAsync(parent, subnetId);
// Once we have the route table, check its list of routes for a route to an internet gateway.
const hasInternetGatewayRoute = routeTable.routes
.find(r => !!r.gatewayId && !isPrivateCIDRBlock(r.cidrBlock)) !== undefined;
if (hasInternetGatewayRoute) {
publicSubnets.push(subnetId);
} else {
privateSubnets.push(subnetId);
}
}
return privateSubnets.length === 0 ? publicSubnets : privateSubnets;
}
async function getRouteTableAsync(parent: pulumi.Resource, subnetId: string) {
const invokeOpts = { parent, async: true };
try {
// Attempt to get the explicit route table for this subnet. If there is no explicit rouute table for
// this subnet, this call will throw.
return await aws.ec2.getRouteTable({ subnetId }, invokeOpts);
} catch {
// If we reach this point, the subnet may not have an explicitly associated route table. In this case
// the subnet is associated with its VPC's main route table (see
// https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Route_Tables.html#RouteTables for details).
const subnet = await aws.ec2.getSubnet({ id: subnetId }, invokeOpts);
const mainRouteTableInfo = await aws.ec2.getRouteTables({
vpcId: subnet.vpcId,
filters: [{
name: "association.main",
values: ["true"],
}],
}, invokeOpts);
return await aws.ec2.getRouteTable({ routeTableId: mainRouteTableInfo.ids[0] }, invokeOpts);
}
}

cc @CyrusNajmabadi


Related PRs that attempted to fix #267:

@metral
Copy link
Contributor

metral commented Jan 9, 2020

cc @lukehoban

@metral
Copy link
Contributor

metral commented Jan 14, 2020

Ran a slew of attempts, all with varying outcomes, but no clear resolution.

cc @lukehoban @stack72 @CyrusNajmabadi


For starters, these are all the examples/tests we have.

There are 6 core examples:

  1. examples/cluster
  2. examples/fargate
  3. examples/nodegroup
  4. examples/managed-nodegroups
  5. examples/storage-classes
  6. examples/tags

There are 4 with_update core examples:

  1. examples/cluster
  2. examples/nodegroup
  3. examples/storage-classes
  4. examples/tags

There are 6 regression tests currently, and 2 new ones in the pipeline per #302 :

  1. examples/tests/awsx-network-and-subnetIds
  2. examples/tests/migrate-nodegroups
  3. examples/tests/nodegroup-options
  4. examples/tests/replace-cluster-add-subnets
  5. examples/tests/replace-secgroup
  6. examples/tests/tag-input-types

and

  1. examples/tests/managed-ng-aws-auth
  2. examples/tests/managed-ng-missing-role

Fix attempts and outcomes:

  1. Commit | Travis Run | PASSED:
    • Ran a subset of tests
    • Enabled Logging in the getRouteTables() func of nodegroup.ts and did not encounter the getRouteTables() issue we've been seeing per comment above.
    • PULUMI_TEST_DEBUG_LOG_LEVEL=2 && PULUMI_TEST_DEBUG_UPDATES=true
  2. Commit | Travis Run | FAILED after 2 hour max test duration limit with The job exceeded the maximum time limit for jobs, and has been terminated.
    • Ran all tests
    • Enabled Logging in the getRouteTables() func of nodegroup.ts and did not encounter the getRouteTables() issue we've been seeing per comment above.
    • No PULUMI_TEST_DEBUG env-vars were enabled.
  3. Commit | Travis Run | FAILED with No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.:
    • Ran all tests
    • Removed debug logging in the getRouteTables() func of nodegroup.ts.
    • No PULUMI_TEST_DEBUG env-vars were enabled.
  4. Last handful of Cron's (#1306 and #1304) on last working commit / master | FAILED with No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
    • PULUMI_TEST_DEBUG_LOG_LEVEL=9 && PULUMI_TEST_DEBUG_UPDATES=true, and
    • PULUMI_TEST_DEBUG_LOG_LEVEL=2 && PULUMI_TEST_DEBUG_UPDATES=true
  5. Re-ran last working commit / master | Travis | FAILED with No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
    • No PULUMI_TEST_DEBUG env-vars were enabled.

In summary:

  • The getRouteTable: Your query returned no results error has not been seen in about 1 week on Travis or locally. This does not mean this error has been resolved - it surfaced back in Oct 2019, and then resurfaced in Dec 2019. Now instead, we've seen tests stall out midway through with No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself across cron jobs and fix attempts.
  • Enabling or not enabling the PULUMI_TEST_DEBUG_* does not seem to matter to a) tests running completely or b) helping them to pass
  • The only test that did pass had only a subset of tests enabled. Given that pulumi/examples is ~90 tests, in theory there is no reason why all 16 current examples/tests of pulumi/eks cannot be all enabled.
  • The only possible theory that's leading to such variance in error and lack of job completion is Travis, either due to network saturation, resource starvation, or spurious bugs that have been visibly noticed on the Travis platform for a while now in both the UI and on job runs.

@CyrusNajmabadi
Copy link
Contributor

Another issue:
https://travis-ci.com/pulumi/examples/builds/145618653

[ com/pulumi/examples/aws-ts-eks ]     Error: invocation of kubernetes:yaml:decode returned an error: unable to load Kubernetes client configuration from kubeconfig file: invalid configuration: no configuration has been provided

[ com/pulumi/examples/aws-ts-eks ]         at monitor.invoke (/tmp/p-it-travis-job-aws-ts-eks-c85be5f0-486422416/node_modules/@pulumi/pulumi/runtime/invoke.js:172:33)

[ com/pulumi/examples/aws-ts-eks ]         at Object.onReceiveStatus (/tmp/p-it-travis-job-aws-ts-eks-c85be5f0-486422416/node_modules/grpc/src/client_interceptors.js:1210:9)

@metral
Copy link
Contributor

metral commented Jan 23, 2020

Error: invocation of kubernetes:yaml:decode returned an error: unable to load Kubernetes client configuration from kubeconfig file: invalid configuration: no configuration has been provided

Related pulumi/pulumi-kubernetes#958

@lblackstone
Copy link
Member

The yaml:decode error was fixed in the 1.4.5 release of the k8s provider.

@metral
Copy link
Contributor

metral commented Jan 23, 2020

The yaml:decode error was fixed in the 1.4.5 release of the k8s provider.

Confirmed that i have not seen this today in EKS CI.

Still debugging rest of the issues we're seeing today:

@infin8x infin8x added the p1 A bug severe enough to be the next item assigned to an engineer label Jul 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing p1 A bug severe enough to be the next item assigned to an engineer
Projects
None yet
5 participants