Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Istio to our nightly tests #240

Closed
geekflyer opened this issue Oct 14, 2018 · 11 comments
Closed

Add Istio to our nightly tests #240

geekflyer opened this issue Oct 14, 2018 · 11 comments
Assignees
Labels
area/helm p1 A bug severe enough to be the next item assigned to an engineer
Milestone

Comments

@geekflyer
Copy link

geekflyer commented Oct 14, 2018

This is mostly related to #239 but I wanted to create a separate issue for istio in general due to the severity of problems I ran into and discuss whether there are other issues to create out of this.

In general I think istio is an excellent example of a non-trivial k8s application that pulumi should adopt as part of their test suite. Why? Istio uses helm, istio deploys more than 100 resources, it uses a lot of advanced resource types, like CRDs etc.

Currently the experience when attempting to deploying istio via pulumi is very bad.
This is the code that can be used for testing:

  1. helm repo add istio https://istio.io/charts
  2. deploy:
import { helm, core } from '@pulumi/kubernetes';

const appName = 'istio';

const namespaceName = `${appName}-system`;

const namespace = new core.v1.Namespace(namespaceName, {
  metadata: { name: namespaceName }
});

const chart = new helm.v2.Chart(
  appName,
  {
    repo: 'istio',
    chart: appName,
    namespace: namespaceName,
    version: '1.0.1',
    // for all options check https://github.com/istio/istio/tree/master/install/kubernetes/helm/istio
    values: {
         kiali: { enabled: true}
      }
  },
  { dependsOn: [namespace] }
);

This is the list of issues I ran into - in order:

1. getting the chart involves manual steps

I found no way to install a chart from local. Then I found istio actually has a not documented custom helm repo endpoint but using this in pulumi requires a manual helm repo add step. Those 2 issues are already addressed in #229 and #238.

2. Pulumi crashes because of duplicate yaml key in kiali definitions with:

Previewing changes:

     Type                         Name                                 Plan          Info
 *   pulumi:pulumi:Stack          api-cluster-dev-b-api-cluster-dev-b  no change     2 errors
 ~   └─ kubernetes:helm.sh:Chart  istio                                update        changes: ~ values

Diagnostics:
  pulumi:pulumi:Stack: api-cluster-dev-b-api-cluster-dev-b
    error: YAMLException: duplicated mapping key at line 2450, column 5:
            name: http-kiali
            ^

    error: an unhandled error occurred: Program exited with non-zero exit code: 1

error: an error occurred while advancing the preview

This is because kiali has duplicated the name key in its service definition here https://github.com/istio/istio/blob/369bf50f45d9e2748b59726a466b840312633f2b/install/kubernetes/helm/istio/charts/kiali/templates/service.yaml#L10-L13 . This mistake was already fixed on master of istio, but the latest official chart still contains this duplicate key. Now the thing is why I'm mentioning is that helm itself can deal with those duplicate keys and simply takes the value from the last key (similar to JSON). A short google for the documentation of the js-yaml library that is used by pulumi reveals that this behaviour can be enabled via the json param on the safeLoad function: https://github.com/nodeca/js-yaml#safeload-string---options- . It is certainly debatable whether pulumi should allow duplicate keys in yamls, but since helm itself allows it I wanted to at least bring it up for discussion. There seem to be also a bunch of other projects which intentionally allow duplicate keys in yaml for compatibility reasons with earlier yaml specs etc.

3. Resources are not deployed in the right order and pulumi fails early when one resource failed to deploy

This is basically #239 , however there are a few additional adverse behaviors that surfaced in attempting to workaround this. In general for the istio deployment to succeed the CRDs and the Deployments have to be deployed first. When running pulumi up pulumi currently by default starts 10 concurrent resource deployments and once one of them fails it stops spawning further resource deployments. For istio that means it takes a lot of pulumi up "attempts" until all of the CRDs and Deployments randomly end up in the resource deployment queue. A workaround which actually helped to reduce the number of necessary pulumi up attempts was to increase the concurrency by using pulumi up -p 50 which results in pulumi starting 50 concurrent deployments right from the start and which increases the likelihood for the CRDs and Deployment resources to get picked up. Based on this workaround I was wondering if pulumi could change or introduce a param which controls the behaviour when pulumi is going to stop an update and considers it as failure. In case of istio I would like pulumi to continue deploying resources even if the ones that have been picked up first have failed. I'm not 100% sure if it's clear enough of what I'm suggesting from the description above, so please let me know if this needs further clarification.

@lukehoban lukehoban added this to the 0.18 milestone Oct 15, 2018
@hausdorff
Copy link
Contributor

Thanks for opening the issue, 100% agree, we should fix this stuff. Istio has been on our radar for awhile anyway, and (1) and (3) are problems we've encountered from another big, important, complex chart as well (kube-prometheus), so we are actively discussing fixes for those.

(2) is one I've not heard before; if Helm does it we should probably support it too, even though I think this is not a great decision individually. :)

Even though this issue is really a meta-issue, I'm going to keep it open for now, because I suspect that once we get through these issues, we'll find even more. My experience is that large Helm charts usually have subtle and unnoticed bugs that need to be fixed, and our pattern has been to contribute fixes upstream, and I expect we'll likely do that here, too.

@hausdorff
Copy link
Contributor

I'm moving the umbrella issue to M19. We'll try to get a quick fix in for most Charts in the next day or two (I'm guessing we will do infinite parallelism), and then follow up with a series of fixes that solve the problems of the Istio Chart generally. I also suspect we will discover more issues than those listed here -- so we'll want to budget time for that.

@hausdorff hausdorff modified the milestones: 0.18, 0.19 Oct 15, 2018
@joeduffy joeduffy changed the title Umbrella issue: It's close to impossible to deploy istio via pulumi Add Istio to our nightly tests Oct 17, 2018
@joeduffy
Copy link
Member

I have updated the title. We should not only fix the underlying set of issues, but also aim to close out M19 with Istio actually harnessed and running in our nightly tests, so that we lock in the goodness.

hausdorff added a commit that referenced this issue Nov 5, 2018
This commit will introduce a simple retry mechanism that makes the
Kubernetes provider more tolerant to misordered resource creation, as is
particularly common in Helm and Kubernetes YAML. For example, if
Kubernetes YAML specifies to create a `Pod` before the `Namespace` it
resides in, it is likely this will cause `pulumi up` to fail. Another
example is creating `CustomResoruces` before an operator or controller has
the opportunity to register the relevant `CustomResourceDefinition`s.

We ameliorate this bug by retrying creation REST calls that fail, with
some backoff, and some maximum retry threshold. The effects of this are:

* Makes the Kubernetes provider somewhat tolerant to undeclared
  dependencies between resources.
* Makes the Kubernetes provider somewhat tolerant to under-specified
  CRDs, custom controllers, and custom operators, since the retry logic
  can stand in for all but the most complex of these (which would still
  require await logic).

This commmit represents major progress towards #240, and should (we
believe) fix #239.
hausdorff added a commit that referenced this issue Nov 5, 2018
This commit will introduce a simple retry mechanism that makes the
Kubernetes provider more tolerant to misordered resource creation, as is
particularly common in Helm and Kubernetes YAML. For example, if
Kubernetes YAML specifies to create a `Pod` before the `Namespace` it
resides in, it is likely this will cause `pulumi up` to fail. Another
example is creating `CustomResoruces` before an operator or controller has
the opportunity to register the relevant `CustomResourceDefinition`s.

We ameliorate this bug by retrying creation REST calls that fail, with
some backoff, and some maximum retry threshold. The effects of this are:

* Makes the Kubernetes provider somewhat tolerant to undeclared
  dependencies between resources.
* Makes the Kubernetes provider somewhat tolerant to under-specified
  CRDs, custom controllers, and custom operators, since the retry logic
  can stand in for all but the most complex of these (which would still
  require await logic).

This commmit represents major progress towards #240, and should (we
believe) fix #239.
hausdorff added a commit that referenced this issue Nov 6, 2018
This commit will introduce a simple retry mechanism that makes the
Kubernetes provider more tolerant to misordered resource creation, as is
particularly common in Helm and Kubernetes YAML. For example, if
Kubernetes YAML specifies to create a `Pod` before the `Namespace` it
resides in, it is likely this will cause `pulumi up` to fail. Another
example is creating `CustomResoruces` before an operator or controller has
the opportunity to register the relevant `CustomResourceDefinition`s.

We ameliorate this bug by retrying creation REST calls that fail, with
some backoff, and some maximum retry threshold. The effects of this are:

* Makes the Kubernetes provider somewhat tolerant to undeclared
  dependencies between resources.
* Makes the Kubernetes provider somewhat tolerant to under-specified
  CRDs, custom controllers, and custom operators, since the retry logic
  can stand in for all but the most complex of these (which would still
  require await logic).

This commmit represents major progress towards #240, and should (we
believe) fix #239.
hausdorff added a commit that referenced this issue Nov 6, 2018
This commit will introduce a simple retry mechanism that makes the
Kubernetes provider more tolerant to misordered resource creation, as is
particularly common in Helm and Kubernetes YAML. For example, if
Kubernetes YAML specifies to create a `Pod` before the `Namespace` it
resides in, it is likely this will cause `pulumi up` to fail. Another
example is creating `CustomResoruces` before an operator or controller has
the opportunity to register the relevant `CustomResourceDefinition`s.

We ameliorate this bug by retrying creation REST calls that fail, with
some backoff, and some maximum retry threshold. The effects of this are:

* Makes the Kubernetes provider somewhat tolerant to undeclared
  dependencies between resources.
* Makes the Kubernetes provider somewhat tolerant to under-specified
  CRDs, custom controllers, and custom operators, since the retry logic
  can stand in for all but the most complex of these (which would still
  require await logic).

This commmit represents major progress towards #240, and should (we
believe) fix #239.
hausdorff added a commit that referenced this issue Nov 6, 2018
This commit will introduce a simple retry mechanism that makes the
Kubernetes provider more tolerant to misordered resource creation, as is
particularly common in Helm and Kubernetes YAML. For example, if
Kubernetes YAML specifies to create a `Pod` before the `Namespace` it
resides in, it is likely this will cause `pulumi up` to fail. Another
example is creating `CustomResoruces` before an operator or controller has
the opportunity to register the relevant `CustomResourceDefinition`s.

We ameliorate this bug by retrying creation REST calls that fail, with
some backoff, and some maximum retry threshold. The effects of this are:

* Makes the Kubernetes provider somewhat tolerant to undeclared
  dependencies between resources.
* Makes the Kubernetes provider somewhat tolerant to under-specified
  CRDs, custom controllers, and custom operators, since the retry logic
  can stand in for all but the most complex of these (which would still
  require await logic).

This commmit represents major progress towards #240, and should (we
believe) fix #239.
hausdorff added a commit that referenced this issue Nov 6, 2018
This commit will introduce a simple retry mechanism that makes the
Kubernetes provider more tolerant to misordered resource creation, as is
particularly common in Helm and Kubernetes YAML. For example, if
Kubernetes YAML specifies to create a `Pod` before the `Namespace` it
resides in, it is likely this will cause `pulumi up` to fail. Another
example is creating `CustomResoruces` before an operator or controller has
the opportunity to register the relevant `CustomResourceDefinition`s.

We ameliorate this bug by retrying creation REST calls that fail, with
some backoff, and some maximum retry threshold. The effects of this are:

* Makes the Kubernetes provider somewhat tolerant to undeclared
  dependencies between resources.
* Makes the Kubernetes provider somewhat tolerant to under-specified
  CRDs, custom controllers, and custom operators, since the retry logic
  can stand in for all but the most complex of these (which would still
  require await logic).

This commmit represents major progress towards #240, and should (we
believe) fix #239.
@hausdorff
Copy link
Contributor

hausdorff commented Nov 13, 2018

Ok, we're getting close, I think. Here are the outstanding issues that we're in the process of resolving:

  • Allow our caching discovery client to invalidate caches after CRD creation. Because client caches once per engine invocation, if we boot up a CRD and a CR in the same program, the client does get updated, which causes CR creation to fail.
  • Account for finalizers in delete path. Kubernetes resources can have deletion hooks called "finalizers." For example, foreground cascading delete causes a resource to be marked as "deleting" (i.e., the deletion timestamp is set), and then at some point it is deleted by the GC only after all the resources it owns are deleted. It's not yet clear to us what we should be doing in this case, but we're actively working on it.

@hausdorff
Copy link
Contributor

hausdorff commented Nov 14, 2018

Ok, I'm marking the two items from my last comment off -- with #271, Istio does not (on my machine) error out during install or uninstall. I am guessing we'll be running into more bugs as I have time to explore the entire parameter space of the Istio helm chart, but for now I think we've made significant progress on @geekflyer's point (3)

For (2) I haven't decided if we should allow duplicate keys for Helm 2 -- I'm thinking yes, because Helm itself supports this, and I'm a big believer in duplicating bugs when you're mirroring an API. Would appreciate thoughts on this.

For (1), we will finish this work before the sprint ends.

Over the next few days, I'm going to be:

  • Testing the configuration space to find more kinks. The chart is pretty big, and I want to make sure we encounter most of the errors a user would realistically encounter.
  • Looking to make sure that in addition to installing it, Istio is working. So far we've looked only at making sure it installs.
  • Writing custom classes and await logic for the various CRDs users expect to be able to submit.
  • Figuring out some way to test all of this.

@hausdorff
Copy link
Contributor

Ok, we're now making significant progress. For @geekflyer's (2), we've decided to go ahead and conform to Helm's YAML parsing semantics. This is the final provider-level blocker, so in tandem I've put a WIP example for Istio here. We've tested this with a variety of settings so we have some confidence that this will work for normal users.

@hausdorff
Copy link
Contributor

I've worked with several users (including @geekflyer) to make sure this works for them, so we have some confidence we're getting to the right place.

Let's moving out the "write custom CRDs" bit to a new issue -- this is really outside the testing umbrella.

For the last remaining issue of writing tests:

  • We have successfully created a Pulumi app that will boot a testing/CI cluster. We've put this into our internal infrastructure repo, and it has a README that explains how to use it.
  • We've successfully plumbed all the secrets through Travis, so that tests can target this test cluster (Use test-ci k8s cluster; add Istio integration test #102).
  • Write the actual test cases. We already have the Istio program we want to test, so this mostly involves writing the integration test itself.

@hausdorff
Copy link
Contributor

Alright, we have a good start on the tests, but I realized yesterday that, because we can't have multiple instances of Istio running on a cluster, we will actually have to boot an entire GKE cluster every time we deploy it. This shouldn't be hard but so far it's pretty fiddly and I don't know what I don't know, so I am not sure how to make an accurate time estimate.

@lukehoban
Copy link
Member

As a first step, could we just run this test in a nightly run? That should be easy to set up, avoid the parallelism issue, and still get us some coverage in the near term until we can do the work to create an entire cluster as part of each test.

@hausdorff
Copy link
Contributor

Success. We're building a GKE cluster and deploying Istio to it. The test conditions need to be fixed, but the hard part seems to be over: https://travis-ci.com/pulumi/pulumi-kubernetes/builds/93263424

The PR will be up shortly.

@hausdorff
Copy link
Contributor

Closed with #102.

@infin8x infin8x added the p1 A bug severe enough to be the next item assigned to an engineer label Jul 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/helm p1 A bug severe enough to be the next item assigned to an engineer
Projects
None yet
Development

No branches or pull requests

5 participants