Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration testing requirements #1

Open
rabernat opened this issue May 3, 2021 · 19 comments
Open

Integration testing requirements #1

rabernat opened this issue May 3, 2021 · 19 comments

Comments

@rabernat
Copy link
Member

rabernat commented May 3, 2021

On today's Pangeo ML WG call, @nbren12 re-raised the perennial issue of the need for integration testing of the many layers of our stack, particularly when it comes to cloud i/o.

This has been discussed in many different related issues:

The feeling from heavy users is that there are still lingering intermittent bugs related to cloud i/o that render workflows unstable. One way to the bottom of this is to get more rigorous about system-wide integration testing.

Some questions we need to resolve to move forward with this idea, and my initial responses, are:

  • What are the workflows we want to test?
    • Write random data to cloud
    • Read back data
    • Copy data
    • Rechunk data
  • How big does the test need to be in order to be realistic?
    • My sense: > 100 GB
  • Which combinations of libraries do we want to include?
    • dask (w/o xarray)
    • xarray (via dask)
    • gcsfs
    • s3fs
    • adlfs
    • rechunker
  • How do we orchestrate an integration test like this? Here is one idea
    • Build dedicated docker containers with the desired environments using GitHub workflows
    • Connect to dask clusters from GitHub workflows (maybe could use Coiled to make this easy; otherwise will need dedicated dask_kubernetes or dask_gateway set up for the purpose)
    • Launch integration tests on remote dask clusters from github workflows

If we can agree on a clearly scoped plan, I think we can support the cloud computing costs. We can also considering asking Anaconda (@martindurant) to spend some time on this via our Pangeo Forge contract with them.

cc @jhamman @cisaacstern @TomNicholas @dcherian

@rabernat
Copy link
Member Author

rabernat commented May 3, 2021

Noah, it would be great to get your thoughts on what sort of testing would help surface the issues you've been having.

@nbren12
Copy link
Contributor

nbren12 commented May 3, 2021

Thanks so much for setting up this repo! I think it will be a useful collaboration point. ccing @spencerkclark @oliverwm1 @frodre for attention. We are particularly attuned to the I/O related issues.

What are the workflows we want to test?

Your suggestion are good. Also, my experience is that coarse-graining operations tend to cause many problems. Unlike a time-mean the output will be too large for memory, but the data reduction is enough that the input and output chunk-sizes should differ.

Generally, I think "writes" are less robust than "reads", but the latter is more frequently used by this community.

How big does the test need to be in order to be realistic?

I think so, but I need a better sense in the error rate per GB/HTTP request.

Which combinations of libraries do we want to include?

This often shows up in apache beam jobs for us, since we don't use dask clusters. We still use dask + xarray as a "lazy" array format, just not for parallelism much.

How do we orchestrate an integration test like this?

Agreed that we can build/push docker images from github CI.

For actually running the jobs, a cronjob in an existing k8s cluster might simplify deployment since there will be no need to authenticate access via dask gateway, firewall punching, kubernetes keys, etc. Also, E2E testing can take a long time, so CI is more costly especially if the compute is not actually happening on the CI server. This is the philosophy of CD tools like https://argoproj.github.io/argo-cd/.

If we can agree on a clearly scoped plan, I think we can support the cloud computing costs.

This would be great! I think I should be to justify investing our time on this effort to my bosses, especially since we rely on this stack so much and already contribute to it.

@dcherian
Copy link

dcherian commented May 3, 2021

@scottyhq will be interested. If our NASA proposal gets funded, we were planning to put some time towards something like this.

@nbren12
Copy link
Contributor

nbren12 commented May 3, 2021

Common problems arise from using too small chunks. It is easy to exceed google's limit of 1000 API requests/s (for reads), 100 API requests/s (writes), and 1 request/s (write to same object).

@rabernat
Copy link
Member Author

rabernat commented May 3, 2021

It is easy to exceed google's limit of 1000 API requests/s (for reads), 100 API requests/s (writes), and 1 request/s (write to same object).

What would you expect the client libraries to do to handle these limits, particularly in a distributed context? Retries? Exponential backoff?

@nbren12
Copy link
Contributor

nbren12 commented May 3, 2021

Exponential backoff is Google's recommend solution. Gcsfs has this, but not for all operations (see fsspec/gcsfs#387) and the test suite doesn't have great coverage for it (e.g. here's a bug fix I recently contributed fsspec/gcsfs#380).

@rabernat
Copy link
Member Author

rabernat commented May 3, 2021

So the idea here is that the sort of integration testing we are proposing would help surface specific issues (like the ones you already raised), which would help guide upstream development?

@nbren12
Copy link
Contributor

nbren12 commented May 4, 2021

Exactly. Currently I think gcsfs relies on user provided bug reports to find these problems, but usually these problems only occur at scale, and it is hard for users to construct minimal examples that are divorced from their idiosyncratic infrastructure.

@nbren12
Copy link
Contributor

nbren12 commented May 4, 2021

The usual workflow would be

  1. bug shows up here
  2. open issue on gcsfs/...
  3. reproduce the bug in a local unit test of the buggy library (refactoring the code to be testable if required)
  4. fix the bug.

@oliverwm1
Copy link

Thanks for getting the ball rolling on this @rabernat! I think the tests you propose would reveal the main issues that have challenged us and make sure these issues don't creep back into the codebase once they are fixed. At least on the gcsfs side, I get the impression that a lot of people have been challenged by intermittent I/O issues: fsspec/gcsfs#327, fsspec/gcsfs#323, fsspec/gcsfs#315, fsspec/gcsfs#316, fsspec/gcsfs#290

Beyond just reliability and catching bugs, would be cool if these tests were also leveraged to monitor performance.

@rabernat
Copy link
Member Author

rabernat commented May 4, 2021

If our NASA proposal gets funded, we were planning to put some time towards something like this.

@scottyhq and @dcherian - when do you expect to know about the NASA proposal? My understanding is that the scope is a lot broader than just cloud i/o. Would it be a problem if we bootstrapped something on a shorter timescale and then handed it off if / when your proposal is funded?

@nbren12
Copy link
Contributor

nbren12 commented May 4, 2021

Another simple way to get some coverage is to install a fixed set of libraries and run their test suites. This will often find errors related to API changes. This is the usual process of preparing a "distribution" like nix or Debian. FYI, I have been working on packaging some of the pangeo stack with the nix package manager. Here's a package I added for gcsfs package. Therefore, there is some level of integration testing being done by their CI servers against specific versions of dask/xarray etc: https://hydra.nixos.org/build/142494940.

@dcherian
Copy link

dcherian commented May 4, 2021

when do you expect to know about the NASA proposal?

🤷‍♂️

My understanding is that the scope is a lot broader than just cloud i/o.

Yeah I think so.

Would it be a problem if we bootstrapped something on a shorter timescale and then handed it off if / when your proposal is funded?

Of course! I don't see an issue at all.

@scottyhq
Copy link
Member

scottyhq commented May 5, 2021

when do you expect to know about the NASA proposal?

Our proposal has a start date of 6/30/2021 to give you a ballpark timeline. If funded I'd definitely be interested in collaborating on this. I don't know how many issues are specific to gcsfs, but NASA is still pretty focused on AWS, so we'd want to use s3 storage as well.

@rabernat
Copy link
Member Author

So here is one version of the steps that need to happen to get this working.

On each cloud provider we wish to test

  • Someone sets up a dask_gateway / Coiled / whatever dask cluster we can connect to using credentials stored as secrets in this repo
  • Someone sets up a bucket with r/w access via crendentials stored as secrets in this repo

In this repo

  • We configure a workflow to build a docker image used by the workers (perhaps could bootstrap or skip this step by using pangeo-docker-images)
  • We add python scripts that do the desired computations
  • We configure a workflow to run the scripts using a matrix of different clusters / configuration options

@nbren12
Copy link
Contributor

nbren12 commented May 11, 2021

Someone sets up a dask_gateway / Coiled / whatever dask cluster we can connect to using credentials stored as secrets in this repo

Just wondering if it would be simpler to put a k8s cron job like this in each cluster:

apiVersion: batch/v1
kind: CronJob
metadata:
  generateName: pangeo-integration-test
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: accountWithNeededPrivs
          containers:
          - name: hello
            image: pangeo-integration-test:latest
            imagePullPolicy: IfNotPresent
            # all logic happens here
            # and publishes a integration report to a static webpage
            command: ["/bin/bash", "run_tests.sh"]

          restartPolicy: OnFailure

Then, this repo wouldn't need to authenticate against each cluster (or even know they exist) beyond providing a link to the integration test reports. Also, the CI wouldn't need to wait around while the jobs finish. Basically, this decouples the integration tests from the update cycle of this repo (i.e. continuous delivery). We could run some basic checks like read/write from a bucket, run test suite, etc from this repo's CI before publishing the docker image pangeo-integration-test:latest.

@yuvipanda
Copy link
Member

How about we run a github self-hosted runner on a Kubernetes cluster that also has dask gateway configured? That way, we get all the goodness of GitHub actions without having to deal with nasty auth stuff.

@martindurant
Copy link

A couple of notes:

  • for gcsfs, recent improvements should have helped things in the version currently waiting on conda-forge. Of course, full-scale testing would still be useful. This would require a real bucket on GCS and payment
  • for s3fs, there is already a proposal to add minio as an additional test runner to moto, which would certainly find some edge cases, particularly for non-AWS S3s. Testing against real S3 would, again, require money.
  • to my mind, designing xarray+dask+storage tests amounts to running some specialised recipes that fill much of the parameter space in (number of chunks, size of chunks, number of files, number of threads/processes

@rabernat
Copy link
Member Author

Thanks for the input Martin. The plan is to indeed spend real money to test the whole stack at scale on real cloud providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants