Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async indexing with concurrency limit (and progress callback) #11

Merged
merged 9 commits into from
Jan 15, 2020

Conversation

manzt
Copy link
Collaborator

@manzt manzt commented Jan 14, 2020

Rationale:

  • Currently, each request is fetched synchronously, which doesn't exploit the browser's ability to make multiple requests concurrently.

Introduces:

  • Asynchronous indexing of the zarr HTTPStore.
  • Anoptions object as a parameter to ZarrArray.get. The options object exposes concurrencyLimit (limit of number of concurrent requests), and progressCallback (function which takes the current progress of the indexing 0 -> 1 as an argument) fields.

Usage:

const z = await openArray(config);
const slice2d = await z.get([0, null, null], { 
        concurrencyLimit: 20, 
        progressCallback: progress => console.log(progress)
    }
);

I think there is something much cleaner to do with the StoreGetOptions interface, but I am fairly new to Typescript. Hoping you have some suggestions there!

I decided to use p-limit as it seems to be the most popular tool in this space (used by 3.8M) and is maintained actively. It also provides a nice way to get the number of items remaining in the queue, which I use.

For some reason optional chaining is was not working in typescript, so there are some rather ugly blocks performing checks for null and undefined.

@manzt
Copy link
Collaborator Author

manzt commented Jan 14, 2020

This demo times multiple requests to a zarr store containing (3, 5000, 5000) array. Toggling between my build and v0.1.3 (and playing with the concurrency) demonstrates ~2x increase in performance for these 2D slices (5000 x 5000) of the data.

Make sure to turn off cache in your browser's network tab when trying this demo.

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <meta http-equiv="X-UA-Compatible" content="ie=edge" />
  <title>zarr.js demo</title>
</head>

<body>
  <!-- <script src="https://unpkg.com/zarr@0.1.3/dist/zarr.umd.js"></script> -->
  <script src="./dist/zarr.umd.js"></script>
  <script>
    const baseUrl =
      'https://gist.githubusercontent.com/' +
      'manzt/d16dbac0ea3adc3c7b9b61f54fa1f78d';

    const config = {
      store: `${baseUrl}/raw/`,
      path: '95854058512862accb0182d4a02f86a55ad19139',
      mode: 'r',
    }; // (3, 5000, 5000) array

    async function main() {
      // connect to HTTPStore
      const z = await zarr.openArray(config);
      const times = [];
      const iters = 5;
     
      // repeat the same request for the (5000, 5000) slice
      for (let i = 1; i <= iters; i++) {
        console.time(`index [0, null, null] (${i} of ${iters})`);
        const t0 = performance.now();
        const slice = await z.get([0, null, null], {
          concurrencyLimit: 10,
          progressCallback: progress => console.log(progress),
        });
        const t1 = performance.now();
        console.timeEnd(`index [0, null, null] (${i} of ${iters})`);
        times.push(t1 - t0);
      }

      // summarize stats
      const n = times.length;
      const mean = times.reduce((a, b) => a + b) / n;
      const s = Math.sqrt(
        times.map(x => (x - mean) ** 2).reduce((a, b) => a + b) / n
      );

      console.log(`mean time ${mean}ms, sd ${s}ms`);
    }
    main();
  </script>
</body>

</html>

Try fetching the whole array,

const slice = await z.get(null, {
     concurrencyLimit: 10,
     progressCallback: progress => console.log(progress)
}

to see ~4x speed up (12s vs 3.5s).

@manzt manzt requested a review from gzuidhof January 14, 2020 20:36
Copy link
Owner

@gzuidhof gzuidhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up 👍

I left some comments that I think can improve it, let me know whether you agree with them

src/core/index.ts Outdated Show resolved Hide resolved
src/core/index.ts Outdated Show resolved Hide resolved
src/core/index.ts Outdated Show resolved Hide resolved
src/core/index.ts Show resolved Hide resolved
@gzuidhof
Copy link
Owner

I love that you can use GitHub Gist as a backend! I'm so excited for how this library will be used in the future :)

@manzt
Copy link
Collaborator Author

manzt commented Jan 14, 2020

It was a bit of a hack since directories are not supported on GitHub Gist. I just created an empty gist, cloned it, dumped the contents of dummy_data.zarr, and crossed my fingers :)

@manzt
Copy link
Collaborator Author

manzt commented Jan 15, 2020

Ok! I think I found something really interesting. p-queue written by the same author as p-limit offers the ability to dynamically allocate a queue.

This seems perfect for our situation, and solves the callback issue.

See most recent commits.

@manzt
Copy link
Collaborator Author

manzt commented Jan 15, 2020

However, I'm still a bit confused with how to supply the default args in.getBasicSelection. I think there is something I don't understand about function overloading... probably just need to do some reading. Sorry for my lack of understanding here

  public async getBasicSelection(selection: Slice | ":" | "..." | null | (Slice | null | ":" | "...")[]): Promise<NestedArray<TypedArray>>;
  public async getBasicSelection(selection: ArraySelection): Promise<NestedArray<TypedArray> | number>;
  public async getBasicSelection(selection: ArraySelection, { concurrencyLimit = 10 }: StoreGetOptions): Promise<number | NestedArray<TypedArray>> {
    // Refresh metadata
    if (!this.cacheMetadata) {
      await this.reloadMetadata();
    }

    // Check fields (TODO?)

    if (this.shape === []) {
      throw new Error("Shape [] indexing is not supported yet");
    } else {
      return this.getBasicSelectionND(selection, concurrencyLimit, progressCallback); 
       //progressCallback isn't defined so how do I pass this explicitly?
    }
}

Then would we expect to see the subsequent definitions like:

private getBasicSelectionND(selection: ArraySelection, concurrencyLimit: number, progressCallback: ((progressUpdate: { progress: number; queueSize: number }) => void) | undefined) {
    const indexer = new BasicIndexer(selection, this);
    return this.getSelection(indexer, concurrencyLimit, progressCallback);
  }

@gzuidhof
Copy link
Owner

The way I see it is that this is actually not function overloading, but instead what different types of input (signatures) you can give and then what type that returns.

The last one in the list is the "actual" one, the ones above are a subset of it. In the case of this getSelection function it strongly types that if you select with a slice of some sort you will always get back a NestedArray and not a number. Whereas if you pass in an integer it may return a single number.

Hopefully that makes sense and helps

@manzt
Copy link
Collaborator Author

manzt commented Jan 15, 2020

I guess I'm confused because the defaults are ignored when the other signatures are used, and when I try to supply defaults (with destructuring) to the other signatures, the compiler complains

A parameter initializer is only allowed in a function or constructor implementation.

@gzuidhof
Copy link
Owner

  public async getBasicSelection(selection: Slice | ":" | "..." | null | (Slice | null | ":" | "...")[], opts?: GetOptions): Promise<NestedArray<TypedArray>>;
  public async getBasicSelection(selection: ArraySelection, opts?: GetOptions): Promise<NestedArray<TypedArray> | number>;
  public async getBasicSelection(selection: ArraySelection, {concurrencyLimit = 10}: GetOptions): Promise<number | NestedArray<TypedArray>> {

Does something like this work perhaps?

@manzt
Copy link
Collaborator Author

manzt commented Jan 15, 2020

Alright, with a bit of head banging I think I figured it out!

Thanks for the guidance & patience 🙏

Copy link
Owner

@gzuidhof gzuidhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth adding a test case for the progressCallback, but we can do that another day. Thank you for the contribution!

@gzuidhof gzuidhof merged commit bb29412 into gzuidhof:master Jan 15, 2020
@manzt manzt deleted the async-indexing branch January 20, 2020 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants