Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve loadImage performance by between 20% and 100% #726

Closed
wants to merge 8 commits into from

Conversation

nathanbabcock
Copy link
Contributor

@nathanbabcock nathanbabcock commented Apr 2, 2023

TL;DR: toBlob() uses a relatively slow PNG encoder by default. A more uncompressed format like image/bmp is better suited for this use case.

Canvas .toBlob() Benchmarks

I tested the performance of toBlob with 4 different lossless image formats:

  • PNG (the current default, mime type image/png)
  • RAW (an obscure mime type image/x-dcraw)
  • TIF (image/tif)
  • BMP (image/bmp)

There's a list of other potential mime types in this thread, however image/bmp is commonly supported and gave me the best results on average (see below).

On a 1955x3036px Canvas image:

(Using pixel data from the meditations.jpg from the tesseract.js repo)

toPngBlob: 442.7470703125 ms
toRawBlob: 360.158935546875 ms
toTifBlob: 362.74609375 ms
toBmpBlob: 354.842041015625 ms

Average of 19.9% speedup for BMP

On a 100x100px Canvas image:

toPngBlob: 67.254150390625 ms
toRawBlob: 5.721923828125 ms
toTifBlob: 6.571044921875 ms
toBmpBlob: 1.542724609375 ms

Average of 97.7% speedup for BMP (!!)

Summary

The benefits of a faster image encoder are most noticeable on a small image size (100x100px or smaller). In addition, Tesseract.js can achieve truly lightning fast speeds on small inputs -- I'm seeing single-digit millisecond calls to recognize in some of my testing. It is plenty fast enough for fully realtime use cases when PNG encoding is avoided, since PNG encoding could take 4-5x longer than the actual text recognition at these sizes.

Other changes

Add support for OffscreenCanvas

This enables Tesseract.js to be started from inside another Web Worker, if desired (for example, after some image pre-processing which already happens off of the main thread). It uses the OffscreenCanvas.convertToBlob method which is an exact parallel of Canvas.toBlob.

Also, since HTMLElement is not defined inside a web worker, I added a check for that before using the instanceof method in order to avoid errors when calling loadImage from inside a Worker.

Update ImageLike types

type ImageLike currently contains two types for which there is no implementation:

  • CanvasRenderingContext2D, which is not the intended way to read from <canvas> and would have no effect
  • ImageData which similarly would need to be converted to a blob somehow, and it is easier to start from the <canvas> directly

Both options would currently result in runtime errors if they were used. So I've removed these two types, while also adding the OffscreenCanvas type as an option.

(not implemented) Better support for <video>

Support for <video> sources could be improved, to obtain individual video frames instead of the static video.poster which is currently used. This would involve capturing video frames via canvasContex.drawImage(video, 0, 0) and then calling canvas.toBlob() like any other source. However, this would technically be a breaking change if anyone relies on the existing video.poster behavior, so I've left that out of this PR for now. I think reading video frames would be a more intuitive behavior than reading the video poster/thumbnail, but for now it would suffice for the user to implement that manually and maintain backwards compatibility.

Checks

  • No breaking changes ✔
  • Eslint passing ✔
  • npm run test:all passes ✔
  • Small change, only a few lines in 1 file + type definitions ✔

@Balearica
Copy link
Collaborator

Balearica commented Apr 2, 2023

Thank you for this thoughtful contribution and explanation. Recognizing extremely small images is outside of my personal use case, so I was unaware of the large performance overhead in this specific situation. My initial thoughts are below.

  1. Can you confirm that your addition of OffscreenCanvas support does not cause issues in browsers that do not support this feature?
    1. Up until very recently OffscreenCanvas was only available in Chrome
  2. Can you add a unit test for recognizing OffscreenCanvas?
    1. We have existing unit tests for other recognition sources, so you should be able to copy/paste and slightly modify an existing test
      1. See https://github.com/naptha/tesseract.js/blob/master/tests/recognize.test.js
  3. Can you run the benchmark code with/without your changes and report the results?
    1. I want to be sure we have numbers on how this impacts the entire runtime of recognize, not just the toBlob function
    2. Presumably this change won't lead to a big improvement in runtime for the larger test images, but we should confirm that this change does not hurt runtime for certain image types
    3. Edit: This request does not make sense because the existing benchmark code does not use canvas elements, so none of the changes here actually impact it

Additionally, if Tesseract.js switches to using bmp more, I will have to look into the impact of this part of the setImage function on runtime.

// /*
// * Leptonica supports some but not all bmp files
// * @see https://github.com/DanBloomberg/leptonica/issues/607#issuecomment-1068802516
// * We therefore use bmp-js to convert all bmp files into a format Leptonica is known to support
// */
if (type && type.mime === 'image/bmp') {
// Not sure what this line actually does, but removing breaks the function
const buf = Buffer.from(Array.from({ ...image, length: Object.keys(image).length }));
const bmpBuf = bmp.decode(buf);
TessModule.FS.writeFile('/input', bmp.encode(bmpBuf).data);
} else {

At present all bmp images are re-encoded at this step to account for the fact that Leptonica (image processing library used by Tesseract) can not process certain bmp images. Presumably this adds some overhead. If this overhead is meaningful, and the bmp images produced by toBlob are not subject to these issues, we should think about bypassing this step.

@nathanbabcock
Copy link
Contributor Author

Great points, I will follow up on these later this week! 👌

@Balearica
Copy link
Collaborator

I looked into this more today and am now confused regarding whether converting from canvas elements to bmp files is supported by browsers at all.

Running the following in Chrome returns an object where the type attribute is image/png (the default).

canvas = document.createElement('canvas');
canvas.toBlob((x) => console.log(x))

By setting the second argument of toBlob to image/jpeg the type attribute of the object returned changes to image/jpeg.

canvas.toBlob((x) => console.log(x), "image/jpeg")

However, by setting the second argument to image/bmp the type attribute of the object returned is image/png. This is the same result that occurs when you set the type argument to gibberish.

canvas.toBlob((x) => console.log(x), "image/bmp")

Additionally, I did not find any indication in documentation for toBlob that these image types were supported. The link you provide merely lists these formats as MIME types, which does not indicate that this particular function is capable of creating images in these formats.

@nathanbabcock
Copy link
Contributor Author

I just realized a major problem with my benchmark...

I was using the API from convertToBlob (the OffscreenCanvas variant). That function accepts an object like { type: 'image/bmp' } as the second param. But regular canvas toBlob takes a plain string as the second param. As you noted from your testing, no error or warning gets thrown from an invalid parameter, it just silently falls back to PNG. That means that every test in my benchmark was actually outputting PNG!

But what caused the significant difference in performance then? There must be some kind of caching, because the first call to toBlob is always slower than subsequent ones. Because of the way I structured my tests, PNG was the first every single iteration, and took the performance hit. The following three formats all had comparable performance because they were all just cached PNGs. Then I re-rendered the canvas and started again with the next iteration, getting similar results...

Between this and your notes about bitmap weirdness in Tesseract I think might need to rethink the whole approach.

I found another format called PNM in the Leptonica source code.

 /**
  *      The pnm formats are exceedingly simple, because they have
  *      no compression and no colormaps.  They support images that
  *      are 1 bpp; 2, 4, 8 and 16 bpp grayscale; and rgb.
  */

Source: pnmio.c

I assume Tesseract.js uses all the Leptonica stuff under the hood. In theory it would be trivial to take a canvas ImageData and just prepend a header to it.

@nathanbabcock
Copy link
Contributor Author

nathanbabcock commented Apr 16, 2023

I've updated the implementation.


  1. Can you confirm that your addition of OffscreenCanvas support does not cause issues in browsers that do not support this feature?

Yes. I've guarded each usage of OffscreenCanvas by a check like this:

} else if (OffscreenCanvas && image instanceof OffscreenCanvas) {
    // ...
}

If OffscreenCanvas is not defined in the context (e.g. on older browsers), it will skip over these branches entirely. The same applies in reverse for the usage of HTMLElement inside a Web Worker (which has access to OffscreenCanvas, but not HTMLCanvasElement or HTMLElement).

In addition to these fallbacks, OffscreenCanvas is supported by all major browsers now: https://caniuse.com/offscreencanvas.


  1. Can you add a unit test for recognizing OffscreenCanvas?

Yes, I've added a test that duplicates the regular canvas test exactly, using OffscreenCanvas: nathanbabcock@1ca7cc3


Additionally, if Tesseract.js switches to using bmp more, I will have to look into the impact of this part of the setImage function on runtime.

Between this and the questionable behavior of canvas.toBlob(), I've rewritten the implementation to convert from canvas ImageData directly into a variant of the PBM format. PBM is supported internally by Tesseract/Leptonica, and is even covered by the Tesseract.js unit tests already. It can be created instantly from ImageData by prepending a simple ASCII-encoded header. The implementation is straightforward:

const imageDataToPBM = (imageData) => {
  const { width, height, data } = imageData;
  const DEPTH = 4; // channels per pixel (RGBA = 4)
  const MAXVAL = 255; // range of each channel (0-255)
  const TUPLTYPE = 'RGB_ALPHA';
  let header = 'P7\n';
  header += `WIDTH ${width}\n`;
  header += `HEIGHT ${height}\n`;
  header += `DEPTH ${DEPTH}\n`;
  header += `MAXVAL ${MAXVAL}\n`;
  header += `TUPLTYPE ${TUPLTYPE}\n`;
  header += 'ENDHDR\n';
  const encoder = new TextEncoder();
  const binaryHeader = encoder.encode(header);
  const binary = new Uint8Array(binaryHeader.length + data.length);
  binary.set(binaryHeader);
  binary.set(data, binaryHeader.length);
  return binary;
};

@nathanbabcock
Copy link
Contributor Author

Here is an updated sample run of the Canvas unit tests, before and after this PR. It includes both the total runtime of the tests, as well as a console.time placed inside of the loadImage function to measure the impact of that change in isolation.

Before

should read video from canvas DOM element (browser only)
loadImage(): 11.55908203125 ms
      ✅ support png format (165ms)
loadImage(): 9.030029296875 ms
      ✅ support jpg format (161ms)
loadImage(): 7.42333984375 ms
      ✅ support bmp format (163ms)
loadImage(): 12.073974609375 ms
      ✅ support webp format (163ms)
loadImage(): 8.673095703125 ms
      ✅ support gif format (161ms)

After

loadImage(): 0.3720703125 ms
      ✅ support png format (155ms)
loadImage(): 0.132080078125 ms
      ✅ support jpg format (153ms)
loadImage(): 0.22998046875 ms
      ✅ support bmp format (154ms)
loadImage(): 0.241943359375 ms
      ✅ support webp format (156ms)
loadImage(): 0.234130859375 ms
      ✅ support gif format (153ms)

Like before, the performance improvement is impressive percentage-wise, but only about a 10ms difference in absolute terms. That means it will be most noticeable on very small input images which already run end-to-end very fast (100ms or less).

@Balearica
Copy link
Collaborator

Thanks for updating, I will review at some point this week.

@Balearica
Copy link
Collaborator

I tested this today with the benchmark images, and for the larger images this branch appears to run significantly slower. For example, when I loaded the largest benchmark image (meditations.jpg) to a canvas and then ran recognition this change increased recognition time from 4.4 seconds to 6.2 seconds on Chrome and from 4.7 seconds to 7.8 seconds on Firefox.

@nathanbabcock
Copy link
Contributor Author

Interesting. I'll take a look and try to track down the source of the slowness on large images.

Maybe it's something like an unnecessary alpha channel that I'm always including? I'll also take a closer look at the PBM format I'm generating and make sure the resulting images appear as expected (visually). It could also be some kind of unintended color-shift under the hood if I've got the image format slightly off.

Thanks for your time looking into this.

@nathanbabcock
Copy link
Contributor Author

I tried removing the alpha channel. It's a bit faster than before, but still slower than master. Here's the comparison (all using meditations.jpg on a canvas on Chrome):

master

loadImage: 452.35400390625 ms
recognize from canvas: 7434.01318359375 ms

pr

loadImage: 143.279052734375 ms
recognize from canvas: 9575.90087890625 ms

pr with no alpha channel

loadImage: 139.8369140625 ms
recognize from canvas: 8361.510986328125 ms

I'm pretty confused why recognize takes so much longer with this PBM image format. Especially when you can see that loadImage runs 4x faster. My assumption was that having immediate access to the raw, uncompressed pixels would be optimal for whatever happens internally inside tesseract-core. But this must be mistaken somehow.

Anyone have any insight why raw image data would be slower to recognize than a PNG encoded image?

@Balearica
Copy link
Collaborator

I ran your branch with meditations.jpg and produced qualitatively similar results. I am not sure why this is, and share your intuition that one would think a less complex format would run faster.

Runs with master:
5,656ms
5,706ms
5,719ms

Runs with pr:
7,235ms
7,239ms
7,250ms

I also ran the updated branch with testocr.png and it failed to recognize this file correctly. I believe this is a binarized image, so the problem likely with color channels. However, probably not worth investigating this as long as the above observation (regarding runtime increasing) remains true.

@nathanbabcock
Copy link
Contributor Author

@Balearica

Let's ditch all this custom image loading stuff for now. It's too unclear what is causing the performance differences.

I'd like to recover a small piece of working functionality from this PR: support for OffscreenCanvas, and the unit tests to cover it.

Take a look at your convenience. #766

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants