Further modularize the pdf.js npm package #12900

rondonjon · 2021-01-24T17:13:07Z

Given that pdf.js was once built as a PDF rendering library for Firefox and bundled with the application, I understand that bundle size was never a top priority. On the other hand, pdf.js is still (and maybe more than ever) the first choice for developers who work on web applications that need to process or render PDF documents in some way. There is a myriad of questions on StackOverflow around how to bundle the huge pdfjs CommonJS modules with applications.

@robrez suggested to deploy at least tree-shakeable ES modules along with the CJS modules in 2017 (#10317), but the ticket has received little attention despite 33 thumb-ups. There is a module called @bundled-es-modules/pdfjs-dist on npm that claims to "bundle and expose" pdfjs as an ES module, but I think they're really only wrapping the pre-built pdf.js library (which is still only default-exporting and relying on that built-in require()), i.e. no tree-shaking advantage over the original library.

I spent many hours browsing the source code of pdf.js, and my greatest concerns with regard to reducing the bundle size are:

the highly generic API which even abstracts the different types of input sources from the caller
the hardcoded worker module resolution
the runtime evaluation of the PDFJSDev.* configuration variables

As an application developer, I probably don't need or want a PDF library to handle input abstraction (including different buffer formats, downloading from network streams, node.js integration, etc.) for me. I would probably prefer direct access to the underlying processor and provide my PDF data directly in its single, preferred buffer format. That, together with working tree-shaking, would probably already remove a big part from the resulting bundle.

The worker module resolution is a similar case. The published version of the library relies on a hardcoded (yet dynamic) require() which bundlers cannot replace with their own module resolution due to the sheer number of evaluated options. Again, I think this should better be handled outside the library, by the caller, who knows better about the (single, actual) environment.

And lastly, the whole PDFJSDev variable system could contribute to smaller bundle sizes if only it would qualify for build-time evaluation and dead code removal. Maybe I am missing something, but I have a feeling that the configuration (which is currently an object and runtime-processed with helpers like .test() and .eval() on top) could just as well be replaced with something more common like process.env variables, which bundlers like webpack or rollup would then evaluate earlier, at build time, so that unused code could be identified and removed from the bundle.

I think this whole topic deserves more attention than it is currently receiving. I would be willing to contribute a fair share, but it would need more volunteers, and contact to somebody who is more familiar with the code would surely speed things up.

The text was updated successfully, but these errors were encountered:

timvandermeij · 2021-01-24T17:19:40Z

I think the problem so far has been that most contributors work on the library itself and are not too familiar with the myriad of bundling tools. In the beginning we supported multiple bundling tools that contributors helped to implement, but this became complex quite soon, so we simplified this in order for it to be maintainable for us. I think we're all open to suggestions for improved bundling, but we rely on the community here to provide us with feedback and patches since this isn't something we work with often. We're happy to review any patches to improve the current situation, but we do want to focus on maintainability as well.

Snuffleupagus · 2021-01-24T18:07:37Z

Given that pdf.js was once built as a PDF rendering library for Firefox and bundled with the application,

Well, please note that this is still the main reason for this library to exist :-)

As mentioned in #12900 (comment), over the last few years we've purposely removed a lot of overall complexity related to bundling since that's allowed quite a bit of clean-up/simplification of old code.

the highly generic API which even abstracts the different types of input sources from the caller

Note that having one unified API-surface significantly simplifies both maintenance and support, not to mention that by not providing direct access to "lower"-level code we can much more easily make changes and implement new features.
Basically, we really want to avoid a situation where a significant portion of the underlying code is part of the official API-surface.

the hardcoded worker module resolution
[...]
The worker module resolution is a similar case. The published version of the library relies on a hardcoded (yet dynamic) require()

Note that this part in only relevant for Node.js environments, where Web Workers aren't available...

which bundlers cannot replace with their own module resolution due to the sheer number of evaluated options.

... for every other use-case, i.e. in browsers, note that you're not supposed to load the pdf.worker.js file manually since that will lead to very poor performance (since all parsing should run in a worker-thread).
Hence you should either use the workerSrc option to point its location, or use the workerPort option (see e.g. the Webpack use case in pdfjs-dist), or even provide your own PDFWorker-instance when calling getDocument (note the JSDocs here).

the runtime evaluation of the PDFJSDev.* configuration variables
[...]
Maybe I am missing something, but I have a feeling that the configuration (which is currently an object and runtime-processed with helpers like .test() and .eval() on top) could just as well be replaced with something

Huh; but those are evaluated when the library is built, see gulpfile.js, rather than at runtime!?

While you can obviously always create your own build of the PDF.js library, where you remove e.g. the network-code that you don't need, I'm a bit worried that trying to officially support such a thing in our builds could quickly lead to a high maintenance/support-burden for the few regular PDF.js contributors.

timvandermeij · 2021-02-15T19:35:40Z

I'm going to close this issue since there are currently no actionable points in here. Hopefully it does shed some light on the current situation and in which areas we will review contributions. Thanks.

rondonjon changed the title ~~Modularize pdf.js~~ Further modularize the pdf.js npm package Jan 24, 2021

timvandermeij added the other label Jan 24, 2021

timvandermeij closed this as completed Feb 15, 2021

localjo mentioned this issue Feb 26, 2021

Dependencies that use require.ensure for backward compatibility cause warnings facebook/create-react-app#6418

Open

This comment was marked as duplicate.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further modularize the pdf.js npm package #12900

Further modularize the pdf.js npm package #12900

rondonjon commented Jan 24, 2021

timvandermeij commented Jan 24, 2021

Snuffleupagus commented Jan 24, 2021 •

edited

Loading

timvandermeij commented Feb 15, 2021

This comment was marked as duplicate.

Further modularize the pdf.js npm package #12900

Further modularize the pdf.js npm package #12900

Comments

rondonjon commented Jan 24, 2021

timvandermeij commented Jan 24, 2021

Snuffleupagus commented Jan 24, 2021 • edited Loading

timvandermeij commented Feb 15, 2021

This comment was marked as duplicate.

Snuffleupagus commented Jan 24, 2021 •

edited

Loading