Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further modularize the pdf.js npm package #12900

Closed
rondonjon opened this issue Jan 24, 2021 · 4 comments
Closed

Further modularize the pdf.js npm package #12900

rondonjon opened this issue Jan 24, 2021 · 4 comments
Labels

Comments

@rondonjon
Copy link

Given that pdf.js was once built as a PDF rendering library for Firefox and bundled with the application, I understand that bundle size was never a top priority. On the other hand, pdf.js is still (and maybe more than ever) the first choice for developers who work on web applications that need to process or render PDF documents in some way. There is a myriad of questions on StackOverflow around how to bundle the huge pdfjs CommonJS modules with applications.

@robrez suggested to deploy at least tree-shakeable ES modules along with the CJS modules in 2017 (#10317), but the ticket has received little attention despite 33 thumb-ups. There is a module called @bundled-es-modules/pdfjs-dist on npm that claims to "bundle and expose" pdfjs as an ES module, but I think they're really only wrapping the pre-built pdf.js library (which is still only default-exporting and relying on that built-in require()), i.e. no tree-shaking advantage over the original library.

I spent many hours browsing the source code of pdf.js, and my greatest concerns with regard to reducing the bundle size are:

  • the highly generic API which even abstracts the different types of input sources from the caller
  • the hardcoded worker module resolution
  • the runtime evaluation of the PDFJSDev.* configuration variables

As an application developer, I probably don't need or want a PDF library to handle input abstraction (including different buffer formats, downloading from network streams, node.js integration, etc.) for me. I would probably prefer direct access to the underlying processor and provide my PDF data directly in its single, preferred buffer format. That, together with working tree-shaking, would probably already remove a big part from the resulting bundle.

The worker module resolution is a similar case. The published version of the library relies on a hardcoded (yet dynamic) require() which bundlers cannot replace with their own module resolution due to the sheer number of evaluated options. Again, I think this should better be handled outside the library, by the caller, who knows better about the (single, actual) environment.

And lastly, the whole PDFJSDev variable system could contribute to smaller bundle sizes if only it would qualify for build-time evaluation and dead code removal. Maybe I am missing something, but I have a feeling that the configuration (which is currently an object and runtime-processed with helpers like .test() and .eval() on top) could just as well be replaced with something more common like process.env variables, which bundlers like webpack or rollup would then evaluate earlier, at build time, so that unused code could be identified and removed from the bundle.

I think this whole topic deserves more attention than it is currently receiving. I would be willing to contribute a fair share, but it would need more volunteers, and contact to somebody who is more familiar with the code would surely speed things up.

@rondonjon rondonjon changed the title Modularize pdf.js Further modularize the pdf.js npm package Jan 24, 2021
@timvandermeij
Copy link
Contributor

I think the problem so far has been that most contributors work on the library itself and are not too familiar with the myriad of bundling tools. In the beginning we supported multiple bundling tools that contributors helped to implement, but this became complex quite soon, so we simplified this in order for it to be maintainable for us. I think we're all open to suggestions for improved bundling, but we rely on the community here to provide us with feedback and patches since this isn't something we work with often. We're happy to review any patches to improve the current situation, but we do want to focus on maintainability as well.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Jan 24, 2021

Given that pdf.js was once built as a PDF rendering library for Firefox and bundled with the application,

Well, please note that this is still the main reason for this library to exist :-)

As mentioned in #12900 (comment), over the last few years we've purposely removed a lot of overall complexity related to bundling since that's allowed quite a bit of clean-up/simplification of old code.

the highly generic API which even abstracts the different types of input sources from the caller

Note that having one unified API-surface significantly simplifies both maintenance and support, not to mention that by not providing direct access to "lower"-level code we can much more easily make changes and implement new features.
Basically, we really want to avoid a situation where a significant portion of the underlying code is part of the official API-surface.

the hardcoded worker module resolution
[...]
The worker module resolution is a similar case. The published version of the library relies on a hardcoded (yet dynamic) require()

Note that this part in only relevant for Node.js environments, where Web Workers aren't available...

which bundlers cannot replace with their own module resolution due to the sheer number of evaluated options.

... for every other use-case, i.e. in browsers, note that you're not supposed to load the pdf.worker.js file manually since that will lead to very poor performance (since all parsing should run in a worker-thread).
Hence you should either use the workerSrc option to point its location, or use the workerPort option (see e.g. the Webpack use case in pdfjs-dist), or even provide your own PDFWorker-instance when calling getDocument (note the JSDocs here).

the runtime evaluation of the PDFJSDev.* configuration variables
[...]
Maybe I am missing something, but I have a feeling that the configuration (which is currently an object and runtime-processed with helpers like .test() and .eval() on top) could just as well be replaced with something

Huh; but those are evaluated when the library is built, see gulpfile.js, rather than at runtime!?


While you can obviously always create your own build of the PDF.js library, where you remove e.g. the network-code that you don't need, I'm a bit worried that trying to officially support such a thing in our builds could quickly lead to a high maintenance/support-burden for the few regular PDF.js contributors.

@timvandermeij
Copy link
Contributor

I'm going to close this issue since there are currently no actionable points in here. Hopefully it does shed some light on the current situation and in which areas we will review contributions. Thanks.

@yang1978u

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants