Skip to content

feat: reading multiple pdf files with a single PDFParser object#371

Merged
modesty merged 17 commits intomodesty:masterfrom
nicolabaesso:master
Jul 5, 2025
Merged

feat: reading multiple pdf files with a single PDFParser object#371
modesty merged 17 commits intomodesty:masterfrom
nicolabaesso:master

Conversation

@nicolabaesso
Copy link
Copy Markdown
Contributor

@nicolabaesso nicolabaesso commented Oct 26, 2024

Elements changed:

  1. Added new test case in a separate file
  2. Added the two example PDFs
  3. Add the reset of the pages array when the data variable is null

I've added this elements because in my corporate job we are using this library, and recreating everytime the PDFParser object is not something I'm a fan of.
Other test cases are not failing, so no regressions.

Summary by CodeRabbit

  • New Features

    • Added a new command-line option to enable singleton mode for the PDF parser, allowing reuse of a single parser instance.
    • Introduced new methods to reset the internal state of the PDF parser without recreating the entire object.
  • Bug Fixes

    • Improved handling of multiple PDF files parsed sequentially to ensure correct data extraction.
  • Tests

    • Added tests to verify parsing multiple PDFs with the same structure using a single parser instance.

Comment thread pdfparser.js Outdated
Copy link
Copy Markdown
Owner

@modesty modesty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding more tests. A few thoughts on making the instance of PDFJSClass reusable:

  1. pdfparser instance (or the client that instantiates PDFJSClass) needs to be reset/reusable whenever PDFParser is created. (line 107 of pdfparser.js)
  2. lib/pdf.js: setting this.pages=[] is not sufficient to dispose the object, pdfDocument and rawTextContents needs reset too. Recomment to call existing destroy method.

@nicolabaesso
Copy link
Copy Markdown
Contributor Author

Hi @modesty,
thank you for your review. As suggested, I removed the this.pages=[] line and instead called the already available destroy() function.
Also I've added a function to reset the PDFJS object, is this what you were mentioning? Otherwise let me know.
I've also removed the if in line 120 of pdfparser.js, it was a leftover of one test I was doing to understand the code.

@nicolabaesso
Copy link
Copy Markdown
Contributor Author

Hi @modesty,
sorry for the pressure, can you give me a feedback on the code? The next days I could make some changes if something is still wrong (I feel like the function for resetting the PDFJS object could use some more work, but I would like to have your opinion).

Thank you!

@modesty
Copy link
Copy Markdown
Owner

modesty commented Dec 30, 2024

sorry for the delay. code LGTM. two notes:

  1. could you add a command line option to enable it optionally? default false would keep current clients intact.
  2. could you merge master into this branch and run all tests when ready?

@nicolabaesso
Copy link
Copy Markdown
Contributor Author

sorry for the delay. code LGTM. two notes:

1. could you add a command line option to enable it optionally? default `false` would keep current clients intact.

2. could you merge master into this branch and run all tests when ready?
  1. Sorry, I'm not getting it: how we can provide this feature when using the library via command line? Should it be a whole section where, if the flag is provided, you can pass two or more files and the library reads them with a single instance? Just asking

  2. Master branch merged, I modified the reset part by calling a similar method to the destroy() one, with the exception of not removing the listeners. Keeping the same method hanged the test forever, with this change the test cases are ok and done in less than 5 seconds.

Let me know what to do next, I'll do it as soon as I can.

Thank you!

@nicolabaesso
Copy link
Copy Markdown
Contributor Author

Merged recent changes of master, tests are ok

@nicolabaesso
Copy link
Copy Markdown
Contributor Author

Added a little change to package.json: issue #381 says that the semver is broken in 3.1.5.
Test are still working fine, it's better to make a major release as suggested on the issue.
I can set up some scripts to automate the semver processing, just let me know if it's a modification you could appreciate.

Thank you!

@nicolabaesso
Copy link
Copy Markdown
Contributor Author

Hi @modesty,
it's been a while since your last comment, do you have any feedback for this PR?

Thank you

@modesty
Copy link
Copy Markdown
Owner

modesty commented Jun 29, 2025

@nicolabaesso , sorry for replying late.
As for the CLI commandline, it can be extended from the bottom of src/cli/p2jcliarg.ts. When evexcute, line 148 of src/cli/p2jcli.ts creates new instance of PDFParser whenever parseOnePDF is invoked, it can extended to use PDFParser as singleton if command line arg is set.
The idea is to keep the existing behavior as default, make the singleton approach to be opt-in.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jul 3, 2025

Walkthrough

The changes introduce a singleton mode for the PDF parser in the CLI, allowing reuse of a single PDFParser instance across multiple parses. Supporting methods for resetting parser state are added. The Jest test pattern is broadened, and a new test verifies parsing multiple PDFs with the same structure using a persistent parser instance.

Changes

File(s) Change Summary
jest.config.json Expanded testMatch pattern to match multiple test files with _test_ prefix in test directories.
lib/pdf.js Added resetCurrentObject() method to PDFJSClass; updated parsing logic to use this for state reset.
pdfparser.js Added resetPDFJS(needRawText) method to PDFParser for resetting its internal parser instance.
test/_test_testMultipleDataPDF.cjs Added new test suite to verify parsing of multiple PDFs with the same structure using a single parser instance.
src/cli/p2jcli.ts Added singleton parser mode via "si" flag; conditional parser instantiation based on singleton flag.
src/cli/p2jcliarg.ts Added "si"/"singleton" CLI argument; updated Argv type and CLI argument parser.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant PDFProcessor
    participant PDFParser

    User->>CLI: Run with --singleton (si) flag
    CLI->>PDFProcessor: Initialize with singleton mode
    loop For each PDF to parse
        PDFProcessor->>PDFParser: parseOnePDF / parseOnePDFStream
        PDFParser-->>PDFProcessor: Parsing result
    end
    PDFProcessor-->>CLI: Output results
Loading

Poem

In the warren of code, a singleton's born,
One parser for many, no need to respawn.
Tests now abound for PDFs galore,
With stateful resets and features in store.
From CLI to Jest, the changes are bright—
Hopping through PDFs, all day and all night!
🐇✨

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

npm error Exit handler never called!
npm error This is an error with npm itself. Please report this error at:
npm error https://github.com/npm/cli/issues
npm error A complete log of this run can be found in: /.npm/_logs/2025-07-05T13_08_13_066Z-debug-0.log


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e42cb4 and 26a9dc3.

📒 Files selected for processing (2)
  • lib/pdf.js (3 hunks)
  • src/cli/p2jcli.ts (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/pdf.js
  • src/cli/p2jcli.ts
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
test/_test_testMultipleDataPDF.cjs (2)

1-4: Remove unused import and clarify testing framework usage.

The test imports assert but uses expect throughout, which suggests it's using Jest's expect. Remove the unused import for clarity.

-const assert = require("assert");
 const fs = require("fs");

10-16: Consider adding buffer content validation.

While the test checks that buffers are not the same reference, consider adding a more robust check to ensure the buffers have different content.

 //we need to check if buffers are indeed different, otherwise it's useless!
-expect(firstPDFBuffer).not.toBe(secondPDFBuffer);
+expect(firstPDFBuffer).not.toBe(secondPDFBuffer);
+expect(Buffer.compare(firstPDFBuffer, secondPDFBuffer)).not.toBe(0);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d562af and 05c800f.

⛔ Files ignored due to path filters (2)
  • test/pdf/mpf/testPDF.pdf is excluded by !**/*.pdf
  • test/pdf/mpf/testPDF2.pdf is excluded by !**/*.pdf
📒 Files selected for processing (5)
  • jest.config.json (1 hunks)
  • lib/pdf.js (3 hunks)
  • package.json (1 hunks)
  • pdfparser.js (1 hunks)
  • test/_test_testMultipleDataPDF.cjs (1 hunks)
🔇 Additional comments (6)
lib/pdf.js (3)

188-192: LGTM! Reset functionality correctly implemented.

The automatic reset when data is null is well-designed and integrates cleanly with the existing event flow. This enables the parser to handle multiple PDFs sequentially without requiring manual reset calls.


413-420: LGTM! Reset method correctly clears internal state.

The implementation properly destroys the PDF document and resets all internal arrays. This ensures a clean state for parsing the next PDF.


306-306: Minor formatting improvement.

The blank line improves code readability by separating logical sections.

jest.config.json (1)

2-2: LGTM! Test pattern correctly broadened.

The updated pattern "**/test/_test_*" appropriately includes the new test file and allows for additional test files following the same naming convention.

pdfparser.js (1)

307-315: LGTM! Reset method correctly implemented.

The resetPDFJS method properly destroys the old PDFJS instance and creates a fresh one with the specified needRawText parameter. This enables reusing the PDFParser instance for multiple PDF files without recreating the entire parser.

package.json (1)

58-65: Verify Node.js (>=18.12.1) and npm (>=8.19.2) compatibility

Our automated check ran on Node v24.3.0 and npm v11.4.2 and couldn’t conclusively verify that the code—especially the new reset functionality—will work under Node 18.12.1 and npm 8.19.2. Please:

  • Manually test the reset workflow on Node 18.12.1 and npm 8.19.2.
  • Confirm that no ES-module or other newer-version–only APIs are used (e.g. run
    rg "import|export" --type js,ts).
  • Ensure all scripts and dependencies execute as expected under the downgraded engine constraints.

Comment on lines +17 to +36
const firstData=await new Promise((resolve,reject)=>{
parser.parseBuffer(firstPDFBuffer,5);
parser.on("pdfParser_dataReady", (evtData) => {
resolve(evtData);
});

parser.on("pdfParser_dataError", (evtData) => {
reject(evtData);
});
});
const secondData=await new Promise((resolve,reject)=>{
parser.parseBuffer(secondPDFBuffer,5);
parser.on("pdfParser_dataReady", (evtData) => {
resolve(evtData);
});

parser.on("pdfParser_dataError", (evtData) => {
reject(evtData);
});
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider improving event listener management and reset method usage.

The current test has a few areas for improvement:

  1. Event listener cleanup: Event listeners are added for each parse operation but not removed, which could lead to memory leaks or duplicate event handling.

  2. Explicit reset: The test relies on automatic reset when data is null, but it would be more explicit to call parser.resetPDFJS(false) between parses to clearly demonstrate the feature.

  3. Error handling: Consider using Jest's timeout and proper error handling.

 const firstData=await new Promise((resolve,reject)=>{
+    // Clear any existing listeners
+    parser.removeAllListeners();
     parser.parseBuffer(firstPDFBuffer,5);
     parser.on("pdfParser_dataReady", (evtData) => {
         resolve(evtData);
     });
 
     parser.on("pdfParser_dataError", (evtData) => {
         reject(evtData);
     });
 });
+
+// Explicitly reset the parser for the next PDF
+parser.resetPDFJS(false);
+
 const secondData=await new Promise((resolve,reject)=>{
+    // Clear any existing listeners
+    parser.removeAllListeners();
     parser.parseBuffer(secondPDFBuffer,5);
     parser.on("pdfParser_dataReady", (evtData) => {
         resolve(evtData);
     });
 
     parser.on("pdfParser_dataError", (evtData) => {
         reject(evtData);
     });
 });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const firstData=await new Promise((resolve,reject)=>{
parser.parseBuffer(firstPDFBuffer,5);
parser.on("pdfParser_dataReady", (evtData) => {
resolve(evtData);
});
parser.on("pdfParser_dataError", (evtData) => {
reject(evtData);
});
});
const secondData=await new Promise((resolve,reject)=>{
parser.parseBuffer(secondPDFBuffer,5);
parser.on("pdfParser_dataReady", (evtData) => {
resolve(evtData);
});
parser.on("pdfParser_dataError", (evtData) => {
reject(evtData);
});
});
const firstData = await new Promise((resolve, reject) => {
// Clear any existing listeners
parser.removeAllListeners();
parser.parseBuffer(firstPDFBuffer, 5);
parser.on("pdfParser_dataReady", (evtData) => {
resolve(evtData);
});
parser.on("pdfParser_dataError", (evtData) => {
reject(evtData);
});
});
// Explicitly reset the parser for the next PDF
parser.resetPDFJS(false);
const secondData = await new Promise((resolve, reject) => {
// Clear any existing listeners
parser.removeAllListeners();
parser.parseBuffer(secondPDFBuffer, 5);
parser.on("pdfParser_dataReady", (evtData) => {
resolve(evtData);
});
parser.on("pdfParser_dataError", (evtData) => {
reject(evtData);
});
});
🤖 Prompt for AI Agents
In test/_test_testMultipleDataPDF.cjs around lines 17 to 36, the event listeners
for "pdfParser_dataReady" and "pdfParser_dataError" are added for each parse but
never removed, risking memory leaks and duplicate handling. To fix this, add
code to remove these listeners after each parse completes or fails. Also,
explicitly call parser.resetPDFJS(false) between the two parseBuffer calls to
clearly reset the parser state. Finally, enhance error handling by integrating
Jest's timeout features and ensuring promise rejections are properly caught and
reported.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/cli/p2jcli.ts (1)

21-21: Fix code formatting.

Missing space after the assignment operator.

-const SINGLETON_PDF_PARSER= "si" in argv;
+const SINGLETON_PDF_PARSER = "si" in argv;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05c800f and 2e42cb4.

📒 Files selected for processing (2)
  • src/cli/p2jcli.ts (3 hunks)
  • src/cli/p2jcliarg.ts (2 hunks)
🔇 Additional comments (4)
src/cli/p2jcliarg.ts (2)

24-24: LGTM: Type definition correctly updated.

The addition of the si property to the Argv type maintains type safety for the new singleton CLI argument.


203-207: LGTM: CLI argument properly registered.

The singleton option is correctly registered with a clear description and follows the established pattern for other CLI arguments.

src/cli/p2jcli.ts (2)

127-133: LGTM: Singleton logic correctly implemented.

The conditional initialization logic is correct - it only creates a new PDFParser instance when singleton mode is disabled OR when no existing instance exists. The event listener is properly attached only during initialization to avoid duplicates.


153-159: LGTM: Consistent singleton implementation.

The singleton logic is consistently implemented across both parsing methods (parseOnePDFStream and parseOnePDF), maintaining the same conditional initialization pattern.

Comment thread src/cli/p2jcli.ts Outdated
Comment on lines +127 to +133
if(!SINGLETON_PDF_PARSER || !this.pdfParser){
//we initialize the PDFParser object only if the object itself is null, or the singleton parameter was not provided
this.pdfParser = new PDFParser(null, PROCESS_RAW_TEXT_CONTENT);
this.pdfParser.on("pdfParser_dataError", (evtData: any) =>
this.onPrimaryError(evtData.parserError, reject)
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify PDFParser state management between files.

While the singleton logic is correct, ensure that the PDFParser instance properly resets its internal state between different PDF files to avoid data contamination. The PR mentions a resetPDFJS method was added to the PDFParser class - verify this is being called appropriately.

Also applies to: 153-159


🏁 Script executed:

#!/bin/bash
# Description: Verify that PDFParser has proper state reset functionality
# Expected: Find resetPDFJS method calls or similar state management

# Search for resetPDFJS method usage
rg -A 5 -B 5 "resetPDFJS"

# Search for any state reset or cleanup calls in the singleton context
rg -A 10 -B 5 "SINGLETON_PDF_PARSER.*parser"

Length of output: 455


Ensure PDFParser state is reset when reusing the singleton

We didn’t find any calls to resetPDFJS in the CLI, so the same PDFJS instance may carry over data between files. To avoid contamination, invoke resetPDFJS before parsing each new document.

  • In src/cli/p2jcli.ts, around lines 127–133 (and similarly at 153–159), insert a reset when the singleton parser already exists:

     if (!SINGLETON_PDF_PARSER || !this.pdfParser) {
       this.pdfParser = new PDFParser(null, PROCESS_RAW_TEXT_CONTENT);
       this.pdfParser.on("pdfParser_dataError", (evtData: any) =>
         this.onPrimaryError(evtData.parserError, reject)
       );
     } else {
  • // Reinitialize internal PDFJS to clear previous state
  • this.pdfParser.resetPDFJS(PROCESS_RAW_TEXT_CONTENT);
    }

- This guarantees each file is parsed with a fresh PDFJS instance and the correct raw-text setting.

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion
          if (!SINGLETON_PDF_PARSER || !this.pdfParser) {
              // we initialize the PDFParser object only if the object itself is null, or the singleton parameter was not provided
              this.pdfParser = new PDFParser(null, PROCESS_RAW_TEXT_CONTENT);
              this.pdfParser.on("pdfParser_dataError", (evtData: any) =>
                  this.onPrimaryError(evtData.parserError, reject)
              );
          } else {
              // Reinitialize internal PDFJS to clear previous state
              this.pdfParser.resetPDFJS(PROCESS_RAW_TEXT_CONTENT);
          }
🤖 Prompt for AI Agents
In src/cli/p2jcli.ts around lines 127 to 133, when reusing the singleton
PDFParser instance, call the resetPDFJS method on this.pdfParser before starting
to parse a new document to clear any previous state. Similarly, apply this reset
at lines 153 to 159 where the parser is reused. This ensures each file is parsed
with a fresh PDFJS instance and the correct raw-text setting, preventing data
contamination between parses.

@nicolabaesso
Copy link
Copy Markdown
Contributor Author

@nicolabaesso , sorry for replying late. As for the CLI commandline, it can be extended from the bottom of src/cli/p2jcliarg.ts. When evexcute, line 148 of src/cli/p2jcli.ts creates new instance of PDFParser whenever parseOnePDF is invoked, it can extended to use PDFParser as singleton if command line arg is set. The idea is to keep the existing behavior as default, make the singleton approach to be opt-in.

@modesty modified the two CLI files:
I've set si as the optional parameter for the singleton, if the parameter is not provided then the object is created as usual (so every time)

Let me know if there is something else that should be adjusted.

Thank you!

Copy link
Copy Markdown
Owner

@modesty modesty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are unresolved comments.

Comment thread lib/pdf.js Outdated

raiseReadyEvent(data) {
process.nextTick(() => this.emit("pdfjs_parseDataReady", data));
if(data===null){
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the client will read data (not only the data from event payload) when pdfjs_parseDataReady is raised, resetting here could cause race conditions. Do the reset before start loading new PDF is a safer bet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've moved the method call to the parsePdfData method:

parsePDFData(arrayBuffer, password) {
		this.resetCurrentObject();
		const parameters = { password, data: arrayBuffer };
		PDFJS.getDocument(parameters).then(
			(pdfDocument) => this.load(pdfDocument, 1),
			(error) => this.raiseErrorEvent(error)
		);
	}

By my understanding, this is the earliest it gets to load a pdf file.

Comment thread package.json Outdated
"devEngines": {
"packageManager": {
"name": "npm",
"version": ">=8.19.2"
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no downgrade of node & npm please

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored to your last changes. BTW comment #381 underlines that you're introducing a breaking change, maybe this could be version 4.0.0?
Minor and patches release should not introduce breaking changes

Comment thread src/cli/p2jcli.ts Outdated
this.pdfParser.on("pdfParser_dataError", (evtData: any) =>
this.onPrimaryError(evtData.parserError, reject)
);
if(!SINGLETON_PDF_PARSER || !this.pdfParser){
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a batch process, parseOnePDFStream will be invoked multiple times. Line 127 makes this.pdfParser always a singleton regardless of SINGLETON_PDF_PARSER. No change of current behavior if SINGLETON_PDF_PARSER not set (default) please, also check other places that instantiates pdfParser instances.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, modified the part.
Now it does the following:

  • If the parameter is provided AND the pdfParser object is not initialized, the pdfParserObject gets initialized
  • If the parameter is NOT provided, the pdfParser object gets initialized
  • If the parameter is provided AND the pdfParser object is initialized, the pdfParserObject DOES NOT get initialized

Sorry for the previous mistake, is this the desired behaviour?

Comment thread src/cli/p2jcli.ts Outdated
this.pdfParser.on("pdfParser_dataError", (evtData: any) => {
this.onPrimaryError(evtData.parserError, reject);
});
if(!SINGLETON_PDF_PARSER || !this.pdfParser){
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a batch process, parseOnePDF will be invoked multiple times. Line 153 makes this.pdfParser always a singleton regardless of SINGLETON_PDF_PARSER. No change of current behavior if SINGLETON_PDF_PARSER not set (default) please, also check other places that instantiates pdfParser instances.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to my comment on parseOnePDFStream method (the same thing got applied, maybe we could use a refactor?)

@nicolabaesso nicolabaesso requested a review from modesty July 5, 2025 13:16
@modesty modesty merged commit dab556a into modesty:master Jul 5, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants