feat: reading multiple pdf files with a single PDFParser object#371
feat: reading multiple pdf files with a single PDFParser object#371modesty merged 17 commits intomodesty:masterfrom
Conversation
modesty
left a comment
There was a problem hiding this comment.
thanks for adding more tests. A few thoughts on making the instance of PDFJSClass reusable:
- pdfparser instance (or the client that instantiates PDFJSClass) needs to be reset/reusable whenever PDFParser is created. (line 107 of pdfparser.js)
- lib/pdf.js: setting this.pages=[] is not sufficient to dispose the object, pdfDocument and rawTextContents needs reset too. Recomment to call existing
destroymethod.
|
Hi @modesty, |
|
Hi @modesty, Thank you! |
|
sorry for the delay. code LGTM. two notes:
|
Let me know what to do next, I'll do it as soon as I can. Thank you! |
|
Merged recent changes of master, tests are ok |
|
Added a little change to package.json: issue #381 says that the semver is broken in 3.1.5. Thank you! |
|
Hi @modesty, Thank you |
|
@nicolabaesso , sorry for replying late. |
WalkthroughThe changes introduce a singleton mode for the PDF parser in the CLI, allowing reuse of a single Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant PDFProcessor
participant PDFParser
User->>CLI: Run with --singleton (si) flag
CLI->>PDFProcessor: Initialize with singleton mode
loop For each PDF to parse
PDFProcessor->>PDFParser: parseOnePDF / parseOnePDFStream
PDFParser-->>PDFProcessor: Parsing result
end
PDFProcessor-->>CLI: Output results
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
npm error Exit handler never called! 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
test/_test_testMultipleDataPDF.cjs (2)
1-4: Remove unused import and clarify testing framework usage.The test imports
assertbut usesexpectthroughout, which suggests it's using Jest's expect. Remove the unused import for clarity.-const assert = require("assert"); const fs = require("fs");
10-16: Consider adding buffer content validation.While the test checks that buffers are not the same reference, consider adding a more robust check to ensure the buffers have different content.
//we need to check if buffers are indeed different, otherwise it's useless! -expect(firstPDFBuffer).not.toBe(secondPDFBuffer); +expect(firstPDFBuffer).not.toBe(secondPDFBuffer); +expect(Buffer.compare(firstPDFBuffer, secondPDFBuffer)).not.toBe(0);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
test/pdf/mpf/testPDF.pdfis excluded by!**/*.pdftest/pdf/mpf/testPDF2.pdfis excluded by!**/*.pdf
📒 Files selected for processing (5)
jest.config.json(1 hunks)lib/pdf.js(3 hunks)package.json(1 hunks)pdfparser.js(1 hunks)test/_test_testMultipleDataPDF.cjs(1 hunks)
🔇 Additional comments (6)
lib/pdf.js (3)
188-192: LGTM! Reset functionality correctly implemented.The automatic reset when
datais null is well-designed and integrates cleanly with the existing event flow. This enables the parser to handle multiple PDFs sequentially without requiring manual reset calls.
413-420: LGTM! Reset method correctly clears internal state.The implementation properly destroys the PDF document and resets all internal arrays. This ensures a clean state for parsing the next PDF.
306-306: Minor formatting improvement.The blank line improves code readability by separating logical sections.
jest.config.json (1)
2-2: LGTM! Test pattern correctly broadened.The updated pattern
"**/test/_test_*"appropriately includes the new test file and allows for additional test files following the same naming convention.pdfparser.js (1)
307-315: LGTM! Reset method correctly implemented.The
resetPDFJSmethod properly destroys the old PDFJS instance and creates a fresh one with the specifiedneedRawTextparameter. This enables reusing the PDFParser instance for multiple PDF files without recreating the entire parser.package.json (1)
58-65: Verify Node.js (>=18.12.1) and npm (>=8.19.2) compatibilityOur automated check ran on Node v24.3.0 and npm v11.4.2 and couldn’t conclusively verify that the code—especially the new reset functionality—will work under Node 18.12.1 and npm 8.19.2. Please:
- Manually test the reset workflow on Node 18.12.1 and npm 8.19.2.
- Confirm that no ES-module or other newer-version–only APIs are used (e.g. run
rg "import|export" --type js,ts).- Ensure all scripts and dependencies execute as expected under the downgraded engine constraints.
| const firstData=await new Promise((resolve,reject)=>{ | ||
| parser.parseBuffer(firstPDFBuffer,5); | ||
| parser.on("pdfParser_dataReady", (evtData) => { | ||
| resolve(evtData); | ||
| }); | ||
|
|
||
| parser.on("pdfParser_dataError", (evtData) => { | ||
| reject(evtData); | ||
| }); | ||
| }); | ||
| const secondData=await new Promise((resolve,reject)=>{ | ||
| parser.parseBuffer(secondPDFBuffer,5); | ||
| parser.on("pdfParser_dataReady", (evtData) => { | ||
| resolve(evtData); | ||
| }); | ||
|
|
||
| parser.on("pdfParser_dataError", (evtData) => { | ||
| reject(evtData); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Consider improving event listener management and reset method usage.
The current test has a few areas for improvement:
-
Event listener cleanup: Event listeners are added for each parse operation but not removed, which could lead to memory leaks or duplicate event handling.
-
Explicit reset: The test relies on automatic reset when
datais null, but it would be more explicit to callparser.resetPDFJS(false)between parses to clearly demonstrate the feature. -
Error handling: Consider using Jest's timeout and proper error handling.
const firstData=await new Promise((resolve,reject)=>{
+ // Clear any existing listeners
+ parser.removeAllListeners();
parser.parseBuffer(firstPDFBuffer,5);
parser.on("pdfParser_dataReady", (evtData) => {
resolve(evtData);
});
parser.on("pdfParser_dataError", (evtData) => {
reject(evtData);
});
});
+
+// Explicitly reset the parser for the next PDF
+parser.resetPDFJS(false);
+
const secondData=await new Promise((resolve,reject)=>{
+ // Clear any existing listeners
+ parser.removeAllListeners();
parser.parseBuffer(secondPDFBuffer,5);
parser.on("pdfParser_dataReady", (evtData) => {
resolve(evtData);
});
parser.on("pdfParser_dataError", (evtData) => {
reject(evtData);
});
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const firstData=await new Promise((resolve,reject)=>{ | |
| parser.parseBuffer(firstPDFBuffer,5); | |
| parser.on("pdfParser_dataReady", (evtData) => { | |
| resolve(evtData); | |
| }); | |
| parser.on("pdfParser_dataError", (evtData) => { | |
| reject(evtData); | |
| }); | |
| }); | |
| const secondData=await new Promise((resolve,reject)=>{ | |
| parser.parseBuffer(secondPDFBuffer,5); | |
| parser.on("pdfParser_dataReady", (evtData) => { | |
| resolve(evtData); | |
| }); | |
| parser.on("pdfParser_dataError", (evtData) => { | |
| reject(evtData); | |
| }); | |
| }); | |
| const firstData = await new Promise((resolve, reject) => { | |
| // Clear any existing listeners | |
| parser.removeAllListeners(); | |
| parser.parseBuffer(firstPDFBuffer, 5); | |
| parser.on("pdfParser_dataReady", (evtData) => { | |
| resolve(evtData); | |
| }); | |
| parser.on("pdfParser_dataError", (evtData) => { | |
| reject(evtData); | |
| }); | |
| }); | |
| // Explicitly reset the parser for the next PDF | |
| parser.resetPDFJS(false); | |
| const secondData = await new Promise((resolve, reject) => { | |
| // Clear any existing listeners | |
| parser.removeAllListeners(); | |
| parser.parseBuffer(secondPDFBuffer, 5); | |
| parser.on("pdfParser_dataReady", (evtData) => { | |
| resolve(evtData); | |
| }); | |
| parser.on("pdfParser_dataError", (evtData) => { | |
| reject(evtData); | |
| }); | |
| }); |
🤖 Prompt for AI Agents
In test/_test_testMultipleDataPDF.cjs around lines 17 to 36, the event listeners
for "pdfParser_dataReady" and "pdfParser_dataError" are added for each parse but
never removed, risking memory leaks and duplicate handling. To fix this, add
code to remove these listeners after each parse completes or fails. Also,
explicitly call parser.resetPDFJS(false) between the two parseBuffer calls to
clearly reset the parser state. Finally, enhance error handling by integrating
Jest's timeout features and ensuring promise rejections are properly caught and
reported.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/cli/p2jcli.ts (1)
21-21: Fix code formatting.Missing space after the assignment operator.
-const SINGLETON_PDF_PARSER= "si" in argv; +const SINGLETON_PDF_PARSER = "si" in argv;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/cli/p2jcli.ts(3 hunks)src/cli/p2jcliarg.ts(2 hunks)
🔇 Additional comments (4)
src/cli/p2jcliarg.ts (2)
24-24: LGTM: Type definition correctly updated.The addition of the
siproperty to theArgvtype maintains type safety for the new singleton CLI argument.
203-207: LGTM: CLI argument properly registered.The singleton option is correctly registered with a clear description and follows the established pattern for other CLI arguments.
src/cli/p2jcli.ts (2)
127-133: LGTM: Singleton logic correctly implemented.The conditional initialization logic is correct - it only creates a new PDFParser instance when singleton mode is disabled OR when no existing instance exists. The event listener is properly attached only during initialization to avoid duplicates.
153-159: LGTM: Consistent singleton implementation.The singleton logic is consistently implemented across both parsing methods (
parseOnePDFStreamandparseOnePDF), maintaining the same conditional initialization pattern.
| if(!SINGLETON_PDF_PARSER || !this.pdfParser){ | ||
| //we initialize the PDFParser object only if the object itself is null, or the singleton parameter was not provided | ||
| this.pdfParser = new PDFParser(null, PROCESS_RAW_TEXT_CONTENT); | ||
| this.pdfParser.on("pdfParser_dataError", (evtData: any) => | ||
| this.onPrimaryError(evtData.parserError, reject) | ||
| ); | ||
| } |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Verify PDFParser state management between files.
While the singleton logic is correct, ensure that the PDFParser instance properly resets its internal state between different PDF files to avoid data contamination. The PR mentions a resetPDFJS method was added to the PDFParser class - verify this is being called appropriately.
Also applies to: 153-159
🏁 Script executed:
#!/bin/bash
# Description: Verify that PDFParser has proper state reset functionality
# Expected: Find resetPDFJS method calls or similar state management
# Search for resetPDFJS method usage
rg -A 5 -B 5 "resetPDFJS"
# Search for any state reset or cleanup calls in the singleton context
rg -A 10 -B 5 "SINGLETON_PDF_PARSER.*parser"Length of output: 455
Ensure PDFParser state is reset when reusing the singleton
We didn’t find any calls to resetPDFJS in the CLI, so the same PDFJS instance may carry over data between files. To avoid contamination, invoke resetPDFJS before parsing each new document.
-
In
src/cli/p2jcli.ts, around lines 127–133 (and similarly at 153–159), insert a reset when the singleton parser already exists:if (!SINGLETON_PDF_PARSER || !this.pdfParser) { this.pdfParser = new PDFParser(null, PROCESS_RAW_TEXT_CONTENT); this.pdfParser.on("pdfParser_dataError", (evtData: any) => this.onPrimaryError(evtData.parserError, reject) ); } else {
- // Reinitialize internal PDFJS to clear previous state
- this.pdfParser.resetPDFJS(PROCESS_RAW_TEXT_CONTENT);
}
- This guarantees each file is parsed with a fresh PDFJS instance and the correct raw-text setting.
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
if (!SINGLETON_PDF_PARSER || !this.pdfParser) {
// we initialize the PDFParser object only if the object itself is null, or the singleton parameter was not provided
this.pdfParser = new PDFParser(null, PROCESS_RAW_TEXT_CONTENT);
this.pdfParser.on("pdfParser_dataError", (evtData: any) =>
this.onPrimaryError(evtData.parserError, reject)
);
} else {
// Reinitialize internal PDFJS to clear previous state
this.pdfParser.resetPDFJS(PROCESS_RAW_TEXT_CONTENT);
}
🤖 Prompt for AI Agents
In src/cli/p2jcli.ts around lines 127 to 133, when reusing the singleton
PDFParser instance, call the resetPDFJS method on this.pdfParser before starting
to parse a new document to clear any previous state. Similarly, apply this reset
at lines 153 to 159 where the parser is reused. This ensures each file is parsed
with a fresh PDFJS instance and the correct raw-text setting, preventing data
contamination between parses.
@modesty modified the two CLI files: Let me know if there is something else that should be adjusted. Thank you! |
modesty
left a comment
There was a problem hiding this comment.
there are unresolved comments.
|
|
||
| raiseReadyEvent(data) { | ||
| process.nextTick(() => this.emit("pdfjs_parseDataReady", data)); | ||
| if(data===null){ |
There was a problem hiding this comment.
the client will read data (not only the data from event payload) when pdfjs_parseDataReady is raised, resetting here could cause race conditions. Do the reset before start loading new PDF is a safer bet.
There was a problem hiding this comment.
Good point, I've moved the method call to the parsePdfData method:
parsePDFData(arrayBuffer, password) {
this.resetCurrentObject();
const parameters = { password, data: arrayBuffer };
PDFJS.getDocument(parameters).then(
(pdfDocument) => this.load(pdfDocument, 1),
(error) => this.raiseErrorEvent(error)
);
}By my understanding, this is the earliest it gets to load a pdf file.
| "devEngines": { | ||
| "packageManager": { | ||
| "name": "npm", | ||
| "version": ">=8.19.2" |
There was a problem hiding this comment.
no downgrade of node & npm please
There was a problem hiding this comment.
Restored to your last changes. BTW comment #381 underlines that you're introducing a breaking change, maybe this could be version 4.0.0?
Minor and patches release should not introduce breaking changes
| this.pdfParser.on("pdfParser_dataError", (evtData: any) => | ||
| this.onPrimaryError(evtData.parserError, reject) | ||
| ); | ||
| if(!SINGLETON_PDF_PARSER || !this.pdfParser){ |
There was a problem hiding this comment.
in a batch process, parseOnePDFStream will be invoked multiple times. Line 127 makes this.pdfParser always a singleton regardless of SINGLETON_PDF_PARSER. No change of current behavior if SINGLETON_PDF_PARSER not set (default) please, also check other places that instantiates pdfParser instances.
There was a problem hiding this comment.
My bad, modified the part.
Now it does the following:
- If the parameter is provided AND the pdfParser object is not initialized, the pdfParserObject gets initialized
- If the parameter is NOT provided, the pdfParser object gets initialized
- If the parameter is provided AND the pdfParser object is initialized, the pdfParserObject DOES NOT get initialized
Sorry for the previous mistake, is this the desired behaviour?
| this.pdfParser.on("pdfParser_dataError", (evtData: any) => { | ||
| this.onPrimaryError(evtData.parserError, reject); | ||
| }); | ||
| if(!SINGLETON_PDF_PARSER || !this.pdfParser){ |
There was a problem hiding this comment.
in a batch process, parseOnePDF will be invoked multiple times. Line 153 makes this.pdfParser always a singleton regardless of SINGLETON_PDF_PARSER. No change of current behavior if SINGLETON_PDF_PARSER not set (default) please, also check other places that instantiates pdfParser instances.
There was a problem hiding this comment.
Refer to my comment on parseOnePDFStream method (the same thing got applied, maybe we could use a refactor?)
Elements changed:
I've added this elements because in my corporate job we are using this library, and recreating everytime the PDFParser object is not something I'm a fan of.
Other test cases are not failing, so no regressions.
Summary by CodeRabbit
New Features
Bug Fixes
Tests