Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178
Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178notsooamit wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses docx→pdf conversion path issues that caused LLM_PARSE failures on macOS/Windows, and hardens parsing behavior and tests around those failures (Issue #177).
Changes:
- Ensure
.doc/.docx→PDF conversion uses absolute paths for compatibility withdocx2pdfon macOS/Windows. - Mock
parse_llm_docin the docx parsing test to avoid requiring live LLM credentials for that specific test. - Prevent
IndexErrorwhen checking for bounding boxes if a parse returns an emptysegmentslist.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
tests/test_parser.py |
Adds mocking for parse_llm_doc in the docx test to avoid live LLM calls for that case. |
lexoid/core/conversion_utils.py |
Converts docx conversion inputs (input_path, temp_dir) to absolute paths before calling platform-specific converters. |
lexoid/api.py |
Guards bbox detection against empty segments to avoid crashes during bbox routing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8815af4 to
ac7d0ac
Compare
|
@notsooamit thanks again for looking into this. 🙏 It looks like the formatting or something else is causing most of the changes to appear as new files. Sharing .vscode config can be dumped settings.json |
ac7d0ac to
e543390
Compare
|
Hey @pramitchoudhary! You were completely right, my local formatter ran with an 88-character limit instead of the project's 120-character limit, causing a massive formatting diff. I reset the branch to upstream/main to wipe out the formatting artifacts and surgically re-applied the fixes. The PR diff is now clean (+26 -8) and ready for review! |
|
Just a quick note on the second commit: While doing native Windows testing, I realized that if os.path.abspath(temp_dir) is called after temp_path is constructed, docx2pdf is still handed a relative output path, causing Microsoft Word to crash upon saving. The second commit simply moves the path resolution to the top of the function so both the input and output paths are strictly absolute before the COM object ever sees them! |
|
Hi @notsooamit proposed solution is not ideal.
|
|
@dilithjay also can we swtich to |
e543390 to
3b10ddd
Compare
- Add _is_valid_pdf helper to validate converted output - Add _convert_with_soffice using LibreOffice (soffice/lowriter) - Use os.path.abspath to resolve paths for docx2pdf/COM/AppleScript - Linux: use LibreOffice as primary; macOS/Windows: docx2pdf with LibreOffice fallback - Add error propagation for failed LLM parses in parse_chunk - Guard against empty segments in bbox check - Fix expected_ouput -> expected_output typos in tests
3b10ddd to
53dc0fd
Compare
Fixes Issue #177
Description:
The
test_parsing_docx_typetest was failing forLLM_PARSEon macOS with aFileNotFoundError, and on native Windows due to COM object pathing constraints. While fixing this, I also fortified the test suite to not require live API keys and patched a fragile bounding-box check.Changes Included:
conversion_utils.py): Updatedconvert_doc_to_pdfto dynamically resolveinput_pathandtemp_dirusingos.path.abspath(). The underlyingdocx2pdfutility relies on AppleScript (macOS) and COM objects (Windows), both of which strictly require absolute file paths. Feeding them relative paths caused silent failures.test_parser.py): Added a@patchto mockparse_llm_doc. Previously, running this test locally without valid API keys in a.envfile would trigger a live network failure. This mock ensures the.docxto.pdfconversion is still rigorously tested without hitting API rate limits or requiring local credentials.api.py): Wrapped the fragile bounding box check (result["segments"][0].get("bboxes")) in a safe condition. If an LLM parse fails or returns an empty segment list, the router now gracefully triggers fallback logic instead of crashing with anIndexError.Testing:
Verified locally on a native Linux environment (Ubuntu/WSL) and natively on Windows. Test suite passes 100%