Skip to content

Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178

Open
notsooamit wants to merge 2 commits into
oidlabs-com:mainfrom
notsooamit:fix-macos-docx-sandbox
Open

Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178
notsooamit wants to merge 2 commits into
oidlabs-com:mainfrom
notsooamit:fix-macos-docx-sandbox

Conversation

@notsooamit
Copy link
Copy Markdown

@notsooamit notsooamit commented Apr 21, 2026

Fixes Issue #177

Description:
The test_parsing_docx_type test was failing for LLM_PARSE on macOS with a FileNotFoundError, and on native Windows due to COM object pathing constraints. While fixing this, I also fortified the test suite to not require live API keys and patched a fragile bounding-box check.

Changes Included:

  1. Core Pathing Fix (conversion_utils.py): Updated convert_doc_to_pdf to dynamically resolve input_path and temp_dir using os.path.abspath(). The underlying docx2pdf utility relies on AppleScript (macOS) and COM objects (Windows), both of which strictly require absolute file paths. Feeding them relative paths caused silent failures.
  2. Test Suite Stabilization (test_parser.py): Added a @patch to mock parse_llm_doc. Previously, running this test locally without valid API keys in a .env file would trigger a live network failure. This mock ensures the .docx to .pdf conversion is still rigorously tested without hitting API rate limits or requiring local credentials.
  3. API Robustness (api.py): Wrapped the fragile bounding box check (result["segments"][0].get("bboxes")) in a safe condition. If an LLM parse fails or returns an empty segment list, the router now gracefully triggers fallback logic instead of crashing with an IndexError.

Testing:
Verified locally on a native Linux environment (Ubuntu/WSL) and natively on Windows. Test suite passes 100%

Copilot AI review requested due to automatic review settings April 21, 2026 18:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses docx→pdf conversion path issues that caused LLM_PARSE failures on macOS/Windows, and hardens parsing behavior and tests around those failures (Issue #177).

Changes:

  • Ensure .doc/.docx→PDF conversion uses absolute paths for compatibility with docx2pdf on macOS/Windows.
  • Mock parse_llm_doc in the docx parsing test to avoid requiring live LLM credentials for that specific test.
  • Prevent IndexError when checking for bounding boxes if a parse returns an empty segments list.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/test_parser.py Adds mocking for parse_llm_doc in the docx test to avoid live LLM calls for that case.
lexoid/core/conversion_utils.py Converts docx conversion inputs (input_path, temp_dir) to absolute paths before calling platform-specific converters.
lexoid/api.py Guards bbox detection against empty segments to avoid crashes during bbox routing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_parser.py Outdated
Comment thread tests/test_parser.py Outdated
Comment thread lexoid/api.py Outdated
@notsooamit notsooamit force-pushed the fix-macos-docx-sandbox branch from 8815af4 to ac7d0ac Compare April 21, 2026 18:43
@pramitchoudhary pramitchoudhary added the bug Something isn't working label Apr 23, 2026
@pramitchoudhary
Copy link
Copy Markdown
Contributor

@notsooamit thanks again for looking into this. 🙏

It looks like the formatting or something else is causing most of the changes to appear as new files.
This might be difficult to verify.
The project is using Ruff, which mimics black closely, if I think.

Sharing .vscode config can be dumped settings.json
.vscode/settings.json

{
  "[python]": {
        "editor.defaultFormatter": "charliermarsh.ruff",
        "editor.codeActionsOnSave": {
            // Organize imports using Ruff on save
            "source.organizeImports.ruff": "explicit",
            // Apply all auto-fixable problems on save
            "source.fixAll": "explicit"
        }
    },
    "python.linting.enabled": true,
    "python.linting.ruffEnabled": true,
    "python.linting.lintOnSave": true,
    "python.linting.pylintEnabled": false,
    "python.linting.flake8Enabled": false,
    "ruff.lint.args": ["--select=E,W,F,I001,PL", "--ignore=E203", "--per-file-ignores=**/__init__.py:F401"],
    "python.formatting.blackArgs": ["--line-length", "120"],
    "files.autoSave": "onWindowChange",
    "files.trimTrailingWhitespace": true,
    "git.ignoreLimitWarning": true,
    "cSpell.words": [
      "sonner"
    ],
}

@pramitchoudhary pramitchoudhary linked an issue Apr 23, 2026 that may be closed by this pull request
@notsooamit notsooamit force-pushed the fix-macos-docx-sandbox branch from ac7d0ac to e543390 Compare April 23, 2026 16:07
@notsooamit
Copy link
Copy Markdown
Author

Hey @pramitchoudhary! You were completely right, my local formatter ran with an 88-character limit instead of the project's 120-character limit, causing a massive formatting diff.

I reset the branch to upstream/main to wipe out the formatting artifacts and surgically re-applied the fixes. The PR diff is now clean (+26 -8) and ready for review!

@notsooamit
Copy link
Copy Markdown
Author

Just a quick note on the second commit: While doing native Windows testing, I realized that if os.path.abspath(temp_dir) is called after temp_path is constructed, docx2pdf is still handed a relative output path, causing Microsoft Word to crash upon saving. The second commit simply moves the path resolution to the top of the function so both the input and output paths are strictly absolute before the COM object ever sees them!

@pramitchoudhary
Copy link
Copy Markdown
Contributor

Hi @notsooamit proposed solution is not ideal.
Will it be possible to adopt the following suggestions and verify?

def _is_valid_pdf(path: str) -> bool:
    """Return True when the PDF exists and is non-empty."""
    return os.path.exists(path) and os.path.getsize(path) > 0


def _convert_with_soffice(input_path: str, out_dir: str) -> None:
    """Convert DOC/DOCX to PDF using LibreOffice-compatible binaries."""
    soffice_bin = shutil.which("soffice") or shutil.which("lowriter")
    if not soffice_bin:
        raise RuntimeError("LibreOffice binary not found (soffice/lowriter).")

    subprocess.run(
        [
            soffice_bin,
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            out_dir,
            input_path,
        ],
        check=True,
    )


def convert_doc_to_pdf(input_path: str, temp_dir: str) -> str:
    input_path = os.path.abspath(input_path)
    temp_dir = os.path.abspath(temp_dir)
    os.makedirs(temp_dir, exist_ok=True)
    doc_name = os.path.basename(input_path)

    temp_path = os.path.join(
        temp_dir, os.path.splitext(os.path.basename(input_path))[0] + ".pdf"
    )
    logger.info(f"DOCX->PDF conversion started for {doc_name}")

    # Linux: use LibreOffice path only.
    if "linux" in sys.platform.lower():
        _convert_with_soffice(input_path, temp_dir)
        if not _is_valid_pdf(temp_path):
            raise RuntimeError(
                f"LibreOffice conversion finished, but PDF is missing/empty: {temp_path}."
            )
        return temp_path

    # Non-Linux (macOS/Windows): try docx2pdf first, then fallback to soffice.
    try:
        logger.debug(f"Converting document {doc_name} to PDF using docx2pdf...")
        docx2pdf.convert(input_path, temp_path)
    except Exception as docx_err:
        logger.warning(
            f"docx2pdf failed for {doc_name}, will fallback to soffice: {docx_err}"
        )

    if _is_valid_pdf(temp_path):
        return temp_path

    logger.debug("Falling back to LibreOffice conversion via soffice/lowriter...")
    _convert_with_soffice(input_path, temp_dir)

    if not _is_valid_pdf(temp_path):
        raise RuntimeError(
            "DOCX to PDF conversion failed with both docx2pdf and soffice. "
            f"Expected output: {temp_path}."
        )

    return temp_path
  1. test_parser.py: Remove mock parsing calls that will hide the issue.

@pramitchoudhary
Copy link
Copy Markdown
Contributor

@dilithjay also can we swtich to soffice instead of lowriter?

@notsooamit notsooamit force-pushed the fix-macos-docx-sandbox branch from e543390 to 3b10ddd Compare May 27, 2026 19:53
- Add _is_valid_pdf helper to validate converted output
- Add _convert_with_soffice using LibreOffice (soffice/lowriter)
- Use os.path.abspath to resolve paths for docx2pdf/COM/AppleScript
- Linux: use LibreOffice as primary; macOS/Windows: docx2pdf with
  LibreOffice fallback
- Add error propagation for failed LLM parses in parse_chunk
- Guard against empty segments in bbox check
- Fix expected_ouput -> expected_output typos in tests
@notsooamit notsooamit force-pushed the fix-macos-docx-sandbox branch from 3b10ddd to 53dc0fd Compare May 27, 2026 19:56
@notsooamit notsooamit requested a review from Copilot May 27, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test fails on test_parsing_docx_type

3 participants