Skip to content

v7.1.0

Choose a tag to compare

@github-actions github-actions released this 25 May 21:59
· 4 commits to master since this release

v7.1.0: 🛡️ Cancellation Control, Thread Safety & Robust Entity Decoding

I am excited to announce the release of officeParser v7.1.0! Following the massive paradigm shift of v7.0.0, this release is dedicated to enterprise-grade reliability, memory leak prevention, and precision parsing.

As officeParser scales to support millions of production workloads and AI pipelines, v7.1.0 introduces critical safety guards, cancellation capabilities, and robustness improvements for heavy-duty document processing.


🌟 Key Pillars of the v7.1.0 Update

1. Native Cancellation with AbortSignal

You can now pass an abortSignal in both OfficeParserConfig and OcrConfig (as well as specific configurations for PdfGenerator and ChunkingGenerator). This allows you to immediately interrupt:

  • Document loading and parsing loops.
  • Background Puppeteer browsers.
  • Active OCR worker recognition tasks.

2. Consolidated Timeouts & Memory Safety

To prevent execution stalls and hanging resources in serverless or containerized environments:

  • Consolidated OCR Timeouts: Timeout options have been unified under a structured timeout configuration (workerLoad, recognition, and autoTerminate in OcrTimeoutConfig).
  • Generator Timeouts: Added robust timeouts for PdfGenerator and ChunkingGenerator tasks.
  • Resource Leak Prevention: If a generator or parser execution fails, cancels, or times out, Puppeteer browser instances and Tesseract workers are forcefully terminated and evicted, ensuring no dangling resources are left behind.

3. Robust XLSX Parsing & Entity Decoding

  • XML Entity Decoding: Resolved bugs where decimal, hex, and named XML entities (e.g., &, &, <) in Excel sheets were parsed as raw strings.
  • inlineStr Attribute Support: Fixed inlineStr tag attribute matching to correctly process inline spreadsheet strings.

4. Visualizer Panel Upgrades & Compliance

  • Timeout & Cancellation Controls: The web visualizer config drawer now exposes granular controls for OCR and generator timeouts.
  • ESM CSP Compliance: Replaced legacy dynamic module loading with direct ESM-native import() to comply with strict Content Security Policies.

🛠 Getting Started

npm install officeparser@7.1.0

Example of using the new AbortSignal and timeout suite:

const { parseOffice } = require('officeparser');

const controller = new AbortController();

try {
  const ast = await parseOffice('large-file.docx', {
    abortSignal: controller.signal,
    ocr: {
      enable: true,
      // Consolidated OCR timeouts
      timeout: {
        workerLoad: 10000,   // Max time to load OCR worker (ms)
        recognition: 30000,  // Max time for text recognition (ms)
        autoTerminate: 60000 // Inactivity cleanup (ms)
      }
    }
  });
} catch (error) {
  if (error.name === 'AbortError') {
    console.log('Parsing task was aborted successfully.');
  } else {
    console.error('Parsing failed:', error);
  }
}

// Cancel the operation at any point:
// controller.abort();

🔗 Full Changelog: View v7.1.0 Details
🔗 Documentation & Visualizer: officeparser.harshankur.com


❤️ Supporting the Future of Document Infrastructure

Since 2019, officeParser has been maintained as a voluntary project, growing to support over 10 million downloads and 300,000+ weekly installations.

As I build the ultimate document-to-AI pipeline, I seek professional sustainability to fund officeParser's next milestones:

  • Core Sustainability: Keeping up with dependency updates, test coverage, and performance tuning.
  • Multi-Runtime Excellence: Official support for Bun, Deno, and Edge (Cloudflare Workers, Vercel).
  • Enterprise Connectors: Dedicated integrations with LangChain, LlamaIndex, and Haystack.

If officeParser powers your production workflows or AI pipelines, please consider supporting its development:

👉 GitHub Sponsors
👉 Buy Me A Coffee