Skip to content

v7.2.0

Latest

Choose a tag to compare

@github-actions github-actions released this 04 Jun 21:29
· 3 commits to master since this release

v7.2.0: 🏗️ Parser Enhancements, Granular HTML Generator Controls, and Strict AST Typings

I am thrilled to announce the release of officeParser v7.2.0! This major update brings a massive architectural upgrade to the AST, empowering developers with deeper insight into document layout, embedded metadata, and bulletproof TypeScript integrations.

As we pave the way for building advanced RAG architectures, deep-document search systems, and robust AI parsing pipelines on top of officeParser, v7.2.0 guarantees that every piece of document intelligence—from slide masters to hidden footnotes—is logically structured and heavily typed.

Warning

Soft Breaking Change: Notes Placement
If your application iterates over ast.content to manually extract footnotes, endnotes, or slide speaker notes, you will need to update your logic. These nodes are no longer appended to the main content array. They are now structurally nested inside the notes[] array of their logical parent or preceding text node.


🌟 Key Pillars of the v7.2.0 Update

1. Structural Notes Attachment

Previously, footnotes, endnotes, and slide speaker notes were flattened and appended to the end of the document content. In v7.2.0, these notes are now strictly attached to their logical parent or preceding sibling nodes via a new node.notes[] array.
Note: The legacy putNotesAtLast config flag is now deprecated.

2. Auxiliary Content (Headers, Footers, Slide Masters)

The new ast.auxiliary property unlocks out-of-band document templates! officeParser now automatically extracts headers and footers from Word documents (ast.auxiliary.headers / footers), and Slide Masters from PowerPoint presentations (ast.auxiliary.slideMasters). These are neatly separated from the main sequential document flow.

3. Native & Custom Document Properties

The OfficeMetadata interface has been radically upgraded. Alongside canonical metadata fields (title, author, dates), officeParser now exposes format-specific verbatim metadata via ast.metadata.nativeProperties (e.g., <meta> tags in HTML, app.xml stats in DOCX, XMP dicts in PDF) and user-defined variables via ast.metadata.customProperties.

4. Discriminated Unions & Strict AST Typings

The generic OfficeContentNode interface has been completely refactored into a strict TypeScript Discriminated Union. This unlocks precise, compile-time type narrowing per node.type (e.g., safely accessing SlideMetadata only when type === 'slide'), eliminating the need for generic fallback assertions across your application.

5. Interactive HTML Spreadsheet Layouts & DOM Injections

The HTML Generator just got significantly smarter:

  • Interactive Spreadsheets: Spreadsheets generated from Excel or CSV files now render with desktop-class interactivity, featuring native draggable boundary handles (.col-resizer) to dynamically resize rows and columns in the browser.
  • Granular Layout Controls: Expanded HtmlGeneratorConfig with containerWidth, customCss, and DOM injections (head/body hook insertions).

🛠 Getting Started

npm install officeparser@7.2.0

Example of using the new Discriminated Unions, Auxiliary nodes, and Structural Notes:

import { parseOffice } from 'officeparser';

const ast = await parseOffice('presentation.pptx', {
  ignoreSlideMasters: false
});

// Access Slide Masters from the new auxiliary AST branch
const masterSlides = ast.auxiliary?.slideMasters || [];
console.log(`Found ${masterSlides.length} master slides!`);

// Confidently narrow types using Discriminated Unions!
for (const node of ast.content) {
  if (node.type === 'slide') {
    // TypeScript now explicitly knows this is a Slide node.
    // Slide Notes are now structurally nested under the slide!
    const noteCount = node.notes?.length || 0;
    console.log(`Slide ${node.metadata.pageNumber} has ${noteCount} notes attached.`);
  }
}

🔗 Full Changelog: View v7.2.0 Details
🔗 Documentation & Visualizer: officeparser.harshankur.com


❤️ Supporting the Future of Document Infrastructure

Since 2019, officeParser has been maintained as a voluntary project, growing to support over 10 million downloads and 300,000+ weekly installations.

As I build the ultimate document-to-AI pipeline, I seek professional sustainability to fund officeParser's next milestones:

  • Core Sustainability: Keeping up with dependency updates, test coverage, and performance tuning.
  • Multi-Runtime Excellence: Official support for Bun, Deno, and Edge (Cloudflare Workers, Vercel).
  • Enterprise Connectors: Dedicated integrations with LangChain, LlamaIndex, and Haystack.

If officeParser powers your production workflows or AI pipelines, please consider supporting its development:

👉 GitHub Sponsors
👉 Buy Me A Coffee


Changes: v7.1.0..v7.2.0