In [None]:
FILE_NAME='data/docx/sample-docx.docx'

Approaches to read a DOCX file in Python:
1. Convert to HTML using mammoth.js, then parse with BeautifulSoup.
2. Convert to Markdown with MarkItDown, then use a Markdown parser.
3. Use unstructured.io to extract and process the document directly.
4. Use libraries like python-docx or docling to read the document structure.

## mammoth.js

**Mammoth.js** is a JavaScript library designed to convert `.docx` documents into clean, semantic HTML. Instead of replicating visual styles, Mammoth focuses on meaningful structure—like turning "Heading 1" into `<h1>` tags. It supports features like:

*   Headings, lists, tables, footnotes, images, links
*   Custom style mappings
*   Raw text extraction
*   Browser and Node.js usage
*   CLI and API support

Mammoth avoids unnecessary formatting and works best with semantically styled documents. It’s available via npm, PyPI, Maven, NuGet, and WordPress.



In [None]:
%pip install mammoth

In [None]:
import mammoth
from pathlib import Path
from bs4 import BeautifulSoup
from textwrap import shorten

output_path = Path(FILE_NAME).with_suffix(".html")

with open(FILE_NAME, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)

html = result.value
messages = result.messages

output_path.write_text(html, encoding="utf-8")
print(f"Saved HTML to {output_path}")

if messages:
    print("Conversion messages:")
    for message in messages:
        print(f" - {message}")

soup = BeautifulSoup(html, "html.parser")

headings = [tag.get_text(' ', strip=True) for tag in soup.find_all(["h1", "h2", "h3"])]
if headings:
    print("\nHeadings detected:")
    for heading in headings:
        print(f" - {heading}")

paragraphs = [p.get_text(' ', strip=True) for p in soup.find_all("p")]
if paragraphs:
    print(f"\nFirst paragraph snippet: {shorten(paragraphs[0], width=120, placeholder='…')}")

tables = soup.find_all("table")
if tables:
    first_table = tables[0]
    rows = [
        [cell.get_text(' ', strip=True) for cell in row.find_all(["th", "td"])]
        for row in first_table.find_all("tr")
    ]
    print("\nFirst table preview:")
    for row in rows[:3]:
        print(" | ".join(row))

## MarkItDown 

**MarkItDown** is a lightweight Python utility developed by Microsoft for converting various file formats into Markdown, optimized for use with Large Language Models (LLMs) and text analysis pipelines. It preserves document structure (headings, tables, lists, etc.) and supports a wide range of input formats including:

*   PDF, Word, Excel, PowerPoint
*   Images (with OCR and EXIF metadata)
*   Audio (transcription)
*   HTML, CSV, JSON, XML
*   ZIP files, YouTube URLs, EPubs

#### **Key Features**

*   Converts files to Markdown for efficient LLM processing
*   Supports plugins and integration with Azure Document Intelligence
*   Offers both CLI and Python API usage
*   Compatible with Docker and virtual environments
*   Optional dependencies for format-specific support
*   Supports LLM-based image descriptions (e.g., GPT-4o)
*   Open-source under the MIT License

In [None]:
%pip install markitdown[all]

In [None]:
from pathlib import Path
from markitdown import MarkItDown

converter = MarkItDown()  # Set enable_plugins=True if you need advanced formats
result = converter.convert(FILE_NAME)

output_path = Path(FILE_NAME).with_suffix(".md")
output_path.write_text(result.text_content, encoding="utf-8")
print(f"Saved Markdown to {output_path}")

lines = result.text_content.splitlines()
preview_lines = lines[:15] if lines else []
if preview_lines:
    print("\nMarkdown preview:")
    for line in preview_lines:
        print(line)
else:
    print("\nNo textual content detected.")

attachments = getattr(result, "attachments", None)
if attachments:
    print("\nEmbedded attachments:")
    for name, data in attachments.items():
        print(f" - {name} ({len(data)} bytes)")
else:
    print("\nNo embedded attachments detected.")

## unstructured.io

**Unstructured.io** is an open-source ETL library designed to convert complex documents—like PDFs, Word files, HTML, and images—into clean, structured data optimized for use with large language models (LLMs). It provides modular components for:

*   **Document ingestion and pre-processing**
*   **Auto-partitioning and format detection**
*   **Table and image enrichment**
*   **Chunking and embedding generation**

Unstructured supports both local development and containerized deployment via Docker. It integrates easily with Python and offers connectors for platforms like Discord. The library is ideal for building scalable, production-grade data pipelines and is available via PyPI and GitHub.

In [None]:
%pip  install "unstructured[docx]"

In [None]:
from unstructured.partition.docx import partition_docx
elements = partition_docx(FILE_NAME)
print(elements)

In [None]:
print("Number of elements: ", len(elements))
for i, element in enumerate(elements):             
    if element.category == 'Table':
        chunk_text = element.metadata.text_as_html        
    else:
        if element.category == 'Title':
            chunk_text = "# "+ element.text
        else:
            chunk_text = element.text 
    print(f'element {i} ({element.category}): Chunk len ({len(chunk_text)}) {chunk_text[:100]}...') 
    

### 🆚 Tool Comparison: MarkItDown vs Unstructured.io vs Mammoth.js

| Feature / Aspect          | **MarkItDown**                                         | **Unstructured.io**                                        | **Mammoth.js**                                        |
| ------------------------- | ------------------------------------------------------ | ---------------------------------------------------------- | ----------------------------------------------------- |
| **Purpose**               | Converts files to Markdown for LLM-friendly input      | Transforms unstructured documents into structured data     | Converts `.docx` files to clean HTML markup           |
| **Supported Formats**     | PDF, Word, Excel, PPT, images, audio, HTML, JSON, etc. | PDFs, Word, PowerPoint, HTML, emails, images, EPUBs, etc.  | `.docx` only (Word)                                   |
| **Output Format**         | Markdown (`.md`)                                       | Structured JSON                                            | HTML (semantic, minimal markup)                       |
| **Installation**          | `pip install 'markitdown[all]'`                        | `pip install "unstructured[all-docs]"`                     | `npm install mammoth` (also Python, Java, .NET, etc.) |
| **Usage Style**           | CLI and Python API                                     | Python SDK, CLI, Docker, API, web UI                       | Node.js/JavaScript or CLI; optional Python wrapper    |
| **Processing Focus**      | Retains document structure (headings, tables, lists)   | Partitioning, cleaning, chunking, metadata extraction      | Focused semantic mapping from Word styles to HTML     |
| **Image & Audio Support** | OCR for images, transcription for audio                | OCR, layout parsing, table extraction                      | Images embedded in output HTML                        |
| **Customization**         | Plugins, Azure integration                             | Style mapping, connectors, pipeline orchestration          | Custom style maps to convert named styles to HTML     |
| **Enterprise Features**   | Lightweight utility with plugin support                | Full enterprise platform with UI, API, security, analytics | Community/open-source focused (GitHub-backed)         |
| **Open Source**           | Yes (Microsoft-backed)                                 | Yes (community-driven, GitHub-hosted)                      | Yes (BSD-licensed, multi-platform)                    |
| **Docker Support**        | Yes                                                    | Yes (multi-platform images)                                | Browser-demo support; core API usable in Node.js      |
| **Ideal Use Case**        | Quick Markdown conversion for LLM ingestion            | Building full ETL pipelines for AI/ML                      | Cleanly converting Word `.docx` to semantic HTML      |
