# Word Document Processing and Loaders 

Word document processing involves reading, extracting, and manipulating content from `.doc` and `.docx` files. This is commonly required for tasks such as data extraction, text analysis, and automated reporting.

## Common Loaders for Word Documents

- **python-docx**:  
    A popular library for reading and writing `.docx` files. It allows you to extract text, tables, images, and metadata from Word documents.

- **docx2txt**:  
    A simple tool to extract text from `.docx` files, ignoring formatting and images.

- **Mammoth**:  
    Focuses on converting `.docx` files to HTML or plain text, preserving semantic structure.

- **Unstructured Loader**:  
    Used in document processing pipelines to load and chunk Word documents for further analysis or machine learning tasks.

## Typical Workflow

1. **Loading the Document**:  
     Use a loader to open and read the Word file.

2. **Extracting Content**:  
     Retrieve text, tables, images, or metadata as needed.

3. **Processing**:  
     Clean, chunk, or analyze the extracted content for downstream tasks.

## Example Libraries

- `python-docx`
- `docx2txt`
- `unstructured`
- `mammoth`

These loaders help streamline the process of working with Word documents in Python and other programming environments.

**Example 1: Docx2txtLoader**

In [3]:
from docx import Document as DocxDocument
from langchain_community.document_loaders import Docx2txtLoader, UnstructuredWordDocumentLoader

try:
    docx_loader = Docx2txtLoader("data/word_docs/Final-Generative-AI-statement-v1.docx")
    docx_documents = docx_loader.load()
    print(f"\nLoaded {len(docx_documents)} documents using Docx2txtLoader.")
    print("Sample document content:")   
    print(docx_documents[0])  # Print the first document's content
except Exception as e:
    print(f"Error loading DOCX with Docx2txtLoader: {e}")


Loaded 1 documents using Docx2txtLoader.
Sample document content:
page_content='This Statement has been provided to members of the Global Connections Network

by Zorva Consulting and All Nations.

Generative AI statement Template

This statement has been produced so that you can tailor it to your own circumstances and needs. It is supplied ‘as is’ and without warranty.  If you do make any changes, we would be pleased to know about them, so that they might be shared with other Global Connections members.

Its authors are Nick Swain, Zorva Consulting Ltd and Clive Thomas, All Nations.

Nick Swain provides a range of Generative AI services to help you stay safe and deliver more impact, alongside data protection and Theory of Change advice. 

You are welcome to contact both Nick and Clive via email at nick.swain@zorva.info or via the Zorva website at https://zorva.info/about-us/contact-us/. Please mark any messages for Clive clearly, and Nick will pass them on.



What is Generative AI?



**Example 2: UnstructuredWordDocumentLoader**

In [9]:
try:
    unstructured_loader = UnstructuredWordDocumentLoader("data/word_docs/Final-Generative-AI-statement-v1.docx",
                                                       mode="elements")
    unstructured_documents = unstructured_loader.load()
    print(f"\nLoaded {len(unstructured_documents)} documents using UnstructuredWordDocumentLoader.")
    print("Sample document content:")
    print(unstructured_documents[0])  # Print the first document's content

    for i, doc in enumerate(unstructured_documents[:3]):  # Print first 3 documents
        print(f"\nDocument {i+1} content:")
        print(doc.page_content)
        print(f"Metadata: {doc.metadata.get('category', 'N/A') }")
        print(f"File directory: {doc.metadata.get('file_directory', 'N/A') }")
        print(f"Element ID: {doc.metadata.get('element_id', 'N/A') }")
except Exception as e:
    print(f"Error loading DOCX with UnstructuredWordDocumentLoader: {e}")


Loaded 69 documents using UnstructuredWordDocumentLoader.
Sample document content:
page_content='This Statement has been provided to members of the Global Connections Network
by Zorva Consulting and All Nations.' metadata={'source': 'data/word_docs/Final-Generative-AI-statement-v1.docx', 'category_depth': 0, 'file_directory': 'data/word_docs', 'filename': 'Final-Generative-AI-statement-v1.docx', 'header_footer_type': 'primary', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Header', 'element_id': 'd6838e6211e1d7ad7b9cbcb0bc3afac5'}

Document 1 content:
This Statement has been provided to members of the Global Connections Network
by Zorva Consulting and All Nations.
Metadata: Header
File directory: data/word_docs
Element ID: d6838e6211e1d7ad7b9cbcb0bc3afac5

Document 2 content:
Generative AI statement Template
Metadata: Title
File directory: data/word_docs
Element ID: a02e8e0e67590316fc9bb8582be9f134

Document 3

## Use Cases for Word Document Loaders 

### Docx2txtLoader

- **Bulk Text Extraction**:  
    Quickly extract all text from `.docx` files for downstream processing, such as NLP tasks or search indexing.

- **Simple Content Analysis**:  
    Useful for scenarios where only the raw text is needed, ignoring formatting, images, and complex structures.

- **Data Migration**:  
    Migrate content from Word documents to other formats (e.g., plain text, CSV) for reporting or archival.

- **Preprocessing for Machine Learning**:  
    Prepare textual data for training models by extracting and cleaning document content.

---

### UnstructuredWordDocumentLoader

- **Element-Level Extraction**:  
    Extracts individual elements (paragraphs, tables, headings) from Word documents, enabling fine-grained analysis.

- **Metadata Enrichment**:  
    Captures metadata such as category, file directory, element ID, and more, supporting advanced document management and search.

- **Content Chunking**:  
    Breaks documents into logical chunks for tasks like semantic search, question answering, or summarization.

- **Document Classification**:  
    Enables classification or tagging of document sections based on extracted metadata and content.

- **Automated Reporting & Knowledge Management**:  
    Integrates with pipelines to automate extraction, categorization, and storage of document knowledge for enterprise use.

---

These loaders streamline workflows for extracting, processing, and analyzing Word document content in data science, automation, and enterprise applications.

In [10]:
# Python-docx Example (alternative method)
try:
    doc = DocxDocument("data/word_docs/Final-Generative-AI-statement-v1.docx")
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    print(f"\nLoaded document using python-docx with {len(full_text)} paragraphs.")
    print("Sample paragraph content:")
    print(full_text[0])  # Print the first paragraph's content
except Exception as e:
    print(f"Error loading DOCX with python-docx: {e}")


Loaded document using python-docx with 83 paragraphs.
Sample paragraph content:
Generative AI statement Template
