## üìÑ Word Document Processing ‚Äî LangChain

üéØ Purpose
- Load a .docx file
- Extract text content
- Understand how two different loaders process Word documents

üß† What the Script Does

1Ô∏è‚É£ Using Docx2txtLoader
- Loads the Word document
- Converts the whole file into plain text
- Prints:
- number of documents loaded
- first few characters of the content
- basic metadata

2Ô∏è‚É£ Using UnstructuredWordDocumentLoader
- Loads the same document in element mode
- Breaks the document into structured parts like:
- headings
- paragraphs
- lists / tables (if present)
- Prints:
- total number of elements
- first three elements
- element type (category)
- short content preview

üßæ Key Idea

- Docx2txtLoader ‚Üí simple full-text extraction
- UnstructuredWordDocumentLoader ‚Üí structured, section-wise extraction

üöÄ Why This Matters

- Useful for RAG, preprocessing, and document pipelines
- Helps choose the right loader depending on:
- whether structure is needed or not

## Word Document Processing

In [1]:
from langchain_community.document_loaders import Docx2txtLoader,UnstructuredWordDocumentLoader


  from .autonotebook import tqdm as notebook_tqdm


In [6]:
## Method1: Using Docx2txtloader
print("Using Docx2txtloader")
try:
    docx_loader = Docx2txtLoader("data/docs/doc1.docx")
    docs=docx_loader.load()
    print(f"Loaded {len(docs)} document(s))")
    print(f"Content preview: {docs[0].page_content[:200]}")
    print(f"Metadata: {docs[0].metadata}")

except Exception as e:
    print(f"Error loading document with Docx2txtLoader: {e}")

Using Docx2txtloader
Loaded 1 document(s))
Content preview: Synergizing Data Organization: Novel Paths in Conversation and Analysis





Laxman.B 
Department of CSE(Data Science) 
CMR Technical Campus
 Hyderabad, India
sailaxman24@gmail.com

 Nenavath Raviteja
Metadata: {'source': 'data/docs/doc1.docx'}


In [11]:
## Method 2:UnstructuredWordDocumentLoader
print("Using UnstructuredWordDocumentLoader")
try:
    unstructured_loader = UnstructuredWordDocumentLoader("data/docs/doc1.docx",mode="elements")
    unstructured_docs=unstructured_loader.load()
    print(f"Loaded {len(unstructured_docs)} elements")
    for i,doc in enumerate(unstructured_docs[:3]):
        print(f"\nElement {i+1}:")
        print(f"Type: {doc.metadata.get('category', 'unknown')}")
        print(f"Content: {doc.page_content[:100]}...")
    

except Exception as e:
    print(f"Error: {e}")

Using UnstructuredWordDocumentLoader
Loaded 70 elements

Element 1:
Type: UncategorizedText
Content: Synergizing Data Organization: Novel Paths in Conversation and Analysis...

Element 2:
Type: Footer
Content: XXX-X-XXXX-XXXX-X/XX/$XX.00 ¬©20XX IEEE...

Element 3:
Type: UncategorizedText
Content: Laxman.B 
Department of CSE(Data Science) 
CMR Technical Campus
 Hyderabad, India
sailaxman24@gmail....


In [16]:
unstructured_docs

[Document(metadata={'source': 'data/docs/doc1.docx', 'category_depth': 0, 'file_directory': 'data/docs', 'filename': 'doc1.docx', 'last_modified': '2025-11-20T22:12:32', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'UncategorizedText', 'element_id': 'a117a81f0e25991e1f1b8150ff907fb2'}, page_content='Synergizing Data Organization: Novel Paths in Conversation and Analysis'),
 Document(metadata={'source': 'data/docs/doc1.docx', 'category_depth': 0, 'file_directory': 'data/docs', 'filename': 'doc1.docx', 'header_footer_type': 'first_page', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Footer', 'element_id': '004628369c8714ea3c7595e7fbdf197e'}, page_content='XXX-X-XXXX-XXXX-X/XX/$XX.00 ¬©20XX IEEE'),
 Document(metadata={'source': 'data/docs/doc1.docx', 'category_depth': 0, 'emphasized_text_contents': ['CMR Technical Campus'