# Excel Processing Techniques in LangChain

LangChain supports flexible strategies for loading and processing Excel files, enabling workflows that leverage multiple sheets, rich metadata, and custom document creation.

---

## 1. UnstructuredExcelLoader

- **Approach:**  
    Loads entire Excel files, including multiple sheets, as unstructured documents.
- **Implementation:**  
    Uses `UnstructuredExcelLoader` to read `.xlsx` files and create document objects for each sheet.
- **Advantages:**  
    - Handles multiple sheets automatically.
    - Preserves sheet names and structure in metadata.
    - Useful for exploratory analysis and multi-sheet reporting.
- **Limitations:**  
    - Less control over individual cell or row formatting.
    - May require post-processing for granular data extraction.

**Example:**
```python
from langchain.document_loaders.excel import UnstructuredExcelLoader

loader = UnstructuredExcelLoader("data/excel_files/Products.xlsx")
documents = loader.load()
```

---

## 2. Custom Excel Loader

- **Approach:**  
    Use pandas to read Excel files, process sheets and rows, and create tailored document objects.
- **Implementation:**  
    Read each sheet with pandas, iterate over rows, and construct `Document` objects with custom content and metadata.
- **Advantages:**  
    - Full control over which sheets and rows to process.
    - Can combine data from multiple sheets.
    - Enables advanced filtering, summaries, and metadata enrichment.
- **Limitations:**  
    - Requires more code and maintenance.
    - Slightly slower for large files.

**Example:**
```python
import pandas as pd
from langchain_core.documents import Document

excel_file = "data/excel_files/Products.xlsx"
xls = pd.ExcelFile(excel_file)
documents = []
for sheet_name in xls.sheet_names:
    df = pd.read_excel(xls, sheet_name=sheet_name)
    for idx, row in df.iterrows():
        content = f"Sheet: {sheet_name}\nProduct: {row.get('ProductName', '')}\nPrice: {row.get('Price', '')}"
        metadata = {"sheet": sheet_name, "row_index": idx, "source": excel_file}
        documents.append(Document(page_content=content, metadata=metadata))
```

---

## Summary Table

| Technique                | Multi-Sheet Support | Customization | Metadata Richness | Use Case                  |
|--------------------------|--------------------|---------------|-------------------|---------------------------|
| UnstructuredExcelLoader  | ✅                 | Low           | Basic             | Quick multi-sheet loading |
| Custom Loader (pandas)   | ✅                 | High          | Advanced          | Tailored document creation|

Choose the technique that best fits your Excel data complexity and processing needs.

In [5]:
from langchain_core.documents import Document
from typing import List

In [1]:
# Save as Excel with multiple sheets
# Create sample data for Products.csv provide some realtime product data
import pandas as pd

data = {
    "ProductID": [1, 2, 3],
    "ProductName": ["Laptop", "Smartphone", "Tablet"],
    "Price": [999.99, 499.99, 299.99],
    "Stock": [50, 200, 150],
    "Rating": [4.5, 4.7, 4.3],
    "Description": [
        "A high-performance laptop suitable for all your computing needs.",
        "A sleek and powerful smartphone with cutting-edge features.",
        "A lightweight tablet perfect for browsing and media consumption."
    ]
} 

df = pd.DataFrame(data) 
with pd.ExcelWriter("data/excel_files/Products.xlsx") as writer:
    df.to_excel(writer, sheet_name="Products", index=False)
    df[['ProductID', 'ProductName', 'Price']].to_excel(writer, sheet_name="Summary", index=False)
    df[['ProductID', 'Stock', 'Rating']].to_excel(writer, sheet_name="Inventory", index=False)

In [6]:

# Custom excel loader for multiple sheets
def custom_excel_loader(filepath: str) -> List[Document]:

    # Read the Excel file with all sheets
    xls = pd.ExcelFile(filepath)
    documents = []
    for sheet_name in xls.sheet_names:
        df = pd.read_excel(xls, sheet_name=sheet_name)
        # create a document for each sheet
        sheet_content = f"Sheet: {sheet_name}\n"
        sheet_content += f"Columns: {', '.join(df.columns)}\n"
        sheet_content += f"Rows: {len(df)}\n"
        sheet_content += df.to_string(index=False)

        document = Document(
            page_content=sheet_content,
            metadata={
                "source": filepath,
                "sheet_name": sheet_name,
                "num_rows": len(df),
                "num_columns": len(df.columns),
                "data_type": "excel_sheet"
            }
        )
        documents.append(document)
    return documents

In [None]:
excel_documents = custom_excel_loader("data/excel_files/Products.xlsx")
print(f"Loaded {len(excel_documents)} documents using custom_excel_loader.")
excel_documents

Loaded 3 documents using custom_excel_loader.


[Document(metadata={'source': 'data/excel_files/Products.xlsx', 'sheet_name': 'Products', 'num_rows': 3, 'num_columns': 6, 'data_type': 'excel_sheet'}, page_content='Sheet: Products\nColumns: ProductID, ProductName, Price, Stock, Rating, Description\nRows: 3\n ProductID ProductName  Price  Stock  Rating                                                      Description\n         1      Laptop 999.99     50     4.5 A high-performance laptop suitable for all your computing needs.\n         2  Smartphone 499.99    200     4.7      A sleek and powerful smartphone with cutting-edge features.\n         3      Tablet 299.99    150     4.3 A lightweight tablet perfect for browsing and media consumption.'),
 Document(metadata={'source': 'data/excel_files/Products.xlsx', 'sheet_name': 'Summary', 'num_rows': 3, 'num_columns': 3, 'data_type': 'excel_sheet'}, page_content='Sheet: Summary\nColumns: ProductID, ProductName, Price\nRows: 3\n ProductID ProductName  Price\n         1      Laptop 999.99\n  

In [None]:
from langchain_community.document_loaders import UnstructuredExcelLoader

try:
    excel_loader = UnstructuredExcelLoader(file_path="data/excel_files/Products.xlsx", 
                                           # sheet_name="Products"
                                           mode ="elements"
                                           )
    excel_documents = excel_loader.load()
    print(f"Loaded {len(excel_documents)} documents using UnstructuredExcelLoader.")
    excel_documents
except Exception as e:
    print(f"Error loading Excel file: {e}")


Error loading Excel file: No module named 'msoffcrypto'
