# 在 LangChain 中載入 Excel 檔案

## 概覽

本教學介紹如何在 ```LangChain``` 中載入與處理 ```Microsoft Excel``` 檔案。

內容聚焦於兩種主要方法：使用 ```UnstructuredExcelLoader``` 進行原始文字提取，以及使用 ```DataFrameLoader``` 處理結構化資料。

本指南旨在幫助開發者有效整合 Excel 資料至 LangChain 專案中，涵蓋基礎與進階的使用情境。

### 目錄

- [概覽](#overview)
- [環境設置](#environment-setup)
- [UnstructuredExcelLoader](#UnstructuredExcelLoader)
- [DataFrameLoader](#DataFrameLoader)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [None]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [None]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_community",
        "unstructured",
        "openpyxl"
    ],
    verbose=False,
    upgrade=False,
)

## ```UnstructuredExcelLoader```

```UnstructuredExcelLoader``` 用於載入 ```Microsoft Excel``` 檔案。

此載入器支援 ```.xlsx``` 與 ```.xls``` 格式的檔案。

當你使用 ```mode="elements"``` 模式時，Excel 檔案的 HTML 表現形式會儲存在文件 metadata 的 ```text_as_html``` 欄位中。

In [None]:
# install
# %pip install -qU langchain-community unstructured openpyxl

In [None]:
import sys
from langchain_community.document_loaders import UnstructuredExcelLoader

# Set recursion limit
sys.setrecursionlimit(10**6)    

# Create UnstructuredExcelLoader 
loader = UnstructuredExcelLoader("./data/titanic.xlsx", mode="elements")

# Load a document
docs = loader.load()

# Print the number of documents
print(len(docs))

這表示已成功載入一個文件。

```page_content``` 包含每一列的資料內容，而 ```metadata``` 中的 ```text_as_html``` 則以 HTML 格式儲存整體資料。

In [None]:
# Print the document
print(docs[0].page_content[:200])

In [None]:
# Print the text_as_html of metadata
print(docs[0].metadata["text_as_html"][:1000])

![text_as_html](./assets/05-excel-loader-text-as-html.png)

## ```DataFrameLoader```

- Similar to CSV files, we can load Excel files by using the ```read_excel()``` function to create a ```pandas.DataFrame```, and then load it.

In [None]:
import pandas as pd

# read the Excel file
df = pd.read_excel("./data/titanic.xlsx")

In [None]:
from langchain_community.document_loaders import DataFrameLoader

# Set up DataFrame loader, specifying the page content column
loader = DataFrameLoader(df, page_content_column="Name")

# Load the document
docs = loader.load()

# Print the data
print(docs[0].page_content)

# Print the metadata
print(docs[0].metadata)