# CSV 載入器（CSV Loader）

## 概覽

本教學提供一份完整指南，說明如何使用 LangChain 中的 ```CSVLoader``` 工具，將 CSV 檔案中的資料無縫整合至你的應用中。```CSVLoader``` 是處理結構化資料的強大工具，讓開發者能夠在 LangChain 框架中提取、解析並利用來自 CSV 檔案的資訊。

[逗號分隔值（CSV）](https://en.wikipedia.org/wiki/Comma-separated_values) 是最常用的資料儲存與交換格式之一。

```CSVLoader``` 簡化了從 CSV 載入、解析與提取資料的流程，讓開發者能輕鬆將這些資訊整合進 LangChain 的工作流程中。

### 目錄

- [概覽](#overview)
- [環境設置](#environment-setup)
- [如何載入 CSV](#how-to-load-csvs)
- [自訂 CSV 解析與載入方式](#customizing-the-csv-parsing-and-loading)
- [指定欄位作為文件來源識別](#specify-a-column-to-identify-the-document-source)
- [產生 XML 文件格式](#generating-xml-document-format)
- [UnstructuredCSVLoader](#unstructuredcsvloader)
- [DataFrameLoader](#dataframeloader)

### 參考資料

- [LangChain CSVLoader API](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html)
- [LangChain：如何載入 CSV](https://python.langchain.com/docs/how_to/document_loader_csv)
- [LangChain DataFrameLoader API](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.dataframe.DataFrameLoader.html#dataframeloader)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can check out the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.
- ```unstructured``` package is a Python library for extracting text and metadata from various document formats like PDF and CSV


In [48]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [49]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_community",
        "unstructured"
    ],
    verbose=False,
    upgrade=False,
)

In [50]:
# Set environment variables
from langchain_opentutorial import set_env
from dotenv import load_dotenv

if not load_dotenv():
    set_env(
        {
            "OPENAI_API_KEY": "",
            "LANGCHAIN_API_KEY": "",
            "LANGCHAIN_TRACING_V2": "true",
            "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
            "LANGCHAIN_PROJECT": "04-CSV-Loader",
        }
    )

You can alternatively set ```OPENAI_API_KEY``` in ```.env``` file and load it. 

[Note] This is not necessary if you've already set ```OPENAI_API_KEY``` in previous steps.

In [51]:
from dotenv import load_dotenv

load_dotenv()

True

## How to load CSVs

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. LangChain can help you load CSV files easily—just import ```CSVLoader``` to get started. 

Each line of the file is a data record, and each record consists of one or more fields, separated by commas. 

We use a sample CSV file for the example.

In [52]:
from langchain_community.document_loaders.csv_loader import CSVLoader

# Create CSVLoader instance
loader = CSVLoader(file_path="./data/titanic.csv")

# Load documents
docs = loader.load()

for record in docs[:2]:
    print(record)

page_content='PassengerId: 1
Survived: 0
Pclass: 3
Name: Braund, Mr. Owen Harris
Sex: male
Age: 22
SibSp: 1
Parch: 0
Ticket: A/5 21171
Fare: 7.25
Cabin: 
Embarked: S' metadata={'source': './data/titanic.csv', 'row': 0}
page_content='PassengerId: 2
Survived: 1
Pclass: 1
Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Sex: female
Age: 38
SibSp: 1
Parch: 0
Ticket: PC 17599
Fare: 71.2833
Cabin: C85
Embarked: C' metadata={'source': './data/titanic.csv', 'row': 1}


In [53]:
print(docs[1].page_content)

PassengerId: 2
Survived: 1
Pclass: 1
Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Sex: female
Age: 38
SibSp: 1
Parch: 0
Ticket: PC 17599
Fare: 71.2833
Cabin: C85
Embarked: C


## 自訂 CSV 解析與載入方式

```CSVLoader``` 接受一個 ```csv_args``` 關鍵字參數，用於自訂傳遞給 Python 原生 ```csv.DictReader``` 的設定。這讓你可以處理各種不同格式的 CSV，例如自訂分隔符號、引號字元，或特定的換行符處理方式。

請參考 Python 的 [csv 模組官方文件](https://docs.python.org/3/library/csv.html)，了解可用的 ```csv_args``` 參數，以及如何依照你的需求調整解析行為。

In [None]:
# 從 langchain_community.document_loaders 模組中匯入 CSVLoader 類別
from langchain_community.document_loaders import CSVLoader

# 初始化 CSVLoader 物件
# CSVLoader 用於從 CSV 檔案載入資料並將其轉換為 LangChain 的 Document 物件
loader = CSVLoader(
    # file_path: 指定要載入的 CSV 檔案的路徑
    file_path="./data/titanic.csv",  # 假設 titanic.csv 檔案位於 ./data/ 資料夾下
    # csv_args: 一個字典，包含傳遞給 Python 內建 csv.DictReader 的額外參數
    csv_args={
        # "delimiter": 指定 CSV 檔案中欄位之間的分隔符號，預設通常是逗號
        "delimiter": ",",
        # "quotechar": 指定 CSV 檔案中包圍欄位內容的引號字元，預設通常是雙引號
        "quotechar": '"',
        # "fieldnames": 一個包含欄位名稱的列表。
        # CSVLoader 會使用這些名稱作為 Document 物件 page_content 中每一列資料的鍵。
        # 這對於沒有標頭列的 CSV 檔案，或者您想要覆蓋現有標頭時特別有用。
        "fieldnames": [
            "Passenger ID",  # 乘客ID
            "Survival (1: Survived, 0: Died)",  # 生存狀況 (1: 生還, 0: 罹難)
            "Passenger Class",  # 乘客艙等
            "Name",  # 姓名
            "Sex",  # 性別
            "Age",  # 年齡
            "Number of Siblings/Spouses Aboard",  # 船上兄弟姊妹/配偶數量
            "Number of Parents/Children Aboard",  # 船上父母/子女數量
            "Ticket Number",  # 船票號碼
            "Fare",  # 票價
            "Cabin",  # 船艙號碼
            "Port of Embarkation",  # 登船港口
        ],
    },
)

# 呼叫 loader 物件的 load() 方法來實際讀取和處理 CSV 檔案
# load() 方法會返回一個 Document 物件的列表，每個 Document 對應 CSV 檔案中的一列
docs = loader.load()

# 印出 docs 列表中第二個 Document 物件 (索引為 1) 的 page_content 屬性
# page_content 通常包含該列資料，並以 fieldnames 中定義的欄位名稱作為標籤
# 例如："Passenger ID: 2\nSurvival (1: Survived, 0: Died): 1\n..."
print(docs[1].page_content)

Passenger ID: 1
Survival (1: Survived, 0: Died): 0
Passenger Class: 3
Name: Braund, Mr. Owen Harris
Sex: male
Age: 22
Number of Siblings/Spouses Aboard: 1
Number of Parents/Children Aboard: 0
Ticket Number: A/5 21171
Fare: 7.25
Cabin: 
Port of Embarkation: S


## Specify a column to identify the document source

你應該使用 ```source_column``` 參數來指定每一列所產生文件的來源欄位。否則，預設會將整個 CSV 檔案的 ```file_path``` 作為所有文件的來源。

當你要將從 CSV 載入的文件用於根據來源回答問題的鏈式應用時，這個功能特別有用。

In [55]:
loader = CSVLoader(
    file_path="./data/titanic.csv",
    source_column="PassengerId",  # Specify the source column
)

docs = loader.load()  

print(docs[1])
print(docs[1].metadata)

page_content='PassengerId: 2
Survived: 1
Pclass: 1
Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Sex: female
Age: 38
SibSp: 1
Parch: 0
Ticket: PC 17599
Fare: 71.2833
Cabin: C85
Embarked: C' metadata={'source': '2', 'row': 1}
{'source': '2', 'row': 1}


## 產生 XML 文件格式

本範例說明如何透過 ```CSVLoader``` 將 CSV 檔案中的資料轉換為 XML 文件格式。  
你可以將每一列與欄位的資料處理成具結構性的 XML 表示方式，方便後續使用或整合至其他系統。

Convert a row in the document.

In [56]:
row = docs[1].page_content.split("\n")  # split by new line
row_str = "<row>"
for element in row:
    splitted_element = element.split(":")  # split by ":"
    value = splitted_element[-1]  # get value
    col = ":".join(splitted_element[:-1])  # get column name

    row_str += f"<{col}>{value.strip()}</{col}>"
row_str += "</row>"
print(row_str)

<row><PassengerId>2</PassengerId><Survived>1</Survived><Pclass>1</Pclass><Name>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</Name><Sex>female</Sex><Age>38</Age><SibSp>1</SibSp><Parch>0</Parch><Ticket>PC 17599</Ticket><Fare>71.2833</Fare><Cabin>C85</Cabin><Embarked>C</Embarked></row>


Convert entire rows in the document.

In [57]:
for doc in docs[1:6]:  # skip header
    row = doc.page_content.split("\n")
    row_str = "<row>"
    for element in row:
        splitted_element = element.split(":")  # split by ":"
        value = splitted_element[-1]  # get value
        col = ":".join(splitted_element[:-1])  # get column name
        row_str += f"<{col}>{value.strip()}</{col}>"
    row_str += "</row>"
    print(row_str)

<row><PassengerId>2</PassengerId><Survived>1</Survived><Pclass>1</Pclass><Name>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</Name><Sex>female</Sex><Age>38</Age><SibSp>1</SibSp><Parch>0</Parch><Ticket>PC 17599</Ticket><Fare>71.2833</Fare><Cabin>C85</Cabin><Embarked>C</Embarked></row>
<row><PassengerId>3</PassengerId><Survived>1</Survived><Pclass>3</Pclass><Name>Heikkinen, Miss. Laina</Name><Sex>female</Sex><Age>26</Age><SibSp>0</SibSp><Parch>0</Parch><Ticket>STON/O2. 3101282</Ticket><Fare>7.925</Fare><Cabin></Cabin><Embarked>S</Embarked></row>
<row><PassengerId>4</PassengerId><Survived>1</Survived><Pclass>1</Pclass><Name>Futrelle, Mrs. Jacques Heath (Lily May Peel)</Name><Sex>female</Sex><Age>35</Age><SibSp>1</SibSp><Parch>0</Parch><Ticket>113803</Ticket><Fare>53.1</Fare><Cabin>C123</Cabin><Embarked>S</Embarked></row>
<row><PassengerId>5</PassengerId><Survived>0</Survived><Pclass>3</Pclass><Name>Allen, Mr. William Henry</Name><Sex>male</Sex><Age>35</Age><SibSp>0</SibSp><Parch>0</

## UnstructuredCSVLoader 

```UnstructuredCSVLoader``` 可用於 ```single``` 或 ```elements``` 模式下。

若使用 ```elements``` 模式，整份 CSV 檔案會被視為一個 Unstructured 表格元素。  
在此模式下，該表格的 HTML 表現形式將會儲存在文件 metadata 中的 ```text_as_html``` 欄位。

這使你能在需要呈現表格結構時，存取對應的 HTML 格式資料。

In [58]:
from langchain_community.document_loaders.csv_loader import UnstructuredCSVLoader

# Generate UnstructuredCSVLoader instance with elements mode
loader = UnstructuredCSVLoader(file_path="./data/titanic.csv", mode="elements")

docs = loader.load()

html_content = docs[0].metadata["text_as_html"]

# Partial output due to space constraints
print(html_content[:810]) 

<table><tr><td>PassengerId</td><td>Survived</td><td>Pclass</td><td>Name</td><td>Sex</td><td>Age</td><td>SibSp</td><td>Parch</td><td>Ticket</td><td>Fare</td><td>Cabin</td><td>Embarked</td></tr><tr><td>1</td><td>0</td><td>3</td><td>Braund, Mr. Owen Harris</td><td>male</td><td>22</td><td>1</td><td>0</td><td>A/5 21171</td><td>7.25</td><td/><td>S</td></tr><tr><td>2</td><td>1</td><td>1</td><td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td><td>female</td><td>38</td><td>1</td><td>0</td><td>PC 17599</td><td>71.2833</td><td>C85</td><td>C</td></tr><tr><td>3</td><td>1</td><td>3</td><td>Heikkinen, Miss. Laina</td><td>female</td><td>26</td><td>0</td><td>0</td><td>STON/O2. 3101282</td><td>7.925</td><td/><td>S</td></tr><tr><td>4</td><td>1</td><td>1</td><td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>


## DataFrameLoader

```Pandas``` is an open-source data analysis and manipulation tool for the Python programming language. This library is widely used in data science, machine learning, and various fields for working with data.

LangChain's ```DataFrameLoader``` is a powerful utility designed to seamlessly integrate ```Pandas```  ```DataFrames``` into LangChain workflows.

In [59]:
import pandas as pd

df = pd.read_csv("./data/titanic.csv")

Search the first 5 rows.

In [60]:
df.head(n=5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Parameters ```page_content_column``` (str) – Name of the column containing the page content. Defaults to “text”.



In [61]:
from langchain_community.document_loaders import DataFrameLoader

# The Name column of the DataFrame is specified to be used as the content of each document.
loader = DataFrameLoader(df, page_content_column="Name")

docs = loader.load()

print(docs[0].page_content)


Braund, Mr. Owen Harris


```Lazy Loading``` for large tables. Avoid loading the entire table into memory

In [62]:
# Lazy load records from dataframe.
for row in loader.lazy_load():
    print(row)
    break  # print only the first row


page_content='Braund, Mr. Owen Harris' metadata={'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': nan, 'Embarked': 'S'}
