![](example_content/undatasio_example.png)

## Generating LangChain Document objects using the Undatasio platform's Python client.

The Undatasio platform processes unstructured data and can parse various file types, including PDF, Word, Excel, video, and audio. The generated output can be in various readily processable text formats like JSON, Markdown, and TXT.

- - -

## Below is the complete process of using Undatasio to create LangChain Document objects.

# Installing the **Undatasio** Python API library

In [1]:
# install undatasio
!pip install -U undatasio



## To import an **UnDataIO** object, you need a token and an optional task name from the Undatasio platform.

In [1]:
from undatasio.undatasio import UnDatasIO

undatasio_obj = UnDatasIO('025ae1da7598456daa802fef7873e31b')

## The **show_version** function of the generated Undatasio object can display all version information and file lists for the current token's task name.

In [2]:
version_data = undatasio_obj.show_version()
version_data

Response(code=200, msg='success', data=      title version  count                                          file_name
0   1 files     v26      1                        [1d8c9bc374114b6e901da.pdf]
1   1 files     v25      1                     [页面提取自－VAR：风险价值—金融风险管理新标准.pdf]
2   1 files     v24      1                                          [100.pdf]
3   1 files     v23      1                                          [100.pdf]
4   2 files     v22      2   [棉花标准仓单销售合同.pdf, 调整组合结构应对短期波折基金每日资讯20150608.pdf]
5   3 files     v21      3  [高杠杆分级基金估值风险上升基金周报20110225.pdf, 风险化解紧迫信托保障基金成立...
6   1 files     v20      1                                    [借款合同-借款合同.pdf]
7   1 files     v19      1                                    [借款合同-借款合同.pdf]
8   1 files     v18      1                                    [借款合同-借款合同.pdf]
9    专业合同解析     v17      2   [调整组合结构应对短期波折基金每日资讯20150608.pdf, 棉花标准仓单销售合同.pdf]
10  1 files     v16      1                                   [漩涡中的信用衍生产品.pdf]
11  1 files     v15      

## The **get_result_to_langchain_document** function of the Undatasio object returns a Langchain Document object. Parameters for this function can be gleaned from the data returned by the **show_version** function.

In [3]:
lc_document = undatasio_obj.get_result_to_langchain_document(
    type_info=['text'],
    file_name='1d8c9bc374114b6e901da.pdf',
    version='v26'
)
lc_document

ValidationError: 1 validation error for Document
page_content
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.9/v/string_type

## Use **RecursiveCharacterTextSplitter** from **langchain_text_splitters** to split the text returned by the **get_result_to_langchain_document** function.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.split_documents([lc_document.data])
texts