# Preprocess
[Preprocess](https://preprocess.co) is an API service that splits any kind of document into optimal chunks of text for use in language model tasks.

Given documents in input `Preprocess` splits them into chunks of text that respect the layout and semantics of the original document.
We split the content by taking into account sections, paragraphs, lists, images, data tables, text tables, and slides, and following the content semantics for long texts.

Preprocess supports:
- PDFs
- Microsoft Office documents (Word, PowerPoint, Excel)
- OpenOffice documents (ods, odt, odp)
- HTML content (web pages, articles, emails)
- plain text.

`PreprocessLoader` interact the `Preprocess API library` to provide document conversion and chunking or to load already chunked files inside LangChain.

## Requirements
Install the `Python Preprocess library` if it is not already present:

In [None]:
# Install Preprocess Python SDK package
pip install pypreprocess

## Usage

To use Preprocess loader, you need to pass the `Preprocess API Key`. 
When initializing `PreprocessLoader`, you should pass your `API Key`, if you don't have it yet, please ask for one at [support@preprocess.co](mailto:support@preprocess.co). Without an `API Key`, the loader will raise an error.

To chunk a file pass a valid filepath and the loader will start converting and chunking it.
`Preprocess` will chunk your files by applying an internal splitter. For this reason, you shouldn't apply `TextSplitter`s to `Document`s returned by the `load()` method.


In [None]:
from langchain.document_loaders import PreprocessLeader

In [None]:
loader = PreprocessLeader("example_data/layout-parser-paper.pdf", api_key="PREPROCESS_API_KEY")

In [None]:
data = loader.load()


If you want to return only the extracted text and handle it with custom splitters set `return_whole_document = True` 

In [None]:
data = loader.load(return_whole_document = True)

If you want to load already chunked files you can do it via `process_id` passing it to the reader.

In [None]:
loader = PreprocessLeader(process_id="PROCESS_ID", api_key="PREPROCESS_API_KEY")

## Other info

`PreprocessReader` is based on `pypreprocess` from [Preprocess](https://github.com/preprocess-co/pypreprocess) library.
For more information or other integration needs please check the [documentation](https://github.com/preprocess-co/pypreprocess).