# ZeroxPDFLoader

`ZeroxPDFLoader` is a document loader that leverages the [Zerox](https://github.com/getomni-ai/zerox) library. Zerox converts PDF documents into images, processes them using a vision-capable language model, and generates a structured Markdown representation. This loader allows for asynchronous operations and provides page-level document extraction.

## Installation
To use `ZeroxPDFLoader`, you need to install the `zerox` package. You can do so using pip:

```bash
pip install zerox
```


## Usage

`ZeroxPDFLoader` enables PDF text extraction using vision-capable language models by converting each page into an image and processing it asynchronously. To use this loader, you need to specify a model and configure any necessary environment variables for Zerox, such as API keys.

If you're working in an environment like Jupyter Notebook, you may need to handle asynchronous code by using `nest_asyncio`. You can set this up as follows:

```python
import nest_asyncio
nest_asyncio.apply()
```


In [None]:
import os

from langchain_community.document_loaders.pdf import ZeroxPDFLoader

# Specify the file path for the PDF you want to process
file_path = "sample_document.pdf"

# Set up necessary env vars for a vision model
os.environ["OPENAI_API_KEY"] = ""  ## your-api-key

# Initialize ZeroxPDFLoader with the desired model
loader = ZeroxPDFLoader(file_path=file_path, model="gpt-4o-mini")

# Load the document lazily and print the output page by page
documents = loader.load()

## API Reference

### `ZeroxPDFLoader`

This loader class initializes with a file path and model type, and supports custom configurations via `zerox_kwargs` for handling Zerox-specific parameters.

**Arguments**:
- `file_path` (Union[str, Path]): Path to the PDF file.
- `model` (str): Vision-capable model to use for processing in format `<provider>/<model>`.
Some examples of valid values are: 
  - `model = "gpt-4o-mini" ## openai model`
  - `model = "azure/gpt-4o-mini"`
  - `model = "gemini/gpt-4o-mini"`
  - `model="claude-3-opus-20240229"`
  - `model = "vertex_ai/gemini-1.5-flash-001"`
  - See more details in [Zerox documentation](https://github.com/getomni-ai/zerox)
  - Defaults to `"gpt-4o-mini".`
- `**zerox_kwargs` (dict): Additional Zerox-specific parameters such as API key, endpoint, etc.
  - See [Zerox documentation](https://github.com/getomni-ai/zerox)

**Methods**:
- `lazy_load`: Generates an iterator of `Document` instances, each representing a page of the PDF, along with metadata including page number and source.

See full API documentaton [here](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.ZeroxPDFLoader.html)

## Notes
- **Model Compatibility**: Zerox supports a range of vision-capable models. Refer to [Zerox's GitHub documentation](https://github.com/getomni-ai/zerox) for a list of supported models and configuration details.
- **Environment Variables**: Make sure to set required environment variables, such as `API_KEY` or endpoint details, as specified in the Zerox documentation.
- **Asynchronous Processing**: If you encounter errors related to event loops in Jupyter Notebooks, you may need to apply `nest_asyncio` as shown in the setup section.


## Troubleshooting
- **RuntimeError: This event loop is already running**: Use `nest_asyncio.apply()` to prevent asynchronous loop conflicts in environments like Jupyter.
- **Configuration Errors**: Verify that the `zerox_kwargs` match the expected arguments for your chosen model and that all necessary environment variables are set.


## Additional Resources
- **Zerox Documentation**: [Zerox GitHub Repository](https://github.com/getomni-ai/zerox)
- **LangChain Document Loaders**: [LangChain Documentation](https://python.langchain.com/docs/integrations/document_loaders/)