# CrateDB Document Loader

> [CrateDB] is capable of performing both vector and lexical search.
> It is built on top of the Apache Lucene library, talks SQL,
> is PostgreSQL-compatible, and scales like Elasticsearch.

This notebook covers how to get started with the CrateDB document loader.

The CrateDB document loader is based on [SQLAlchemy], and uses LangChain's
SQLDatabaseLoader. It loads the result of a database query with one document
per row.

[CrateDB]: https://github.com/crate/crate
[SQLAlchemy]: https://www.sqlalchemy.org/

## Overview

The `CrateDBLoader` class helps you get your unstructured content from CrateDB
into LangChain's `Document` format.

You must provide an SQLAlchemy-compatible connection string, and a query
expression in SQL format. 

### Integration details

| Class                                                                                                                                          | Package                                                                        | Local | Serializable | JS support|
|:-----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------| :---: | :---: |  :---: |
| [CrateDBLoader](https://python.langchain.com/api_reference/cratedb/document_loaders/langchain_cratedb.document_loaders.cratedb.CrateDBLoader.html) | [langchain_box](https://python.langchain.com/api_reference/cratedb/index.html) | ✅ | ❌ | ❌ | 
### Loader features
| Source | Document Lazy Loading | Async Support
| :---: | :---: | :---: | 
| CrateDBLoader | ✅ | ❌ | 

## Setup

You can run CrateDB Community Edition on your premises, or you can use CrateDB Cloud.

### Credentials

You will supply credentials through a regular SQLAlchemy connection string, like
`crate://username:password@cratedb.example.org/`.

### Installation

Install the **langchain-community** and **sqlalchemy-cratedb** packages.

In [None]:
%pip install -qU langchain-community sqlalchemy-cratedb

## Initialization

Now, initialize the loader and start loading documents. 

In [None]:
from langchain_community.document_loaders import CrateDBLoader

loader = CrateDBLoader("SELECT * FROM sys.summits", url="crate://crate@localhost/")

## Load

In [None]:
documents = loader.load()
print(documents)

## Lazy Load


In [None]:
page = []
for doc in loader.lazy_load():
    page.append(doc)
    if len(page) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        page = []

## API reference

For detailed documentation of all PyMuPDFLoader features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html

## Tutorial

### Populate database.

In [None]:
!crash < ./example_data/mlb_teams_2012.sql
!crash --command "REFRESH TABLE mlb_teams_2012;"

### Usage

In [None]:
from pprint import pprint

from langchain.document_loaders import CrateDBLoader

CONNECTION_STRING = "crate://crate@localhost/"

loader = CrateDBLoader(
    'SELECT * FROM mlb_teams_2012 ORDER BY "Team" LIMIT 5;',
    url=CONNECTION_STRING,
)
documents = loader.load()

In [None]:
pprint(documents)

### Specifying Which Columns are Content vs Metadata

In [None]:
loader = CrateDBLoader(
    'SELECT * FROM mlb_teams_2012 ORDER BY "Team" LIMIT 5;',
    url=CONNECTION_STRING,
    page_content_columns=["Team"],
    metadata_columns=["Payroll (millions)"],
)
documents = loader.load()

In [None]:
pprint(documents)

### Adding Source to Metadata

In [None]:
loader = CrateDBLoader(
    'SELECT * FROM mlb_teams_2012 ORDER BY "Team" LIMIT 5;',
    url=CONNECTION_STRING,
    source_columns=["Team"],
)
documents = loader.load()

In [None]:
pprint(documents)