# CrateDB

> [CrateDB] is capable of performing both vector and lexical search.
> It is built on top of the Apache Lucene library, talks SQL,
> is PostgreSQL-compatible, and scales like Elasticsearch.

This notebook shows how to use the CrateDB vector store functionality around
[`FLOAT_VECTOR`] and [`KNN_MATCH`]. You will learn how to use LangChain's
`CrateDBVectorSearch` adapter for similarity search and other purposes.

It supports:
- Similarity Search with Euclidean Distance
- Maximal Marginal Relevance Search (MMR)

## What is CrateDB?

[CrateDB] is an open-source, distributed, and scalable SQL analytics database
for storing and analyzing massive amounts of data in near real-time, even with
complex queries. It is PostgreSQL-compatible, based on [Lucene], and inherits
the shared-nothing distribution layer of [Elasticsearch].

This example uses the [Python client driver for CrateDB]. For more documentation,
see also [LangChain with CrateDB].


[CrateDB]: https://github.com/crate/crate
[Elasticsearch]: https://github.com/elastic/elasticsearch
[`FLOAT_VECTOR`]: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#float-vector
[`KNN_MATCH`]: https://cratedb.com/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#scalar-knn-match
[LangChain with CrateDB]: /docs/extras/integrations/providers/cratedb.html
[Lucene]: https://github.com/apache/lucene
[Python client driver for CrateDB]: https://cratedb.com/docs/python/

## Setup

In order to use the CrateDB vector search you must install the sqlalchemy-cratedb package.

In [None]:
# Install required packages: LangChain, OpenAI SDK, and the CrateDB SQLAlchemy adapter.
%pip install -qU langchain-community langchain-openai sqlalchemy-cratedb

## Initialization

### OpenAI API key

You need to provide an OpenAI API key, optionally using the environment
variable `OPENAI_API_KEY`.

In [None]:
import getpass
import os

from dotenv import find_dotenv, load_dotenv

# Run `export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY`.
# Get OpenAI api key from `.env` file.
# Otherwise, prompt for it.
_ = load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", getpass.getpass("OpenAI API key:"))
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

You also need to provide a connection string to your CrateDB database cluster,
optionally using the environment variable `CRATEDB_CONNECTION_STRING`.

This example uses a CrateDB instance on your workstation, which you can start by
running [CrateDB using Docker]. Alternatively, you can also connect to a cluster
running on [CrateDB Cloud].

[CrateDB Cloud]: https://console.cratedb.cloud/
[CrateDB using Docker]: https://cratedb.com/docs/guide/install/container/

### CrateDB connection string

You will need to supply an SQLAlchemy-compatible connection string.

In [None]:
import os

CONNECTION_STRING = os.environ.get(
    "CRATEDB_CONNECTION_STRING",
    "crate://crate@localhost:4200/?schema=langchain",
)

# For CrateDB Cloud, use:
# CONNECTION_STRING = os.environ.get(
#     "CRATEDB_CONNECTION_STRING",
#     "crate://username:password@hostname:4200/?ssl=true&schema=langchain",
# )

In [None]:
"""
# Alternatively, the connection string can be assembled from individual
# environment variables.
import os

CONNECTION_STRING = CrateDBVectorSearch.connection_string_from_db_params(
    driver=os.environ.get("CRATEDB_DRIVER", "crate"),
    host=os.environ.get("CRATEDB_HOST", "localhost"),
    port=int(os.environ.get("CRATEDB_PORT", "4200")),
    database=os.environ.get("CRATEDB_DATABASE", "langchain"),
    user=os.environ.get("CRATEDB_USER", "crate"),
    password=os.environ.get("CRATEDB_PASSWORD", ""),
)
"""

### Import Python Modules

You will start by importing all required modules.

In [None]:
from langchain.docstore.document import Document
from langchain.document_loaders import UnstructuredURLLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import CrateDBVectorSearch

## Manage vector store

In the example above, you created a vector store from scratch. When
aiming to work with an existing vector store, you can initialize it directly.

In [None]:
embeddings = OpenAIEmbeddings()

store = CrateDBVectorSearch(
    collection_name="testdrive",
    connection_string=CONNECTION_STRING,
    embedding_function=embeddings,
)

### Add items to vector store

You can also add documents to an existing vector store.

In [None]:
store.add_documents([Document(page_content="foo")])

In [None]:
docs_with_score = store.similarity_search_with_score("foo")

In [None]:
docs_with_score[0]

In [None]:
docs_with_score[1]

### Update items in vector store

FIXME

In [None]:
# Foo.

### Delete items from vector store
FIXME

In [None]:
store.delete(ids=["foo"])

### Load and Index Documents

Next, you will read input data, and tokenize it. The module will create a table
with the name of the collection. Make sure the collection name is unique, and
that you have the permission to create a table.

In [None]:
loader = UnstructuredURLLoader(
    "https://github.com/langchain-ai/langchain/raw/v0.0.325/docs/docs/modules/state_of_the_union.txt"
)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

COLLECTION_NAME = "state_of_the_union_test"

db = CrateDBVectorSearch.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

### Overwriting a Vector Store

If you have an existing collection, you can overwrite it by using `from_documents`,
aad setting `pre_delete_collection = True`.

In [None]:
db = CrateDBVectorSearch.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    pre_delete_collection=True,
)

In [None]:
docs_with_score = db.similarity_search_with_score("foo")

In [None]:
docs_with_score[0]

## Query vector store

### Query directly

#### Similarity search
Searching by euclidean distance is the default.

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
docs_with_score = db.similarity_search_with_score(query)

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

#### Maximal Marginal Relevance Search (MMR)
Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents.

In [None]:
docs_with_score = db.max_marginal_relevance_search_with_score(query)

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

#### Searching in Multiple Collections
`CrateDBVectorSearchMultiCollection` is a special adapter which provides similarity search across
multiple collections. It can not be used for indexing documents.

In [None]:
from langchain.vectorstores.cratedb import CrateDBVectorSearchMultiCollection

multisearch = CrateDBVectorSearchMultiCollection(
    collection_names=["test_collection_1", "test_collection_2"],
    embedding_function=embeddings,
    connection_string=CONNECTION_STRING,
)
docs_with_score = multisearch.similarity_search_with_score(query)

### Query by turning into retriever

In [None]:
retriever = store.as_retriever()

In [None]:
print(retriever)

## Usage for retrieval-augmented generation

For guides on how to use this vector store for retrieval-augmented generation (RAG), see the following sections:

- [Tutorials: working with external knowledge](https://python.langchain.com/docs/tutorials/#working-with-external-knowledge)
- [How-to: Question and answer with RAG](https://python.langchain.com/docs/how_to/#qa-with-rag)
- [Retrieval conceptual docs](https://python.langchain.com/docs/concepts/retrieval)

## API reference

For detailed documentation of all `CrateDBVectorSearch` features and configurations,
head to the API reference:
https://python.langchain.com/api_reference/cratedb/vectorstores/langchain_cratedb.vectorstores.CrateDBVectorSearch.html