# Google Bigtable

> [Bigtable](https://cloud.google.com/bigtable) is a key-value and wide-column store, ideal for fast access to structured, semi-structured, or unstructured data. Extend your database application to build AI-powered experiences leveraging Bigtable's Langchain integrations.

This notebook goes over how to use [Bigtable](https://cloud.google.com/bigtable) to [add documents to and query vector stores](https://python.langchain.com/docs/concepts/vectorstores/) with `BigtableVectorStore`.

Learn more about the package on [GitHub](https://github.com/googleapis/langchain-google-bigtable-python/).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googleapis/langchain-google-bigtable-python/blob/main/docs/vector_store.ipynb)

## Before You Begin

To run this notebook, you will need to do the following:

* [Create a Google Cloud Project](https://developers.google.com/workspace/guides/create-project)
* [Enable the Bigtable API](https://console.cloud.google.com/flows/enableapi?apiid=bigtable.googleapis.com)
* [Create a Bigtable instance](https://cloud.google.com/bigtable/docs/creating-instance)
* [Create a Bigtable table](https://cloud.google.com/bigtable/docs/managing-tables)
* [Create Bigtable access credentials](https://developers.google.com/workspace/guides/create-credentials)

After confirmed access to database in the runtime environment of this notebook, filling the following values and run the cell before running example scripts.

In [None]:
# @markdown Please specify an instance and a table for demo purpose.
INSTANCE_ID = "your-instance-id"  # @param {type:"string"}
TABLE_ID = "your-table-id"  # @param {type:"string"}

### 🦜🔗 Library Installation

The integration lives in its own `langchain-google-bigtable` package, so we need to install it.

In [None]:
%pip install --quiet langchain-google-bigtable

**Colab only**: Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### ☁ Set Your Google Cloud Project
Set your Google Cloud project so that you can leverage Google Cloud resources within this notebook.

If you don't know your project ID, try the following:

* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113).

In [None]:
# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.

PROJECT_ID = "your-project-id"  # @param {type:"string"}

# Set the project id
!gcloud config set project {PROJECT_ID}

### 🔐 Authentication

Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.

- If you are using Colab to run this notebook, use the cell below and continue.
- If you are using Vertex AI Workbench, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
from google.colab import auth

auth.authenticate_user(project_id=PROJECT_ID)

## Embedding Service Setup

To use the vector store to store and search using vectors, an embedding model is needed. For this demonstration, we're using one of VertexAI's embedding models.

In [None]:
%pip install --quiet langchain-google-vertexai

In [None]:
from langchain_google_vertexai import VertexAIEmbeddings

In [None]:
embedding_service = VertexAIEmbeddings(model_name="gemini-embedding-001")

## Instantiation

### Initialize a table

In [None]:
from langchain_google_bigtable.vector_store import init_vector_store_table

DATA_COLUMN_FAMILY = "doc_data"

try:
    init_vector_store_table(
        project_id=PROJECT_ID,
        instance_id=INSTANCE_ID,
        table_id=TABLE_ID,
        content_column_family=DATA_COLUMN_FAMILY,
        embedding_column_family=DATA_COLUMN_FAMILY,
    )
except ValueError as e:
    print(e)

## BigtableEngine

A `BigtableEngine` object will be used to handle the execution context of the store. While not strictly required for creating a `BigtableVectorStore`, it is highly recommended to initialize a single `BigtableEngine` instance and reuse it across multiple stores for better performance and resource management.

In [None]:
from langchain_google_bigtable import BigtableEngine

engine = await BigtableEngine.async_initialize(project_id=PROJECT_ID)

## BigtableVectorStore

In [None]:
from langchain_google_bigtable.vector_store import (
    BigtableVectorStore,
    ColumnConfig,
    DistanceStrategy,
    Encoding,
    QueryParameters,
    VectorDataType,
    VectorMetadataMapping,
    init_vector_store_table,
)

from langchain_core.documents import Document

### Configuring Metadata Storage

When creating a `BigtableVectorStore`, you have two optional parameters for handling metadata:

* `metadata_mappings`: This is a list of `VectorMetadataMapping` objects. You **must** define a mapping for any metadata key you wish to use for filtering in your search queries. Each mapping specifies the data type (`encoding`) for the metadata field, which is crucial for correct filtering.
* `metadata_as_json_column`: This is an optional `ColumnConfig` that tells the store to save the *entire* metadata dictionary as a single JSON string in a specific column. This is useful for efficiently retrieving all of a document's metadata at once, including fields not defined in `metadata_mappings`. **Note:** Fields stored only in this JSON column cannot be used for filtering.

In [None]:
collection = "fiction_docs"

# Prepare a list of MetadataMappings to be used for filtering later on.
metadata_mappings = [
    VectorMetadataMapping(metadata_key="author", encoding=Encoding.UTF8),
    VectorMetadataMapping(metadata_key="category", encoding=Encoding.UTF8),
    VectorMetadataMapping(metadata_key="year", encoding=Encoding.INT_BIG_ENDIAN),
    VectorMetadataMapping(metadata_key="is_public", encoding=Encoding.BOOL),
    VectorMetadataMapping(metadata_key="rating", encoding=Encoding.FLOAT),
]

# Configure the columns for your store.
content_column = ColumnConfig(
    column_family=DATA_COLUMN_FAMILY, column_qualifier="content"
)
embedding_column = ColumnConfig(
    column_family=DATA_COLUMN_FAMILY, column_qualifier="embedding"
)
metadata_as_json_column = ColumnConfig(
    column_family=DATA_COLUMN_FAMILY, column_qualifier="metadata_json"
)

In [None]:
vector_store = await BigtableVectorStore.create(
    project_id=PROJECT_ID,
    instance_id=INSTANCE_ID,
    table_id=TABLE_ID,
    engine=engine,
    embedding_service=embedding_service,
    collection=collection,
    metadata_mappings=metadata_mappings,
    metadata_as_json_column=metadata_as_json_column,  # Optional: for storing all metadata
    content_column=content_column,
    embedding_column=embedding_column,
)

## Usage

### Populate the Vector Store

#### Documents to Add

This is a smaller list of documents, but it contains enough variety in metadata and IDs to demonstrate all filtering capabilities. If a document is added without an `id`, the vector store will automatically generate a UUID4 for it.

In [None]:
# Example documents to use for this demonstration
documents_to_add = [
    Document(
        page_content="A young farm boy, Luke Skywalker, is thrust into a galactic conflict.",
        metadata={
            "author": "George Lucas",
            "year": 1977,
            "category": "sci-fi",
            "rating": 4.8,
            "is_public": True,
        },
        id="group_A/doc_1",
    ),
    Document(
        page_content="In a distant future, the noble House Atreides rules the desert planet Arrakis.",
        metadata={
            "author": "Frank Herbert",
            "year": 1965,
            "category": "sci-fi",
            "rating": 4.9,
            "is_public": True,
        },
        id="group_A/doc_2",
    ),
    Document(
        page_content="A hobbit named Frodo Baggins must destroy a powerful ring.",
        metadata={
            "author": "J.R.R. Tolkien",
            "year": 1954,
            "category": "fantasy",
            "rating": 4.9,
            "is_public": True,
        },
        id="group_B/doc_3",
    ),
    Document(
        page_content="A group of children confront an evil entity emerging from the sewers.",
        metadata={
            "author": "Stephen King",
            "year": 1986,
            "category": "horror",
            "rating": 4.5,
            "is_public": False,
        },
        id="group_B/doc_4",
    ),
    Document(
        page_content="The last human is whisked off Planet Earth by his alien friend.",
        metadata={
            "author": "Douglas Adams",
            "year": 1979,
            "category": "sci-fi",
            "rating": 4.8,
            "is_public": True,
        },
        id="group_A/doc_5",
    ),
    Document(
        page_content="A young wizard, Harry Potter, discovers his heritage and battles a dark wizard.",
        metadata={
            "author": "J.K. Rowling",
            "year": 1997,
            "category": "fantasy",
            "rating": 4.9,
            "is_public": True,
        },
        id="group_B/doc_6",
    ),
    # Documents without a pre-defined ID
    Document(
        page_content="A dystopian novel set where people are genetically engineered into a caste system.",
        metadata={
            "author": "Aldous Huxley",
            "year": 1932,
            "category": "dystopian",
            "rating": 4.5,
            "is_public": True,
        },
    ),
    Document(
        page_content="An exploration of the history of humankind, from the Stone Age to the present day.",
        metadata={
            "author": "Yuval Noah Harari",
            "year": 2011,
            "category": "non-fiction",
            "rating": 4.9,
            "is_public": True,
        },
    ),
]

#### Add Documents

In [None]:
added_ids = await vector_store.aadd_documents(documents_to_add)

In [None]:
print(added_ids)

### Using the Vector Store as a Retriever

In [None]:
query_params_retriever = QueryParameters(
    filters={"ColumnValueFilter": {"category": {"==": "sci-fi"}}}
)
retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 2,
        "lambda_mult": 0.2,  # prioritizes diversity over similarity
        "query_parameters": query_params_retriever,  # Apply a filter
    },
)

print("Created a retriever configured for MMR search on 'sci-fi' documents.\n")

query = "a story about humanity's future on a desert planet"
print(f"Invoking retriever with query: '{query}'\n")

retrieved_docs = await retriever.ainvoke(query)

print(f"Retriever found {len(retrieved_docs)} documents:")
for doc in retrieved_docs:
    print("-" * 30)
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")

### Search with Filters

#### The kNN Search Algorithm and Filtering

By default, `BigtableVectorStore` uses a **k-Nearest Neighbors (kNN)** search algorithm to find the `k` vectors in the database that are most similar to your query vector. The vector store offers filtering to reduce the search space *before* the kNN search is performed, which can make queries faster and more relevant.

#### Configuring Queries with `QueryParameters`

All search settings are controlled via the `QueryParameters` object. This object allows you to specify not only filters but also other important search aspects:
* `algorithm`: The search algorithm to use. Defaults to `"kNN"`.
* `distance_strategy`: The metric used for comparison, such as `COSINE` (default) or `EUCLIDEAN`.
* `vector_data_type`: The data type of the stored vectors, like `FLOAT32` or `DOUBLE64`. This should match the precision of your embeddings.
* `filters`: A dictionary defining the filtering logic to apply.

#### Understanding Encodings

To filter on metadata fields, you must define them in `metadata_mappings` with the correct `encoding` so Bigtable can properly interpret the data. Supported encodings include:
* **String**: `UTF8`, `UTF16`, `ASCII` for text-based metadata.
* **Numeric**: `INT_BIG_ENDIAN` or `INT_LITTLE_ENDIAN` for integers, and `FLOAT` or `DOUBLE` for decimal numbers.
* **Boolean**: `BOOL` for true/false values.

#### Supported Filter Types

`BigtableVectorStore` supports three main categories of filters to refine your search:

* **Row Key (Document ID) Filters**:
    * `RowKeyFilter`: Narrows the search to document IDs and a specific row prefix. The document ID filter is always applied by default.

* **Metadata Key (Qualifier) Filters**: Checks for the *presence* of metadata keys.
    * `ColumnQualifiers`: Checks for one or more exact keys.
    * `ColumnQualifierPrefix`: Checks if a key starts with a given prefix.
    * `ColumnQualifierRegex`: Checks if a key matches a regular expression.

* **Metadata Value Filters**: Filters on the actual values of metadata fields.
    * `ColumnValueFilter`: Filters on the values of the metadata keys. The following list showcases the supported operators.
    * **Supported Operators**: `==`, `!=`, `>`, `<`, `>=`, `<=`, `in`, `nin`, `contains`, `like`.

* **Logical Filters**: Combines value-based conditions.
    * `ColumnValueChainFilter`: Logical AND.
    * `ColumnValueUnionFilter`: Logical OR.

#### Filter Example 1 - Row Key and Value Filter (Implicit AND)

Find sci-fi books published after 1970 within `group_A`. When you provide multiple filters at the top level, they are combined with a logical AND.

In [None]:
query = "a story about space and destiny"

filter_1 = {
    "RowKeyFilter": "group_A/",
    "ColumnValueFilter": {"year": {">": 1970}, "category": {"==": "sci-fi"}},
}

query_params_1 = QueryParameters(filters=filter_1)

print(f"Querying for: '{query}'")
print(f"With filter: In 'group_A' AND category is 'sci-fi' AND year > 1970\n")

results_1 = await vector_store.asimilarity_search(
    query, k=5, query_parameters=query_params_1
)

print(f"Found {len(results_1)} documents matching the filter:")
for doc in results_1:
    print(f"- ID: {doc.id}, Metadata: {doc.metadata}")

#### Filter Example 2 - Complex Logical Filter (AND/OR)

Find books that are either (`category` is 'fantasy' AND `is_public` is `True`) OR (`rating` is less than 4.6).

In [None]:
query = "a tale of magic and quests"

filter_2 = {
    "ColumnValueFilter": {
        "ColumnValueUnionFilter": {  # Logical OR
            "ColumnValueChainFilter": {  # Logical AND
                "category": {"==": "fantasy"},
                "is_public": {"==": True},
            },
            "rating": {"<": 4.6},
        }
    }
}

query_params_2 = QueryParameters(filters=filter_2)

print(f"Querying for: '{query}'")
print(f"With filter: (category is 'fantasy' AND is_public) OR (rating < 4.6)\n")

results_2 = await vector_store.asimilarity_search(
    query, k=5, query_parameters=query_params_2
)

print(f"Found {len(results_2)} documents matching the filter:")
for doc in results_2:
    print(f"- ID: {doc.id}, Metadata: {doc.metadata}")

#### Filter Example 3 - Full `QueryParameters` with List Membership

Here, we use the entire `QueryParameters` object to customize the search. We are changing the `distance_strategy`, specifying the `vector_data_type`, and filtering for books where the author is in a specific list, using the `in` operator.

In [None]:
query = "a story of good versus evil"

filter_3 = {"ColumnValueFilter": {"author": {"in": ["J.R.R. Tolkien", "J.K. Rowling"]}}}

# Create a full QueryParameters object
query_params_3 = QueryParameters(
    algorithm="kNN",  # Explicitly setting the default algorithm
    distance_strategy=DistanceStrategy.EUCLIDEAN,
    vector_data_type=VectorDataType.FLOAT32,  # Specifying the vector data type
    filters=filter_3,
)

print(f"Querying for: '{query}'")
print(f"With filter: author is 'J.R.R. Tolkien' or 'J.K. Rowling'")
print(f"Using algorithm: {query_params_3.algorithm}")
print(f"Using distance strategy: {query_params_3.distance_strategy.name}\n")

results_3 = await vector_store.asimilarity_search(
    query, k=2, query_parameters=query_params_3
)

print(f"Found {len(results_3)} documents matching the filter:")
for doc in results_3:
    print(f"- ID: {doc.id}, Metadata: {doc.metadata}")