## Vertex AI Agent Builder Data Store Status Checker

### Overview

#### What is a Data Store?

A Data Store in Vertex AI Agent Builder is a collection of websites or documents, both structured and unstructured, that can be indexed for search and retrieval actions.

Data Stores are the fundamental building block behind Vertex AI Agent Builder.

#### Data Store Indexing Time

With each website or set of documents added, the Data Store needs to index the site and/or documents in order for them to be searchable. This can take up to 4 hours for new data store web content to be indexed.

Using the attached example notebook, you can query your Data Store ID to see if indexing is complete. Once complete, you can additionally use the notebook to search your Data Store for specific pages or documents.

### Objectives

This lab uses the Cloud Discovery Engine API to check a Data Store for indexed docs.

This lab utilizes the `google-cloud-discoveryengine` Python library and allows the user to perform the following tasks:

- Check Indexing Status of given Data Store ID.
- List all documents in a given Data Store ID.
- List all indexed URLs for a given Data Store ID
- Search all indexed URLs for a specific URL within a given Data Store ID.

### Task 1. Enable APIs

Enable the Dialogflow API

### Task 2. Install prerequisites

In [None]:
pip install --upgrade --quiet google-cloud-discoveryengine humanize

### Task 3. Helper Methods

In [2]:
import humanize
import time
import re
from typing import List, Optional

from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine_v1beta as discoveryengine


def _call_list_documents(
    project_id: str, location: str, datastore_id: str, page_token: Optional[str] = None
) -> discoveryengine.ListDocumentsResponse:
    """Build the List Docs Request payload."""
    client_options = (
        ClientOptions(
            api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )
    client = discoveryengine.DocumentServiceClient(
        client_options=client_options)

    request = discoveryengine.ListDocumentsRequest(
        parent=client.branch_path(
            project_id, location, datastore_id, "default_branch"
        ),
        page_size=1000,
        page_token=page_token,
    )

    return client.list_documents(request=request)


def list_documents(
    project_id: str, location: str, datastore_id: str, rate_limit: int = 1
) -> List[discoveryengine.Document]:
    """Gets a list of docs in a datastore."""

    res = _call_list_documents(project_id, location, datastore_id)

    # setup the list with the first batch of docs
    docs = res.documents

    while res.next_page_token:
        # implement a rate_limit to prevent quota exhaustion
        time.sleep(rate_limit)

        res = _call_list_documents(
            project_id, location, datastore_id, res.next_page_token
        )
        docs.extend(res.documents)

    return docs


def list_indexed_urls(
    docs: Optional[List[discoveryengine.Document]] = None,
    project_id: str = None,
    location: str = None,
    datastore_id: str = None,
) -> List[str]:
    """Get the list of docs in data store, then parse to only urls."""
    if not docs:
        docs = list_documents(project_id, location, datastore_id)
    urls = [doc.content.uri for doc in docs]

    return urls


def search_url(urls: List[str], url: str) -> None:
    """Searches a url in a list of urls."""
    for item in urls:
        if url in item:
            print(item)


def search_doc_id(
    doc_id: str,
    docs: Optional[List[discoveryengine.Document]] = None,
    project_id: str = None,
    location: str = None,
    datastore_id: str = None,
) -> None:
    """Searches a doc_id in a list of docs."""
    if not docs:
        docs = list_documents(project_id, location, datastore_id)

    doc_found = False
    for doc in docs:
        if doc.parent_document_id == doc_id:
            doc_found = True
            print(doc)

    if not doc_found:
        print(f"Document not found for provided Doc ID: `{doc_id}`")


def estimate_data_store_size(
    urls: Optional[List[str]] = None,
    docs: Optional[List[discoveryengine.Document]] = None,
    project_id: str = None,
    location: str = None,
    datastore_id: str = None,
) -> None:
    """For Advanced Website Indexing data stores only."""
    if not urls:
        if not docs:
            docs = list_documents(project_id, location, datastore_id)
        urls = list_indexed_urls(docs=docs)

    # Filter to only include website urls.
    urls = list(filter(lambda x: re.search(r"https?://", x), urls))

    if not urls:
        print(
            "No urls found. Make sure this data store is for websites with advanced indexing."
        )
        return

    # For website indexing, each page is calculated as 500KB.
    size = len(urls) * 500_000
    print(f"Estimated data store size: {humanize.naturalsize(size)}")


PENDING_MESSAGE = """
No docs found.\n\nIt\'s likely one of the following issues: \n  [1] Your data store is not finished indexing. \n  [2] Your data store failed indexing. \n  [3] Your data store is for website data without advanced indexing.\n\n
If you just added your data store, it can take up to 4 hours before it will become available.
"""

### Task 4. User Inputs

#### Creating a new data store.

1. Navigate to [Agent Builder console](https://console.cloud.google.com/gen-app-builder/start) and click on CONTINUE AND ACTIVATE THE API button.
2. In the **Create App** page, select **Chat** as an App type.
3. For **Company Name** enter `Cymbal`. For **Agent Name** enter `cymbalagent` and click **CONTINUE**.
4. In the **Data Stores** page, click **+ CREATE NEW DATA STORE**.
5. Select **Cloud Storage**, and enter the following Google Cloud Storage location `cloud-samples-data/dialogflow-cx/arc-lifeblood` to add the folder. Then select **Unstructured documents** under **What kind of data are you importing?** section and finally click **CONTINUE**.
6. For Data store name, enter **cymbaldatastore** and click **CREATE**. This creates a Data store.
7. Finally on the App's Data page, select **cymbaldatastore** and click on **CREATE**.
8. Click on cymbaldatastore and note down the Data store ID.

### Task 5. Check Data Store Index Status

Let's use the `list_documents` method, to check if the data store has finished indexing.

In [3]:
PROJECT = !gcloud config get-value project
project_id  = PROJECT[0]
location = "global"  # Options: "global", "us", "eu"
datastore_id = "cymbaldatastore_1737322168745"

In [6]:
docs = list_documents(project_id, location, datastore_id)

if len(docs) == 0:
    print(PENDING_MESSAGE)
else:
    SUCCESS_MESSAGE = f"""
  Success! 🎉\n
  Your indexing is complete.\n
  Your index contains {len(docs)} documents.
  """
    print(SUCCESS_MESSAGE)


  Success! 🎉

  Your indexing is complete.

  Your index contains 79 documents.
  


### Task 6. List Documents

In [7]:
docs = list_documents(project_id, location, datastore_id)
docs[0]

name: "projects/353022593964/locations/global/collections/default_collection/dataStores/cymbaldatastore_1737322168745/branches/0/documents/01c3140622cfed9a86572e550ed049a0"
id: "01c3140622cfed9a86572e550ed049a0"
schema_id: "default_schema"
struct_data {
}
parent_document_id: "01c3140622cfed9a86572e550ed049a0"
content {
  mime_type: "text/html"
  uri: "gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/testing.html"
}

### Task 7. Search Data Store by Doc ID

Let's search through all docs in a given Data Store and find a specific Doc ID.

In the following command, replace the **placeholder_document_id** with the value of **parent_document_id** from the last output.

In [8]:
document_id = "01c3140622cfed9a86572e550ed049a0"
search_doc_id(document_id, docs=docs)

name: "projects/353022593964/locations/global/collections/default_collection/dataStores/cymbaldatastore_1737322168745/branches/0/documents/01c3140622cfed9a86572e550ed049a0"
id: "01c3140622cfed9a86572e550ed049a0"
schema_id: "default_schema"
struct_data {
}
parent_document_id: "01c3140622cfed9a86572e550ed049a0"
content {
  mime_type: "text/html"
  uri: "gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/testing.html"
}



### Task 8. List Indexed URLs

In [9]:
urls = list_indexed_urls(docs=docs)
urls[0]

'gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/testing.html'

### Task 9. Search Indexed URLs

In [10]:
search_url(urls, "gs://cloud-samples-data/dialogflow-cx/arc-lifeblood")

gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/testing.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/strategy.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/making-blood-components.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/making-your-donation.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/products.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/high-ferritin.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/forms.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/blood-for-transfusion.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/how-you-can-give-life.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/transplantation-immunogenetics-services.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/blood.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/prepare-and-aftercare.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/our-strategy.html
gs://cloud-samples-data/dialogflow-cx/arc-l

In [11]:
search_url(urls, "dialogflow-cx")

gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/testing.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/strategy.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/making-blood-components.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/making-your-donation.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/products.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/high-ferritin.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/forms.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/blood-for-transfusion.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/how-you-can-give-life.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/transplantation-immunogenetics-services.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/blood.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/prepare-and-aftercare.html
gs://cloud-samples-data/dialogflow-cx/arc-lifeblood/our-strategy.html
gs://cloud-samples-data/dialogflow-cx/arc-l