# Create a Vertex AI Datastore and Search Engine

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/search/create_datastore_and_search.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

---

* Author(s): [Kara Greenfield](https://github.com/kgreenfield2)
* Created: 22 Nov 2023
* Updated: 31 Oct 2024

---

## Objective

This notebook shows how to create and populate a Vertex AI Search Datastore, how to create a search app connected to that datastore, and how to submit queries through the search engine.


Services used in the notebook:

- ✅ Vertex AI Search for document search and retrieval

## Install pre-requisites

If running in Colab install the pre-requisites into the runtime. Otherwise it is assumed that the notebook is running in Vertex AI Workbench.

In [None]:
%pip install --upgrade --user -q google-cloud-discoveryengine

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [None]:
# Restart kernel after installs so that your environment can access the new packages

import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>


## Authenticate

If running in Colab authenticate with `google.colab.google.auth` otherwise assume that running on Vertex AI Workbench.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

## Configure notebook environment

In [1]:
from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine

PROJECT_ID = "sandbox-373102"  # @param {type:"string"}
LOCATION = "global"

Set [Application Default Credentials](https://cloud.google.com/docs/authentication/application-default-credentials)

In [None]:
!gcloud auth application-default login --project {PROJECT_ID}

## Create and Populate a Datastore

In [2]:
def create_data_store(
    project_id: str, location: str, data_store_name: str, data_store_id: str
):
    # Create a client
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )
    client = discoveryengine.DataStoreServiceClient(client_options=client_options)

    # Initialize request argument(s)
    data_store = discoveryengine.DataStore(
        display_name=data_store_name,
        industry_vertical=discoveryengine.IndustryVertical.GENERIC,
        content_config=discoveryengine.DataStore.ContentConfig.CONTENT_REQUIRED,
    )

    operation = client.create_data_store(
        request=discoveryengine.CreateDataStoreRequest(
            parent=client.collection_path(project_id, location, "default_collection"),
            data_store=data_store,
            data_store_id=data_store_id,
        )
    )

    # Make the request
    # The try block is necessary to prevent execution from halting due to an error being thrown when the datastore takes a while to instantiate
    try:
        response = operation.result(timeout=90)
    except:
        print("long-running operation error.")

In [3]:
# The datastore name can only contain lowercase letters, numbers, and hyphens
DATASTORE_NAME = "pension_report"
DATASTORE_ID = f"{DATASTORE_NAME}-id"

#Need to create at least two data store for blended search
create_data_store(PROJECT_ID, LOCATION, DATASTORE_NAME, DATASTORE_ID)

In [4]:
def import_documents(
    project_id: str,
    location: str,
    data_store_id: str,
    bigquery_dataset: str,
    bigquery_table: str    
):
    # Create a client
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        bigquery_source=discoveryengine.BigQuerySource(
            project_id=project_id,
            dataset_id=bigquery_dataset,
            table_id=bigquery_table,
            data_schema="document",
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )

    # Make the request
    operation = client.import_documents(request=request)

    response = operation.result()

    # Once the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    return operation.operation.name

## Create a Search Engine

This is used to set the `search_tier` to enterprise and to enable advanced LLM features.

Enterprise tier is required to get extractive answers from a search query and advanced LLM features are required to summarize search results.

In [5]:
def create_engine(
    project_id: str, location: str, engine_id: str, data_store_name: str, data_store_id: str
):
    # Create a client
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )
    client = discoveryengine.EngineServiceClient(client_options=client_options)

    # Initialize request argument(s)
    engine = discoveryengine.Engine(
        display_name=data_store_name,
        solution_type=discoveryengine.SolutionType.SOLUTION_TYPE_SEARCH,
        industry_vertical=discoveryengine.IndustryVertical.GENERIC,
        data_store_ids=[data_store_id],
        search_engine_config=discoveryengine.Engine.SearchEngineConfig(
            search_tier=discoveryengine.SearchTier.SEARCH_TIER_ENTERPRISE,
            search_add_ons=[discoveryengine.SearchAddOn.SEARCH_ADD_ON_LLM],
        ),
    )

    request = discoveryengine.CreateEngineRequest(
        parent=client.collection_path(project_id, location, "default_collection"),
        engine=engine,
        engine_id=engine_id,
    )

    # Make the request
    operation = client.create_engine(request=request)
    response = operation.result(timeout=90)

In [7]:
#Creating engine usually take more than 5 minutes
import uuid
ENGINE_ID = str(uuid.uuid4())
create_engine(PROJECT_ID, LOCATION, ENGINE_ID, DATASTORE_NAME, DATASTORE_ID)

In [6]:
#Empty bq table took around 10 minutes
BQ_DATASET = "documents"
BQ_TABLE = "sandbox-373102"
import_documents(PROJECT_ID, LOCATION, DATASTORE_ID, BQ_DATASET, BQ_TABLE)

'projects/1045259343465/locations/global/collections/default_collection/dataStores/pension_report-id/branches/0/operations/import-documents-11336015597626721256'

## Query your Search Engine

Note: The Engine will take some time to be ready to query.

If you recently created an engine and you receive an error similar to:

`404 Engine {ENGINE_ID} is not found`

Then wait a few minutes and try your query again.

In [15]:
def search_sample(
    project_id: str,
    location: str,
    engine_id: str,
    search_query: str,
) -> list[discoveryengine.SearchResponse]:
    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if LOCATION != "global"
        else None
    )

    # Create a client
    client = discoveryengine.SearchServiceClient(client_options=client_options)

    # The full resource name of the search engine serving config
    # e.g. projects/{project_id}/locations/{location}/dataStores/{data_store_id}/servingConfigs/{serving_config_id}
    serving_config = f"projects/{project_id}/locations/{location}/collections/default_collection/engines/{engine_id}/servingConfigs/default_search"

    # Optional: Configuration options for search
    # Refer to the `ContentSearchSpec` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest.ContentSearchSpec
    content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(
        # For information about snippets, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/snippets
        snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(
            return_snippet=True
        ),
        # For information about search summaries, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/get-search-summaries
        summary_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(
            summary_result_count=5,
            include_citations=True,
            ignore_adversarial_query=True,
            ignore_non_summary_seeking_query=True,
        ),
    )

    # Refer to the `SearchRequest` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest
    request = discoveryengine.SearchRequest(
        serving_config=serving_config,
        query=search_query,
        page_size=10,
        content_search_spec=content_search_spec,
        query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(
            condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,
        ),
        spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(
            mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO
        ),
    )

    response = client.search(request)
    return response

In [18]:
query = "국민연금 연도별 수익률"

response = search_sample(PROJECT_ID, LOCATION, ENGINE_ID, query)
print(response.summary.summary_text)

결과를 찾을 수 없습니다. 검색어를 수정해 보세요.


In [19]:
#Took 10 minutes
import_documents(PROJECT_ID, LOCATION, DATASTORE_ID, BQ_DATASET, BQ_TABLE)

'projects/1045259343465/locations/global/collections/default_collection/dataStores/pension_report-id/branches/0/operations/import-documents-12149196499042085082'

In [21]:
query = "국민연금 연도별 수익률"

response = search_sample(PROJECT_ID, LOCATION, ENGINE_ID, query)
print(response.summary.summary_text)

국민연금은 장기 투자자로서 중장기적으로 양호한 수익률을 달성하고 있다 [1]. 기금 설치 이후 2019년 말까지 연평균 누적 수익률은 5.86%이며, 누적 수익금은 총 367조 5천억 원 상당을 기록했다 [1]. 3년 평균 수익률은 5.87%, 5년 평균 수익률은 5.45%이다 [1]. 2019년 액티브 직접운용 수익률은 15.87%로 벤치마크 대비 1.06%p 초과했고, 액티브 위탁운용 수익률은 9.30%로 벤치마크 대비 0.69%p 초과했다 [1]. 2019년의 3년 수익률은 2017~2019년, 5년 수익률은 2015~2019년 평균 수익률이다 [1].



In [22]:
import_documents(PROJECT_ID, LOCATION, DATASTORE_ID, BQ_DATASET, BQ_TABLE)

'projects/1045259343465/locations/global/collections/default_collection/dataStores/pension_report-id/branches/0/operations/import-documents-3743752173518605490'

In [23]:
query = "국민연금 연도별 수익률"

response = search_sample(PROJECT_ID, LOCATION, ENGINE_ID, query)
print(response.summary.summary_text)

국민연금은 장기 투자자로서 중장기적으로 양호한 수익률을 달성하고 있다 [2]. 1988년 기금 설치 이후 2022년 말까지 연평균 누적 수익률은 5.11%이며, 누적 수익금은 451조 3천억 원이다 [3].

다음은 연도별 국민연금 수익률 정보이다:

*   **2019년:** 2019년 말까지 연평균 누적 수익률은 5.86%이며, 누적 수익금은 총 367조 5천억 원이다 [2]. 3년 평균 수익률은 5.87%, 5년 평균 수익률은 5.45%이다 [2].
*   **2020년:** 1988년 기금 설치 이후 2020년 말까지 연평균 누적수익률은 6.27%이며, 누적수익금은 총 439조 6천억 원이다 [1]. 3년 평균 수익률은 6.89%, 5년 평균 수익률은 6.60%이다 [1].
*   **2021년:** 1988년 기금 설치 이후 2021년 말까지 연평균 누적수익률은 6.76%이며, 누적 수익금은 530조 8천억 원이다 [4].
*   **2022년:** 2022년 연간 운용수익률은 -8.22%이다 [3]. 1988년 국민연금기금 설치 이후 2022년 말까지 연평균 누적 수익률은 5.11%이며, 누적 수익금은 451조 3천억 원이다 [3].

2022년 국내 주식 운용 수익률은 -22.75%이다 [3]. 2020~2022년 3년 평균 수익률은 3.27%, 2018~2022년 5년 평균 수익률은 0.58%이다 [3]. 2021년 국내 주식 운용 수익률은 5.88%이다 [4]. 2019~2021년 3년 연평균 수익률은 17.04%, 2017~2021년 5년 연평균 수익률은 10.97%이다 [4].

2022년 국내주식 섹터별 투자 현황은 IT 30.3%, 산업재 16.8%, 소재 10.0%, 금융 9.4%, 임의소비재 9.3%, 헬스케어 7.6%, 통신서비스 7.5%, 필수소비재 5.2%, 에너지 1.9%, 유틸리티 1.2%, 부동산 0.2%, 기타 0.6%이다 [3]. 2021년 국내주식 섹터별 투자 현황은 정보기술 37.4%, 통신서비스 11.0%, 산업 

In [24]:
#Check purge
import_documents(PROJECT_ID, LOCATION, DATASTORE_ID, BQ_DATASET, BQ_TABLE)

'projects/1045259343465/locations/global/collections/default_collection/dataStores/pension_report-id/branches/0/operations/import-documents-17735843912080028684'

In [26]:
#Be sure that deleted items are not detected, need to run FULL 
query = "국민연금 연도별 수익률"

response = search_sample(PROJECT_ID, LOCATION, ENGINE_ID, query)
print(response.summary.summary_text)

국민연금은 장기 투자자로서 중장기적으로 양호한 수익률을 달성하고 있다 [2]. 1988년 기금 설치 이후 2022년 말까지 연평균 누적 수익률은 5.11%이며, 누적 수익금은 451조 3천억 원이다 [3].

다음은 연도별 국민연금 수익률 정보이다:

*   **2019년:** 2019년 말까지 연평균 누적 수익률은 5.86%이며, 누적 수익금은 총 367조 5천억 원이다 [2]. 3년 평균 수익률은 5.87%, 5년 평균 수익률은 5.45%이다 [2].
*   **2020년:** 1988년 기금 설치 이후 2020년 말까지 연평균 누적수익률은 6.27%이며, 누적수익금은 총 439조 6천억 원이다 [1]. 3년 평균 수익률은 6.89%, 5년 평균 수익률은 6.60%이다 [1].
*   **2021년:** 1988년 기금 설치 이후 2021년 말까지 연평균 누적수익률은 6.76%이며, 누적 수익금은 530조 8천억 원이다 [4].
*   **2022년:** 2022년 연간 운용수익률은 -8.22%이다 [3]. 1988년 국민연금기금 설치 이후 2022년 말까지 연평균 누적 수익률은 5.11%이며, 누적 수익금은 451조 3천억 원이다 [3].

2022년 국내 주식 운용 수익률은 -22.75%이다 [3]. 2020~2022년 3년 평균 수익률은 3.27%, 2018~2022년 5년 평균 수익률은 0.58%이다 [3]. 2021년 국내 주식 운용 수익률은 5.88%이다 [4]. 2019~2021년 3년 연평균 수익률은 17.04%, 2017~2021년 5년 연평균 수익률은 10.97%이다 [4].

2022년 국내주식 섹터별 투자 현황은 IT 30.3%, 산업재 16.8%, 소재 10.0%, 금융 9.4%, 임의소비재 9.3%, 헬스케어 7.6%, 통신서비스 7.5%, 필수소비재 5.2%, 에너지 1.9%, 유틸리티 1.2%, 부동산 0.2%, 기타 0.6%이다 [3]. 2021년 국내주식 섹터별 투자 현황은 정보기술 37.4%, 통신서비스 11.0%, 산업 

In [None]:
#Add next data store