## Vertex AI Search > Data Source Access Control



Refs:

* https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb 
* https://cloud.google.com/generative-ai-app-builder/docs/data-source-access-control#acl-storage-unstructured



## Pre-requisites 

TODO

* * *

## Colab Setup

To run this notebook in Colab click [![Open In Colab]() and run the cells in this section. Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

## Installs 


In [1]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.storage', 'google-cloud-storage'),
    ('google.cloud.discoveryengine','google-cloud-discoveryengine')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

installing package google-cloud-discoveryengine


### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.


In [2]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Setup
inputs:

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
print(PROJECT_ID)

demos-vertex-ai


In [2]:
from google.cloud import storage

import json

from google.cloud import discoveryengine_v1alpha as discoveryengine
from google.api_core.client_options import ClientOptions


### parameters:

In [3]:
# PROJECT_ID = '' # set above
REGION = 'us-central1'
EXPERIMENT = 'search-alphabet-investor-pdfs'
SERIES = "generative-ai"

LOCATION="global"

In [4]:
BUCKET = SERIES + EXPERIMENT 
BUCKET_URI = f"gs://{BUCKET}"

### Clinets

In [5]:
gcs = storage.Client(project = PROJECT_ID)

### Create Storage Bucket

In [6]:
if not gcs.lookup_bucket(BUCKET):
    print("Bucket does not exist, creating it now...")
    bucketDef = gcs.bucket(BUCKET)
    bucket = gcs.create_bucket(bucketDef, project=PROJECT_ID, location=REGION)
    print(bucket)
else:
    print("Bucket already exists:")
    print(gcs.lookup_bucket(BUCKET))

Bucket already exists:
<Bucket: generative-aisearch-alphabet-investor-pdfs>


## ingest data into GCS



### Upload  PDFs from public folder 

Copy PDFs from public gcs folder to the one we created. We'll use `gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs` for demonsttration purposes.

In [7]:
# ! gsutil -m cp gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/* $BUCKET_URI # TODO - all pdfs 
! gsutil cp gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/20040630_google_10Q.pdf $BUCKET_URI
! gsutil cp gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/20040930_google_10Q.pdf $BUCKET_URI

Copying gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/20040630_google_10Q.pdf [Content-Type=application/pdf]...
/ [1 files][265.6 KiB/265.6 KiB]                                                
Operation completed over 1 objects/265.6 KiB.                                    
Copying gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/20040930_google_10Q.pdf [Content-Type=application/pdf]...
/ [1 files][962.2 KiB/962.2 KiB]                                                
Operation completed over 1 objects/962.2 KiB.                                    


### metadata 

To set ACLs  for Vertex Search, we include the permissions in the metadata. 

The following is an example of a single record to show the format

```json
metadata = {
   "id": "",
   "jsonData": "",
   "content": {
     "mimeType": "<application/pdf>",
     "uri": "gs://generative-aisearch-alphabet-investor-pdfs/20040630_google_10Q.pdf"
   },
   "acl_info": {
     "readers": [
       {
         "principals": [
           { "group_id": "group_1" },
           { "user_id": "user_1" }
         ]
       }
     ]
   }
 }
```
https://cloud.google.com/generative-ai-app-builder/docs/data-source-access-control#acl-storage-unstructured

#### Create JSON metadata file 

Create JSON file of metadata for setting acl rules. 

To start, we simply specify ACLs for a single file.

https://cloud.google.com/generative-ai-app-builder/docs/data-source-access-control#acl-storage-unstructured

In [8]:
# TODO - fix format and filename to be correct
metadata_filename = "metadata.jsonl"

metadata = [
    {
   "id": "",
   "jsonData": "",
   "content": {
     "mimeType": "<application/pdf>",
     "uri": "gs://generative-aisearch-alphabet-investor-pdfs/20040630_google_10Q.pdf"
   },
   "acl_info": {
     "readers": [
       {
         "principals": [
           { "user_id": "bruce@justinjm.altostrat.com"}
         ]
       }
     ]
   }
    },
     {
   "id": "",
   "jsonData": "",
   "content": {
     "mimeType": "<application/pdf>",
     "uri": "gs://generative-aisearch-alphabet-investor-pdfs/20040930_google_10Q.pdf"
   },
   "acl_info": {
     "readers": [
       {
         "principals": [
           { "user_id": "admin@justinjm.altostrat.com"},
         ]
       }
     ]
   }
 }
    
    
]
   
# Write to a .jsonl file
with open(metadata_filename,  'w') as file:
    for item in metadata:
        json_string = json.dumps(item)
        file.write(json_string + '\n')

In [9]:

# TODO - add ACL for all files 
## get list of files from GCS 
## pick 5 files to be "secret"
## add bruce to all except "secret"
## save file
## upload file
## create new datastore and search App

#### upload metadata file just created

In [10]:
! gsutil -m cp $metadata_filename $BUCKET_URI/$metadata_filename

Copying file://metadata.jsonl [Content-Type=application/octet-stream]...
/ [1/1 files][  490.0 B/  490.0 B] 100% Done                                    
Operation completed over 1 objects/490.0 B.                                      


## Create Vertex AI Search Datastore

TODO - API:When creating data store, include the flag "aclEnabled": "true" in your JSON payload. https://cloud.google.com/generative-ai-app-builder/docs/data-source-access-control#acl-storage-unstructured

* https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es#cloud-storage
* https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores/create
* https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb


In [19]:
def create_data_store(
    project_id: str, location: str, data_store_name: str, data_store_id: str
):
    # Create a client
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )
    client = discoveryengine.DataStoreServiceClient(client_options=client_options)

    # Initialize request argument(s)
    data_store = discoveryengine.DataStore(
        display_name=data_store_name,
        industry_vertical="GENERIC",
        content_config="CONTENT_REQUIRED"
    )

    request = discoveryengine.CreateDataStoreRequest(
        parent=discoveryengine.DataStoreServiceClient.collection_path(
            project_id, location, "default_collection"
        ),
        data_store=data_store,
        data_store_id=data_store_id,
    )
    operation = client.create_data_store(request=request)

    # Make the request
    # The try block is necessary to prevent execution from haulting due to an error being thrown when the datastore takes a while to instantiate
    try:
        response = operation.result(timeout=90)
    except:
        print("long-running operation")

In [20]:
# The datastore name can only contain lowercase letters, numbers, and hyphens
# DATASTORE_NAME = EXPERIMENT
DATASTORE_NAME = "test-data-store"
DATASTORE_ID = f"{DATASTORE_NAME}-id"

create_data_store(PROJECT_ID, LOCATION, DATASTORE_NAME, DATASTORE_ID)

## Ingest data from Cloud Storage 



Refs

* https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es#discoveryengine_v1_generated_DocumentService_ImportDocuments_sync-python

helper function to import data 

TODO - When following the steps for data import in Create a search data store, make sure to do the following:

If using the API, set GcsSource.dataSchema to document

https://cloud.google.com/generative-ai-app-builder/docs/data-source-access-control#before-you-begin:~:text=your%20JSON%20payload.-,When%20following%20the%20steps%20for%20data%20import%20in%20Create%20a%20search,If%20using%20the%20API%2C%20set%20GcsSource.dataSchema%20to%20document,-Structured%20data%20from


In [None]:
def import_documents(
    project_id: str,
    location: str,
    data_store_id: str,
    gcs_uri: str,
):
    # Create a client
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    source_documents = [f"{gcs_uri}/*"]

    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        gcs_source=discoveryengine.GcsSource(
            input_uris=source_documents, data_schema="content"
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )

    # Make the request
    operation = client.import_documents(request=request)

    response = operation.result()

    # Once the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    return operation.operation.name

In [None]:
# import_documents(PROJECT_ID, LOCATION, DATASTORE_ID, BUCKET_URI)

## Create Vertex AI Search Engine 

TODO

* https://cloud.google.com/generative-ai-app-builder/docs/create-engine-es


In [None]:
# def create_engine(
#     project_id: str, location: str, data_store_name: str, data_store_id: str
# ):
#     # Create a client
#     client_options = (
#         ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
#         if location != "global"
#         else None
#     )
#     client = discoveryengine.EngineServiceClient(client_options=client_options)

#     # Initialize request argument(s)
#     config = discoveryengine.Engine.SearchEngineConfig(
#         search_tier="SEARCH_TIER_ENTERPRISE", search_add_ons=["SEARCH_ADD_ON_LLM"]
#     )

#     engine = discoveryengine.Engine(
#         display_name=data_store_name,
#         solution_type="SOLUTION_TYPE_SEARCH",
#         industry_vertical="GENERIC",
#         data_store_ids=[data_store_id],
#         search_engine_config=config,
#     )

#     request = discoveryengine.CreateEngineRequest(
#         parent=discoveryengine.DataStoreServiceClient.collection_path(
#             project_id, location, "default_collection"
#         ),
#         engine=engine,
#         engine_id=engine.display_name,
#     )

#     # Make the request
#     operation = client.create_engine(request=request)
#     response = operation.result(timeout=90)

In [None]:
# create_engine(PROJECT_ID, LOCATION, DATASTORE_NAME, DATASTORE_ID)

### Query your datastore

In [14]:
# from typing import List


# def search_sample(
#     project_id: str,
#     location: str,
#     data_store_id: str,
#     search_query: str,
# ) -> List[discoveryengine.SearchResponse]:
#     #  For more information, refer to:
#     # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
#     client_options = (
#         ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
#         if LOCATION != "global"
#         else None
#     )

#     # Create a client
#     client = discoveryengine.SearchServiceClient(client_options=client_options)

#     # The full resource name of the search engine serving config
#     # e.g. projects/{project_id}/locations/{location}/dataStores/{data_store_id}/servingConfigs/{serving_config_id}
#     serving_config = client.serving_config_path(
#         project=project_id,
#         location=location,
#         data_store=data_store_id,
#         serving_config="default_config",
#     )

#     # Optional: Configuration options for search
#     # Refer to the `ContentSearchSpec` reference for all supported fields:
#     # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest.ContentSearchSpec
#     content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(
#         # For information about snippets, refer to:
#         # https://cloud.google.com/generative-ai-app-builder/docs/snippets
#         snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(
#             return_snippet=True
#         ),
#         # For information about search summaries, refer to:
#         # https://cloud.google.com/generative-ai-app-builder/docs/get-search-summaries
#         summary_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(
#             summary_result_count=5,
#             include_citations=True,
#             ignore_adversarial_query=True,
#             ignore_non_summary_seeking_query=True,
#         ),
#     )

#     # Refer to the `SearchRequest` reference for all supported fields:
#     # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest
#     request = discoveryengine.SearchRequest(
#         serving_config=serving_config,
#         query=search_query,
#         page_size=10,
#         content_search_spec=content_search_spec,
#         query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(
#             condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,
#         ),
#         spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(
#             mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO
#         ),
#     )

#     response = client.search(request)
#     return response

In [None]:
# query = "Who is the CEO of Google?"

# print(search_sample(PROJECT_ID, LOCATION, DATASTORE_ID, query).summary.summary_text)