## Vertex AI Search > Data Source Access Control



Refs:

https://cloud.google.com/generative-ai-app-builder/docs/data-source-access-control#acl-storage-unstructured



## Pre-requisites 

* Setup GCP




## Setup
inputs:

In [5]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
print(PROJECT_ID)

demos-vertex-ai


In [6]:
from google.cloud import storage

import json

import requests
import os


parameters:

In [7]:
# PROJECT_ID = '' # set above
REGION = 'us-central1'
EXPERIMENT = 'search-alphabet-investor-pdfs'
SERIES = "generative-ai"

In [8]:
BUCKET = SERIES + EXPERIMENT 
BUCKET_URI = f"gs://{BUCKET}"

### Create Storage Bucket

In [9]:
gcs = storage.Client(project = PROJECT_ID)

In [10]:
if not gcs.lookup_bucket(BUCKET):
    bucketDef = gcs.bucket(BUCKET)
    bucket = gcs.create_bucket(bucketDef, project=PROJECT_ID, location=REGION)
    print(bucket)
else:
    print(gcs.lookup_bucket(BUCKET))

<Bucket: generative-aisearch-alphabet-investor-pdfs>


## ingest data into GCS



### PDFs 

TODO - copy from public gcs folder to one we created

gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs

#### Upload

In [11]:
# ! gsutil -m cp gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/* $BUCKET_URI

! gsutil cp gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/20040630_google_10Q.pdf $BUCKET_URI
! gsutil cp gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/20040930_google_10Q.pdf $BUCKET_URI

Copying gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/20040630_google_10Q.pdf [Content-Type=application/pdf]...
/ [1 files][265.6 KiB/265.6 KiB]                                                
Operation completed over 1 objects/265.6 KiB.                                    
Copying gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/20040930_google_10Q.pdf [Content-Type=application/pdf]...
/ [1 files][962.2 KiB/962.2 KiB]                                                
Operation completed over 1 objects/962.2 KiB.                                    


### metadata 

#### format

following format

https://cloud.google.com/generative-ai-app-builder/docs/data-source-access-control#acl-storage-unstructured

```json
metadata = {
   "id": "",
   "jsonData": "",
   "content": {
     "mimeType": "<application/pdf>",
     "uri": "gs://generative-aisearch-alphabet-investor-pdfs/20040630_google_10Q.pdf"
   },
   "acl_info": {
     "readers": [
       {
         "principals": [
           { "group_id": "group_1" },
           { "user_id": "user_1" }
         ]
       }
     ]
   }
 }
```


#### Create JSON meta data 

Create JSON file of metadata for setting acl rules.

To start, we simply specify ACLs for a single file.

https://cloud.google.com/generative-ai-app-builder/docs/data-source-access-control#acl-storage-unstructured

In [12]:
# TODO - add ACL for all files 
## get list of files from GCS 
## pick 5 files to be "secret"
## add bruce to all except "secret"
## save file
## upload file
## create new datastore and search App

In [13]:
metadata_filename = "metadata.jsonl"

metadata = [
    {
   "id": "",
   "jsonData": "",
   "content": {
     "mimeType": "<application/pdf>",
     "uri": "gs://generative-aisearch-alphabet-investor-pdfs/20040630_google_10Q.pdf"
   },
   "acl_info": {
     "readers": [
       {
         "principals": [
           { "user_id": "bruce@justinjm.altostrat.com"}
         ]
       }
     ]
   }
    },
     {
   "id": "",
   "jsonData": "",
   "content": {
     "mimeType": "<application/pdf>",
     "uri": "gs://generative-aisearch-alphabet-investor-pdfs/20040930_google_10Q.pdf"
   },
   "acl_info": {
     "readers": [
       {
         "principals": [
           { "user_id": "admin@justinjm.altostrat.com"},
         ]
       }
     ]
   }
 }
    
    
]
   
# Write to a .jsonl file
with open(metadata_filename,  'w') as file:
    for item in metadata:
        json_string = json.dumps(item)
        file.write(json_string + '\n')

#### upload

upload metadata file just created

In [14]:
! gsutil -m cp $metadata_filename $BUCKET_URI/$metadata_filename

Copying file://metadata.jsonl [Content-Type=application/octet-stream]...
/ [1/1 files][  490.0 B/  490.0 B] 100% Done                                    
Operation completed over 1 objects/490.0 B.                                      


## Create Vertex AI Search Datastore


TODO - Create via API or GUI?

example curl request:


```sh
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Goog-User-Project: PROJECT_ID" \
"https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
-d '{
  "displayName": "DATA_STORE_DISPLAY_NAME",
  "industryVertical": "GENERIC",
  "solutionTypes": ["SOLUTION_TYPE_SEARCH"]
}'
```


https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es#cloud-storage


In [15]:
# helper function
def create_vertex_datastore(project_id, data_store_id, display_name):
    """
    Creates a data store in Vertex Search.

    Args:
        project_id (str): Your Google Cloud project ID.
        data_store_id (str): The ID for the new data store.
        display_name (str): The display name for the data store.

    Returns:
        requests.Response: The response from the API call.
    """

    # Get an OAuth access token using gcloud
    access_token = os.popen("gcloud auth print-access-token").read().strip()

    # Construct the API endpoint URL
    url = (
        f"https://discoveryengine.googleapis.com/v1alpha/projects/{project_id}/"
        f"locations/global/collections/default_collection/dataStores"
        f"?dataStoreId={data_store_id}"
    )

    # Payload for the request
    data = {
        "displayName": display_name,
        "industryVertical": "GENERIC",
        "solutionTypes": ["SOLUTION_TYPE_SEARCH"]
    }

    # Headers for authorization and content type
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json",
        "X-Goog-User-Project": project_id
    }

    # Make the POST request
    response = requests.post(url, headers=headers, json=data)

    # Return the response for further handling, if needed
    return response




In [16]:
# Example usage (replace with your values)
project_id = PROJECT_ID
data_store_id = EXPERIMENT
display_name = "Test Data Store"

response = create_vertex_datastore(project_id, data_store_id, display_name)

if response.status_code == 200:
    print("Data store created successfully!")
else:
    print(f"Failed to create data store. Response: {response.text}")

Data store created successfully!


### Ingest data from Cloud Storage 

Create datastore via UI

* https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es#cloud-storage
* https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es#discoveryengine_v1_generated_DocumentService_ImportDocuments_sync-python

TODO - update SCRIPT to create datastore

In [None]:
# # TODO  API:When creating data store, include the flag "aclEnabled": "true" in your JSON payload.
# # https://cloud.google.com/generative-ai-app-builder/docs/data-source-access-control#acl-storage-unstructured

# from typing import Optional

# from google.api_core.client_options import ClientOptions
# from google.cloud import discoveryengine

# # TODO(developer): Uncomment these variables before running the sample.
# # project_id = "YOUR_PROJECT_ID"
# # location = "YOUR_LOCATION" # Values: "global"
# # data_store_id = "YOUR_DATA_STORE_ID"

# # Must specify either `gcs_uri` or (`bigquery_dataset` and `bigquery_table`)
# # Format: `gs://bucket/directory/object.json` or `gs://bucket/directory/*.json`
# # gcs_uri = "YOUR_GCS_PATH"
# # bigquery_dataset = "YOUR_BIGQUERY_DATASET"
# # bigquery_table = "YOUR_BIGQUERY_TABLE"


# def import_documents_sample(
#     project_id: str,
#     location: str,
#     data_store_id: str,
#     gcs_uri: Optional[str] = None,
#     bigquery_dataset: Optional[str] = None,
#     bigquery_table: Optional[str] = None,
# ) -> str:
#     #  For more information, refer to:
#     # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
#     client_options = (
#         ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
#         if location != "global"
#         else None
#     )

#     # Create a client
#     client = discoveryengine.DocumentServiceClient(client_options=client_options)

#     # The full resource name of the search engine branch.
#     # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
#     parent = client.branch_path(
#         project=project_id,
#         location=location,
#         data_store=data_store_id,
#         branch="default_branch",
#     )

#     if gcs_uri:
#         request = discoveryengine.ImportDocumentsRequest(
#             parent=parent,
#             gcs_source=discoveryengine.GcsSource(
#                 input_uris=[gcs_uri], data_schema="custom"
#             ),
#             # Options: `FULL`, `INCREMENTAL`
#             reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
#         )
#     else:
#         request = discoveryengine.ImportDocumentsRequest(
#             parent=parent,
#             bigquery_source=discoveryengine.BigQuerySource(
#                 project_id=project_id,
#                 dataset_id=bigquery_dataset,
#                 table_id=bigquery_table,
#                 data_schema="custom",
#             ),
#             # Options: `FULL`, `INCREMENTAL`
#             reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
#         )

#     # Make the request
#     operation = client.import_documents(request=request)

#     print(f"Waiting for operation to complete: {operation.operation.name}")
#     response = operation.result()

#     # Once the operation is complete,
#     # get information from operation metadata
#     metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

#     # Handle the response
#     print(response)
#     print(metadata)

#     return operation.operation.name


In [None]:
## https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient#google_cloud_discoveryengine_v1_services_document_service_DocumentServiceClient_import_documents
# This snippet has been automatically generated and should be regarded as a
# code template only.
# It will require modifications to work:
# - It may require correct/in-range values for request initialization.
# - It may require specifying regional endpoints when creating the service
#   client as shown in:
#   https://googleapis.dev/python/google-api-core/latest/client_options.html
# from google.cloud import discoveryengine_v1

# def sample_import_documents():
#     # Create a client
#     client = discoveryengine_v1.DocumentServiceClient()

#     # Initialize request argument(s)
#     request = discoveryengine_v1.ImportDocumentsRequest(
#         parent="parent_value",
#     )

#     # Make the request
#     operation = client.import_documents(request=request)

#     print("Waiting for operation to complete...")

#     response = operation.result()

#     # Handle the response
#     print(response)

## Create Vertex AI Search App 

TODO - console


* https://cloud.google.com/generative-ai-app-builder/docs/create-engine-es

