In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Setting up Vector Stores with Vertex Matching Engine
<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview

<center>
<img src="imgs/zghost_overview_ME.png" width="1200"/>
</center>
When working with LLMs and conversational agents, how the data that they are accessing is stored is crucial - efficient data processing is more important than ever for applications involving large language models, genAI, and semantic search. Many of these new applications using large unstructured datasets use vector embeddings, a data representation containing semantic information that LLMs can use to answer questions and maintain in a long-term memory. 

In this application we will use a specialized database - a Vector Database - for handling embeddings, optimized for storage and querying capabilities for embeddings. The GDELT dataset extract could be quite large depending on the actor_name and time range, so we want to make sure that we aren't sacrificing performance to interact with such a potentially large dataset, which is where Vertex AI Matching Engine's Vector Database will ensure that we can scale for any very large number of embeddings.

In this notebook you'll go through the process to create and deploy a vector store in Vertex Matching Engine. Whilst the setup may take 40-50min, once you've done this once, you can update, delete, and continue to add embeddings to this instance. 

---

[Vertex AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview) provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.

Matching Engine provides tooling to build use cases that match semantically similar items. More specifically, given a query item, Matching Engine finds the most semantically similar items to it from a large corpus of candidate items. This ability to search for semantically similar or semantically related items has many real world use cases and is a vital part of applications such as:

* Recommendation engines
* Search engines
* Ad targeting systems
* Image classification or image search
* Text classification
* Question answering
* Chatbots

To build semantic matching systems, you need to compute vector representations of all items. These vector representations are often called embeddings. Embeddings are computed by using machine learning models, which are trained to learn an embedding space where similar examples are close while dissimilar ones are far apart. The closer two items are in the embedding space, the more similar they are.

At a high level, semantic matching can be simplified into two critical steps:

* Generate embedding representations of items.
* Perform nearest neighbor searches on embeddings.

### Objectives

In this notebook, you will create a Vector Store using Vertex AI Matching Engine

The steps performed include:

- Installing the Python SDK 
- Create or initialize an existing matching engine index
  - Creating a new index can take 40-50 minutes
  - If you have already created an index and want to use this existing one, follow the instructions to initialize an existing index
  - Whilst creating a new index, consider proceeding to [GDELT DataOps](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb) notebook
- Create the Vector Store with embedddings, leveraging the embeddings model with `textembedding-gecko@001`
  

### Costs
This tutorial uses billable components of Google Cloud:

* Vertex AI Generative AI Studio
* Vertex AI Matching Engine

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Getting Started

**Colab only:** Uncomment the following cell to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top. 

In [2]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### Authenticating your notebook environment
* If you are using **Colab** to run this notebook, uncomment the cell below and continue.
* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [3]:
# from google.colab import auth
# auth.authenticate_user()

### Make sure you edit the values below
Each time you run the notebook for the first time with new variables, you just need to edit the actor prefix and version variables below. They are needed to grab all the other variables in the notebook configuration.

In [4]:
# CREATE_NEW_ASSETS        = True # True | False
ACTOR_PREFIX             = "way"
VERSION                  = 'v1'

# print(f"CREATE_NEW_ASSETS  : {CREATE_NEW_ASSETS}")
print(f"ACTOR_PREFIX       : {ACTOR_PREFIX}")
print(f"VERSION            : {VERSION}")

ACTOR_PREFIX       : way
VERSION            : v1


### Load configuration settings from setup notebook
Set the variables used in this notebook and load the config settings from the `00-env-setup.ipynb` notebook.

In [5]:
# staging GCS
GCP_PROJECTS             = !gcloud config get-value project
PROJECT_ID               = GCP_PROJECTS[0]

BUCKET_NAME              = f'zghost-{ACTOR_PREFIX}-{VERSION}-{PROJECT_ID}'
BUCKET_URI               = f'gs://{BUCKET_NAME}'

config = !gsutil cat {BUCKET_URI}/config/notebook_env.py
print(config.n)
exec(config.n)

print(f"BUCKET_NAME        : {BUCKET_NAME}")
print(f"BUCKET_URI         : {BUCKET_URI}")


PROJECT_ID               = "wortz-project-352116"
PROJECT_NUM              = "679926387543"
LOCATION                 = "us-central1"

REGION                   = "us-central1"
BQ_LOCATION              = "US"
VPC_NETWORK_NAME         = "me-network"

CREATE_NEW_ASSETS        = "True"
ACTOR_PREFIX             = "way"
VERSION                  = "v1"
ACTOR_NAME               = "wayfair"
ACTOR_CATEGORY           = "retail"

BUCKET_NAME              = "zghost-way-v1-wortz-project-352116"
EMBEDDING_DIR_BUCKET     = "zghost-way-v1-wortz-project-352116-emd-dir"

BUCKET_URI               = "gs://zghost-way-v1-wortz-project-352116"
EMBEDDING_DIR_BUCKET_URI = "gs://zghost-way-v1-wortz-project-352116-emd-dir"

VPC_NETWORK_FULL         = "projects/679926387543/global/networks/me-network"

ME_INDEX_NAME            = "vectorstore_way_v1"
ME_INDEX_ENDPOINT_NAME   = "vectorstore_way_v1_endpoint"
ME_DIMENSIONS            = "768"

MY_BQ_DATASET            = "zghost_way_v1_wortz_project_352116"
MY_BQ_TRENDS

### Import Packages

In [7]:
import sys
import os
sys.path.append("..")
# the following helper classes create and instantiate the matching engine resources
from zeitghost.vertex.MatchingEngineCRUD import MatchingEngineCRUD
from zeitghost.vertex.MatchingEngineVectorstore import MatchingEngineVectorStore
from zeitghost.vertex.LLM import VertexLLM
from zeitghost.vertex.Embeddings import VertexEmbeddings

import uuid
import time
import numpy as np
import json

from google.cloud import aiplatform as vertex_ai
from google.cloud import storage
from google.cloud import bigquery

In [8]:
storage_client = storage.Client(project=PROJECT_ID)

vertex_ai.init(project=PROJECT_ID,location=LOCATION)

# bigquery client
bqclient = bigquery.Client(
    project=PROJECT_ID,
    # location=LOCATION
)

## Matching Engine Index: initialize existing or create a new one

Validate access and bucket contents

In [9]:
! gsutil ls $EMBEDDING_DIR_BUCKET_URI/init_index

gs://zghost-way-v1-wortz-project-352116-emd-dir/init_index/embeddings_0.json


Pass the required parameters that will be used to create the matching engine index

In [10]:
mengine = MatchingEngineCRUD(
    project_id=PROJECT_ID 
    , project_num=PROJECT_NUM
    , region=LOCATION 
    , index_name=ME_INDEX_NAME
    , vpc_network_name=VPC_NETWORK_FULL
)

### Create or Initialize Existing Index

Creating a Vertex Matching Engine index can take ~40-50 minutes due to the index compaction algorithm it uses to structure the index for high performance queries at scale. Read more about the [novel algorithm](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) proposed by Google Researchand the [official whitepaper](https://arxiv.org/abs/1908.10396)

**Considering this setup time, proceed to Notebook `02-gdelt-data-ops.ipynb` to start extracting events and articles related to your actor**

In [None]:
start = time.time()
# create ME index
me_index = mengine.create_index(
    f"{EMBEDDING_DIR_BUCKET_URI}/init_index"
    , int(ME_DIMENSIONS)
)

end = time.time()
print(f"elapsed time: {end - start}")

if me_index:
    print(me_index.name)

INFO:root:Index vectorstore_way_v1 does not exists. Creating index ...
INFO:root:Poll the operation to create index ...


....

### Create or Initialize Index Endpoint
Once your Matching Engine Index has been created, create an index endpoint where the Index will be deployed to 

In [None]:
start = time.time()

index_endpoint=mengine.create_index_endpoint(
    endpoint_name=ME_INDEX_ENDPOINT_NAME
    , network=VPC_NETWORK_FULL
)

end = time.time()
print(f"elapsed time: {end - start}")

Print out the detailed information about the index endpoint and VPC network where it is deployed, and any indexes that are already deployed to that endpoint

In [None]:
if index_endpoint:
    print(f"Index endpoint resource name: {index_endpoint.name}")
    print(f"Index endpoint VPC network name: {index_endpoint.network}")
    print(f"Deployed indexes on the index endpoint:")
    for d in index_endpoint.deployed_indexes:
        print(f"    {d.id}")

### Deploy Index to Index Endpoint
To interact with a matching engine index, you'll need to deploy it to an endpoint, where you can customize the underlying infrastructure behind the endpoint. For example, you can specify the scaling properties. 

In [None]:
if CREATE_NEW_ASSETS == 'True':
    
    index_endpoint = mengine.deploy_index(
        index_name = ME_INDEX_NAME
        , endpoint_name = ME_INDEX_ENDPOINT_NAME
        , min_replica_count = 2
        , max_replica_count = 2
    )

Print out the information about the matching engine resources

In [None]:
if index_endpoint:
    print(f"Index endpoint resource name: {index_endpoint.name}")
    print(f"Index endpoint VPC network name: {index_endpoint.network}")
    print(f"Deployed indexes on the index endpoint:")
    for d in index_endpoint.deployed_indexes:
        print(f"    {d.id}")

### Get Index and IndexEndpoint IDs
Set the variable values and print out the resource details

In [None]:
ME_INDEX_RESOURCE_NAME, ME_INDEX_ENDPOINT_ID = mengine.get_index_and_endpoint()
ME_INDEX_ID=ME_INDEX_RESOURCE_NAME.split("/")[5]

print(f"ME_INDEX_RESOURCE_NAME  = {ME_INDEX_RESOURCE_NAME}")
print(f"ME_INDEX_ENDPOINT_ID    = {ME_INDEX_ENDPOINT_ID}")
print(f"ME_INDEX_ID             = {ME_INDEX_ID}")

## Matching Engine Vector Store

### Define Vertex LLM & Embeddings
The base class to create the various LLMs can be found in in the root repository - in zeitghost.vertex the `LLM.py` file

In [None]:
llm = VertexLLM(
    stop=None 
    , temperature=0.0
    , max_output_tokens=1000
    , top_p=0.7
    , top_k=40
)

# llm that can be used for a BigQuery agent, containing stopwords to prevent hallucinations and string parsing
langchain_llm_for_bq = VertexLLM(
    stop=['Observation:'] 
    , strip=True 
    , temperature=0.0
    , max_output_tokens=1000
    , top_p=0.7
    , top_k=40
)

# llm that can be used for a pandas agent, containing stopwords to prevent hallucinations
langchain_llm_for_pandas = VertexLLM(
    stop=['Observation:']
    , strip=False
    , temperature=0.0
    , max_output_tokens=1000
    , top_p=0.7
    , top_k=40
)

Let's ping the language model to ensure we are getting an expected response

In [None]:
# llm('how are you doing today?')
llm('In no more than 50 words, what can you tell me about the band Widespread Panic?')

Now let's call the VertexEmbeddings class which helps us get document embeddings using the [Vertex AI Embeddings model](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). Make sure that your REQUESTS_PER_MINUTE does not exceed your project quota.

In [None]:
from zeitghost.vertex.Embeddings import VertexEmbeddings

REQUESTS_PER_MINUTE = 299 # example project quota==300
vertex_embedding = VertexEmbeddings(requests_per_minute=REQUESTS_PER_MINUTE)

## Initialize Matching Engine Vector Store
Finally, to interact with the matching engine instance initialize it with everything that you have created

In [None]:
# initialize vector store
me = MatchingEngineVectorStore.from_components(
    project_id=PROJECT_ID
    # , project_num=PROJECT_NUM
    , region=LOCATION
    , gcs_bucket_name=EMBEDDING_DIR_BUCKET_URI
    , embedding=vertex_embedding
    , index_id=ME_INDEX_ID
    , endpoint_id=ME_INDEX_ENDPOINT_ID
)

Validate that you have created the vector store with the Vertex embeddings

In [None]:
me.embedding