In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Load Vector Store Index with Data Sources
<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview

<center>
<img src="imgs/zghost_overview_load_index.png" width="1200"/>
</center>
In order for our conversational agent to be able to interact with the GDELT data, we will need to create embeddings on the data that we extracted and insert these newly created vectors into the Matching Engine vector database that was created in notebook 01-setup-vertex-vector-store.

Now, we will index the extracted GDELT data (for both event and entity) and load these embeddings into the Vertex AI Matching Engine Vector Store database. Once the embeddings are loaded, we can start experimenting with information retrieval from the Vector Database - including using a Langchain agent.

By the end of the notebook, you will be able to perform semantic searches and query against the Vector Store using a Langchain Agent.

---

### Objectives

In this notebook, the steps performed include:

- Initialize the previously created Vertex AI Matching Engine resources - if you did not do this already, please see [Setup Vertex Vector Store Notebook]()
    - Index
    - Endpoint
    - Vector Store 
- Instantiate the Vertex AI Generative AI Language Model 
    - Embeddings using `textembedding-gecko@001` model
    - Using the `MODEL_TEXT_BISON_001` LLM and including stopwords to help prevent hallucinations
    - Create the langchain agent by providing the vector store name and description
- Semantic Search & Agent Execution against the Vector Store with the ability to cite sources
    - Use simple semantic search functionality to return information from the Vector
    - Use the langchain agent to answer questions based on the information in the vector store

After you have run this notebook, you may want to set up a recurring schedule for extraction of the GDELT data - the [GDELT Pipelines Notebook]() shows how to create an end to end pipeline including updating the Matching Engine Vector store with the latest GDELT data.

### Costs
This tutorial uses billable components of Google Cloud:

* Vertex AI Generative AI Studio
* Vertex AI Matching Engine
* BigQuery Storage & BigQuery Compute
* Google Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Getting Started
**Colab only:** Uncomment the following cell to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top. 

In [None]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### Authenticating your notebook environment
* If you are using **Colab** to run this notebook, uncomment the cell below and continue.
* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
# from google.colab import auth
# auth.authenticate_user()

### Make sure you edit the values below
Each time you run the notebook for the first time with new variables, you just need to edit the actor prefix and version variables below. They are needed to grab all the other variables in the notebook configuration.

In [1]:
# CREATE_NEW_ASSETS        = True # True | False
ACTOR_PREFIX             = "ggl"
VERSION                  = 'v1'

# print(f"CREATE_NEW_ASSETS  : {CREATE_NEW_ASSETS}")
print(f"ACTOR_PREFIX       : {ACTOR_PREFIX}")
print(f"VERSION            : {VERSION}")

ACTOR_PREFIX       : ggl
VERSION            : v1


### Load configuration settings from setup notebook
> Set the constants used in this notebook and load the config settings from the `00-env-setup.ipynb` notebook.

In [2]:
# staging GCS
GCP_PROJECTS             = !gcloud config get-value project
PROJECT_ID               = GCP_PROJECTS[0]

BUCKET_NAME              = f'zghost-{ACTOR_PREFIX}-{VERSION}-{PROJECT_ID}'
BUCKET_URI               = f'gs://{BUCKET_NAME}'

config = !gsutil cat {BUCKET_URI}/config/notebook_env.py
print(config.n)
exec(config.n)

print(f"BUCKET_NAME        : {BUCKET_NAME}")
print(f"BUCKET_URI         : {BUCKET_URI}")


PROJECT_ID               = "wortz-project-352116"
PROJECT_NUM              = "679926387543"
LOCATION                 = "us-central1"

REGION                   = "us-central1"
BQ_LOCATION              = "US"
VPC_NETWORK_NAME         = "me-network"

CREATE_NEW_ASSETS        = "True"
ACTOR_PREFIX             = "ggl"
VERSION                  = "v1"
ACTOR_NAME               = "google"
ACTOR_CATEGORY           = "technology"

BUCKET_NAME              = "zghost-ggl-v1-wortz-project-352116"
EMBEDDING_DIR_BUCKET     = "zghost-ggl-v1-wortz-project-352116-emd-dir"

BUCKET_URI               = "gs://zghost-ggl-v1-wortz-project-352116"
EMBEDDING_DIR_BUCKET_URI = "gs://zghost-ggl-v1-wortz-project-352116-emd-dir"

VPC_NETWORK_FULL         = "projects/679926387543/global/networks/me-network"

ME_INDEX_NAME            = "vectorstore_ggl_v1"
ME_INDEX_ENDPOINT_NAME   = "vectorstore_ggl_v1_endpoint"
ME_DIMENSIONS            = "768"

MY_BQ_DATASET            = "zghost_ggl_v1"
MY_BQ_TRENDS_DATASET     = "zg

### Import Packages

In [5]:
import sys
import os
sys.path.append("..")
# helper classes for using the langchain agent, LLM class, and embeddings model
from zeitghost.agents.LangchainAgent import LangchainAgent
from zeitghost.vertex.LLM import VertexLLM
from zeitghost.vertex.Embeddings import VertexEmbeddings

In [6]:
from google.cloud import aiplatform as vertex_ai
from google.cloud import storage
from google.cloud import bigquery

from langchain.document_loaders import DataFrameLoader
from langchain.docstore.document import Document

import pandas as pd
import uuid
import numpy as np
import json
import time
import io

from IPython.display import display, Image, Markdown
from PIL import Image, ImageDraw
import logging
logging.basicConfig(level = logging.INFO)

Instantiate Google Cloud SDK clients

In [7]:
storage_client = storage.Client(project=PROJECT_ID)

vertex_ai.init(project=PROJECT_ID,location=LOCATION)

# bigquery client
bqclient = bigquery.Client(
    project=PROJECT_ID,
    # location=LOCATION
)

Helper function for inspecting blob data

In [8]:
def test_gcs_blob_metadata(blob_name, bucket_name):
    """
    inspect blobs uploaded to GCS
    """
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.get_blob(blob_name)
    print(f"Metadata: {blob.metadata}")

## Initialize Vertex AI Matching Engine & GenAI Resources

### Matching Engine Index and IndexEndpoint
Initialize the Vertex AI Matching Engine resources. You should have already created these resources in the [Setup Vertex Vector Store]() notebook. 

In [9]:
from zeitghost.vertex.MatchingEngineCRUD import MatchingEngineCRUD

mengine = MatchingEngineCRUD(
    project_id=PROJECT_ID 
    , project_num=PROJECT_NUM
    , region=LOCATION 
    , index_name=ME_INDEX_NAME
    , vpc_network_name=VPC_NETWORK_FULL
)

Update the variable values and print out the matching engine resource information

In [10]:
ME_INDEX_RESOURCE_NAME, ME_INDEX_ENDPOINT_ID = mengine.get_index_and_endpoint()
ME_INDEX_ID=ME_INDEX_RESOURCE_NAME.split("/")[5]

print(f"ME_INDEX_RESOURCE_NAME  = {ME_INDEX_RESOURCE_NAME}")
print(f"ME_INDEX_ENDPOINT_ID    = {ME_INDEX_ENDPOINT_ID}")
print(f"ME_INDEX_ID             = {ME_INDEX_ID}")

ME_INDEX_RESOURCE_NAME  = projects/939655404703/locations/us-central1/indexes/5031884178191286272
ME_INDEX_ENDPOINT_ID    = projects/939655404703/locations/us-central1/indexEndpoints/3175838181761220608
ME_INDEX_ID             = 5031884178191286272


Uncomment the cell below to optionally list out existing index endpoints in your project and region

In [11]:
# !gcloud ai index-endpoints list --project=$PROJECT_ID --region=$LOCATION

# endpoint_data = ! gcloud ai index-endpoints list --region {LOCATION} --filter {ME_INDEX_ENDPOINT_ID}
# endpoint_address = [e for e in endpoint_data if 'publicEndpointDomainName' in e][0].partition(': ')[2] #careful - this is grabbing the first one in the list
# endpoint_address

Either create a new index endpoint or use an existing one - if you already created it in the Setup Vertex Vector notebook, you should see that the Index endpoint already exists. 

In [12]:
index_endpoint = mengine.create_index_endpoint()

if index_endpoint:
    print(f"Index endpoint resource name: {index_endpoint.name}")
    print(f"Index endpoint public domain name: {index_endpoint.public_endpoint_domain_name}")
    print(f"Deployed indexes on the index endpoint:")
    for d in index_endpoint.deployed_indexes:
        print(f"    {d.id}")

INFO:root:Index endpoint already exists


Index endpoint resource name: projects/939655404703/locations/us-central1/indexEndpoints/3175838181761220608
Index endpoint public domain name: 
Deployed indexes on the index endpoint:
    vectorstore_ggl_v1_20230614201535


### Vertex AI LLM & Embeddings Generator

Instantiate the Vertex AI LLMs using the helper classes, with different parameters passed to be used for different scenarios

For more information on parameters such as temperature, top_p, top_k, see [Getting Started with the Vertex AI PaLM API & Python SDK](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb)

*  `stop=None`: in this case, we are NOT passing a stopword when calling the LLM
* `stop=['Observation:']`: in this case, we are stopping when the LLM response shows <i>Observation:</i> to avoid hallucinations from the BigQuery and Pandas agents
* `strip=True`: we've noticed that when working with PaLM + BigQuery Langchain Agents, enabling string parsing stripping capabilities can prevent hallucinations 

To read more about different types of agents, see the [Langchain Documentation on Agents](https://python.langchain.com/en/latest/modules/agents.html)

Make sure your REQUESTS_PER_MINUTE does not exceed your project quota

In [13]:
# llm = VertexLLM(
#     stop=None 
#     , temperature=0.0
#     , max_output_tokens=1000
#     , top_p=0.7
#     , top_k=40
# )

llm = VertexLLM(
    temperature=0
    , stop=['Observation:']
    # , stop=None
    , strip=True
    , strip_chars=['\\', '\\\\']
    , max_output_tokens=1000
    , top_p=0.7
    , top_k=40
)

langchain_llm_for_bq = VertexLLM(
    stop=['Observation:'] 
    , strip=True 
    , temperature=0.0
    , max_output_tokens=1000
    , top_p=0.7
    , top_k=40
)

langchain_llm_for_pandas = VertexLLM(
    stop=['Observation:']
    , strip=False
    , temperature=0.0
    , max_output_tokens=1000
    , top_p=0.7
    , top_k=40
)
REQUESTS_PER_MINUTE = 299 # project quota==300
vertex_embedding = VertexEmbeddings(requests_per_minute=REQUESTS_PER_MINUTE)

Let's ping the LLM and verify we are getting a response

In [14]:
# llm.predict('how are you doing today?')
llm('In no more than 50 words, what can you tell me about the band Widespread Panic?')

'Widespread Panic is an American rock band formed in Athens, Georgia, in 1986. The band\'s lineup consists of John Bell (vocals, guitar), Jimmy Herring (guitar, vocals), Dave Schools (bass), John "JoJo" Hermann (keyboards), and Duane Trucks (drums). Widespread Panic has released 15 studio albums, four live albums, and two compilation albums.'

### Initialize Matching Engine VectorStore
In this notebook we will import the MatchingEngineVector class which will:
- Enable streaming index updates to a Matching Engine Index
- While the embeddings are stored in the Matching Engine, the embedded documents will be stored in GCS.     
    - An existing Index and corresponding Endpoint are preconditions for
    using this module.
    - See usage in docs/modules/indexes/vectorstores/examples/matchingengine.ipynb
    - Note that this implementation is mostly meant for reading if you are
    planning to do a real time implementation. While reading is a real time
    operation, updating the index takes close to one hour.
- Note: at current time of writing, creating ME index from LangChain only supports batch index updates

In [15]:
from zeitghost.vertex.MatchingEngineVectorstore import MatchingEngineVectorStore

Initialize the Vector Store

In [16]:
me = MatchingEngineVectorStore.from_components(
    project_id=PROJECT_ID
    # , project_num=PROJECT_NUM
    , region=LOCATION
    , gcs_bucket_name=EMBEDDING_DIR_BUCKET_URI
    , embedding=vertex_embedding
    , index_id=ME_INDEX_ID
    , endpoint_id=ME_INDEX_ENDPOINT_ID
    , k = 10
)

Verify the Vertex Embeddings & Endpoint

In [17]:
me.embedding

<zeitghost.vertex.Embeddings.VertexEmbeddings at 0x7fc3642bc820>

In [18]:
me.endpoint

<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x7fc364358550> 
resource name: projects/939655404703/locations/us-central1/indexEndpoints/3175838181761220608

## Index GDELT BigQuery Tables
Now that you have extracted the Event & Entity GDELT data into BigQuery, we create Vectors that will be streamed into the Vertex AI Matching Engine Index 

### GDELT Events Table
Update the variable values for the previously created GDELT events tables

In [19]:
GDELT_TABLE_NAME                 = f'test_events_gdelt_{ACTOR_PREFIX}_{VERSION}'

GDELT_TABLE_REF                  = f'{PROJECT_ID}.{MY_BQ_DATASET}.{GDELT_TABLE_NAME}'
SCRAPED_GDELT_TABLE_REF          = f'{PROJECT_ID}.{MY_BQ_DATASET}.scraped_{GDELT_TABLE_NAME}'

print(f"GDELT_TABLE_NAME         : {GDELT_TABLE_NAME}")
print(f"GDELT_TABLE_REF          : {GDELT_TABLE_REF}")
print(f"SCRAPED_GDELT_TABLE_REF  : {SCRAPED_GDELT_TABLE_REF}")

GDELT_TABLE_NAME         : test_events_gdelt_ggl_v1
GDELT_TABLE_REF          : cpg-cdp.zghost_ggl_v1.test_events_gdelt_ggl_v1
SCRAPED_GDELT_TABLE_REF  : cpg-cdp.zghost_ggl_v1.scraped_test_events_gdelt_ggl_v1


Let's preview some of this data before creating the embeddings

In [20]:
query = f"""
    SELECT * 
    FROM `{SCRAPED_GDELT_TABLE_REF}`
    -- LIMIT 5
"""
df=bqclient.query(query).to_dataframe().head(1)

df_col_list = df.columns
meta_cols_list = list(df_col_list)
meta_cols_list.remove('text')

df

Unnamed: 0,title,text,summary,publish_date,url,language,date,Actor1Name,Actor2Name,GoldsteinScale,NumMentions,NumSources,NumArticles,AvgTone,source
0,Chinese Threat Actor Abused ESXi Zero-Day to P...,A Chinese cyber-espionage group that researche...,The malware bundle also allowed UNC3886 actor ...,2023-06-13 22:00:00+00:00,https://www.darkreading.com/attacks-breaches/c...,en,20230614,VMWARE,VMWARE,2.35,[3],1,3,-3.42465753424657,https://www.darkreading.com/attacks-breaches/c...


List out the metadata columns - we will be providing this metadata information to the conversational agent

In [21]:
meta_cols_list

['title',
 'summary',
 'publish_date',
 'url',
 'language',
 'date',
 'Actor1Name',
 'Actor2Name',
 'GoldsteinScale',
 'NumMentions',
 'NumSources',
 'NumArticles',
 'AvgTone',
 'source']

Create the embeddings and write out to a GCS bucket to be used for insert into the Matching Engine Vector Store

After this job completes, you will see how many documents were chunked, how long it took, and confirmation that this was uploaded to cloud storage.
Example: 

```
INFO:root:# of chunked documents = 60
INFO:root:# of texts = 60
INFO:root:# of metadatas = 60
elapsed time: 1.0051262378692627
...........................................................
len of embeddings: 60
len of embeddings[0]: 768
INFO:root:Uploaded 60 documents to GCS.
60
```

In [22]:
start = time.time()

docs = me.chunk_bq_table(
    bq_dataset_name=MY_BQ_DATASET
    , bq_table_name=SCRAPED_GDELT_TABLE_REF
    , query=query
    , page_content_cols=['text']
    , metadata_cols=meta_cols_list
    , chunk_size=2000
    , chunk_overlap=0
)

end = time.time()
print(f"elapsed time: {end - start}")

texts = [d.page_content for d in docs]
metas = [d.metadata for d in docs]

# chunk text and add to matching engine vector store
uploaded_ids = me.add_texts(
    texts=texts
    , metadatas=metas
)

uuid_strings = []

for uid in uploaded_ids:
    uuid_strings.append(str(uid))
    
print(len(uuid_strings))

INFO:root:# of chunked documents = 60
INFO:root:# of texts = 60
INFO:root:# of metadatas = 60


elapsed time: 1.0051262378692627
...........................................................
len of embeddings: 60
len of embeddings[0]: 768


INFO:root:Uploaded 60 documents to GCS.


60


Sample the text values - you should see the resulting text value from an article

In [23]:
texts[4]

'text: BRUSSELS (AP) — European Union regulators hit Google with fresh antitrust charges Wednesday, saying the only way to satisfy competition concerns about its lucrative digital ad business is by selling off parts of the tech giant’s main moneymaker.\n\nThe unprecedented decision to push for such a breakup marks a significant escalation by Brussels in its crackdown on Silicon Valley digital giants.\n\nThe European Commission, the bloc’s executive branch and top antitrust enforcer, said its preliminary view after an investigation is that “only the mandatory divestment by Google of part of its services” would satisfy the concerns.\n\nThe 27-nation EU has led the global movement to crack down on Big Tech companies — including groundbreaking rules on artificial intelligence — but it has previously relied on issuing blockbuster fines, including three antitrust penalties for Google worth billions of euros (dollars).\n\nIt’s the first time the bloc has told a tech giant that it should split

Inspect the metadata

In [24]:
BLOB_UUID = str(uuid_strings[0])
print(BLOB_UUID)

BLOB_NAME=f'documents/{BLOB_UUID}'
test_gcs_blob_metadata(blob_name=BLOB_NAME,bucket_name=EMBEDDING_DIR_BUCKET)

31048973-5b3a-48e6-a810-1bf4f7b83350
Metadata: {'Actor1Name': 'VMWARE', 'NumSources': '1', 'NumArticles': '3', 'AvgTone': '-3.42465753424657', 'source': 'https://www.darkreading.com/attacks-breaches/chinese-threat-actor-abused-esxi-zero-day-pilfer-files-guest-vms', 'summary': 'The malware bundle also allowed UNC3886 actor to tamper with the hypervisor\'s logging service and to execute arbitrary commend between guest VMs on the same hypervisor.\nMandiant\'s analysis at the time showed the threat actor required admin-level privileges on the ESXi hypervisor to deploy the backdoors.\nThe threat actor targeted ESXi hosts belonging to defense, technology and telecommunications companies, Mandiant said.\nOnce connected to the ESXi hosts, the threat actor leveraged CVE-2023-20867 to run commands and transfer files on running guest machines without the need for the guest’s credentials, he says.\n"UNC3886 has shown itself to be a flexible, yet highly capable threat actor, which will modify open 

### Global Entity Graph (GEG) articles
We'll now perform a similar workflow to extract the entity graph articles 

First set the variables based on the previously created GDELT entity tables

In [25]:
GDELT_TABLE_NAME                 = f'test_geg_articles_{ACTOR_PREFIX}_{VERSION}'

GDELT_TABLE_REF                  = f'{PROJECT_ID}.{MY_BQ_DATASET}.{GDELT_TABLE_NAME}'
SCRAPED_GDELT_TABLE_REF          = f'{PROJECT_ID}.{MY_BQ_DATASET}.scraped_{GDELT_TABLE_NAME}'

print(f"GDELT_TABLE_NAME         : {GDELT_TABLE_NAME}")
print(f"GDELT_TABLE_REF          : {GDELT_TABLE_REF}")
print(f"SCRAPED_GDELT_TABLE_REF  : {SCRAPED_GDELT_TABLE_REF}")

GDELT_TABLE_NAME         : test_geg_articles_ggl_v1
GDELT_TABLE_REF          : cpg-cdp.zghost_ggl_v1.test_geg_articles_ggl_v1
SCRAPED_GDELT_TABLE_REF  : cpg-cdp.zghost_ggl_v1.scraped_test_geg_articles_ggl_v1


Let's sample the data before creating embeddings

In [26]:
query = f"""
    SELECT * 
    FROM `{SCRAPED_GDELT_TABLE_REF}`
    -- LIMIT 5
"""
df=bqclient.query(query).to_dataframe().head(1)

df_col_list = df.columns
meta_cols_list = list(df_col_list)
# meta_cols_list.remove('text')
meta_cols_list = [
    e for e in meta_cols_list if e not in (
        'text'
        , 'language'
        , 'date'
        , 'Actor1Name'
        , 'Actor2Name'
        , 'GoldsteinScale'
    )
]

df

Unnamed: 0,title,text,summary,publish_date,url,language,date,Actor1Name,Actor2Name,GoldsteinScale,NumSources,NumArticles,AvgTone,source
0,Google Search agrees to pay $23 million settle...,People who searched using Google and clicked o...,People who searched using Google and clicked o...,2023-06-13 00:00:00+00:00,https://www.cincinnati.com/story/money/2023/06...,en,,,,,0,0,0.0,https://www.cincinnati.com/story/money/2023/06...


List out the column metadata - remember this will also be provided to the conversational agent

In [27]:
meta_cols_list

['title',
 'summary',
 'publish_date',
 'url',
 'NumSources',
 'NumArticles',
 'AvgTone',
 'source']

Create the embeddings and write out to a GCS bucket to be used for insert into the Matching Engine Vector Store

Example output will have input about the number of docs loaded, how long it took, successful upload to GCS:
```
INFO:root:# of chunked documents = 40
INFO:root:# of texts = 40
INFO:root:# of metadatas = 40
elapsed time: 0.6589460372924805
.......................................
len of embeddings: 40
len of embeddings[0]: 768
INFO:root:Uploaded 40 documents to GCS.
40
```

In [28]:
start = time.time()

docs = me.chunk_bq_table(
    bq_dataset_name=MY_BQ_DATASET
    , bq_table_name=SCRAPED_GDELT_TABLE_REF
    , query=query
    , page_content_cols=['text']
    , metadata_cols=meta_cols_list
    , chunk_size=1000
    , chunk_overlap=0
)

end = time.time()
print(f"elapsed time: {end - start}")

texts = [d.page_content for d in docs]
metas = [d.metadata for d in docs]

# # chunk article text and add to Matching Engine Vector Store
uploaded_ids = me.add_texts(
    texts=texts
    , metadatas=metas
)

uuid_strings = []

for uid in uploaded_ids:
    uuid_strings.append(str(uid))
    
print(len(uuid_strings))

INFO:root:# of chunked documents = 40
INFO:root:# of texts = 40
INFO:root:# of metadatas = 40


elapsed time: 0.6589460372924805
.......................................
len of embeddings: 40
len of embeddings[0]: 768


INFO:root:Uploaded 40 documents to GCS.


40


Let's sample a text value

In [29]:
texts[1]

"To opt out or into of the settlement, visit refererheadersettlement.com.\n\nTo opt out the settlement, click on the Exclusion Form page, where you will register to receive a Class Member ID.\n\nTo opt into the settlement, click on the Registration Form page. Fill out your information and submit it to receive a Class Member ID at your email provided in your registration form. You'll then go to the Submit Claim page, where you can enter your Class Member ID and submit the form to file your claim.\n\nHow much is the payment for?\n\nBased on current data, the payment amounts for those with approved claims is expected to be around $7.70 per person.\n\nWhen can I expect the settlement to be finalized?\n\nThe final approval hearing for the settlement is currently scheduled for Oct. 12. Users who wish to object the settlement have to write to the court about why they disapprove of it by July 31, and can also ask to speak in court on the issue.\n\nHow is Google changing following the lawsuit?"

Inspect the metadata in blob storage

In [30]:
BLOB_UUID = str(uuid_strings[0])
print(BLOB_UUID)

BLOB_NAME=f'documents/{BLOB_UUID}'
test_gcs_blob_metadata(blob_name=BLOB_NAME,bucket_name=EMBEDDING_DIR_BUCKET)

9c33b1bc-e8b3-4cc2-baf2-11e4c72d1d9f
Metadata: {'title': 'Google Search agrees to pay $23 million settlement and you may be entitled to a portion', 'NumArticles': '0', 'publish_date': '2023-06-13 00:00:00+00:00', 'AvgTone': '0.0', 'summary': 'People who searched using Google and clicked on search results between October 2006 and September 2013 may be entitled to receive a cut of a $23 million settlement Google has agreed to pay.\nPeople who used Google Search and clicked on a search result link between or on Oct. 26, 2006 and Sept. 30, 2013 are included in the settlement as a settlement class member.\nSettlement class members have to decide to participate by July 31.\nTo opt out the settlement, click on the Exclusion Form page, where you will register to receive a Class Member ID.\nFollowing the settlement, Google will change its frequently asked questions (FAQs) and key terms pages to explain the circumstances under which search queries are shared with third parties using referrer hea

## Test streaming updates with Semantic Search and Action Agents

Because we created a Vertex Matching Engine index with streaming updates, we can query our index and expect the documents we just uploaded (above) to immediately be considered for the LLM to synthesize answers given a prompt 

1. We'll start with a generic semantic search, given a query. This will return `k` relevant documents (and their source URL/URI)

2. We'll build on this idea with [Agents](https://python.langchain.com/en/latest/modules/agents.html):

* Some user input is received
* The agent decides which tool - if any - to use, and what the input to that tool should be
* That tool is then called with that tool input, and an observation is recorded (this is just the output of calling that tool with that tool input)
* That history of tool, tool input, and observation is passed back into the agent, and it decides what step to take next
* This is repeated until the agent decides it no longer needs to use a tool, and then it responds directly to the user.

### Semantic search
Using the similarity_search helper function in Zeitghost.Vertex.MatchingEngineVectorStore, we can perform a simple semantic search.

Args:
- query: The string that will be used to search for similar documents.
- k: The amount of neighbors that will be retrieved.

Returns:
- A list of k matching documents.

Uncomment to print the endpoint name

In [31]:
# me.endpoint.public_endpoint_domain_name

We have some preloaded semantic search prompts in the windows below, but feel free to change them

In [32]:
# Test whether search from vector store is working
# me.similarity_search("What are some good black friday deals?", k=2) # retail
me.similarity_search(f"What are some of {ACTOR_NAME}s recent board appointments?", k=2) # coporate board appts - industry agnostic

INFO:root:Embedding query What are some of googles recent board appointments?.
INFO:root:Deployed Index ID = vectorstore_ggl_v1_20230614201535


[Document(page_content='text: European Union (EU) antitrust officials made a groundbreaking ruling on Wednesday, June 14, by suggesting the internet giant Google sell off a portion of its profitable digital advertising business due to competition concerns.\n\nAfter conducting an inquiry, the European Commission (EC), which is the bloc\'s executive body and top antitrust enforcer, came to the conclusion that "only the mandatory divestment by Google of part of its services" would address the issue.\n\nThe EU\'s 27 member states have been at the forefront of the international effort to regulate tech giants like Facebook, Google, and Amazon by passing groundbreaking regulations on AI and issuing massive fines in the past.\n\nHistoric EU Decision to Order Business Units to Break Up\n\nAccording to ABC News, this is the first time the EU has ordered a tech company to break up major elements of its business for violating the EU\'s stringent antitrust regulations. However, the specifics of thi

In [33]:
# me.similarity_search("What are some popular DIY projects?", k=2) # retail - home improvement
me.similarity_search(f"What are some of {ACTOR_NAME}s recent acquisitons?", k=2) # M&A activity - industry agnostic

INFO:root:Embedding query What are some of googles recent acquisitons?.
INFO:root:Deployed Index ID = vectorstore_ggl_v1_20230614201535


[Document(page_content='text: Google’s new AI-powered search experience, released in May, might have gotten a mixed reception. But the search giant isn’t letting that slow its feature roadmap.\n\nGoogle today announced new capabilities, some of which it previously previewed at its I/O conference, heading to Search Generative Experience (SGE) — the moniker for its experimental search experience — focused on travel and shopping.\n\nNow, when a user asks questions about a place or destination in Google Search (e.g. “Is this restaurant good for large groups?”), they’ll see a snapshot that brings together information from across the web as well as reviews, photos and business profile details submitted by business owners. And when they’re shopping for a product — Bluetooth speakers, say — SGE will show factors to consider along with product descriptions, reviews, ratings, prices, images and recommendations.', metadata={'source': 'https://techcrunch.com/2023/06/14/google-intros-new-ai-powered

### Action Agent with VectorStore
So, how do we move beyond basic semantic search and use a conversational agent to interact with this data?

Here, we will use the PaLM LLM with a langchain agent that has access to the Vector Store to answer questions.

We'll use the agent LLM that accepts a Stopword to help prevent hallucinations

Make sure your REQUESTS_PER_MINUTE does not exceed the project quota

In [34]:
REQUESTS_PER_MINUTE = 200 # project quota==300
vertex_embedding = VertexEmbeddings(requests_per_minute=REQUESTS_PER_MINUTE)

Instantiate the LLM to be used with the agent - the stopword here can help decrease hallucination

In [35]:
langchain_llm = VertexLLM(
    stop=['Observation:']
    , temperature=0.0
    , max_output_tokens=1000
    , top_p=0.7
    , top_k=10
)

Create the agent with the LLM and the Vector Store 

**Providing descriptive Vector Store Name and Vector Store Descriptions can improve the quality of the results returned by the agent!**

In [36]:
from langchain.agents.agent_toolkits import (
    create_vectorstore_agent,
    VectorStoreToolkit,
    VectorStoreInfo
)

VECTOR_STORE_NAME = f"news on {ACTOR_NAME}"
VECTOR_STORE_DESCRIPTION = f"a vectorstore containing news and event documents on {ACTOR_NAME} from the GDELT 2.0 dataset"

print(f"VECTOR_STORE_NAME        = {VECTOR_STORE_NAME}")
print(f"VECTOR_STORE_DESCRIPTION  = {VECTOR_STORE_DESCRIPTION}")


vectorstore_info = VectorStoreInfo(
    name = f"{VECTOR_STORE_NAME}"
    , description=f"{VECTOR_STORE_DESCRIPTION}"
    , vectorstore=me                          ,
)

toolkit = VectorStoreToolkit(vectorstore_info=vectorstore_info, llm=langchain_llm)

agent_executor = create_vectorstore_agent(
    llm=langchain_llm,
    toolkit=toolkit,
    verbose=True
)

from langchain.indexes.vectorstore import VectorStoreIndexWrapper
index = VectorStoreIndexWrapper(vectorstore=me)

VECTOR_STORE_NAME        = news on google
VECTOR_STORE_DESCRIPTION  = a vectorstore containing news and event documents on google from the GDELT 2.0 dataset


Verify your toolkit

Example output:

```
VectorStoreToolkit(vectorstore_info=VectorStoreInfo(vectorstore=<zeitghost.vertex.MatchingEngineVectorstore.MatchingEngineVectorStore object at 0x7fc36431a1f0>, name='news on google', description='a vectorstore containing news and event documents on google from the GDELT 2.0 dataset'), llm=VertexLLM(cache=None, verbose=False, callbacks=None, callback_manager=None, model=<vertexai.language_models._language_models.TextGenerationModel object at 0x7fc364015880>, predict_kwargs={'temperature': 0.0, 'max_output_tokens': 1000, 'top_p': 0.7, 'top_k': 10}, model_source='text-bison@001', stop=['Observation:'], strip=False, strip_chars=['{', '}', '\n']))
```

In [37]:
toolkit

VectorStoreToolkit(vectorstore_info=VectorStoreInfo(vectorstore=<zeitghost.vertex.MatchingEngineVectorstore.MatchingEngineVectorStore object at 0x7fc36431a1f0>, name='news on google', description='a vectorstore containing news and event documents on google from the GDELT 2.0 dataset'), llm=VertexLLM(cache=None, verbose=False, callbacks=None, callback_manager=None, model=<vertexai.language_models._language_models.TextGenerationModel object at 0x7fc364015880>, predict_kwargs={'temperature': 0.0, 'max_output_tokens': 1000, 'top_p': 0.7, 'top_k': 10}, model_source='text-bison@001', stop=['Observation:'], strip=False, strip_chars=['{', '}', '\n']))

#### Use the langchain `VectorStoreAgent`

Note: we are using the langchain llm that has a stopword `Observation:` to prevent halllucations

For experimentation purposes, we have also noticed that running

`agent_executor(query)` instead of `agent_executor.run(query)` can provide helpful insight into the reasoning process of the agent

In [41]:
query = f"Use [{VECTOR_STORE_NAME}]: " + \
        f"What are the top news stories for {ACTOR_NAME}?"

# print(query)
agent_executor.run(query)



[1m> Entering new AgentExecutor chain...[0m


INFO:root:Embedding query What are the top news stories for google?
.
INFO:root:Deployed Index ID = vectorstore_ggl_v1_20230614201535


[32;1m[1;3mI need to use the news on google tool to answer this question
Action: news on google
Action Input: What are the top news stories for google?
[0m
Observation: [36;1m[1;3mrules by the European Commission.

The EU's antitrust chief, Margrethe Vestager, said in a statement that Google is "abusing its dominant position in the online advertising market."

"Google is present at almost every level of the so-called ad tech supply chain," Vestager said. "Our first concern is that Google is using its market position to favor its own intermediary services."

The commission said that Google's practices have led to higher prices for advertisers and less choice for consumers.

Google has been accused of using its dominance in the online advertising market to favor its own services, such as AdSense and AdMob. The company has also been accused of making it difficult for advertisers to use other advertising platforms.

Google has denied the allegations, saying that it has "always worked 

'Google is being investigated by the European Commission for violating antitrust'

In [44]:
query = f"Use [{VECTOR_STORE_NAME}]: " + \
        f"How does {ACTOR_NAME}'s commitment to technology impact their brand perception?"

print(query)
agent_executor.run(query)

In [27]:
query = f"Use [{VECTOR_STORE_NAME}]: " + \
        f"What are some of {ACTOR_NAME}'s most popular brands"

print(query)
agent_executor.run(query)

In [46]:
query = f"Use [{VECTOR_STORE_NAME}]: " + \
        f"How is {ACTOR_NAME} impacted by economic uncertainty?"

print(query)
agent_executor.run(query)

In [30]:
query = f"Use [{VECTOR_STORE_NAME}]: " + \
        f"What will the impact of inflation have on the {ACTOR_CATEGORY} category over the next 6 months?"

print(query)
agent_executor.run(query)

In [48]:
query = f"Use [{VECTOR_STORE_NAME}]: " + \
        f"What are the most popular TikTok trends for {ACTOR_CATEGORY} from the last 6 months?"

print(query)
agent_executor.run(query)