In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Chunk up the Docs and Index them in Vertex Matching Engine 
<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/referencearchitectures/setup.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/referencearchitectures/setup.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/blob/main/language/referencearchitectures/setup.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview
This notebook has been marked as optional as it's providing an example of how you can work with other forms of unstructured data sources in different formats and also insert those into your Vector Store DB, but is not required to execute the remaining notebooks in this repo. 

Here we cover how to use [LangChain and Vertex AI](https://python.langchain.com/en/latest/modules/models/text_embedding/examples/google_vertex_ai_palm.html) to extract text embeddings from a variety of source document types, and index them with Vertex Matching Engine. From each document source (see below), various metadata fields will be retained and saved to the original text's GCS blob metadata. For a comprehenive review of using LangChain with Vertex AI, see [intro_langchain_palm_api.ipynb](https://github.com/GoogleCloudPlatform/generative-ai/blob/dev/language/examples/langchain/intro_langchain_palm_api.ipynb)

**Source sample document types included in this notebook:**
* BigQuery Tables
* PDF
* Word docs
* PowerPoint
* YouTube videos

**Throughout this notebook we will index documents with `add_texts()`**

* This class method assigns a unique ID (UUID) to each source chunk
* The original text chunk is uploaded to GCS, where the metadata are stored in the GCS blob metadata fields, and the blob name is equal to the UUID
* The associated embedding vector is indexed (via stream update) in Vertex AI Matching Engine using the same UUID  

<center>
<img src="imgs/langchain_intro.png" width="800"/>
</center>

### Objectives

This notebook will show how to apply the following transfomations for each document source:
* Chunk the text to sizes that can be indexed using the embeddings model
* Upload the original text + metadata to Google Cloud Storage (GCS)
* Extract embedding representation of text chunk using the Vertex PaLM API
* Stream update embedding vector representation to Vertex AI Matching Engine

### Costs
This tutorial uses billable components of Google Cloud:

* Vertex AI Generative AI Studio
* Vertex AI Matching Engine
* BigQuery Storage & BigQuery Compute
* Google Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Getting Started
**Colab only:** Uncomment the following cell to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

In [1]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### Authenticating your notebook environment
* If you are using **Colab** to run this notebook, uncomment the cell below and continue.
* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [2]:
# from google.colab import auth
# auth.authenticate_user()

### Make sure you edit the values below
Each time you run the notebook for the first time with new variables, you just need to edit the actor prefix and version variables below. They are needed to grab all the other variables in the notebook configuration.

In [3]:
# CREATE_NEW_ASSETS        = True # True | False
ACTOR_PREFIX             = "ggl"
VERSION                  = 'v1'

# print(f"CREATE_NEW_ASSETS  : {CREATE_NEW_ASSETS}")
print(f"ACTOR_PREFIX       : {ACTOR_PREFIX}")
print(f"VERSION            : {VERSION}")

ACTOR_PREFIX       : ggl
VERSION            : v1


### Load configuration settings from setup notebook
> Set the constants used in this notebook and load the config settings from the `00-env-setup.ipynb` notebook.

In [4]:
# staging GCS
BUCKET_NAME              = f'zghost-{ACTOR_PREFIX}-{VERSION}'
BUCKET_URI               = f'gs://{BUCKET_NAME}'

print(f"BUCKET_NAME        : {BUCKET_NAME}")
print(f"BUCKET_URI         : {BUCKET_URI}")

config = !gsutil cat {BUCKET_URI}/config/notebook_env.py
print(config.n)
exec(config.n)

BUCKET_NAME        : zghost-ggl-v1
BUCKET_URI         : gs://zghost-ggl-v1

PROJECT_ID               = "cpg-cdp"
PROJECT_NUM              = "939655404703"
LOCATION                 = "us-central1"

REGION                   = "us-central1"
BQ_LOCATION              = "US"
VPC_NETWORK_NAME         = "genai-haystack-vpc"

CREATE_NEW_ASSETS        = "True"
ACTOR_PREFIX             = "ggl"
VERSION                  = "v1"
ACTOR_NAME               = "google"
ACTOR_CATEGORY           = "cyber security"

BUCKET_NAME              = "zghost-ggl-v1"
EMBEDDING_DIR_BUCKET     = "zghost-ggl-v1-emd-dir"

BUCKET_URI               = "gs://zghost-ggl-v1"
EMBEDDING_DIR_BUCKET_URI = "gs://zghost-ggl-v1-emd-dir"

VPC_NETWORK_FULL         = "projects/939655404703/global/networks/genai-haystack-vpc"

ME_INDEX_NAME            = "vectorstore_ggl_v1"
ME_INDEX_ENDPOINT_NAME   = "vectorstore_ggl_v1_endpoint"
ME_DIMENSIONS            = "768"

MY_BQ_DATASET            = "zghost_ggl_v1"
MY_BQ_TRENDS_DATASET     = "zgho

### Import Packages

In [5]:
import sys
import os
sys.path.append("..")
from zeitghost.gdelt.GdeltData import GdeltData
from zeitghost.bigquery.BigQueryAccessor import BigQueryAccessor

from zeitghost.agents.LangchainAgent import LangchainAgent
from zeitghost.vertex.LLM import VertexLLM

from zeitghost.vertex.MatchingEngineCRUD import MatchingEngineCRUD
from zeitghost.vertex.MatchingEngineVectorstore import MatchingEngineVectorStore

from zeitghost.vertex.Embeddings import VertexEmbeddings

In [6]:
from google.cloud import aiplatform as vertex_ai
from google.cloud import storage
from google.cloud import bigquery

from langchain.document_loaders import DataFrameLoader
from langchain.docstore.document import Document
from langchain.document_loaders import BigQueryLoader

import pandas as pd
import uuid
import numpy as np
import json
import time
import io

from IPython.display import display, Image, Markdown
from PIL import Image, ImageDraw
import logging
logging.basicConfig(level = logging.INFO)

Instantiate Google Cloud SDK clients

In [7]:
storage_client = storage.Client(project=PROJECT_ID)

vertex_ai.init(project=PROJECT_ID,location=LOCATION)

# bigquery client
bqclient = bigquery.Client(
    project=PROJECT_ID,
    # location=LOCATION
)

Helper function for inspecting blob data

In [8]:
def test_gcs_blob_metadata(blob_name, bucket_name):
    """
    inspect blobs uploaded to GCS
    """
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.get_blob(blob_name)
    print(f"Metadata: {blob.metadata}")

## Initialize Vertex AI Matching Engine & GenAI Resources

### Matching Engine Index and IndexEndpoint
Initialize the Vertex AI Matching Engine resources. You should have already created these resources in the [Setup Vertex Vector Store]() notebook. 

In [9]:
mengine = MatchingEngineCRUD(
    project_id=PROJECT_ID 
    , project_num=PROJECT_NUM
    , region=LOCATION 
    , index_name=ME_INDEX_NAME
    , vpc_network_name=VPC_NETWORK_FULL
)

In [10]:
ME_INDEX_RESOURCE_NAME, ME_INDEX_ENDPOINT_ID = mengine.get_index_and_endpoint()
ME_INDEX_ID=ME_INDEX_RESOURCE_NAME.split("/")[5]

print(f"ME_INDEX_RESOURCE_NAME  = {ME_INDEX_RESOURCE_NAME}")
print(f"ME_INDEX_ENDPOINT_ID    = {ME_INDEX_ENDPOINT_ID}")
print(f"ME_INDEX_ID             = {ME_INDEX_ID}")

ME_INDEX_RESOURCE_NAME  = projects/939655404703/locations/us-central1/indexes/5031884178191286272
ME_INDEX_ENDPOINT_ID    = projects/939655404703/locations/us-central1/indexEndpoints/3175838181761220608
ME_INDEX_ID             = 5031884178191286272


### Vertex AI LLM & Embeddings Generator

Instantiate the Vertex AI LLMs using the helper classes, with different parameters passed to be used for different scenarios

For more information on parameters such as temperature, top_p, top_k, see [Getting Started with the Vertex AI PaLM API & Python SDK](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb)

*  `stop=None`: in this case, we are NOT passing a stopword when calling the LLM
* `stop=['Observation:']`: in this case, we are stopping when the LLM response shows <i>Observation:</i> to avoid hallucinations from the BigQuery and Pandas agents
* `strip=True`: we've noticed that when working with PaLM + BigQuery Langchain Agents, enabling string parsing stripping capabilities can prevent hallucinations 

To read more about different types of agents, see the [Langchain Documentation on Agents](https://python.langchain.com/en/latest/modules/agents.html)

Make sure your REQUESTS_PER_MINUTE does not exceed your project quota

In [11]:
llm = VertexLLM(
    temperature=0
    , stop=['Observation:']
    # , stop=None
    , strip=True
    , strip_chars=['\\', '\\\\']
    , max_output_tokens=1000
    , top_p=0.7
    , top_k=40
)

REQUESTS_PER_MINUTE = 299 # project quota==300
vertex_embedding = VertexEmbeddings(requests_per_minute=REQUESTS_PER_MINUTE)

Let's ping the LLM and verify we are getting a response

In [12]:
llm('In no more than 50 words, what can you tell me about the band Widespread Panic?')

'Widespread Panic is an American rock band formed in Athens, Georgia, in 1986. The band\'s lineup consists of John Bell (vocals, guitar), Jimmy Herring (guitar, vocals), Dave Schools (bass), John "JoJo" Hermann (keyboards), and Duane Trucks (drums). Widespread Panic has released 15 studio albums, four live albums, and two compilation albums.'

### Initialize Matching Engine VectorStore
In this notebook we will import the MatchingEngineVector class which will:
- Enable streaming index updates to a Matching Engine Index
- While the embeddings are stored in the Matching Engine, the embedded documents will be stored in GCS.     
    - An existing Index and corresponding Endpoint are preconditions for
    using this module.
    - See usage in docs/modules/indexes/vectorstores/examples/matchingengine.ipynb
    - Note that this implementation is mostly meant for reading if you are
    planning to do a real time implementation. While reading is a real time
    operation, updating the index takes close to one hour.
- Note: at current time of writing, creating ME index from LangChain only supports batch index updates

In [14]:
# initialize vector store
me = MatchingEngineVectorStore.from_components(
    project_id=PROJECT_ID
    , region=LOCATION
    , gcs_bucket_name=EMBEDDING_DIR_BUCKET_URI
    , embedding=vertex_embedding
    , index_id=ME_INDEX_ID
    , endpoint_id=ME_INDEX_ENDPOINT_ID
)

## BigQuery Tables

<center>
<img src="imgs/chunk_bq_tables_flow.png" width="750"/>
</center>

When chunking BigQuery tables, we specify the columns we want to chunk, and the columns we want to include as metadata

the `me.chunk_bq_table()` method returns a LangChain Document type consisting of `page_content` (the chunked text) and `metadata` (data describing attributes of page content):

```python
from langchain.schema import Document

Document(
    page_content="This is my document. It is full of text that I've gathered from other places",
    metadata={
        "my_document_id": 234234,
        "my_document_source": "The LangChain Papers",
        "my_document_create_time": 1680013019,
    },
)
```

In [15]:
TABLE_NAME = f'scraped_test_geg_articles_{ACTOR_PREFIX}_{VERSION}'
TABLE_REF  = f'{PROJECT_ID}.{MY_BQ_DATASET}.{TABLE_NAME}'

query = f"""
    SELECT * 
    FROM `{TABLE_REF}`
"""
df = bqclient.query(query).to_dataframe().head(2)
df

Unnamed: 0,title,text,summary,publish_date,url,language,date,Actor1Name,Actor2Name,GoldsteinScale,NumSources,NumArticles,AvgTone,source
0,Google Search agrees to pay $23 million settle...,People who searched using Google and clicked o...,People who searched using Google and clicked o...,2023-06-13 00:00:00+00:00,https://www.cincinnati.com/story/money/2023/06...,en,,,,,0,0,0.0,https://www.cincinnati.com/story/money/2023/06...
1,Zib Digital Explains How Google Search Ads Wil...,Zib Digital Explains How Google Search Ads Wil...,Zib Digital Explains How Google Search Ads Wil...,NaT,http://www.itnewsonline.com/news/Zib-Digital-E...,en,,,,,0,0,0.0,http://www.itnewsonline.com/news/Zib-Digital-E...


Now, let's define the metadata columns that we want to retain - keeping in mind what may be useful for a conversational agent to be able use to answer questions on this data.

In [16]:
df_col_list = df.columns
meta_cols_list = list(df_col_list)
meta_cols_list = [
    e for e in meta_cols_list if e not in (
        'text'
        , 'language'
        , 'date'
        , 'Actor1Name'
        , 'Actor2Name'
        , 'GoldsteinScale'
    )
]
meta_cols_list

['title',
 'summary',
 'publish_date',
 'url',
 'NumSources',
 'NumArticles',
 'AvgTone',
 'source']

In [20]:
CONTENT_COL_NAME = 'text'

docs = me.chunk_bq_table(
    bq_dataset_name=MY_BQ_DATASET
    , bq_table_name=TABLE_REF
    , query=query
    , page_content_cols=[CONTENT_COL_NAME]
    , metadata_cols=meta_cols_list
    , chunk_size=1000
    , chunk_overlap=0
)

texts = [d.page_content for d in docs]
metas = [d.metadata for d in docs]

docs[0]

INFO:root:# of chunked documents = 76


Document(page_content='text: In this article, we discuss 12 undervalued dividend stocks to buy according to analysts. You can skip our detailed analysis of dividend stocks and their performance over the years, and go directly to read 5 Undervalued Dividend Aristocrats To Buy According to Analysts.\n\nDividend aristocrats are the companies in the S&P 500 that have raised their dividends for 25 years or more. Value stocks trade at a lower price relative to their intrinsic value and typically have a lower price-to-earnings ratio. The concept of value investing, popularized by Benjamin Graham and Warren Buffett, revolves around the belief that the market occasionally misprices stocks, leading to opportunities for seasoned investors to buy them at a discount. This investment strategy has proven to be successful over the long haul, delivering a total return of 1,344,600% since 1926, compared with a 626,000% return on growth stocks, as reported by Bank of America.', metadata={'title': '12 Und

### Add texts (with metadata)

Given a list of texts and metadatas:
* assign unique ID to original `<text, metadata>` pair
* convert original text into embedding vector representation
* upload original text to GCS, where blob name == unique ID and blob metadata == metadata
* upsert embedding vector to Matching Engine index, where vector ID == unique ID

In [21]:
# chunk text and add to matching engine vector store
uploaded_ids = me.add_texts(
    texts=texts
    , metadatas=metas
)

uuid_strings = []

for uid in uploaded_ids:
    uuid_strings.append(str(uid))

INFO:root:# of texts = 76
INFO:root:# of metadatas = 76


...........................................................................
len of embeddings: 76
len of embeddings[0]: 768


INFO:root:Uploaded 76 documents to GCS.


### Validate the metadata upload to blob storage

In [22]:
BLOB_UUID = str(uuid_strings[0])
print(BLOB_UUID)

BLOB_NAME=f'documents/{BLOB_UUID}'
test_gcs_blob_metadata(blob_name=BLOB_NAME,bucket_name=EMBEDDING_DIR_BUCKET)

efefbf89-7731-4802-aaa8-7ca68146be61
Metadata: {'publish_date': '', 'NumArticles': '0', 'title': '12 Undervalued Dividend Aristocrats To Buy According to Analysts', 'url': 'https://finance.yahoo.com/news/12-undervalued-dividend-aristocrats-buy-172048478.html', 'NumSources': '0', 'AvgTone': '0.0', 'summary': 'In this article, we discuss 12 undervalued dividend stocks to buy according to analysts.\nYou can skip our detailed analysis of dividend stocks and their performance over the years, and go directly to read 5 Undervalued Dividend Aristocrats To Buy According to Analysts.\nWhen investing in dividend stocks, many investors find undervalued dividend aristocrat stocks appealing because they offer the potential for both income generation and capital appreciation.\nIn this article, we will discuss undervalued dividend aristocrats to buy according to analysts.\n12 Undervalued Dividend Aristocrats To Buy According to Analysts is originally published on Insider Monkey.', 'source': 'https://f

## GCSFileLoader
Google Cloud Storage FileLoader: The `GCSFileLoader` method supports text, word.docx, ppt, html, pdfs, and images stored as blobs in Google Cloud Storage

<center>
<img src="imgs/chunk_gcs_blobs_flow.png" width="750"/>
</center>

To follow along with this example, download the [John Deere 4 R Series manual](http://www.alltractormanuals.com/john-deere/john-deere-4r-series/) and upload it to your GCS bucket

In [29]:
# SOURCE_BLOB = 'docs/OMTR112287_EN_208_Operators_Manual_4044M_4044R_4052M_4052R_4066M_4066R.pdf'

# docs = me.chunk_unstructured_gcs_blob(
#     blob_name = SOURCE_BLOB
#     , bucket_name = BUCKET_NAME
# )

# texts = [d.page_content for d in docs]
# metas = [d.metadata for d in docs]

# docs[0]

### Add texts (with metadata)

Given a list of texts and metadatas:
* assign unique ID to original `<text, metadata>` pair
* convert original text into embedding vector representation
* upload original text to GCS, where blob name == unique ID and blob metadata == metadata
* upsert embedding vector to Matching Engine index, where vector ID == unique ID

In [31]:
# # chunk text and add to matching engine vector store
# uploaded_ids = me.add_texts(
#     texts=texts
#     , metadatas=metas
# )

# uuid_strings = []

# for uid in uploaded_ids:
#     uuid_strings.append(str(uid))

### Validate metadata upload to blob storage

In [32]:
# BLOB_UUID = str(uuid_strings[0])
# print(BLOB_UUID)

# BLOB_NAME=f'documents/{BLOB_UUID}'
# test_gcs_blob_metadata(blob_name=BLOB_NAME,bucket_name=EMBEDDING_DIR_BUCKET)

## YouTube Videos

We can transcribe YouTube videos with the [youtube-transcript-api](https://pypi.org/project/youtube-transcript-api/)

<center>
<img src="imgs/chunk_youtube_flow.png" width="1000"/>
</center>

* Given a YouTube URL, e.g., `https://www.youtube.com/watch?v=rPq8UsaiR1I`, grab the ID after `watch?v=` (e.g.,`'rPq8UsaiR1I'`)
* The transcription returns a list of dicts:
```python
[
    {
        'text': 'Hey there',
        'start': 7.58,
        'duration': 6.13
    },
    {
        'text': 'how are you',
        'start': 14.08,
        'duration': 7.58
    },
    # ...
]

```

Sample YouTube videos for this tutorial:
* [Time to winterize - 10 tips to prepare your vehicle for Winter](https://www.youtube.com/watch?v=rPq8UsaiR1I)
* [The Great Pyramids are power plants](https://www.youtube.com/watch?v=SrT-1y1kyK8)

In [25]:
YOUTUBE_ID='rPq8UsaiR1I'

docs = me.chunk_youtube(
    youtube_id=YOUTUBE_ID
)

texts = [d.page_content for d in docs]
metas = [d.metadata for d in docs]

docs[0]

INFO:root:# of pages loaded (pre-chunking) = 1
INFO:root:# of documents = 13


Document(page_content="Hey, friends, it's Len here from 1A Auto. So, winter's coming, that means inclement\nweather and, of course, colder temperatures. There's a lot of things on your vehicle you're\ngoing to need to think about. So, let's go over a few right now. All right. Now, the first thing that we're going to do\nis get underneath the hood and we're going to start by checking all of our fluids. Go ahead and start on the driver side. Typically, you're going to find your master\ncylinder and that's going to have your brake fluid in it. Go ahead and take a look from the side and\nyou'll see where the maximum line is. You can also open the cap and make sure that\nit looks as though it's clean and clear. Typically, if the fluid starts getting dark\nbrown in any way, it's time for a flush. Moving along, you'd want to check your oil\nlevel. There's going to be a little oil dipstick\nand generally, it'll say engine oil right on it. Go ahead and pull that out, wipe it down,", metadata={'

### Add texts (with metadata)

Given a list of texts and metadatas:
* assign unique ID to original `<text, metadata>` pair
* convert original text into embedding vector representation
* upload original text to GCS, where blob name == unique ID and blob metadata == metadata
* upsert embedding vector to Matching Engine index, where vector ID == unique ID

In [26]:
# chunk text and add to matching engine vector store
uploaded_ids = me.add_texts(
    texts=texts
    , metadatas=metas
)

uuid_strings = []

for uid in uploaded_ids:
    uuid_strings.append(str(uid))

INFO:root:# of texts = 13
INFO:root:# of metadatas = 13


............
len of embeddings: 13
len of embeddings[0]: 768


INFO:root:Uploaded 13 documents to GCS.


### Validate metadata upload to blob storage

In [27]:
BLOB_UUID = str(uuid_strings[0])
print(BLOB_UUID)

BLOB_NAME=f'documents/{BLOB_UUID}'
test_gcs_blob_metadata(blob_name=BLOB_NAME,bucket_name=EMBEDDING_DIR_BUCKET)

727ceac0-96c3-4b56-a192-82fc63ce7d33
Metadata: {'view_count': '10974', 'author': '1A Auto: Repair Tips & Secrets Only Mechanics Know', 'source': 'https://www.youtube.com/watch?v=rPq8UsaiR1I', 'description': '', 'publish_date': '2020-11-21 00:00:00', 'title': 'Time to Winterize! 10 Tips to Prepare Your Car, Truck, or SUV for Winter', 'length': '492', 'thumbnail_url': 'https://i.ytimg.com/vi/rPq8UsaiR1I/hq720.jpg'}
