*jkwng: Import LlamaIndex / Vertex AI integration*

In [None]:
%pip install --quiet llama-index-llms-vertex

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m64.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h

# Document Search with LlamaIndex

This example shows how to use the Python [LlamaIndex](https://docs.llamaindex.ai/en/stable/) library to run a text-generation request on open-source LLMs and embedding models using the OpenAI SDK, then augment that request using the text stored in a collection of local PDF documents.

### <u>Requirements</u>
1. As you will accessing the LLMs and embedding models through Vector AI Engineering's Kaleidoscope Service (Vector Inference + Autoscaling), you will need to request a KScope API Key:

      Run the following command (replace ```<user_id>``` and ```<password>```) from **within the cluster** to obtain the API Key. The ```access_token``` in the output is your KScope API Key.
  ```bash
  curl -X POST -d "grant_type=password" -d "username=<user_id>" -d "password=<password>" https://kscope.vectorinstitute.ai/token
  ```
2. After obtaining the `.env` configurations, make sure to create the ```.kscope.env``` file in your home directory (```/h/<user_id>```) and set the following env variables:
- For local models through Kaleidoscope (KScope):
    ```bash
    export OPENAI_BASE_URL="https://kscope.vectorinstitute.ai/v1"
    export OPENAI_API_KEY=<kscope_api_key>
    ```
- For OpenAI models:
   ```bash
   export OPENAI_BASE_URL="https://api.openai.com/v1"
   export OPENAI_API_KEY=<openai_api_key>
   ```
3. (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

#### Import libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

*jkwng: added the below in colab enterprise - install faiss and langchain dependencies*

In [None]:
%pip install --quiet faiss-cpu langchain llama-index-vector-stores-faiss

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import faiss
import os
import sys

from pathlib import Path

from langchain.text_splitter import RecursiveCharacterTextSplitter

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings, StorageContext
from llama_index.core.llms import ChatMessage
from llama_index.core.node_parser import LangchainNodeParser
from llama_index.core.query_engine import RetrieverQueryEngine

# jkwng: commented out the following on Vertex AI - faiss is in memory
# from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# from llama_index.llms.openai_like import OpenAILike
from llama_index.vector_stores.faiss import FaissVectorStore

#### Load config files

In [None]:
# Add root folder of the rag_bootcamp repo to PYTHONPATH
current_dir = Path().resolve()
parent_dir = current_dir.parent
sys.path.insert(0, str(parent_dir))

#jkwng: we don't need this ?
# from utils.load_secrets import load_env_file
# load_env_file()

In [None]:

#jkwng: we don't need this?
# GENERATOR_BASE_URL = os.environ.get("OPENAI_BASE_URL")

# OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

#### Set up some helper functions

In [None]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.text for i, d in enumerate(docs)]
        )
    )

#### Make sure other necessary items are in place

*jkwng: put the input files on GCS, here we test to make sure we can read our dataset*

In [None]:
from google.cloud import storage

bucket_name = "jkwng-vertex-experiments"
prefix = "rag_bootcamp/document_search/source_documents"


storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
bloblist = bucket.list_blobs(prefix=prefix)

for blob in bloblist:
    print(blob.name)

rag_bootcamp/document_search/source_documents/vector-institute-2021-22-annual-report_accessible.pdf


In [None]:
# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False

#jkwng: migrated this to GCS

#directory_path = "./source_documents"
# if not os.path.exists(directory_path):
    # print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
# for filename in os.listdir(directory_path):
bloblist = bucket.list_blobs(prefix=prefix)
for blob in bloblist:
    contains_pdf = True if ".pdf" in blob.name else contains_pdf
if not contains_pdf:
    print(f"ERROR: The gs://{bucket_name}/{prefix} subfolder must contain at least one .pdf file")

#### Choose LLM and embedding model

*jkwng: use Gemini 2.0 Flash and Gemini text embedding models instead of Llama and BGE - we can also deploy these models via Vertex model garden*

In [None]:
# GENERATOR_MODEL_NAME = "Meta-Llama-3.1-8B-Instruct"
# EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"

GENERATOR_MODEL_NAME = "gemini-2.0-flash-001"
EMBEDDING_MODEL_NAME = "text-embedding-005"

## Start with a basic generation request without RAG augmentation

Let's start by asking Llama-3.1 a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's world knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is domain-specific and it won't know the answer to. A good example would be an obscure detail buried deep within a company's annual report. For example:

*How many Vector scholarships in AI were awarded in 2022?*

In [None]:
query = "How many Vector scholarships in AI were awarded in 2022?"

## Now send the query to the open source model using KScope

*jkwng: send the generation to Gemini 2.0 Flash*

In [None]:
from llama_index.llms.vertex import Vertex

# llm = OpenAILike(
#     model=GENERATOR_MODEL_NAME,
#     is_chat_model=True,
#     temperature=0,
#     max_tokens=None,
#     api_base=GENERATOR_BASE_URL,
#     api_key=OPENAI_API_KEY
# )

#jkwng: send to gemini 2.0
llm = Vertex(
    model=GENERATOR_MODEL_NAME,
    temperature=0
)

message = [
    ChatMessage(
        role="user",
        content=query
    )
]
try:
    result = llm.chat(message)
    print(f"Result: \n\n{result}")
except Exception as err:
    if "Error code: 503" in err.message:
        print(f"The model {GENERATOR_MODEL_NAME} is not ready yet.")
    else:
        raise

Result: 

assistant: According to the Vector Institute, they awarded **170** Vector Scholarships in Artificial Intelligence in 2022.



Without additional information, Llama-3.1 is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from `source_documents`

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

*jkwng - add the llama gcs integration*

In [None]:
%pip install --quiet llama-index-readers-gcs

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.4/40.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h

*jkwng: we use the LlamaIndex GCS integration to read the input files directly off of GCS*

In [None]:
from llama_index.readers.gcs import GCSReader

# Load the pdfs
docs = GCSReader(bucket=bucket_name, prefix=prefix).load_data()

# directory_path = "./source_documents"
# os.makedirs(directory_path, exist_ok=True)

print(f"Number of source documents: {len(docs)}")

# Split the documents into smaller chunks
parser = LangchainNodeParser(RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32))
chunks = parser.get_nodes_from_documents(docs)
print(f"Number of text chunks: {len(chunks)}")



Number of source documents: 42
Number of text chunks: 196


#### Define the embeddings model

*jkwng: use gemini embeddings model*

In [None]:
%pip install --quiet llama-index-embeddings-vertex

In [None]:
from llama_index.embeddings.vertex import VertexTextEmbedding
import google.auth

credentials, project_id = google.auth.default()

print(f"Setting up the embeddings model...")
# embeddings = HuggingFaceEmbedding(
#     model_name=EMBEDDING_MODEL_NAME,
#     device='cuda',
#     trust_remote_code=True,
# )

embeddings = VertexTextEmbedding(
    model_name=EMBEDDING_MODEL_NAME,
    credentials=credentials,
)

Setting up the embeddings model...


#### Set LLM and embedding model [recommended for LlamaIndex]

In [None]:
Settings.llm = llm
Settings.embed_model = embeddings

## Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [None]:
def get_embed_model_dim(embed_model):
    embed_out = embed_model.get_text_embedding("Dummy Text")
    return len(embed_out)

faiss_dim = get_embed_model_dim(embeddings)
faiss_index = faiss.IndexFlatL2(faiss_dim)

vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(chunks, storage_context=storage_context)

In [None]:
retriever = index.as_retriever(similarity_top_k=5)

# Retrieve the most relevant context from the vector store based on the query
retrieved_docs = retriever.retrieve(query)

Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [None]:
pretty_print_docs(retrieved_docs)

Document 1:

26 
 
 
VECTOR SCHOLARSHIPS IN 
AI ATTRACT TOP TALENT 
TO ONTARIO UNIVERSITIES 
109 
Vector Scholarships in AI awarded 
34 
Programs 
13 
Universities 
351 
Scholarships awarded since the 
program launched in 2018 
Supported with funding from the Province of 
Ontario, the Vector Institute Scholarship in Artifcial 
Intelligence (VSAI) helps Ontario universities to attract 
the best and brightest students to study in AI-related 
master’s programs. 
Scholarship recipients connect directly with leading
----------------------------------------------------------------------------------------------------
Document 2:

5 
Annual Report 2021–22Vector Institute
SPOTLIGHT ON FIVE YEARS OF AI 
LEADERSHIP FOR CANADIANS 
SINCE THE VECTOR INSTITUTE WAS FOUNDED IN 2017: 
2,080+ 
Students have graduated from 
Vector-recognized AI programs and 
study paths 
$6.2 M 
Scholarship funds committed to 
students in AI programs 
3,700+ 
Postings for AI-focused jobs and 
internships ofered on Vector’

## Now send the query to the RAG pipeline

In [None]:
query_engine = RetrieverQueryEngine(retriever=retriever)
result = query_engine.query(query)
print(f"Result: \n\n{result}")

Result: 

In 2022, 109 Vector Scholarships in AI were awarded. Since the program's launch in 2018, a total of 351 scholarships have been awarded.



The model provides the correct answer (109) using the retrieved information.