# Vanilla RAG using Google Vertex AI

## GCP initialization

In [2]:
import os
import vertexai


project_id = "llm-rag-407816"
location = "us-central1"
# Initialize Vertex AI
vertexai.init(project=project_id, location=location)

## Data Retrieval

There are many ways to parse PDF and other formats, [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf/) u can see more

In [8]:
# Define the directory path
directory_path = "./data/docs"

# Get all file names in the specified directory
file_names = os.listdir(directory_path)

print(file_names)

['not-only-stand-with-ukraine-but-also-win-with-ukraine.pdf', 'pidtrymaty-ukrayinskyh-pidpryyemtsiv-za-kordonom-proyekt-studentiv-universytetu-notr-dam-ta-uku.pdf', 'putin-the-pilate.pdf', 'ukraine-stands-on-the-right-side-of-history.pdf', 'usu-foundation-welcomed-42-of-the-brightest-ukrainian-entrepreneurs-to-palo-alto-for-the-stanford-ignite-ukraine-program.pdf']


In [9]:
input_file_path = os.path.join(directory_path, file_names[0])

In [10]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(input_file_path)
data = loader.load()
data[0]

Document(page_content='', metadata={'source': './data/docs\\not-only-stand-with-ukraine-but-also-win-with-ukraine.pdf', 'page': 0})

Since, page_content is empty, PyPDFLoader is not right choice for our case.
Let`s try another way to parse PDF - PyMuPDFLoader

In [11]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(input_file_path)
data = loader.load()
data[0]

Document(page_content='', metadata={'source': './data/docs\\not-only-stand-with-ukraine-but-also-win-with-ukraine.pdf', 'file_path': './data/docs\\not-only-stand-with-ukraine-but-also-win-with-ukraine.pdf', 'page': 0, 'total_pages': 10, 'format': 'PDF 1.7', 'title': 'Not only «Stand with Ukraine» but also «Win with Ukraine» - UCU', 'author': 'Kostya Zahorulko', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Microsoft: Print To PDF', 'creationDate': "D:20240409183252+03'00'", 'modDate': "D:20240409183252+03'00'", 'trapped': ''})

pymupdf is not working as well, let`s try another way to parse PDF - Unstructured

but first, let`s install the required dependency - [Tesseract](https://github.com/UB-Mannheim/tesseract/wiki) and add tesseract path (in my case for Windows 11, "C:\Program Files\Tesseract-OCR") to the environment variable 'PATH'.
Otherwise we will get an error:
*TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.*

In [12]:
from langchain_community.document_loaders import UnstructuredPDFLoader

In [13]:
loader = UnstructuredPDFLoader(input_file_path)
data = loader.load()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\kosty\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


In [15]:
data[0]

Document(page_content="~)\n\nCBACKSHOmE: ff Wo &\n\nNot only «Stand with Ukraine» but also «Win with Ukraine»\n\nWEDHES a APRIL, 2026\n\nThe Russia-Ukraine tvar has geopolitical significance. Russia‘s imperial aspirations, which are clearly seen in its history and culture through the centuries, reach far beyond the borders af Ukraine. Russia is a threat to the security of all Europe and the world. Ukraine is today paying a great price to hold back the enemy - tens of thousands of lives of Ukraine's best sons aml daughters and millions of broken destinies and physical and spiritual wounds. Hows lame will this last? Ukraine cammot win this war without the consolidation of the demacratic world. Taday it is mot enough simply ta stand with Ukraime - it's mecessary to win with Uleraime!\n\n“try does Russian culture, the culture of empire, threaten the whole world? -\n\nce\n\nIria Starewoyt, poetess, literary schalar, ancl assistant prefessor at the Culture Studies\n\nDepartment of Ukrainian 

We parsed the PDF file successfully. To get more sufficient results, try to use [Google Vision API](https://cloud.google.com/vision/docs/pdf)

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
)

chunks = text_splitter.split_documents(data)

In [18]:
len(chunks)

9

In [17]:
chunks[0]

Document(page_content="~)\n\nCBACKSHOmE: ff Wo &\n\nNot only «Stand with Ukraine» but also «Win with Ukraine»\n\nWEDHES a APRIL, 2026\n\nThe Russia-Ukraine tvar has geopolitical significance. Russia‘s imperial aspirations, which are clearly seen in its history and culture through the centuries, reach far beyond the borders af Ukraine. Russia is a threat to the security of all Europe and the world. Ukraine is today paying a great price to hold back the enemy - tens of thousands of lives of Ukraine's best sons aml daughters and millions of broken destinies and physical and spiritual wounds. Hows lame will this last? Ukraine cammot win this war without the consolidation of the demacratic world. Taday it is mot enough simply ta stand with Ukraime - it's mecessary to win with Uleraime!\n\n“try does Russian culture, the culture of empire, threaten the whole world? -\n\nce\n\nIria Starewoyt, poetess, literary schalar, ancl assistant prefessor at the Culture Studies\n\nDepartment of Ukrainian 

In [28]:
pages_content = [doc.page_content for doc in chunks]

## Embeddings

In [33]:
from vertexai.language_models import TextEmbeddingModel

In [45]:
# To match existing method get_embeddings from TextEmbeddingModel to Chroma.from_documents embedding_function input,
# which is langchain_community.embeddings.sentence_transformer.SentenceTransformerEmbeddings

class CustomTextEmbeddingModel(TextEmbeddingModel):
    def embed_documents(self, documents):
        # Directly use or extend the functionality of get_embeddings
        embeddings = self.get_embeddings(documents)
        embeddings = [embedding.values for embedding in embeddings]
        return embeddings
    
    def embed_query(self, query):
        # Directly use or extend the functionality of get_embeddings
        embeddings = self.get_embeddings([query])
        return embeddings[0].values

In [46]:
# https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings
embedding_model = CustomTextEmbeddingModel.from_pretrained("textembedding-gecko-multilingual@001")
embeddings = embedding_model.get_embeddings(pages_content)
print(f"Length of Embedding Vector: {len(embeddings[0].values)}")

Length of Embedding Vector: 768


In [47]:
from langchain.vectorstores import Chroma

# Turn the chunks into embeddings and store them in Chroma
vectordb = Chroma.from_documents(chunks, embedding_model, persist_directory="./chroma_db")

# Configure Chroma as a retriever with top_k=5
retriever = vectordb.as_retriever(search_kwargs={"k": 5})

In [48]:
# Example of the DB usage for similarity_search
vectordb.similarity_search("«Stand with Ukraine» but also «Win with Ukraine»")

[Document(page_content="~)\n\nCBACKSHOmE: ff Wo &\n\nNot only «Stand with Ukraine» but also «Win with Ukraine»\n\nWEDHES a APRIL, 2026\n\nThe Russia-Ukraine tvar has geopolitical significance. Russia‘s imperial aspirations, which are clearly seen in its history and culture through the centuries, reach far beyond the borders af Ukraine. Russia is a threat to the security of all Europe and the world. Ukraine is today paying a great price to hold back the enemy - tens of thousands of lives of Ukraine's best sons aml daughters and millions of broken destinies and physical and spiritual wounds. Hows lame will this last? Ukraine cammot win this war without the consolidation of the demacratic world. Taday it is mot enough simply ta stand with Ukraime - it's mecessary to win with Uleraime!\n\n“try does Russian culture, the culture of empire, threaten the whole world? -\n\nce\n\nIria Starewoyt, poetess, literary schalar, ancl assistant prefessor at the Culture Studies\n\nDepartment of Ukrainian

## LLM

In [56]:
from langchain_google_vertexai import VertexAI


# https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text
llm = VertexAI(
    model_name="text-bison@002",
    max_output_tokens=1024,
    temperature=0,
    top_p=0.8,
    top_k=40,
    verbose=False,
    streaming=False
)

In [57]:
llm.invoke("«Stand with Ukraine» but also «Win with Ukraine»")

' "Stand with Ukraine" and "Win with Ukraine" are two slogans that have been used to express support for Ukraine during the ongoing conflict with Russia.\n\n"Stand with Ukraine" is a call for solidarity with the Ukrainian people and their government. It is a statement of support for Ukraine\'s sovereignty and territorial integrity, and a condemnation of Russia\'s aggression.\n\n"Win with Ukraine" is a more proactive slogan that calls for Ukraine to emerge victorious from the conflict. It is a statement of confidence in Ukraine\'s ability to defend itself and defeat Russia.\n\nBoth slogans are valid expressions of support for Ukraine, and they can be used in different contexts to convey different messages. "Stand with Ukraine" is a more general statement of support, while "Win with Ukraine" is a more specific call for action.\n\nUltimately, the best way to support Ukraine is to do what you can to help the Ukrainian people and their government. This could include donating to charities th

## RAG

In [58]:
from langchain.chains import RetrievalQA

In [59]:
def get_qa_chain(db_retriever, llm, prompt_template, verbose=False):
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=db_retriever,
        return_source_documents=True,
        verbose=verbose,
        chain_type_kwargs={
            "prompt": prompt_template,
        },
        # callbacks=[AsyncIteratorCallbackHandler()]
    )
    # Enable for troubleshooting
    qa_chain.combine_documents_chain.verbose = verbose
    # this prints the prompt in green
    qa_chain.combine_documents_chain.llm_chain.verbose = verbose
    qa_chain.combine_documents_chain.llm_chain.llm.verbose = verbose
    return qa_chain

In [70]:
from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template("""
You are a helpful AI assistant.
Answer based on the context provided. 
context: {context}
input: {question}
answer:
""")

In [71]:
qa_chain = get_qa_chain(retriever, llm, prompt)

In [72]:
response = qa_chain.invoke(
    input={"query": "Tell me about last event at UCU on «Stand with Ukraine» vs «Win with Ukraine»"}
)

In [74]:
response.keys()

dict_keys(['query', 'result', 'source_documents'])

In [75]:
#Print the answer to the question
print(response["result"])

 The latest event at the Ukrainian Catholic University (UCU) focused on the theme of "Stand with Ukraine" versus "Win with Ukraine." The event highlighted the geopolitical significance of the Russia-Ukraine war and emphasized that Russia's imperial aspirations pose a threat to the security of Europe and the world.

The discussion centered around the idea that Ukraine is currently paying a significant price to hold back the Russian aggression, with numerous casualties and widespread devastation. The speakers stressed the importance of global solidarity and support for Ukraine, arguing that it is not enough to simply stand with Ukraine, but rather to actively work towards a Ukrainian victory.

The event featured insights from Iryna Starovoit, a poetess, literary scholar, and assistant professor at the Culture Studies Department of UCU. She provided an analysis of Russian culture and its imperialistic tendencies, explaining how they contribute to the threat posed by Russia to the world.



In [77]:
[doc.metadata for doc in response["source_documents"]]

[{'source': './data/docs\\not-only-stand-with-ukraine-but-also-win-with-ukraine.pdf'},
 {'source': './data/docs\\not-only-stand-with-ukraine-but-also-win-with-ukraine.pdf'},
 {'source': './data/docs\\not-only-stand-with-ukraine-but-also-win-with-ukraine.pdf'},
 {'source': './data/docs\\not-only-stand-with-ukraine-but-also-win-with-ukraine.pdf'},
 {'source': './data/docs\\not-only-stand-with-ukraine-but-also-win-with-ukraine.pdf'}]