## Building-up our Multi-Agent LLM System with LlamaIndex Workflow
*Creating a digital twin of a Data Scientist Job Candidate for Interview Preparations.*

> *Here are the defined/outlined steps:*
- Load personal notes as documents for RAG system
- Building a Hybrid Retrieval Query Engine with Qdrant. 
- Building out websearch tool
- Integrating agents, events and tools into LlamaIndex workflow.

In [None]:
# import dependencies
import qdrant_client
from IPython.display import Markdown, display
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.core import Settings
from llama_index.core.chat_engine import SimpleChatEngine

ModuleNotFoundError: No module named 'tavily'

> *Set up Qdrant vector database with "docker run ..." command*

In [None]:
# load documents - "../data" folder
docs = SimpleDirectoryReader("../data/notes").load_data(show_progress=True)

# build vector store index - Sync & Async
client = qdrant_client.QdrantClient(host='localhost', port=6333)
aclient = qdrant_client.AsyncQdrantClient(host="localhost", port=6333)

# Set LLM & embedding model
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1, max_tokens=1024, streaming=True)
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Initialize Qdrant vector store
vector_store = QdrantVectorStore(
    "interview_notes",
    client=client, 
    aclient = aclient,
    enable_hybrid = True,
    fastembed_sparse_model="Qdrant/bm25"
    )

# Create storage context container
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build vector store index -> Query Engine
index = VectorStoreIndex.from_documents(
    docs,
    storage_context=storage_context,
    embed_model=Settings.embed_model
)

Loading files:  11%|█         | 1/9 [00:00<00:03,  2.54it/s]Ignoring wrong pointing object 29 0 (offset 0)
Ignoring wrong pointing object 39 0 (offset 0)
Loading files: 100%|██████████| 9/9 [00:00<00:00, 11.55it/s]
Fetching 18 files: 100%|██████████| 18/18 [00:01<00:00, 15.34it/s]


In [5]:
# get example response - retrieve 2 sparse, 2 dense, and filter down to 3 total hybrid results
query_engine = index.as_query_engine(
    similarity_top_k=2, 
    sparse_top_k=2, 
    hybrid_top_k=3,
    vector_store_query_mode="hybrid", 
    llm=Settings.llm, 
    # use_async=True,
)
response = query_engine.query(
    "Write a self introduction for an interview for Data Scientist position with Singtel."
    )

In [6]:
display(Markdown(f"<b>{response}</b>"))

<b>Hello, my name is [Your Name], and I am excited about the opportunity to interview for the Data Scientist position with Singtel. I have a strong background in data science and business statistics, having worked with the Business Statistics Division at SingStat. In my role, I was involved in various projects, including the Statistical Business Register and Business Data Analytics teams, where I developed and maintained data extraction functions and collaborated with external agencies to support policy formulation.

My passion lies in leveraging data to derive actionable insights, particularly in the areas of natural language processing and machine learning. I have experience in building and refining machine learning classifiers, running statistical analyses, and translating complex data findings into clear, actionable insights for stakeholders. I believe that my skills in data analysis and my collaborative approach would make me a valuable addition to the Singtel team.

Outside of my professional experience, I am committed to continuous learning and self-development, which I believe is essential in the ever-evolving field of data science. I am also an avid sports enthusiast and enjoy reading, which helps me maintain a balanced perspective.

Thank you for considering my application. I look forward to discussing how my background and skills align with the goals of Singtel.</b>

> *Consider using structured outputs for workflow.*

> *Idea is to retrieve chunks from "past experiences" as a memory on to personalise messages and answers, like a digital twin.*

> *Structure outputs to complement web-search agent - Eg. Related work experiences, how current pursuits fit with the role etc.*

> *Index contains information pertaining to interview preparation notes: research, background, questions and answers, personal questions and self-reflections after interviews.*

**Outputs from Query Engine:**
1. Relevant research follow-up (if any): str
2. Relevant background information (if any): str
3. Relevant Q&A (if any): str

> *Use another agent to formulate questions for research agent.*

#### Thought process in interview preparation
1. Look at Job Description - Upload to system. 
2. Research and find out more about the team and what they do - formulate agent utilising JD to formulate questions + web-search agent to search the web.
3. Structure response for (1) work experience and (2) Why {role}? - RAG agent with context. 
4. Prepare and structure responses for General Questions - RAG agent with context (can include loop back to check for past question bank)
5. Prepare and structure responses for technical questions - Web-search agent
6. Prepare questions for team - Compile agent. 
7. Review and structure outputs - Compile/Review agent.

#### PDF extraction - Test

In [12]:
from PyPDF2 import PdfReader

reader = PdfReader("../pdf/JD Data Scientist - PM&S.pdf")
all_text = "\n".join(page.extract_text() for page in reader.pages if page.extract_text())
print(all_text)

  
Job Title: Data Scientist – Decision Science (DSAI)  
 
 
 
HTX is the world’s first Science and Technology agency that integrates a diverse range of 
scientific and engineering capabilities to innovate and deliver transformative and operationally -
ready solutions for homeland security. As a statutory board of the Ministry of Home Affairs 
(MHA) and integral to the Home Team, HTX works at the forefront of science and technology to 
empower Singapore’s frontline of security. Our shared mission is to amplify, augment and 
accelerate the Home Team’s advantage and secure Singapore as the safest place on planet earth.  
Join us if you would like to be part of HTX’s force multiplier to exponentially impact 
Singapore’s safety and security.  
 
 
We are looking for someone who is creative, curious, collaborative and enjoys challenges.  If 
you like to design, build and deploy scalable optimisation/simulation solutions, we want you!  
 
Successful candidate’s primary job role would be in o

#### Define simple chat engine to return a list of questions for company websearch
- Manager LLM corresponds to "my thought process", like a "mini-me"/digital twin chain of thought reasoning process.

In [None]:
# Define chat engine. 
llm = OpenAI(model="gpt-4o-mini", temperature=0.1, max_tokens=1024, streaming=True)
manager_llm = SimpleChatEngine.from_defaults(llm=llm)

# Get response with prompt
response = manager_llm.chat(
    f"""
    Given the job description: {all_text}, 
    Formulate questions to: 
    1. Clarify job scope/key deliverables (if neccessary).
    2. Find out more about the team and specific projects taken by the team.
    Return only a concise list of questions, each as an independent google search query for web search.
    Limit response to 5 questions.
    """
)

In [23]:
print(response)

1. What are the specific responsibilities of a Data Scientist in the Decision Science team at HTX?
2. Can you provide examples of recent projects undertaken by the Data Science team at HTX?
3. What types of data science and AI solutions have been implemented by HTX to address user needs?
4. How does HTX define success for the projects managed by the Data Science team?
5. What tools and technologies are primarily used by the Data Science team at HTX?
6. How does the collaboration process work between the Data Science team and business users at HTX?
7. What opportunities for professional development and upskilling are available for Data Scientists at HTX?
8. What is the typical project lifecycle for data science initiatives at HTX?
9. How does HTX measure the impact of its data-driven solutions on homeland security?
10. What are the key challenges faced by the Data Science team at HTX in their projects?


#### Simple workflow to coordinate question formulation and web-search loop. 

In [36]:
# Import dependencies for workflow and setup events
from llama_index.core.workflow import (
    Workflow,
    step,
    Context,
    Event,
    StartEvent,
    StopEvent
)
from llama_index.core.base.llms.types import ChatMessage
from llama_index.core.chat_engine import SimpleChatEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.prompts import PromptTemplate
from llama_index.core.agent.workflow import ReActAgent, FunctionAgent

from pydantic import BaseModel, Field
from typing import Literal, List, Dict, Optional, Any, Union, Annotated
from tavily import AsyncTavilyClient
from PyPDF2 import PdfReader

In [32]:
# Define agents
async def search_web(query: str) -> str:
    """Useful for using the web to answer questions."""
    client = AsyncTavilyClient(api_key="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")
    return str(await client.search(query))

web_agent = FunctionAgent(
    tools=[search_web],
    llm=llm,
    system_prompt="You are a helpful assistant that can search the web for information.",
)

# Define chat engine - "My thought process"
llm = OpenAI(model="gpt-4o-mini", temperature=0.1, max_tokens=1024, streaming=True)
manager_llm = SimpleChatEngine.from_defaults(llm=llm)

In [79]:
## Define events
class structured_jd(BaseModel): 
    company_description: str = Field(..., description="Brief overview about company and team in point form with newline.")
    job_scope: str = Field(..., description="Concise summary of job scope and key deliverables in point form with newline.")
    example_projects: str = Field(..., description="Given context of job description, summarize projects (if any). If no projects detailed, return a string named hypothetical_project.")
    questions_ind: Literal[True, False] = Field(..., description="Deduce if there is a need to search the web for more information (If no example projects found, return True) If yes, return True, else False.")
    questions: Optional[str] = Field(None, description="If questions_ind=True, return questions used for web-search. Ensure company name and job role is in each individual questions for search specificity.")

class WebSearchEvent(Event): 
    query: str

class CompileEvent(Event): 
    company_description: str
    job_scope: str
    example_projects: str
    web_responses: Optional[str]
    source: str

# Define workflow
class simpleFlow(Workflow): 
    def __init__(self, manager_llm=llm, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.web_agent = web_agent
        self.llm=llm
        self.manager_llm = SimpleChatEngine.from_defaults(llm=manager_llm)
        self.summary_prompt_template = PromptTemplate(
            """Given the job description pdf: {input}, return a structured output of the given fields."""
        )

    @step 
    # Modify: To generate structured response from JD, then evaluate if supplementation needed. If yes, redirect to WebSearchEvent
    async def internalise_jd(self, ctx: Context, ev: StartEvent) -> WebSearchEvent | CompileEvent:
        pdf_input = ev.input
        # Get response with prompt
        response =  await self.llm.astructured_predict(
            structured_jd,
            self.summary_prompt_template,
            input = pdf_input
        )
        
        # if there are questions, go to WebSearchEvent
        if response.questions_ind: 
            await ctx.set("structured_jd", response)
            return WebSearchEvent(query=str(response.questions))
        # else, go to compile event
        else: 
            return CompileEvent(company_description=response.company_description, job_scope=response.job_scope, example_projects=response.example_project, source="manager_llm")
    
    @step
    async def web_search(self, ctx: Context, ev: WebSearchEvent) -> CompileEvent:
        qns = ev.query
        response = await ctx.get("structured_jd")
        web_response = await web_agent.run(user_msg=f"Search and return questions and answers with /n. Questions: {qns}")
        return CompileEvent(company_description=response.company_description, job_scope=response.job_scope, example_projects=response.example_projects, web_responses=str(web_response), source="both")
    
    @step
    async def compile(self, ctx: Context, ev: CompileEvent) -> StopEvent: # Return StopEvent to end loop first.
        response = await self.manager_llm.achat(
            f"""
            Internalise company description {ev.company_description} and job scope {ev.job_scope}.
            If there are no example projects {ev.example_projects}, supplement it with web-search results {ev.web_responses}. 
            Summarize all the above information into a neat write-up with 3 sections: Company description, job scope/key deliverables and example projects (with their respective weblinks).
            The report contents should be grounded in the information provided only. Return in neat, markdown format.
            """
        )
        return StopEvent(str(response))

#### Run the workflow

In [80]:
# Get JD PDF text
reader = PdfReader("../pdf/JD Data Scientist - PM&S.pdf")
all_text = "\n".join(page.extract_text() for page in reader.pages if page.extract_text())

w = simpleFlow(timeout=60, verbose=False)
result = await w.run(input=all_text)
print(str(result))

  peek = stream.read(20)
  await ctx.set("structured_jd", response)
  response = await ctx.get("structured_jd")


# HTX Overview

## Company Description
HTX is the world’s first Science and Technology agency dedicated to integrating a diverse range of scientific and engineering capabilities to innovate and deliver transformative and operationally-ready solutions for homeland security. As a statutory board under the Ministry of Home Affairs (MHA), HTX operates at the forefront of science and technology to empower Singapore’s frontline security efforts. Their mission is to amplify, augment, and accelerate the Home Team’s advantage, ensuring Singapore remains the safest place on earth.

## Job Scope / Key Deliverables
The Data Science team at HTX is responsible for identifying, scoping, developing, and implementing data science and AI solutions tailored to meet the business needs of the MHA. Key responsibilities include:
- Applying Data Science and Operations Research techniques to address various business challenges.
- Conceptualizing and designing digital twin solutions based on user specifications

> *Web-search not company-specific. Use web-search to supplement JD summary.*