# Pre-requisites
- WSL
- Miniconda3 

# Setup environment
- Create conda env `conda create langchain python=3.11`
- Set the "langchain" env that has been just created as the running env in VS code


Install langchain and openai package

In [None]:
! pip install langchain openai

# Init variables

You need to set value of `OPENAI_API_KEY` that you get from the training team in the .env file

In [2]:
import openai, os
from dotenv import load_dotenv

load_dotenv()

openai.api_type = "azure"
openai.api_key = os.getenv("AZURE_OPENAI_KEY")
openai.api_version = os.getenv("AZURE_OPENAI_API_VERSION")
deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT")


# Overviews
The BonBon FAQ.pdf file contains frequently asked questions and answers for customer support scenario. The topics are around IT related issue troubleshooting such as networking, software, hardware. You are requested to provide a solution to build a chat bot capable of answering the user questions with LangChain.

## Assignment 1: Document Indexing (mandatory)

- The content of BonBon FAQ.pdf should be indexed to the local Chroma vector DB from where the chatbot can lookup the appropriate information to answer questions.
- Should use some embedding model such as Azure Open AI text-embedding-3-small to create vectors, feel free to use any other open source embedding model if it works.

In [3]:
import os
import dotenv
import fitz
from langchain.text_splitter import RecursiveCharacterTextSplitter
import openai

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex, SimpleField, SearchField, SearchFieldDataType,
    SearchableField, VectorSearch, HnswAlgorithmConfiguration,
    VectorSearchProfile, VectorSearchAlgorithmKind, HnswParameters
)
from azure.search.documents.models import VectorizedQuery

# ===== Load ENV =====
dotenv.load_dotenv()

AZURE_OPENAI_SERVICE = os.getenv("AZURE_OPENAI_SERVICE")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_EMBEDDING_DEPLOYMENT = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT")

AZURE_SEARCH_SERVICE = os.getenv("AZURE_SEARCH_SERVICE")
AZURE_SEARCH_API_KEY = os.getenv("AZURE_SEARCH_API_KEY")

AZURE_SEARCH_ENDPOINT = f"https://{AZURE_SEARCH_SERVICE}.search.windows.net"
AZURE_SEARCH_INDEX = "gptkbindex-pdf"

# ===== Load PDF file =====
def read_pdf_by_page(file_path):
    pages = []
    with fitz.open(file_path) as doc:
        for page in doc:
            text = page.get_text()
            pages.append(text)
    return pages

pages = read_pdf_by_page("./data/BonBon FAQ.pdf")

# ===== Split page into semantic chunks =====
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)

sourcefile_name = "bonbon_faq.pdf"
chunks_with_metadata = []
for page_number, page_text in enumerate(pages, start=1):
    sub_chunks = splitter.split_text(page_text)
    for chunk in sub_chunks:
        chunks_with_metadata.append({
            "content": chunk,
            "page": page_number,
            "sourcefile": sourcefile_name
        })

print(f"Total chunks: {len(chunks_with_metadata)}")

# ===== Setup Azure OpenAI client =====
openai_client = openai.AzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    api_version="2023-07-01-preview",
    azure_endpoint=f"https://{AZURE_OPENAI_SERVICE}.openai.azure.com"
)

def get_embedding(text):
    response = openai_client.embeddings.create(
        model=AZURE_OPENAI_EMBEDDING_DEPLOYMENT,
        input=text
    )
    return response.data[0].embedding

# ===== Generate embeddings & prepare documents =====
documents = []
for i, chunk in enumerate(chunks_with_metadata):
    embedding = get_embedding(chunk["content"])
    documents.append({
        "id": str(i),
        "content": chunk["content"],
        "embedding": embedding,
        "sourcefile": chunk["sourcefile"],
        "page": chunk["page"]
    })

# ===== Create index (once) =====
search_cred = AzureKeyCredential(AZURE_SEARCH_API_KEY)
index_client = SearchIndexClient(endpoint=AZURE_SEARCH_ENDPOINT, credential=search_cred)

index = SearchIndex(
    name=AZURE_SEARCH_INDEX,
    fields=[
        SimpleField(name="id", type=SearchFieldDataType.String, key=True),
        SearchableField(name="content", type=SearchFieldDataType.String),
        SimpleField(name="sourcefile", type=SearchFieldDataType.String, filterable=True),
        SimpleField(name="page", type=SearchFieldDataType.Int32, filterable=True),
        SearchField(
            name="embedding",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,
            vector_search_profile_name="embedding_profile"
        )
    ],
    vector_search=VectorSearch(
        algorithms=[
            HnswAlgorithmConfiguration(
                name="hnsw_config",
                kind=VectorSearchAlgorithmKind.HNSW,
                parameters=HnswParameters(metric="cosine")
            )
        ],
        profiles=[
            VectorSearchProfile(
                name="embedding_profile",
                algorithm_configuration_name="hnsw_config"
            )
        ]
    )
)

# Tạo index nếu chưa tồn tại
try:
    index_client.get_index(AZURE_SEARCH_INDEX)
    print("Index already exists.")
except:
    print("Creating new index...")
    index_client.create_index(index)

# ===== Upload documents =====
search_client = SearchClient(
    endpoint=AZURE_SEARCH_ENDPOINT,
    index_name=AZURE_SEARCH_INDEX,
    credential=search_cred
)

print("Uploading documents...")
upload_result = search_client.upload_documents(documents=documents)
print(f"Upload result: {upload_result}")

Total chunks: 63
Index already exists.
Uploading documents...
Upload result: [<azure.search.documents._generated.models._models_py3.IndexingResult object at 0x759b785217d0>, <azure.search.documents._generated.models._models_py3.IndexingResult object at 0x759b78522890>, <azure.search.documents._generated.models._models_py3.IndexingResult object at 0x759b78520710>, <azure.search.documents._generated.models._models_py3.IndexingResult object at 0x759b78522950>, <azure.search.documents._generated.models._models_py3.IndexingResult object at 0x759b78523790>, <azure.search.documents._generated.models._models_py3.IndexingResult object at 0x759b78522cd0>, <azure.search.documents._generated.models._models_py3.IndexingResult object at 0x759b78521550>, <azure.search.documents._generated.models._models_py3.IndexingResult object at 0x759b78522410>, <azure.search.documents._generated.models._models_py3.IndexingResult object at 0x759b78522550>, <azure.search.documents._generated.models._models_py3.Inde

In [4]:
# ===== Search thử bằng vector =====
search_query = "How do I access shared network drives?"
search_vector = get_embedding(search_query)

print("\n=== Search Results ===")
results = search_client.search(
    search_text=None,
    top=5,
    vector_queries=[
        VectorizedQuery(
            vector=search_vector,
            k_nearest_neighbors=5,
            fields="embedding"
        )
    ],
    filter=f"sourcefile eq '{sourcefile_name}'"
)

for doc in results:
    print(f"[Page {doc['page']}] Score: {doc['@search.score']:.4f}")
    print(doc["content"][:200])
    print("---")



=== Search Results ===
[Page 9] Score: 0.7684
2) Know the Shared Drive Path: 
• 
You should have the network path or UNC (Universal Naming Convention) of the shared 
drive. It typically looks like this: \\computername\sharename or \\IP_address\sh
---
[Page 9] Score: 0.7612
5) In File Explorer, go to "This PC." 
• 
Click on "Computer" in the top menu and select "Map network drive." 
• 
Choose a drive letter and enter the UNC path (e.g., \\server\share). 
• 
Check the box
---
[Page 9] Score: 0.7261
• 
If you haven't mapped the drive, you can directly access it by entering the UNC path in the 
address bar of File Explorer and pressing Enter. 
 
7) Provide Credentials (if required): 
• 
If the sha
---
[Page 10] Score: 0.7168
8) Access Files and Folders: 
• 
Once you're connected to the shared drive, you can browse, open, and manage files and 
folders just like you would on your local drive. 
Please note that the exact ste
---
[Page 9] Score: 0.6889
• 
If all else fails, uninstall the prin

## Assignment 2: Building Chatbot (mandatory)
- You are requested to build a chatbot solution for customer support scenario using Conversational ReAct agent supported in LangChain
- The chatbot is able to support user to answer FAQs in the sample BonBon FAQ.pdf file.
- The chatbot should use Azure Open AI GPT-4o LLM as the reasoning engine.
- The chatbot should be context aware, meaning that it should be able to chat with users in the conversation manner.
- The agent is equipped the following tools:
  - Internet Search: Help the chatbot automatically find out more about something using Duck Duck Go internet search
  - Knowledge Base Search: Help the chatbot to lookup information in the private knowledge base
- In case user asks for information related to topics in the BonBon FAQ.pdf file such as internet connection, printer, malware issues the chatbot must use the private knowledge base, otherwise it should search on the internet to answer the question.
- In the answer of chatbot, it should mention the source file and the page that the answer belongs to, for example the answer should mention "BonBon FQA.pdf (page 2)"

In [None]:
# Knowledge base tool
from langchain.tools import Tool
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.tools import DuckDuckGoSearchRun
from langchain.tools import Tool
from langchain.retrievers.azure_ai_search import AzureAISearchRetriever
import os

os.environ["AZURE_AI_SEARCH_SERVICE_NAME"] = "sd-1096-ai-search"
os.environ["AZURE_AI_SEARCH_INDEX_NAME"] = "gptkbindex-pdf"
os.environ["AZURE_AI_SEARCH_API_KEY"] = "oMkpcLdWXLpr0dQ2mP5Oqt9zTiyxew2Fg3kUQ1lejyAzSeAGerNJ"


def should_use_kb(query: str) -> bool:
    kb_keywords = ["internet", "printer", "malware", "vpn", "email", "wifi", "outlook", "virus", "support policy"]
    return any(kw in query.lower() for kw in kb_keywords)

def get_ddg_tool():
    return DuckDuckGoSearchRun(name="DuckDuckGoSearch", func="search_web", description=(
        "Use this tool when the question is general or not related to BonBon company FAQs. "
        "This includes general internet knowledge, latest updates, or topics not covered by internal support."
    ))

def load_knowledge_base():
    embedding = OpenAIEmbeddings(
        deployment="text-embedding-3-small-deployment",
        model="text-embedding-3-small",
        openai_api_key="AMCw2G6dX14KbbrLinGxWE9QzbiWPlY1Ra8sQPJr7orqMf9WsdChJQQJ99BGACYeBjFXJ3w3AAABACOGWZau",
        openai_api_base=f"https://sd1096-langchain-training.openai.azure.com",
        openai_api_type="azure",
        openai_api_version="2023-05-15"
    )

    vectordb = Chroma(
        persist_directory="./vectorstore/index",
        embedding_function=embedding
    )
    retriever = vectordb.as_retriever(search_kwargs={"k": 3})
    return retriever

def get_kb_tool():
    retriever = AzureAISearchRetriever(
        content_key="content", top_k=1, index_name="gptkbindex-pdf"
    )

    def _search_kb(query: str) -> str:
        docs = retriever.get_relevant_documents(query)
        return "\n".join([
            f"{doc.page_content}\nSource: {doc.metadata.get('sourcefile', 'Unknown')} (page {doc.metadata.get('page', '?')})"
            for doc in docs
        ])

    return Tool(
        name="KnowledgeBaseSearch",
        func=_search_kb,
        description="Use for answering BonBon FAQ-related topics like internet, printer, malware, VPN, etc."
    )


In [62]:
import os
from dotenv import load_dotenv
from langchain_core.prompts import ChatPromptTemplate
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.agents.agent_types import AgentType
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import AzureChatOpenAI
from langchain.agents import initialize_agent, AgentType


load_dotenv()

llm = AzureChatOpenAI(
    deployment_name="gpt-4o-deployment",
    model_name="gpt-4o",
    openai_api_key="AMCw2G6dX14KbbrLinGxWE9QzbiWPlY1Ra8sQPJr7orqMf9WsdChJQQJ99BGACYeBjFXJ3w3AAABACOGWZau",
    openai_api_base=f"https://sd1096-langchain-training.openai.azure.com/",
    openai_api_type="azure",
    openai_api_version="2025-01-01-preview"
)


tools = [get_ddg_tool(), get_kb_tool()]

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant.",
        ),
        ("placeholder", "{chat_history}"),
        ("human", "{input}"),
        ("placeholder", "{agent_scratchpad}"),
    ]
)

agent_executor = initialize_agent(
    tools,
    llm,
    agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
    memory=ConversationBufferMemory(memory_key="chat_history", return_messages=True),
    verbose=True
)

if __name__ == "__main__":
    print("Welcome to BonBon Support Chatbot!")
    while True:
        user_input = input("You: ")
        if user_input.lower() in {"exit", "quit"}:
            break
        response = agent_executor.invoke({"input": user_input})
        print("Bot:", response["output"])




Welcome to BonBon Support Chatbot!


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? No  
AI: A slow computer can be caused by various issues, such as too many programs running, insufficient memory, malware, or outdated software. Here are a few steps you can try to improve the performance of your computer:

1. **Restart Your Computer**: Sometimes, a simple restart can resolve performance issues.

2. **Close Unnecessary Programs**: Check your task manager (Ctrl + Shift + Esc on Windows, or Activity Monitor on macOS) to see if there are programs using a lot of resources, and close those you don't need.

3. **Free Up Disk Space**: Delete unnecessary files or programs, and clear your temporary files.

4. **Check for Malware**: Run a virus or malware scan using a trusted antivirus program.

5. **Update Your Software**: Ensure your operating system and programs are up to date with the latest patches and updates.

6. **Check Startup Programs**: Dis

## Assignment 3: Build a new assistant based on BonBon source code (optional)
The objective
- Run the code and index the sample BonBon FAQ.pdf file to Azure Cognitive Search
- Explore the code and implement a new assistant that has the same behavior as above
- Explore other features such as RBACs, features on admin portal

Please contact the training team in case you need to get the source code of BonBon.