## Objective
Your assignment is to design, build, and explain a novel agentic workflow that utilizes a subset of the Wikipedia dataset. As part of this, you will need to define a distinctive GenAI use case that your system is intended to solve. The aim is to showcase not just your technical implementation skills, but also your ability to apply agentic system design innovatively and practically. You will implement your workflow in the Databricks Free Edition, starting from the provided notebook `01_agentic_wikipedia_aimpoint_interview.ipynb`.

To get you started, we pre-installed LangChain and LangGraph which are open source GenAI orchestration frameworks that work well in a Databricks workspace. In addition, we have provided you with a basic setup to access the data source using a LangChain dataloader (https://python.langchain.com/docs/integrations/document_loaders/wikipedia/).

You may use coding assistants for this assignment, but you must provide your own custom prompts and demonstrate your own critical thinking. Large language models must not be used to generate responses for the open-response questions in Part B of this notebook.

Note: This assignment uses serverless clusters. At the time of creating this notebook, all components run successfully. However, you may need to address package dependency issues in the future to ensure your GenAI solution continues to function properly. 

## Deliverables

1. Reference Architecture
    - This should highlight your approach to addressing your use case or problem in either a pdf or image format; include technical agentic workflow details here.

2. Databricks Notebook(s)
    - Includes primary notebook `01_agentic_wikipedia_aimpoint_interview`.ipynb and any supplemental notebooks required to run the agent
    - In the `01_agentic_wikipedia_aimpoint_interview`.ipynb notebook complete the **GenAI Application Development** and **Reflection** sections. The GenAI Application Development section is where you add your own custom logic to create and run your agentic workflow. The Reflection section is writing a markdown response to answer the two questions.
    - To reduce your development time, we created the logic for you to have a FAISS vector store and made the LLM accessible as well.
    - Before finalizing, make sure your code runs correctly by using "Run All" to validate functionality. Then go to "File" → "Export" → "HTML" to download as HTML file. Next, open this HTML file. Finally save as a PDF see instructions below. __Note: In your submissions this must be a PDF file format__

    > **Save HTML as PDF**
    > - Windows: (ctrl + P) → Save as PDF → Save
    > - MacOS: (⌘ + P) → Save as PDF → Save


## Data Source

The Wikipedia Loader ingests documents from the Wikipedia API and converts them into LangChain document objects. The page content includes the first sections of the Wikipedia articles and the metadata is described in detail below.

__Recommendation__: If you are using the LangChain document loader we recommend filtering down to 10k or fewer documents. The `query_terms` argument below can be upated to update the search term used to search wikipedia. Make sure you update this based on the use case you defined.

In the metadata of the LangChain document object; we have the following information:

| Column  | Definition                                                                 |
|---------|-----------------------------------------------------------------------------|
| title   | The Wikipedia page title (e.g., "Quantum Computing").                       |
| summary | A short extract or condensed description from the page content.             |
| source  | The URL link to the original Wikipedia article.                             |

In [0]:
# %pip install -U -qqqq 
# backoff 
# databricks-langchain 
# langgraph==0.5.3 
# uv 
# databricks-agents 
# mlflow-skinny[databricks] 
# chromadb 
# sentence-transformers 
# langchain-huggingface
# langchain-chroma 
# wikipedia 
# faiss-cpu

In [0]:
%pip install -U -q databricks-langchain langchain==0.3.7 faiss-cpu wikipedia langgraph==0.5.3 

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

### Build the Wikipedia subset + FAISS vector store

In [0]:
# #######################################################################################################
#  ###### Config (Define LLMs, Embeddings, Vector Store, Data Loader specs)                         ######
#  #######################################################################################################

# DataLoader Config
query_terms = ["sport", "football", "soccer", "basketball","baseball", "track","swimming", "gymnastics"] #TODO: update to match your use case requirements
max_docs = 10 #TODO: recommend starting with a smaller number for testing purposes

# Retriever Config
k = 2 # number of documents to return
EMBEDDING_MODEL = "databricks-bge-large-en" # Embedding model endpoint name

# LLM Config
LLM_ENDPOINT_NAME = "databricks-meta-llama-3-1-8b-instruct" # Model Serving endpoint name; other option see "Serving" under AI/ML tab (e.g. databricks-gpt-oss-20b)

example_question = "What is the most popular sport in the US?"

In [0]:
from typing import List, Dict, Any
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from databricks_langchain import ChatDatabricks, DatabricksEmbeddings

# 1) Initialize embeddings + LLM
embeddings = DatabricksEmbeddings(endpoint=EMBEDDING_MODEL)
llm = ChatDatabricks(endpoint=LLM_ENDPOINT_NAME, temperature=0.2)

# 2) Load Wikipedia docs (subset)
docs = []
for term in query_terms:
    loader = WikipediaLoader(query=term, load_max_docs=max_docs)
    docs.extend(loader.load())

# 3) Split
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
splits = splitter.split_documents(docs)

# 4) Build FAISS index
vs = FAISS.from_documents(splits, embeddings)
retriever = vs.as_retriever(search_kwargs={"k": k})

len(splits), "chunks indexed"



  lis = BeautifulSoup(html).find_all('li')


(550, 'chunks indexed')

### Create Retrieval Tool

In [0]:
from langchain.tools import tool

@tool("wiki_subset_retriever")
def wiki_subset_retriever(query: str) -> str:
    """Retrieve relevant snippets from curated Wikipedia subset (FAISS)."""
    docs = retriever.invoke(query)  # <- new API
    lines = []
    for idx, d in enumerate(docs):
        title = d.metadata.get("title", "unknown_title")
        lines.append(f"[Doc {idx+1} | {title}] {d.page_content}")
    return "\n\n".join(lines)

In [0]:
q = "basketball history"
out = wiki_subset_retriever.func(q)   # direct python call to the wrapped function

print(type(out))
print(out[:1000])

<class 'str'>
[Doc 1 | History of basketball] Basketball began with its invention in 1891 in Springfield, Massachusetts, by Canadian physical education instructor James Naismith as a less injury-prone sport than football. Naismith was a 31-year-old graduate student when he created the indoor sport to keep athletes indoors during the winters. The game became established fairly quickly and grew very popular as the 20th century progressed, first in America and then in other parts of the world. After basketball became established in American colleges, the professional game followed. The American National Basketball Association (NBA), established in 1946, grew to a multibillion-dollar enterprise by the end of the century, and basketball became an integral part of American culture.


== Early history ==

[Doc 2 | National Basketball Association] The NBA traces its roots to the Basketball Association of America which was founded in 1946 by owners of the major ice hockey arenas in the Northeas

### Agentic workflow design (Planner → Researcher → Fact-checker → Writer)

In [0]:
from langchain_core.prompts import ChatPromptTemplate

planner_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are the Planner in an agentic RAG system.\n"
     "Your job: break the user question into 2-5 precise research queries.\n"
     "Also define what evidence would be required to answer confidently.\n"
     "Return ONLY JSON with keys: subqueries (list of strings), evidence_requirements (list of strings)."),
    ("user", "{question}")
])

In [0]:
print(planner_prompt.format(question="What is the most popular sport in the US?"))

System: You are the Planner in an agentic RAG system.
Your job: break the user question into 2-5 precise research queries.
Also define what evidence would be required to answer confidently.
Return ONLY JSON with keys: subqueries (list of strings), evidence_requirements (list of strings).
Human: What is the most popular sport in the US?


In [0]:
researcher_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are the Researcher.\n"
     "For each subquery, call the tool wiki_subset_retriever(subquery).\n"
     "Extract only relevant evidence. Do NOT add outside facts.\n"
     "Return ONLY JSON with keys: findings (list of objects with subquery, evidence_snippets)."),
    ("user", "Question: {question}\nSubqueries: {subqueries}")
])

In [0]:
print(researcher_prompt.format(
    question="What is the most popular sport in the US?",
    subqueries=["US most popular sport", "football popularity US", "baseball vs basketball popularity US"]
))

System: You are the Researcher.
For each subquery, call the tool wiki_subset_retriever(subquery).
Extract only relevant evidence. Do NOT add outside facts.
Return ONLY JSON with keys: findings (list of objects with subquery, evidence_snippets).
Human: Question: What is the most popular sport in the US?
Subqueries: ['US most popular sport', 'football popularity US', 'baseball vs basketball popularity US']


In [0]:
factchecker_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are the Fact-checker.\n"
     "Given the research evidence, identify candidate claims needed to answer the question.\n"
     "For each claim, decide: supported / weakly_supported / unsupported.\n"
     "A claim is supported ONLY if it is directly stated or clearly implied by evidence.\n"
     "Return ONLY JSON with keys: claims (list of objects: claim, status, supporting_doc_refs, rationale)."),
    ("user", "Question: {question}\nEvidence:\n{evidence}")
])

In [0]:
print(factchecker_prompt.format(
    question="What is the most popular sport in the US?",
    evidence="### Subquery: football in the United States\n[Doc 1 | Football in the United States] ...\n"
))

System: You are the Fact-checker.
Given the research evidence, identify candidate claims needed to answer the question.
For each claim, decide: supported / weakly_supported / unsupported.
A claim is supported ONLY if it is directly stated or clearly implied by evidence.
Return ONLY JSON with keys: claims (list of objects: claim, status, supporting_doc_refs, rationale).
Human: Question: What is the most popular sport in the US?
Evidence:
### Subquery: football in the United States
[Doc 1 | Football in the United States] ...



In [0]:
writer_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are the Final Writer.\n"
     "Write a concise, helpful answer.\n"
     "Use ONLY supported claims; weakly_supported claims must be explicitly qualified.\n"
     "If evidence is insufficient, say what is missing.\n"
     "Include citations in the form (Doc X | title) next to relevant sentences.\n"),
    ("user",
     "Question: {question}\n"
     "Fact-check results:\n{factcheck_json}\n"
     "Research evidence:\n{evidence}")
])

In [0]:
print(writer_prompt.format(
    question="What is the most popular sport in the US?",
    factcheck_json='{"claims":[{"claim":"American football is the most popular sport in the US by viewership.","status":"supported","supporting_doc_refs":["Doc 1 | American football"],"rationale":"Directly stated."}]}',
    evidence="[Doc 1 | American football] American football is the most popular sport in the United States by viewership."
))

System: You are the Final Writer.
Write a concise, helpful answer.
Use ONLY supported claims; weakly_supported claims must be explicitly qualified.
If evidence is insufficient, say what is missing.
Include citations in the form (Doc X | title) next to relevant sentences.

Human: Question: What is the most popular sport in the US?
Fact-check results:
{"claims":[{"claim":"American football is the most popular sport in the US by viewership.","status":"supported","supporting_doc_refs":["Doc 1 | American football"],"rationale":"Directly stated."}]}
Research evidence:
[Doc 1 | American football] American football is the most popular sport in the United States by viewership.


### a) GenAI Application Development

__REQUIRED__: This section is where input your custom logic to create and run your agentic workflow. Feel free to add as many codes cells that are needed for this assignment

In [0]:
import json

def extract_first_json(text: str):
    """
    Robustly extract the first valid JSON object from a string.
    Works even if the LLM adds extra text or multiple JSON blobs.
    """
    decoder = json.JSONDecoder()
    s = text.strip()

    # Find any '{' and attempt to decode from there
    for i, ch in enumerate(s):
        if ch == "{":
            try:
                obj, end = decoder.raw_decode(s[i:])
                return obj
            except json.JSONDecodeError:
                continue

    raise ValueError(f"No valid JSON object found in model output:\n{text}")

In [0]:
text = """
Sure! Here is the result you asked for:

{
  "subqueries": ["football popularity US", "baseball popularity US"],
  "evidence_requirements": ["viewership stats", "participation data"]
}

Hope this helps!
"""

out = extract_first_json(text)
print(out)

{'subqueries': ['football popularity US', 'baseball popularity US'], 'evidence_requirements': ['viewership stats', 'participation data']}


In [0]:
from typing import Dict, Any
from langchain_core.runnables import RunnableLambda

# Step 2: Research (tool calls manually, to keep behavior deterministic)
def run_research(inputs: Dict[str, Any]) -> Dict[str, Any]:
    question = inputs["question"]
    subqueries = inputs["plan"]["subqueries"]
    all_evidence_blocks = []
    findings = []

    for sq in subqueries:
        ev = wiki_subset_retriever.run(sq)  # tool call
        findings.append({"subquery": sq, "evidence_snippets": ev})
        all_evidence_blocks.append(f"### Subquery: {sq}\n{ev}")

    return {
        "question": question,
        "subqueries": subqueries,
        "findings": findings,
        "evidence": "\n\n".join(all_evidence_blocks),
    }

research_chain = RunnableLambda(run_research)

In [0]:
test_inputs = {
    "question": "What is the most popular sport in the US?",
    "plan": {
        "subqueries": ["football in the United States", "baseball in the United States"]
    }
}

out = run_research(test_inputs)

# Print a small preview
print("keys:", out.keys())
print("question:", out["question"])
print("subqueries:", out["subqueries"])
print("findings len:", len(out["findings"]))
print("evidence preview:\n", out["evidence"][:1200])

keys: dict_keys(['question', 'subqueries', 'findings', 'evidence'])
question: What is the most popular sport in the US?
subqueries: ['football in the United States', 'baseball in the United States']
findings len: 2
evidence preview:
 ### Subquery: football in the United States
[Doc 1 | American football] American football is the most popular sport in the United States; the most popular forms of the game are professional and college football, with the other major levels being high school and youth football. Over a million Americans played college or high school football in 2022, and the National Football League (NFL) has one of the highest average attendance of any professional sports league in the world. Its championship game, the Super Bowl, ranks among the most-watched club sporting events globally. Other professional and amateur leagues exist worldwide, but the sport does not have the international popularity of other American sports like baseball or basketball; the sport maintains 

In [0]:
# Step 3: Fact-check (robust parsing)
factcheck_chain = factchecker_prompt | llm | RunnableLambda(lambda x: extract_first_json(x.content))

In [0]:
test_question = "What is the most popular sport in the US?"
test_evidence = """
### Subquery: football in the United States
[Doc 1 | American football] American football is the most popular sport in the United States by viewership.

### Subquery: baseball in the United States
[Doc 2 | Baseball in the United States] Baseball has historically been called the national pastime in the United States.
""".strip()

out = factcheck_chain.invoke({
    "question": test_question,
    "evidence": test_evidence
})

print(out)
print(type(out))

{'claims': [{'claim': 'American football is the most popular sport in the US', 'status': 'supported', 'supporting_doc_refs': ['Doc 1'], 'rationale': 'Directly stated in the evidence'}, {'claim': 'Baseball is the most popular sport in the US', 'status': 'unsupported', 'supporting_doc_refs': ['Doc 2'], 'rationale': 'Historically called the national pastime, but does not imply current popularity'}, {'claim': 'The most popular sport in the US is unclear', 'status': 'weakly_supported', 'supporting_doc_refs': ['Doc 1', 'Doc 2'], 'rationale': 'Evidence suggests American football is the most popular by viewership, but baseball has historical significance'}]}
<class 'dict'>


In [0]:
# Step 4: Write final
final_chain = writer_prompt | llm

In [0]:
print(writer_prompt.format(
    question="What is the most popular sport in the US?",
    factcheck_json='{"claims":[{"claim":"American football is the most popular sport in the US by viewership.","status":"supported","supporting_doc_refs":["Doc 1 | American football"],"rationale":"Directly stated."}]}',
    evidence='[Doc 1 | American football] American football is the most popular sport in the United States by viewership.'
))

System: You are the Final Writer.
Write a concise, helpful answer.
Use ONLY supported claims; weakly_supported claims must be explicitly qualified.
If evidence is insufficient, say what is missing.
Include citations in the form (Doc X | title) next to relevant sentences.

Human: Question: What is the most popular sport in the US?
Fact-check results:
{"claims":[{"claim":"American football is the most popular sport in the US by viewership.","status":"supported","supporting_doc_refs":["Doc 1 | American football"],"rationale":"Directly stated."}]}
Research evidence:
[Doc 1 | American football] American football is the most popular sport in the United States by viewership.


In [0]:
from typing import TypedDict, List, Dict, Any, Optional
import json

from langgraph.graph import StateGraph, END
from langchain_core.runnables import RunnableLambda

# ---------- State ----------
class AgentState(TypedDict, total=False):
    question: str
    plan: Dict[str, Any]
    evidence: str
    findings: List[Dict[str, str]]
    factcheck: Dict[str, Any]
    final: str

# ---------- Nodes ----------
def planner_node(state: AgentState) -> AgentState:
    resp = (planner_prompt | llm).invoke({"question": state["question"]})
    plan = extract_first_json(resp.content)
    return {"plan": plan}

def research_node(state: AgentState) -> AgentState:
    subqueries = state["plan"]["subqueries"]
    all_evidence_blocks = []
    findings = []

    for sq in subqueries:
        ev = wiki_subset_retriever.func(sq)  # deterministic retrieval
        findings.append({"subquery": sq, "evidence_snippets": ev})
        all_evidence_blocks.append(f"### Subquery: {sq}\n{ev}")

    evidence = "\n\n".join(all_evidence_blocks)
    return {"findings": findings, "evidence": evidence}

def factcheck_node(state: AgentState) -> AgentState:
    resp = (factchecker_prompt | llm).invoke({
        "question": state["question"],
        "evidence": state["evidence"]
    })
    factcheck = extract_first_json(resp.content)
    return {"factcheck": factcheck}

def writer_node(state: AgentState) -> AgentState:
    resp = (writer_prompt | llm).invoke({
        "question": state["question"],
        "factcheck_json": json.dumps(state["factcheck"], indent=2),
        "evidence": state["evidence"]
    })
    return {"final": resp.content}

# ---------- Build graph ----------
graph = StateGraph(AgentState)

graph.add_node("planner", planner_node)
graph.add_node("research", research_node)
graph.add_node("factcheck", factcheck_node)
graph.add_node("writer", writer_node)

graph.set_entry_point("planner")
graph.add_edge("planner", "research")
graph.add_edge("research", "factcheck")
graph.add_edge("factcheck", "writer")
graph.add_edge("writer", END)

app = graph.compile()

# ---------- Run ----------
result = app.invoke({"question": example_question})
print(result["final"])

Based on the provided research evidence, the most popular sport in the US is American football. This is supported by multiple documents:

* [Doc 1 | American football] states that American football is the most popular sport in the United States, with over a million Americans playing college or high school football in 2022.
* [Doc 2 | National Football League] states that the NFL has the highest average attendance (67,591) of any professional sports league in the world and is the most popular sports league in the United States.
* [Doc 2 | National Football League] also states that the Super Bowl is among the most-watched sporting events in the world, with individual games accounting for many of the most watched television programs in American history.

Additionally, [Doc 1 | American football] states that the National Football League (NFL) has one of the highest average attendance of any professional sports league in the world, and its championship game, the Super Bowl, ranks among the 

### b) Reflection

__REQUIRED:__ Provide a detailed reflection addressing  these two questions:
1. If you had more time, which specific improvements or enhancements would you make to your agentic workflow, and why?
2. What concrete steps are required to move this workflow from prototype to production?


> Enter your reflection here



### 