
# AI Interviewer — Conversation Flow Manager (LangGraph MVP)
This notebook demonstrates a prototype conversation flow manager for an adaptive AI interviewer built using LangGraph. The system dynamically adjusts its questioning strategy based on a participant’s responses to verify their claimed skills efficiently and accurately.

At its core, the agent maintains a per-skill belief state — a running mean and confidence band - to represent how confident it is that the respondent truly possesses each skill. Each turn of the conversation updates these beliefs and influences what the next question should be. The goal is to maximise verification accuracy while minimising interview time.

## Objective
Design a conversation manager that can:
1.	Measure skill proficiency adaptively – gauge how well the respondent actually understands a claimed skill.
2.	Balance exploration and exploitation – explore unverified skills while deepening assessment of skills with partial evidence.
3.	Optimise for total verification reward – confirm as many skills as possible with strong evidence in limited turns.
4.	Minimise interview cost – reduce redundant or low-information questions to keep the process short and scalable. 

## System Overview
The system operates as an information-seeking loop:

Ask -> Evaluate -> Update Belief -> Choose Next Question 

Under the hood, it's implemented as a LangGraph state graph with a few key nodes: 

generate_candidates → select_question → ask → grade → update → decide


1. **Generate Questions**: Propose a set of questions to ask the respondent. 
2. **Select Question**: Choose the most informative question from the set. 
3. **Ask**: Collect a response from the respondent. 
4. **Evaluate**: Grade the response based on a predefined rubric. 
5. **Update Belief**: Update the belief state based on the response. 
6. **Decide**: Determine what to do next based on the belief state. 

At the end, the system outputs Verification Cards summarising each skill’s confidence, lower-bound certainty, and verification status.

## Design Reasoning 
1. Problem challenge: 
Traditional interview scripts are static and waste turns asking low-value or redundant questions. If we also do exhaustive probing of each skill, we may need 20+ turns. 
We need an interviewer that learns during the interaction. 

2. Key insight: 
The interview can be framed as an active learning or adaptive testing process — each question acts as an experiment that reduces uncertainty about the candidate’s expertise.

3. Belif + policy:
Maintain a running estimate of each skill’s confidence and pick the next question that yields the highest expected information gain (uncertainty reduction) within the turn budget. Compare to a brute-force Large Language Model (LLM) approach that simply generates questions based on a skill, this approach can learns and adapts within the interview, whilst still being able to verify the skills after the interview. 

4. Stopping criteration: 
Stop when further questioning provides minimal confidence improvement — efficient, not exhaustive.

5.	Proof-of-concept scope:
Domain: 3 claimed skills (e.g., PyTorch, LLM Evaluation, CUDA).
Interview length: 8–10 turns.
Output: dynamic evolution of skill confidence + verification summary.
6.	Risks and mitigations:
LLM grading bias: Use fixed rubric + schema validation.
Overconfidence early: Enforce minimum turns before verification.
Verbose or evasive answers: Penalise low specificity in scoring.
JSON drift: Strict schema with fallback templates.




In [47]:
import os
import math
from typing import TypedDict, List, Dict
from dotenv import load_dotenv

from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END

# --- Load API Key ---
load_dotenv()
if "OPENAI_API_KEY" not in os.environ:
    raise ValueError("OpenAI API key not found. Please set it in your .env file.")


# --- 1. Define the State for the Graph ---
# This is the central object that flows through the system, tracking everything.
class InterviewState(TypedDict):
    """Represents the state of the interview at any given time."""

    skills_to_verify: List[str]
    belief_state: Dict[
        str, Dict
    ]  # Maps skill -> {'mean_score': float, 'n_questions': int}
    conversation_history: List[str]
    candidate_questions: List["Question"]  # A pool of questions to choose from
    current_question: "Question"
    candidate_response: str
    grade: "Grade"
    turn_count: int
    max_turns: int
    final_report: str


# --- 2. Define Pydantic Models for Structured LLM Outputs ---
# These ensure the LLM provides responses in a reliable, parsable format.


class Question(BaseModel):
    """A question to ask the candidate about a specific skill."""

    skill: str = Field(description="The skill this question is designed to test.")
    text: str = Field(description="The text of the question.")
    difficulty: int = Field(
        description="The difficulty of the question, from 1 (easy) to 5 (expert)."
    )


class Grade(BaseModel):
    """A grade for the candidate's response."""

    score: int = Field(
        description="The score from 1 (poor) to 5 (excellent) based on the rubric."
    )
    reasoning: str = Field(description="A brief justification for the given score.")


# --- 3. Initialize the LLM ---
# We'll use this LLM for all nodes that require generation or grading.
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

# --- 4. Define the Nodes of the Graph ---
# Each function represents a step in your conversation flow.


def initialize_state(state: InterviewState) -> InterviewState:
    """Initializes the belief state for all skills."""
    belief_state = {
        skill: {"mean_score": 0.0, "n_questions": 0}
        for skill in state["skills_to_verify"]
    }
    state["belief_state"] = belief_state
    state["turn_count"] = 0
    state["conversation_history"] = []
    print(
        "System: Initializing interview for skills:",
        ", ".join(state["skills_to_verify"]),
    )
    print("-" * 50)
    return state


def generate_questions_node(state: InterviewState) -> InterviewState:
    """
    Node 1: Generate a set of candidate questions, one for each skill.
    This corresponds to your "Generate Questions" step.
    """
    skills = state["skills_to_verify"]
    belief_state = state["belief_state"]
    history = "\n".join(state["conversation_history"])

    print("System: Generating candidate questions...")

    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                """You are an expert technical interviewer. Your task is to generate one insightful question for the given skill.
        The question should be designed to verify the candidate's true understanding.
        Based on the candidate's average score for this skill so far, adjust the difficulty:
        - If the score is low (0-2), ask an easy/foundational question.
        - If the score is medium (2-4), ask an intermediate/practical question.
        - If the score is high (4+), ask a difficult/nuanced question.
        Avoid repeating questions from the conversation history.""",
            ),
            (
                "human",
                """
        Skill: {skill}
        Current Average Score for this skill: {avg_score:.2f}
        Conversation History:
        {history}

        Generate one question about {skill}.
        """,
            ),
        ]
    )

    structured_llm = llm.with_structured_output(Question)
    question_prompts = [
        prompt.format_prompt(
            skill=skill, avg_score=belief_state[skill]["mean_score"], history=history
        )
        for skill in skills
    ]

    candidate_questions = structured_llm.batch(question_prompts)
    state["candidate_questions"] = candidate_questions
    return state


def select_question_node(state: InterviewState) -> InterviewState:
    """
    Node 2: Select the most informative question using the UCB1 algorithm.
    This corresponds to your "Select Question" step and implements the exploration/exploitation policy.
    """
    candidate_questions = state["candidate_questions"]
    belief_state = state["belief_state"]
    total_turns = state["turn_count"]
    exploration_factor = 2.0  # The 'C' constant in the UCB formula

    best_question = None
    max_ucb_score = -1

    print("System: Selecting best question using UCB policy...")

    for question in candidate_questions:
        skill = question.skill
        n_s = belief_state[skill]["n_questions"]
        mean_s = belief_state[skill]["mean_score"]

        if n_s == 0:
            # If a skill has never been asked about, its priority is infinite.
            ucb_score = float("inf")
        else:
            # UCB1 Formula
            ucb_score = mean_s + exploration_factor * math.sqrt(
                math.log(total_turns + 1) / n_s
            )

        print(
            f"  - Skill: {skill}, Mean: {mean_s:.2f}, N: {n_s}, UCB Score: {ucb_score:.2f}"
        )

        if ucb_score > max_ucb_score:
            max_ucb_score = ucb_score
            best_question = question

    state["current_question"] = best_question
    return state


def ask_question_and_get_response(state: InterviewState) -> InterviewState:
    """
    Node 3: Ask the selected question and get a response from the user.
    This is the human-in-the-loop part of the graph.
    """
    question = state["current_question"]

    # Append AI's turn to history
    ai_message = f"AI Interviewer: {question.text}"
    state["conversation_history"].append(ai_message)
    print("\n" + ai_message)

    # Get human response
    human_response = input("Your Answer: ")
    state["candidate_response"] = human_response

    # Append human's turn to history
    human_message = f"Candidate: {human_response}"
    state["conversation_history"].append(human_message)

    return state


def grade_response_node(state: InterviewState) -> InterviewState:
    """
    Node 4: Grade the candidate's response based on a rubric.
    This corresponds to your "Evaluate" step.
    """
    question = state["current_question"]
    response = state["candidate_response"]

    print("System: Grading response...")

    prompt = ChatPromptTemplate.from_template("""
    You are an expert technical interviewer grading a candidate's response.
    
    **Rubric:**
    - 1: Completely incorrect or irrelevant.
    - 2: Shows minimal understanding, contains major errors.
    - 3: Partially correct, but misses key details or contains inaccuracies.
    - 4: Mostly correct and well-explained, but could be more detailed or nuanced.
    - 5: Excellent. A thorough, accurate, and insightful answer, demonstrating deep understanding.

    **Question Asked:**
    '{question_text}'

    **Candidate's Response:**
    '{candidate_response}'

    Based on the rubric, provide a score and a brief reasoning. Penalize evasive or low-specificity answers.
    """)

    grading_llm = llm.with_structured_output(Grade)
    grade = grading_llm.invoke(
        prompt.format(question_text=question.text, candidate_response=response)
    )

    state["grade"] = grade
    print(f"System: Grade assigned: {grade.score}/5. Reason: {grade.reasoning}")
    return state


def update_belief_node(state: InterviewState) -> InterviewState:
    """
    Node 5: Update the belief state with the new grade.
    This corresponds to your "Update Belief" step.
    """
    grade = state["grade"]
    skill_in_focus = state["current_question"].skill

    # Update belief state for the questioned skill
    current_belief = state["belief_state"][skill_in_focus]
    old_mean = current_belief["mean_score"]
    n = current_belief["n_questions"]

    # Running mean calculation
    new_mean = (old_mean * n + grade.score) / (n + 1)

    state["belief_state"][skill_in_focus]["mean_score"] = new_mean
    state["belief_state"][skill_in_focus]["n_questions"] += 1

    # Increment global turn count
    state["turn_count"] += 1

    print("System: Belief state updated.")
    print("Current Beliefs:")
    for skill, belief in state["belief_state"].items():
        print(
            f"  - {skill}: Mean Score = {belief['mean_score']:.2f}, Questions Asked = {belief['n_questions']}"
        )
    print("-" * 50)

    return state


def generate_report_node(state: InterviewState) -> InterviewState:
    """
    Final Node: Generate a summary report of the interview.
    This creates the "Verification Cards" you described.
    """
    belief_state = state["belief_state"]
    report_parts = ["# AI Interviewer: Final Verification Report\n"]

    for skill, belief in belief_state.items():
        mean_score = belief["mean_score"]
        n_questions = belief["n_questions"]

        status = "Insufficient Data"
        if n_questions >= 2:  # Min turns before verification (as per your plan)
            if mean_score >= 3.75:
                status = "✅ Verified"
            else:
                status = "❌ Not Verified"

        card = f"""
        ## Skill: {skill}
        - **Verification Status**: {status}
        - **Confidence Score**: {mean_score:.2f} / 5.0
        - **Evidence Strength**: {n_questions} questions asked.
        """
        report_parts.append(card)

    final_report = "\n".join(report_parts)
    state["final_report"] = final_report
    return state


# --- 5. Define Conditional Logic ---
def should_continue(state: InterviewState) -> str:
    """
    Node 6: Decide whether to continue the interview or end it.
    This is the conditional edge logic.
    """
    if state["turn_count"] >= state["max_turns"]:
        print("System: Maximum turns reached. Concluding interview.")
        return "end"
    else:
        return "continue"


# --- 6. Assemble the Graph ---
workflow = StateGraph(InterviewState)

# Add nodes to the graph
workflow.add_node("initialize_state", initialize_state)
workflow.add_node("generate_questions", generate_questions_node)
workflow.add_node("select_question", select_question_node)
workflow.add_node(
    "ask_and_get_response", ask_question_and_get_response
)  # This will be handled in the loop
workflow.add_node("grade_response", grade_response_node)
workflow.add_node("update_belief", update_belief_node)
workflow.add_node("generate_report", generate_report_node)

# Set the entry point
workflow.set_entry_point("initialize_state")

# Add edges to define the flow
workflow.add_edge("initialize_state", "generate_questions")
workflow.add_edge("generate_questions", "select_question")
# The "ask" step is where we'll pause, so no direct edge in the same way
workflow.add_edge(
    "select_question", "grade_response"
)  # This is a conceptual link; we need to insert the human step
workflow.add_edge("grade_response", "update_belief")

# Add the conditional edge
workflow.add_conditional_edges(
    "update_belief",
    should_continue,
    {
        "continue": "generate_questions",
        "end": "generate_report",
    },
)

workflow.add_edge("generate_report", END)

# Compile the graph
app = workflow.compile()


# --- 7. Run the Interview ---
if __name__ == "__main__":
    # Define the interview parameters (as per your PoC scope)
    initial_skills = ["PyTorch", "LLM Evaluation", "CUDA"]
    max_interview_turns = 8

    # Initial state
    config = {"recursion_limit": 100}
    initial_state: InterviewState = {
        "skills_to_verify": initial_skills,
        "max_turns": max_interview_turns,
    }

    # Stream the graph execution
    for event in app.stream(initial_state, config=config):
        # The stream method yields a dictionary for each step that has finished
        # The key is the node name and the value is the output of that node
        node_name, state_update = next(iter(event.items()))

        # The graph pauses implicitly when it needs input that isn't available.
        # We design our graph so the 'select_question' node prepares the question.
        # After 'select_question' runs, we intercept, ask the user, and then reinvoke.
        # However, a simpler way for a script is to just manage the human-in-the-loop
        # part manually by modifying state.

        # A clean way to handle the human step:
        # A proper LangGraph pattern would be to have a separate node for the human input,
        # and compile the graph with an interruption. For this script, we can simulate it.
        # Let's rebuild the flow to be more interactive-script friendly.

    # A more script-friendly execution flow
    print("--- Starting AI Interview ---")

    current_state = initialize_state(initial_state)

    while current_state["turn_count"] < current_state["max_turns"]:
        print(
            f"\n--- Turn {current_state['turn_count'] + 1}/{current_state['max_turns']} ---"
        )
        current_state = generate_questions_node(current_state)
        current_state = select_question_node(current_state)
        current_state = ask_question_and_get_response(current_state)
        current_state = grade_response_node(current_state)
        current_state = update_belief_node(current_state)

    print("\n--- Interview Finished ---")
    final_state = generate_report_node(current_state)
    print(final_state["final_report"])


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


System: Initializing interview for skills: PyTorch, LLM Evaluation, CUDA
--------------------------------------------------
System: Generating candidate questions...
System: Selecting best question using UCB policy...
  - Skill: PyTorch, Mean: 0.00, N: 0, UCB Score: inf
  - Skill: LLM Evaluation, Mean: 0.00, N: 0, UCB Score: inf
  - Skill: CUDA, Mean: 0.00, N: 0, UCB Score: inf


KeyError: 'candidate_response'

In [26]:
# --- Toggle MOCK vs REAL ---
MOCK = (
    False  # Set to False to use real LLM + LangGraph (requires installs and API keys)
)

# Optional: your OpenAI key if MOCK=False
import os

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")


## 0) Setup (only needed for REAL mode)

If you plan to run with a real LLM and LangGraph, install:
```bash
pip install langgraph langchain openai pydantic
```
Set your environment variable:
```bash
export OPENAI_API_KEY=sk-...
```
Then set `MOCK=False` in the cell above.


In [27]:
from typing import Dict, List, Optional, Literal, TypedDict
import random
import json

# Try importing LangGraph only when needed
if not MOCK:
    from langgraph.graph import StateGraph, END
    from langgraph.checkpoint.memory import MemorySaver
else:
    # Light shims so the rest of the notebook runs in MOCK mode without LangGraph
    class StateGraph:
        def __init__(self, state_type=None):
            self.nodes = {}
            self.entry = None
            self.conditional_edges = {}
            self.edges = []

        def add_node(self, name, fn):
            self.nodes[name] = fn

        def set_entry_point(self, name):
            self.entry = name

        def add_edge(self, a, b):
            self.edges.append((a, b))

        def add_conditional_edges(self, node, fn, mapping):
            self.conditional_edges[node] = (fn, mapping)

        def compile(self, checkpointer=None):
            return self

        def invoke(self, state):
            # A tiny runner that follows the fixed path init→... with a single loop
            current = self.entry
            while True:
                state = self.nodes[current](state)
                # Decide next hop
                # 1) If conditional edges from current
                if current in self.conditional_edges:
                    fn, mapping = self.conditional_edges[current]
                    decision = fn(state)
                    nxt = mapping[decision]
                else:
                    # else find the first matching edge
                    nexts = [b for (a, b) in self.edges if a == current]
                    nxt = nexts[0] if nexts else None
                if not nxt:
                    break
                if nxt == "END":
                    break
                current = nxt
            return state

    class MemorySaver:
        pass


# Optional: real LLM clients when MOCK=False
if not MOCK:
    from langchain_openai import ChatOpenAI

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, api_key=OPENAI_API_KEY)
else:
    llm = None

In [28]:
Status = Literal["unverified", "probed", "verified"]


class EvalJSON(TypedDict):
    correctness: float
    specificity: float
    completeness: float
    evidence_strength: float
    inconsistency_flag: bool
    notes: Optional[str]


class QA(TypedDict):
    skill: str
    q_id: str
    question: str
    answer: str
    eval: EvalJSON
    reward: float


class SkillStat(TypedDict):
    n: int
    mean: float
    M2: float
    se: float
    lcb: float
    status: Status


class InterviewState(TypedDict):
    turn: int
    turn_budget: int
    skills: List[str]
    claims: Dict[str, List[Dict]]
    stats: Dict[str, SkillStat]
    last_qa: List[QA]
    candidates: List[Dict]
    selected_q: Optional[Dict]
    interview_complete: bool
    summary: Optional[Dict]

In [29]:
def welford_update(s: SkillStat, r: float) -> SkillStat:
    n = s["n"] + 1
    delta = r - s["mean"]
    mean = s["mean"] + delta / n
    M2 = s["M2"] + delta * (r - mean)
    var = (M2 / (n - 1)) if n > 1 else 0.25  # conservative early variance
    se = math.sqrt(var / max(1, n))
    z = 1.64  # ~95% one-sided
    lcb = mean - z * se
    status: Status = (
        "verified"
        if (n >= 3 and lcb >= 0.75)
        else "probed"
        if (mean >= 0.6 or n >= 2)
        else "unverified"
    )
    return {"n": n, "mean": mean, "M2": M2, "se": se, "lcb": lcb, "status": status}


def blend_reward(ev: EvalJSON) -> float:
    r = (
        0.35 * ev["correctness"]
        + 0.25 * ev["specificity"]
        + 0.10 * ev["completeness"]
        + 0.20 * ev["evidence_strength"]
        - (0.30 if ev["inconsistency_flag"] else 0.0)
    )
    return max(0.0, min(1.0, r))


def need_to_stop(state: InterviewState) -> bool:
    verified = sum(1 for s in state["stats"].values() if s["status"] == "verified")
    return (verified >= 2) or (state["turn"] >= state["turn_budget"])


def summarize_stat(s: SkillStat) -> str:
    return f"n={s['n']} mean={s['mean']:.2f} lcb={s['lcb']:.2f} se={s['se']:.2f} status={s['status']}"


def policy_score(skill_stat: SkillStat, early: bool) -> float:
    if early:
        return skill_stat["se"] + 0.3 * (1 - skill_stat["mean"])
    near_bonus = (
        0.15
        if (
            skill_stat["mean"] >= 0.70
            and skill_stat["lcb"] < 0.75
            and skill_stat["status"] != "verified"
        )
        else 0.0
    )
    return skill_stat["mean"] - 0.2 * skill_stat["se"] + near_bonus

In [30]:
def mock_generate_candidates(state: InterviewState) -> List[Dict]:
    # Produce simple but sensible question candidates
    rng = random.Random(1337 + state["turn"])
    out = []
    for skill in state["skills"]:
        if len(out) >= 5:
            break
        diff = rng.choice([1, 2, 2, 3])
        probe = rng.choice(["how", "why", "design", "debug", "critique", "code"])
        qtext = {
            "PyTorch": "Show a minimal training loop and explain how you manage optimizer and scheduler.",
            "LLM Evaluation": "Design a small test to detect spec gaming; what metrics and guardrails would you use?",
            "CUDA": "Explain how to diagnose an out-of-memory error and what tooling you would check first.",
        }.get(
            skill,
            f"Give me one concrete, falsifiable example proving your competence in {skill}.",
        )
        out.append(
            {
                "id": f"q_{skill}_{state['turn']}_{diff}",
                "skill": skill,
                "difficulty": diff,
                "probe_type": probe,
                "text": qtext,
                "expected_markers": {
                    "PyTorch": [
                        "DataLoader",
                        "optimizer",
                        "scheduler",
                        "backward",
                        "zero_grad",
                        "AdamW",
                        "param_groups",
                    ],
                    "LLM Evaluation": [
                        "gold data",
                        "leakage",
                        "inter-annotator",
                        "specification gaming",
                        "ROC/AUC",
                    ],
                    "CUDA": [
                        "nvidia-smi",
                        "OOM",
                        "profiling",
                        "torch.cuda.memory",
                        "mixed precision",
                    ],
                }.get(skill, []),
            }
        )
    return out


def mock_simulate_answer(q: Dict) -> str:
    # Deterministic, mildly skillful canned answers
    if q["skill"] == "PyTorch":
        return "Use DataLoader, model.train(); optimizer=AdamW with param_groups to exclude bias/LayerNorm from weight_decay; zero_grad(); loss.backward(); optimizer.step(); scheduler.step(); watch exploding grads."
    if q["skill"] == "LLM Evaluation":
        return "Use a held-out gold set; check leakage; track inter-annotator agreement; watch spec gaming by adding adversarial prompts; report precision/recall and ROC; add manual spot checks."
    if q["skill"] == "CUDA":
        return "Start with nvidia-smi, then torch.cuda.memory_summary(); try smaller batch, AMP; profile with Nsight; check tensor shapes; fix dataloader pin_memory; monitor fragmentation."
    return "I would provide a concrete example from past work with tools, logs, and a minimal repro."


def mock_grade_answer(qa: Dict, state: InterviewState) -> EvalJSON:
    # Grade based on overlap with expected_markers and some noise
    markers = qa.get("expected_markers", [])
    ans = qa["answer"].lower()
    hit = sum(1 for m in markers if m.lower() in ans)
    k = len(markers) or 1
    specificity = min(1.0, 0.4 + 0.6 * (hit / k))
    correctness = min(1.0, 0.6 + 0.4 * (hit / k))
    completeness = 0.6 if hit / k >= 0.5 else 0.4
    evidence_strength = (
        0.6
        if any(x in ans for x in ["repo", "code", "log", "nsight", "param_groups"])
        else 0.45
    )
    inconsistency_flag = False  # deterministic MOCK
    return {
        "correctness": round(correctness, 2),
        "specificity": round(specificity, 2),
        "completeness": round(completeness, 2),
        "evidence_strength": round(evidence_strength, 2),
        "inconsistency_flag": inconsistency_flag,
        "notes": "MOCK grading",
    }

In [39]:
def llm_generate_questions(state: InterviewState) -> List[Dict]:
    if MOCK:
        return mock_generate_candidates(state)

    from langgraph.graph import StateGraph
    from langchain_openai import ChatOpenAI
    from langchain.output_parsers import JsonOutputParser
    from langchain.prompts import ChatPromptTemplate

    llm = ChatOpenAI(model="gpt-4", temperature=0.1)

    template = """You are an expert technical interviewer. Generate 3 questions to verify the candidate's expertise in their claimed skills.

Current interview state:
- Skills being assessed: {skills}
- Previous Q&A history: {last_qa}
- Current skill statistics: {stats} 
- Remaining turns: {remaining_turns}

Generate questions that:
1. Focus on skills with high uncertainty (high standard error)
2. Vary difficulty based on previous answers
3. Are concrete and falsifiable
4. Target specific technical knowledge

Return a list of question objects in the following format:
[{{"id": "q_[skill]_[turn]_[difficulty]", 
   "skill": "[skill name]",
   "difficulty": "[easy/medium/hard]",
   "probe_type": "[concept/application/debugging]",
   "text": "[question text]",
   "expected_markers": ["[key terms/concepts that should appear in good answers]"]
}}]"""

    prompt = ChatPromptTemplate.from_template(template)
    parser = JsonOutputParser()

    # Define the question generation node
    def generate_node(state):
        chain = prompt | llm | parser
        questions = chain.invoke(
            {
                "skills": state["skills"],
                "last_qa": state["last_qa"],
                "stats": state["stats"],
                "remaining_turns": state["turn_budget"] - state["turn"],
            }
        )
        return {"questions": questions}

    # Create graph
    workflow = StateGraph(nodes=[])

    # Add node
    workflow.add_node("generate", generate_node)
    workflow.set_entry_point("generate")
    workflow.set_finish_point("generate")

    # Compile and run
    app = workflow.compile()
    result = app.invoke({"state": state})

    return result["questions"]


def llm_grade_answer(qa: Dict, state: InterviewState) -> EvalJSON:
    if MOCK:
        return mock_grade_answer(qa, state)

    from langgraph.graph import StateGraph
    from langchain_openai import ChatOpenAI
    from langchain.output_parsers import JsonOutputParser
    from langchain.prompts import ChatPromptTemplate

    llm = ChatOpenAI(model="gpt-4", temperature=0.1)

    template = """Grade this technical interview response.

Question: {question}
Answer: {answer}

Expected key concepts/markers: {markers}

Evaluate the response on:
1. Correctness (0-1): Technical accuracy and proper use of concepts
2. Specificity (0-1): Level of detail and precision
3. Completeness (0-1): Coverage of relevant aspects
4. Evidence strength (0-1): Concrete examples/experience demonstrated
5. Check for inconsistencies with technical best practices

Return evaluation in this format:
{{
    "correctness": float,
    "specificity": float, 
    "completeness": float,
    "evidence_strength": float,
    "inconsistency_flag": boolean,
    "notes": "Brief justification"
}}"""

    prompt = ChatPromptTemplate.from_template(template)
    parser = JsonOutputParser()

    # Define the grading node
    def grade_node(state):
        chain = prompt | llm | parser
        evaluation = chain.invoke(
            {
                "question": qa["question"],
                "answer": qa["answer"],
                "markers": qa.get("expected_markers", []),
            }
        )
        return {"evaluation": evaluation}

    # Create graph
    workflow = StateGraph(nodes=[])

    # Add node
    workflow.add_node("grade", grade_node)
    workflow.set_entry_point("grade")
    workflow.set_finish_point("grade")

    # Compile and run
    app = workflow.compile()
    result = app.invoke({"state": {"qa": qa}})

    return result["evaluation"]

In [40]:
def init_from_claims(state: InterviewState) -> InterviewState:
    if state["turn"] > 0:
        return state
    stats = {}
    for skill in state["skills"]:
        stats[skill] = {
            "n": 0,
            "mean": 0.5,
            "M2": 0.0,
            "se": 0.5,
            "lcb": 0.0,
            "status": "unverified",
        }  # neutral start
    state["stats"] = stats
    state["last_qa"] = []
    state["candidates"] = []
    state["selected_q"] = None
    state["interview_complete"] = False
    return state


def generate_candidates(state: InterviewState) -> InterviewState:
    state["candidates"] = llm_generate_questions(state)
    return state


def select_question(state: InterviewState) -> InterviewState:
    t, T = state["turn"], state["turn_budget"]
    early = t < (T // 2)
    scored = [
        (policy_score(state["stats"][q["skill"]], early), q)
        for q in state["candidates"]
    ]
    state["selected_q"] = max(scored, key=lambda x: x[0])[1]
    return state


def ask_and_collect(state: InterviewState) -> InterviewState:
    q = state["selected_q"]
    # In a real app, you'd interrupt here and collect human input
    answer = (
        mock_simulate_answer(q) if MOCK else ""
    )  # replace with interrupt in real flow
    qa = {
        "skill": q["skill"],
        "q_id": q["id"],
        "question": q["text"],
        "answer": answer,
        "eval": None,
        "reward": 0.0,
        "expected_markers": q.get("expected_markers", []),
    }
    state["last_qa"] = (state["last_qa"] + [qa])[-3:]
    return state


def grade_answer(state: InterviewState) -> InterviewState:
    qa = state["last_qa"][-1]
    # Include markers in the call for grading
    qa_for_grade = {
        "question": qa["question"],
        "answer": qa["answer"],
        "expected_markers": qa.get("expected_markers", []),
    }
    ev = llm_grade_answer(qa_for_grade, state)
    qa["eval"] = ev
    qa["reward"] = blend_reward(ev)
    state["last_qa"][-1] = qa
    return state


def update_belief(state: InterviewState) -> InterviewState:
    qa = state["last_qa"][-1]
    sk = qa["skill"]
    state["stats"][sk] = welford_update(state["stats"][sk], qa["reward"])
    state["turn"] += 1
    return state


def decide_next(state: InterviewState):
    return "finalise" if need_to_stop(state) else "loop"


def finalise_report(state: InterviewState) -> InterviewState:
    cards = []
    for skill, s in state["stats"].items():
        cards.append(
            {
                "skill": skill,
                "status": s["status"],
                "score_mean": round(s["mean"], 2),
                "lcb": round(s["lcb"], 2),
                "n": s["n"],
                "notes": f"se={s['se']:.2f}",
            }
        )
    state["summary"] = {"verification_cards": cards}
    state["interview_complete"] = True
    return state

In [41]:
graph = StateGraph(InterviewState)
graph.add_node("init", init_from_claims)
graph.add_node("generate_candidates", generate_candidates)
graph.add_node("select_question", select_question)
graph.add_node("ask_and_collect", ask_and_collect)
graph.add_node("grade_answer", grade_answer)
graph.add_node("update_belief", update_belief)
graph.add_node("finalise", finalise_report)

graph.set_entry_point("init")
graph.add_edge("init", "generate_candidates")
graph.add_edge("generate_candidates", "select_question")
graph.add_edge("select_question", "ask_and_collect")
graph.add_edge("ask_and_collect", "grade_answer")
graph.add_edge("grade_answer", "update_belief")
graph.add_conditional_edges(
    "update_belief",
    decide_next,
    {"loop": "generate_candidates", "finalise": "finalise"},
)

app = graph.compile()


## 1) Demo run (MOCK mode)

We simulate answers and grading. You’ll see **turn-by-turn** stats and a final **Verification Card** table.


In [42]:
# Initial state with 3 claimed skills and a small turn budget
initial_state: InterviewState = {
    "turn": 0,
    "turn_budget": 8,
    "skills": ["PyTorch", "LLM Evaluation", "CUDA"],
    "claims": {
        "PyTorch": [{"src": "CV", "confidence": 0.7}],
        "LLM Evaluation": [{"src": "LinkedIn", "confidence": 0.6}],
        "CUDA": [{"src": "CV", "confidence": 0.5}],
    },
    "stats": {},
    "last_qa": [],
    "candidates": [],
    "selected_q": None,
    "interview_complete": False,
    "summary": None,
}

state = app.invoke(initial_state)

# Pretty print turn-by-turn evolution from the mock last_qa (only last 3 saved)
print("Final per-skill stats:")
for skill, st in state["stats"].items():
    print(f"- {skill}: {summarize_stat(st)}")

print("\nVerification Cards:")
print(json.dumps(state["summary"], indent=2))

ImportError: cannot import name 'JsonOutputParser' from 'langchain.output_parsers' (/Users/jason_macstudio/.venv/lib/python3.11/site-packages/langchain/output_parsers/__init__.py)