# Evaluating Models. 

To evaluate the performance of our advanced Retrieval-Augmented Generation (RAG) system, we compared four configurations using a diverse set of five input questions about jobs and hiring trends: (a) a base LLM without retrieval, (b) a basic RAG setup with a retriever and generator, (c) an advanced multi-agent RAG using a base model, and (d) an advanced RAG using a fine-tuned model with LoRA.

In conclusion, we observed a clear progression in quality from base LLM to advanced agentic RAG. The base model with agentic control provides the most precise and informative results, while the fine-tuned model offers clean summaries but can sometimes overgeneralize. The effectiveness of the agent-based architecture becomes especially evident when the task demands structured extraction or constraint satisfaction, reinforcing the value of system orchestration in RAG pipelines.

## Code to Evaluate

In [None]:
from IPython.display import Image, display
from workflow import create_workflow  # Importing workflow function
from langgraph.graph import END, StateGraph, START
from langchain_core.runnables.graph import MermaidDrawMethod
from agents import (query_rewriter, retriever_agent, grade_documents, generate_agent, verification_agent,generate_agent_lora)
import pandas as pd
from agents import llm, retriever
from workflow import create_workflow

In [2]:
from typing_extensions import TypedDict, List, Literal

class RAGState(TypedDict):
    query: str
    refined_query: str
    retrieved_docs: List[str]
    formatted_context: str
    llm_answer: str
    decision: Literal["relevant", "not_relevant", "useful", "not_useful", "end"]

In [3]:
from pprint import pprint

def run_query(query: str, use_lora: bool = False):
    """
    Runs the multi-agent RAG system with a given query and prints intermediate states.
    
    """

    # Initialize State
    input_state = {
        "query": query,
        "refined_query": "",
        "retrieved_docs": [],
        "formatted_context": "",
        "llm_answer": "",
        "decision": "",
        "retries": 0  # Initial retry count
    }
    app = create_workflow(use_lora=use_lora)
    # Store Final Output State
    final_output = {}

    # Run Workflow & Capture Intermediate States
    for step, output in enumerate(app.stream(input_state)):

        # Merge new values with final output state
        final_output.update(output)

    # Extract `llm_answer` 
    final_answer = final_output.get("generate_agent", {}).get("llm_answer", "No `llm_answer` generated.")

    # Extract `decision` 
    decision = final_output.get("verification_agent", {}).get("decision", " No `decision` available.")
    return {
        "query": query,
        "response": final_answer,
        "decision": decision
    }


In [4]:
generated = run_query("What are the highest paying remote Data Science jobs?")
generated['response']

  refined_query = llm.predict(f"Rewrite the following query to improve search results:\n{state['query']}")


'**Staff Data Scientist** at **Jobot Consulting**\nLocation: New York, NY\nExperience Level: Entry level\nWork Type: FULL_TIME\nSalary: 208000.0\n\nThis position offers the highest salary among the listed options at $208,000. It is located in New York, NY and is an entry-level position.'

In [5]:
test_queries = [
    "What marketing jobs are available in New York?",
    "Which companies are hiring AI engineers?",
    "Find jobs requiring Python but not Java.",
    "List remote AI research jobs.",
    "What are the highest-paying data science roles?"
]

def run_evaluation(query):
    results = {"Query": query}

    # Base LLM (No RAG)
    results["Base LLM"] = llm.predict(query)

    # Basic RAG (Retriever + LLM)
    retrieved_docs = retriever.invoke(query)
    if retrieved_docs:
        retrieved_text = "\n".join([doc.page_content for doc in retrieved_docs])
        results["Basic RAG"] = llm.predict(f"Based on these job listings:\n\n{retrieved_text}\n\nAnswer the query: {query}")
    else:
        results["Basic RAG"] = "No relevant documents found."

    # Advanced RAG (Base Model)
    ans_rag = run_query(query, use_lora=False)
    results["Advanced RAG (Base)"] = ans_rag['response']

    # Advanced RAG (Fine-Tuned LoRA)
    ans_rag_lora = run_query(query, use_lora=True)
    results["Advanced RAG (LoRA)"] = ans_rag_lora['response']

    return results

evaluation_results = [run_evaluation(query) for query in test_queries]
df_results = pd.DataFrame(evaluation_results)


In [11]:
for _, row in df_results.iterrows():
    print("\n Query:", row["Query"])
    print("")
    print("BASE LLM")
    print("")
    print(row["Base LLM"])
    print("")
    print("BASIC RAG")
    print("")
    print(row["Basic RAG"])
    print("")
    print("Advanced RAG (Base)")
    print("")
    print(row["Advanced RAG (Base)"])
    print("")
    print("Advanced RAG (LoRA)")
    print("")
    print(row["Advanced RAG (LoRA)"])
    print("")
    print("=" * 100)


 Query: What marketing jobs are available in New York?

BASE LLM

1. Marketing Manager
2. Digital Marketing Specialist
3. Social Media Manager
4. Content Marketing Manager
5. Marketing Coordinator
6. Public Relations Specialist
7. Brand Manager
8. Market Research Analyst
9. Email Marketing Specialist
10. SEO Specialist
11. Advertising Account Executive
12. Event Marketing Manager
13. Influencer Marketing Manager
14. Marketing Communications Manager
15. Product Marketing Manager

BASIC RAG

Based on the job listings provided, the marketing jobs available in New York are:

1. Marketing and Business Development Coordinator at Withers in New Haven, with occasional travel to Greenwich, New York, and Boston.
2. Sales and Marketing Professionals at a leading sales & marketing firm in New Jersey, with territories in New Jersey and New York.
3. Entry Level Sales and Marketing Positions at a firm in New Jersey with opportunities for growth.
4. Face-to-Face Marketing position in Charlotte, with 

## Qualitative Evaluations. 

- The base LLM provided generic responses with little grounding in real data. While it offered reasonable guesses based on prior knowledge, the answers were often vague, outdated, or inaccurate — for example, listing high-profile companies hiring for AI roles without citing real listings. This made the base model suitable only for brainstorming or general awareness, not reliable insights.

- With the basic RAG, performance improved noticeably. The model could pull in job descriptions from a retrieval corpus, offering more grounded answers. However, it lacked the ability to prioritize or filter results meaningfully. This often led to irrelevant or geographically mismatched entries, and the answers sometimes missed key user constraints (e.g., excluding Java). It demonstrated that while retrieval alone enhances accuracy, it needs additional reasoning for refinement.

- The advanced agentic RAG using a base model outperformed the previous two by a wide margin. It combined retrieval with coordination between agents (query rewriting, document grading, answer generation, verification), leading to more accurate and context-aware outputs. It captured key constraints (such as excluding Java), offered well-structured summaries with salary and location details, and surfaced companies actually present in the dataset. This version struck the best balance between specificity, accuracy, and completeness.

- Finally, the advanced agentic RAG using a LoRA fine-tuned model showed both strengths and weaknesses. Its standout feature was the ability to generate well-summarized responses, often including helpful salary or level overviews. However, it sometimes abstracted too far and lost critical information like job titles, company names, or constraints explicitly stated in the question. This suggests the fine-tuning was skewed toward general summaries and lacked sufficient examples involving filtering or constraint-based reasoning.
