# EcoHome Energy Advisor - Agent Run & Evaluation

In this notebook, you'll run the Energy Advisor agent with various real-world scenarios and see how it helps customers optimize their energy usage.

## Learning Objectives
- Create the agent's instructions
- Run the Energy Advisor with different types of questions
- Evaluate response quality and accuracy
- Measure tool usage effectiveness
- Identify areas for improvement
- Implement evaluation metrics

## Evaluation Criteria
- **Accuracy**: Correct information and calculations
- **Relevance**: Responses address the user's question
- **Completeness**: Comprehensive answers with actionable advice
- **Tool Usage**: Appropriate use of available tools
- **Reasoning**: Clear explanation of recommendations


## 1. Import and Initialize

In [2]:
from datetime import datetime
from agent import Agent

In [4]:
## TODO: Create the agent's instructions

ECOHOME_SYSTEM_PROMPT = '''
 You are "EcoHome Energy Advisor" — a highly knowledgeable, data-driven assistant specializing in residential energy efficiency, cost optimization, and sustainability.

MISSION
- Help homeowners reduce energy bills and improve sustainability without compromising comfort.
- Give personalized, actionable energy insights powered by tool data and user context.

YOUR CAPABILITIES
- Analyze household energy usage patterns
- Recommend time-shifting strategies based on peak/off-peak tariff schedules
- Provide solar power insights and potential ROI
- Suggest energy-efficient appliance usage and upgrades
- Use weather and device behavior for dynamic advice
- Quantify results in clear values:
  - Estimated kWh savings
  - Monthly/annual ₹ / $ cost reduction
  - Carbon footprint improvement (KG CO₂)

CONTEXT AWARENESS RULES
- Always reference recent tool data or user-provided info before making recommendations.
- If critical information is missing (e.g., location, solar capacity, appliance usage), ask 1–2 simple clarifying questions.
- Adapt advice based on:
  - Time of day
  - Seasonal variation
  - Local climate and tariff schemes
  - Household lifestyle (e.g., "EV charging needs")

COMMUNICATION STYLE
- Friendly, concise, and encouraging
- Use bullets or numbered steps for recommendations
- Avoid judgmental language
- Focus on high-impact suggestions first

TOOL-USAGE POLICY
- Always use available tools when quantitative information is needed.
- If a tool fails or returns unknown results:
  - Acknowledge gracefully
  - Provide a reasonable, safe fallback suggestion
- Do not estimate specific prices, weather, or consumption unless supported by tools.

SAFETY & VERIFICATION
- If asked for technical or installation tasks:
  - Provide guidance but recommend certified professionals for electrical work
- Never provide misinformation or arbitrary numbers; rely only on data sources or ask for clarification.

OUTPUT FORMAT
- Start with **quick summary**
- Follow with **top 2–4 actionable recommendations**
- Include **approximate savings metrics** when supported by data
- Provide **next step instructions** if user interest appears strong

PERSONALITY
- Smart energy partner
- Empowering and approachable
- Focus on practical impact

'''

In [5]:
ecohome_agent = Agent(
    instructions=ECOHOME_SYSTEM_PROMPT
)

In [6]:
response = ecohome_agent.invoke(
    question="When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
    context="Location: San Francisco, CA",
    return_raw = True
)

In [9]:
print(response["messages"][-1].content)

**Quick Summary:**
To minimize costs and maximize solar power for charging your electric vehicle (EV) tomorrow in San Francisco, you should charge during the early afternoon when solar generation is at its peak and electricity rates are lower.

**Top Recommendations:**
1. **Charge Time:**
   - **Best Time to Charge:** Between **12 PM and 3 PM** when solar irradiance is highest and electricity rates are still relatively low (around $0.15 per kWh).
   - **Peak Solar Generation:** Expect solar irradiance to peak around **1 PM** with values around **435.9 W/m²**.

2. **Electricity Rates:**
   - **Off-Peak Rates:** From **12 AM to 5 PM**, the rate is **$0.15 per kWh**.
   - **Peak Rates:** From **5 PM to 8 PM**, the rate increases to **$0.24 per kWh**.

3. **Solar Power Utilization:**
   - Charging during the day when solar power is available will help you utilize renewable energy and reduce your carbon footprint.

**Approximate Savings Metrics:**
- **Cost Savings:** Charging during off-pea

In [8]:
print("TOOLS:")
for msg in response["messages"]:
    if hasattr(msg,"tool_call_id"):
        print("-", msg.name)

TOOLS:
- get_electricity_prices
- get_weather_forecast


## 2. Define Test Cases

In [None]:
# TODO: Define comprehensive test cases for the Energy Advisor
# Create 10 test cases covering different scenarios:
# - EV charging optimization
# - Thermostat settings
# - Appliance scheduling
# - Solar power maximization
# - Cost savings calculations

In [10]:
test_cases = [
    # EV Charging Optimization
    {
        "id": "ev_charging_1",
        "question": "When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
        "location": "San Diego, CA",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices"],
        "expected_response": "Should include a recommended time window, solar peak reasoning, cost savings.",
    },
    {
        "id": "ev_charging_2",
        "question": "When is electricity the cheapest for EV charging tonight?",
        "location": "Austin, TX",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "Should reference off-peak hours and price differences.",
    },

    # Thermostat
    {
        "id": "thermostat_1",
        "question": "What should I set my thermostat to this evening to save energy?",
        "location": "Phoenix, AZ",
        "expected_tools": ["get_weather_forecast"],
        "expected_response": "Should include recommended temperature range and comfort balance.",
    },
    {
        "id": "thermostat_2",
        "question": "How do I optimize my heating schedule for the weekend?",
        "location": "Chicago, IL",
        "expected_tools": ["get_weather_forecast"],
        "expected_response": "Should include day/night setback and forecast-based strategy.",
    },

    # Appliances
    {
        "id": "appliance_1",
        "question": "When should I run my dishwasher to save energy?",
        "location": "Los Angeles, CA",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "Should suggest avoiding peak hours and using off-peak windows.",
    },
    {
        "id": "appliance_2",
        "question": "What time tomorrow should I use my washing machine?",
        "location": "New York, NY",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "Should mention avoiding 4–9 PM peak if applicable.",
    },

    # Solar Optimization
    {
        "id": "solar_1",
        "question": "How can I maximize self-consumption of my solar energy?",
        "location": "Miami, FL",
        "expected_tools": ["get_weather_forecast"],
        "expected_response": "Should mention aligning loads with high solar generation.",
    },
    {
        "id": "solar_2",
        "question": "What do I do on cloudy days to still save energy?",
        "location": "Seattle, WA",
        "expected_tools": ["get_weather_forecast"],
        "expected_response": "Should advise shifting flexible usage around limited solar.",
    },

    # Cost Analysis
    {
        "id": "cost_1",
        "question": "How can I reduce my electricity bill this month?",
        "location": "Dallas, TX",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "Should combine HVAC, appliance timing, EV strategy.",
    },
    {
        "id": "cost_2",
        "question": "What should I do during high price alerts?",
        "location": "Boston, MA",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "Should recommend turning down HVAC + delaying flexible loads.",
    },
]


if len(test_cases) < 10:
    raise ValueError("You MUST have at least 10 test cases")

## 3. Run Agent Tests

In [11]:
CONTEXT = "Location: San Francisco, CA"

In [12]:
# Run the agent tests
# For each test case, call the agent and collect the response
# Store results for evaluation

print("=== Running Agent Tests ===")
test_results = []

for i, test_case in enumerate(test_cases):
    print(f"\nTest {i+1}: {test_case['id']}")
    print(f"Question: {test_case['question']}")
    print("-" * 50)
    
    try:
        # Call the agent
        response = ecohome_agent.invoke(
            question=test_case['question'],
            context=CONTEXT
        )
        
        # Store the result
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': response,
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat()
        }
        test_results.append(result)
                
    except Exception as e:
        print(f"Error: {e}")
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': f"Error: {str(e)}",
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat(),
            'error': str(e)
        }
        test_results.append(result)

print(f"\nCompleted {len(test_results)} tests")


=== Running Agent Tests ===

Test 1: ev_charging_1
Question: When should I charge my electric car tomorrow to minimize cost and maximize solar power?
--------------------------------------------------

Test 2: ev_charging_2
Question: When is electricity the cheapest for EV charging tonight?
--------------------------------------------------

Test 3: thermostat_1
Question: What should I set my thermostat to this evening to save energy?
--------------------------------------------------

Test 4: thermostat_2
Question: How do I optimize my heating schedule for the weekend?
--------------------------------------------------

Test 5: appliance_1
Question: When should I run my dishwasher to save energy?
--------------------------------------------------

Test 6: appliance_2
Question: What time tomorrow should I use my washing machine?
--------------------------------------------------

Test 7: solar_1
Question: How can I maximize self-consumption of my solar energy?
-------------------------

In [13]:
test_results

[{'test_id': 'ev_charging_1',
  'question': 'When should I charge my electric car tomorrow to minimize cost and maximize solar power?',
  'response': '**Quick Summary:**\nTo minimize costs and maximize solar power for charging your electric car tomorrow in San Francisco, you should charge during the early morning hours when solar generation begins and electricity rates are low.\n\n**Top Recommendations:**\n1. **Charge Time:**\n   - **Best Time to Charge:** Start charging between **6 AM and 10 AM**. \n   - **Solar Generation:** Solar irradiance starts at 713.4 W/m² at 6 AM and peaks around 888.0 W/m² at 1 PM.\n   - **Electricity Rates:** The rate is **$0.15 per kWh** from midnight until 5 PM, then increases to **$0.24 per kWh** from 5 PM to 9 PM.\n\n2. **Optimal Charging Strategy:**\n   - **6 AM - 10 AM:** Charge your EV during this time to take advantage of both low electricity rates and increasing solar power generation.\n   - **Avoid Charging:** After 5 PM, as rates increase signific

## 4. Evaluate Responses

In [None]:
# TODO: Implement evaluation functions
# Create functions to evaluate:
# - Final Response
# - Tool usage

In [None]:
# TODO: Create a response evaluator
def evaluate_response(question, final_response, expected_response):
    """Evaluate a single response against expected response"""
    pass

In [20]:
# TODO: Create a response evaluator
def evaluate_response(question, final_response, expected_response):
    """Evaluate a single response against expected response"""

    def score_accuracy():
        # Check correctness by measuring overlapping key info
        common = set(final_response.lower().split()) & set(expected_response.lower().split())
        return round(len(common) / max(len(expected_response.split()), 1), 2)

    def score_relevance():
        # Check topic adherence using keyword overlap
        q_words = set(question.lower().split())
        f_words = set(final_response.lower().split())
        common = q_words & f_words
        return round(len(common) / max(len(q_words), 1), 2)

    def score_completeness():
        # Compare content coverage vs expected answer
        expected_parts = expected_response.lower().split()
        matched = sum(1 for w in expected_parts if w in final_response.lower())
        return round(matched / max(len(expected_parts), 1), 2)

    def score_usefulness():
        # Heuristic: combines other scores + conciseness check
        combined = (accuracy + relevance + completeness) / 3
        conciseness_penalty = 0.1 if len(final_response.split()) > len(expected_response.split()) * 2 else 0
        return max(round(combined - conciseness_penalty, 2), 0)

    accuracy = score_accuracy()
    relevance = score_relevance()
    completeness = score_completeness()
    usefulness = score_usefulness()

    feedback = {
        "accuracy_feedback": "Correct and aligned with ground truth" if accuracy > 0.7 else "Contains partially correct or incorrect details",
        "relevance_feedback": "Focused and on-topic" if relevance > 0.7 else "Drifts away from the question",
        "completeness_feedback": "Covers most key information" if completeness > 0.7 else "Missing critical details",
        "usefulness_feedback": "Useful and actionable" if usefulness > 0.7 else "Needs more clarity/value",
    }

    return {
        "scores": {
            "accuracy": accuracy,
            "relevance": relevance,
            "completeness": completeness,
            "usefulness": usefulness,
        },
        "feedback": feedback,
        "overall_score": round(
            (accuracy + relevance + completeness + usefulness) / 4, 2
        ),
    }


In [None]:
# TODO: Create a tool udage evaluator
def evaluate_tool_usage(messages, expected_tools):
    """Evaluate if the right tools were used"""
    pass

In [None]:
# TODO: Create a tool usage evaluator
def evaluate_tool_usage(messages, expected_tools):
    """Evaluate if the right tools were used"""

    # Extract tool names from messages where tool calls occurred.
    used_tools = set()
    for msg in messages:
        if isinstance(msg, dict) and msg.get("tool"):
            used_tools.add(msg["tool"])

    expected_tools = set(expected_tools)

    # Metrics
    correctly_used = used_tools & expected_tools
    unnecessary_used = used_tools - expected_tools
    missing_tools = expected_tools - used_tools

    tool_appropriateness = round(len(correctly_used) / max(len(used_tools), 1), 2)
    tool_completeness = round(len(correctly_used) / max(len(expected_tools), 1), 2)

    feedback = {
        "appropriateness_feedback":
            "Tools used match task requirements well"
            if tool_appropriateness > 0.7 else
            "Some tools used were irrelevant to the task",
        
        "completeness_feedback":
            "All expected tools were used"
            if tool_completeness == 1 else
            "Some expected tools were missing",
        
        "missing_tools": list(missing_tools),
        "unnecessary_tools": list(unnecessary_used),
        "correctly_used_tools": list(correctly_used)
    }

    return {
        "scores": {
            "tool_appropriateness": tool_appropriateness,
            "tool_completeness": tool_completeness,
        },
        "feedback": feedback,
        "overall_score": round((tool_appropriateness + tool_completeness) / 2, 2)
    }


In [None]:
# TODO: Generate a comprehensive evaluation report
# Calculate overall scores and metrics
# Identify strengths and weaknesses
# Provide recommendations for improvement
def generate_evaluation_report():
    pass

In [21]:
# Store completed evaluation results here
evaluation_results = {
    "responses": [],  # List of {scores, feedback, overall_score}
    "tool_usage": []  # List of {scores, feedback, overall_score}
}


# TODO: Generate a comprehensive evaluation report
def generate_evaluation_report():
    """Generate aggregated evaluation report from all test results"""

    # Helper to calculate mean safely
    def avg(values):
        return round(sum(values) / max(len(values), 1), 2)

    # Collect all response metrics
    response_scores = [r['overall_score'] for r in evaluation_results["responses"]]
    tool_scores = [t['overall_score'] for t in evaluation_results["tool_usage"]]

    # Overall performance
    overall_response_score = avg(response_scores)
    overall_tool_score = avg(tool_scores)
    total_score = round((overall_response_score + overall_tool_score) / 2, 2)

    # Determine strengths & weaknesses
    strengths = []
    weaknesses = []

    if overall_response_score > 0.7:
        strengths.append("High-quality and accurate final responses")
    else:
        weaknesses.append("Response accuracy and clarity need improvement")

    if overall_tool_score > 0.7:
        strengths.append("Good tool selection and usage based on tasks")
    else:
        weaknesses.append("Tool selection or usage needs refining")

    # Recommendations
    recommendations = []
    if "Response accuracy and clarity need improvement" in weaknesses:
        recommendations.append(
            "Enhance context understanding by using more precise phrasing "
            "and ensuring expected elements are fully covered"
        )
    if "Tool selection or usage needs refining" in weaknesses:
        recommendations.append(
            "Review when to call different tools and confirm alignment with task requirements"
        )

    return {
        "overall_scores": {
            "response_quality": overall_response_score,
            "tool_usage": overall_tool_score,
            "overall_rating": total_score
        },
        "strengths": strengths or ["No major strengths identified yet"],
        "weaknesses": weaknesses or ["No major weaknesses identified"],
        "recommendations": recommendations or ["Continue current approach"]
    }


# Display the report clearly
def display_evaluation_report(report):
    print("\n====== AI PERFORMANCE EVALUATION REPORT ======\n")

    print(">> Overall Scores:")
    for key, val in report["overall_scores"].items():
        print(f"  - {key.replace('_', ' ').title()}: {val}")

    print("\n>> Strengths:")
    for s in report["strengths"]:
        print(f"  ✓ {s}")

    print("\n>> Weaknesses:")
    for w in report["weaknesses"]:
        print(f"  ✗ {w}")

    print("\n>> Recommendations:")
    for r in report["recommendations"]:
        print(f"  → {r}")

    print("\n==============================================\n")


In [23]:
report = generate_evaluation_report()
display_evaluation_report(report)




>> Overall Scores:
  - Response Quality: 0.0
  - Tool Usage: 0.0
  - Overall Rating: 0.0

>> Strengths:
  ✓ No major strengths identified yet

>> Weaknesses:
  ✗ Response accuracy and clarity need improvement
  ✗ Tool selection or usage needs refining

>> Recommendations:
  → Enhance context understanding by using more precise phrasing and ensuring expected elements are fully covered
  → Review when to call different tools and confirm alignment with task requirements


