InfluxDB Data Validation Notebook

Import necessary modules

In [1]:
import os
import sys
from datetime import datetime, timedelta, timezone
import pandas as pd
from dotenv import load_dotenv
import logging

Add parent directory to path to allow importing src modules

In [2]:
sys.path.insert(0, os.path.abspath('../src'))

import config
from ha_client import HAClient
from influx_service import InfluxService

Set up logging

In [3]:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

Load environment variables from .env file

In [4]:
load_dotenv()

def get_influx_service() -> InfluxService:
    """Initializes and returns an InfluxService instance."""
    influx_url = os.getenv("INFLUX_URL")
    influx_token = os.getenv("INFLUX_TOKEN")
    influx_org = os.getenv("INFLUX_ORG")
    # InfluxService constructor does not take influx_bucket directly
    return InfluxService(influx_url, influx_token, influx_org)

def get_ha_client() -> HAClient:
    """Initializes and returns an HAClient instance."""
    hass_url = os.getenv("HASS_URL")
    hass_token = os.getenv("HASS_TOKEN")
    return HAClient(hass_url, hass_token)

Configuration Overview

Verify the key configuration settings for InfluxDB connection and the historical lookback window.

In [5]:
print(f"INFLUX_URL: {os.getenv('INFLUX_URL')}")
print(f"INFLUX_ORG: {os.getenv('INFLUX_ORG')}")
print(f"INFLUX_BUCKET: {os.getenv('INFLUX_BUCKET')}")
print(f"TRAINING_LOOKBACK_HOURS: {config.TRAINING_LOOKBACK_HOURS} hours")

influx_service = get_influx_service()
ha_client = get_ha_client()

end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=config.TRAINING_LOOKBACK_HOURS)
num_steps = int((config.TRAINING_LOOKBACK_HOURS * 60) / config.HISTORY_STEP_MINUTES)

print(f"\nQuerying data from {start_time.isoformat()} to {end_time.isoformat()} for {num_steps} steps")

INFLUX_URL: http://20.10.0.10:8086
INFLUX_ORG: erbehome
INFLUX_BUCKET: home_assistant/autogen
TRAINING_LOOKBACK_HOURS: 576 hours

Querying data from 2025-11-02T11:28:18.999203+00:00 to 2025-11-26T11:28:18.999203+00:00 for 3456 steps


Entity IDs to Validate

These are the entity IDs that `src/physics_features.py` uses. We will check their historical data in InfluxDB.

In [6]:
entity_ids_to_check = [
    config.INDOOR_TEMP_ENTITY_ID,
    config.OUTDOOR_TEMP_ENTITY_ID,
    config.ACTUAL_OUTLET_TEMP_ENTITY_ID,
    config.TARGET_INDOOR_TEMP_ENTITY_ID,
    config.DHW_STATUS_ENTITY_ID,
    config.DISINFECTION_STATUS_ENTITY_ID,
    config.DHW_BOOST_HEATER_STATUS_ENTITY_ID,
    config.DEFROST_STATUS_ENTITY_ID,
    config.PV1_POWER_ENTITY_ID,
    config.PV2_POWER_ENTITY_ID,
    config.PV3_POWER_ENTITY_ID,
    config.FIREPLACE_STATUS_ENTITY_ID,
    config.TV_STATUS_ENTITY_ID,
    config.PV_FORECAST_ENTITY_ID
]

print("Entities that will be validated:")
for eid in entity_ids_to_check:
    print(f"- {eid}")

Entities that will be validated:
- sensor.thermometer_wohnzimmer_kompensiert
- sensor.thermometer_waermepume_kompensiert
- sensor.hp_outlet_temp
- input_number.hp_auto_correct_target
- binary_sensor.hp_dhw_heating_status
- binary_sensor.hp_dhw_tank_disinfection_status
- binary_sensor.hp_dhw_boost_heater_status
- binary_sensor.hp_defrosting_status
- sensor.saj_pv1_power
- sensor.saj_pv2_power
- sensor.solarmax_pv_power
- binary_sensor.fireplace_active
- input_boolean.fernseher
- sensor.energy_production_today_4


InfluxDB Data Retrieval and Summary

This section queries InfluxDB for each specified entity and provides a summary of the retrieved data, including count, time range, and a data preview. For binary sensors, it will also show 'on' events and duration.

In [7]:
async def validate_entity_data(entity_id: str):
    print(f"\n--- Validating {entity_id} ---")

    # Special handling for PV forecast as it's an attribute, not a simple state
    if entity_id == config.PV_FORECAST_ENTITY_ID:
        print("PV Forecast (attributes) cannot be directly queried from InfluxDB by entity_id alone.")
        print("Its data is usually part of a specific sensor's state attributes in HA.")
        print("Please check the live HA state for this entity in the previous notebook if needed.")
        return

    is_binary_sensor = entity_id in [config.DHW_STATUS_ENTITY_ID, config.DISINFECTION_STATUS_ENTITY_ID, config.DHW_BOOST_HEATER_STATUS_ENTITY_ID, config.DEFROST_STATUS_ENTITY_ID, config.FIREPLACE_STATUS_ENTITY_ID, config.TV_STATUS_ENTITY_ID]

    agg_fn = "max" if is_binary_sensor else "mean"
    default_val = 0.0 if is_binary_sensor else 20.0 # Default for temps, 0 for binary

    try:
        # Use fetch_history with appropriate parameters
        history_values = influx_service.fetch_history(entity_id, num_steps, default_val, agg_fn=agg_fn)

        if not history_values:
            print(f"No data found for {entity_id} in the last {config.TRAINING_LOOKBACK_HOURS} hours.")
            return

        # Create a DataFrame for consistent processing
        # Approximate timestamps for the fetched values
        times = [end_time - timedelta(minutes=i * config.HISTORY_STEP_MINUTES) for i in range(num_steps)][::-1]
        df = pd.DataFrame({'time': times, 'value': history_values})
        df['time'] = pd.to_datetime(df['time'])

        print(f"Total records found: {len(df)}")
        print(f"Time range: {df['time'].min()} to {df['time'].max()}")
        print("Data preview (first 5 rows):")
        print(df.head())
        print("\nData preview (last 5 rows):")
        print(df.tail())

        # Add min, max, and average value
        if not df.empty:
            print(f"Min value: {df['value'].min()}")
            print(f"Max value: {df['value'].max()}")
            print(f"Average value: {df['value'].mean()}")

        if is_binary_sensor:
            # Filter for 'on' states (assuming 1 for on, 0 for off)
            on_states = df[df['value'] > 0]

            if not on_states.empty:
                print(f"Binary sensor was 'ON' for {len(on_states)} of {num_steps} periods.")
                # Calculate total duration it was 'on'
                total_on_duration_seconds = len(on_states) * config.HISTORY_STEP_MINUTES * 60
                print(f"Approximate total 'ON' duration: {timedelta(seconds=total_on_duration_seconds)}")
            else:
                print("Binary sensor was never 'ON' in the queried period.")

    except Exception as e:
        print(f"Error fetching data for {entity_id}: {e}")


import asyncio

async def main():
    for entity_id in entity_ids_to_check:
        await validate_entity_data(entity_id)

await main()


--- Validating sensor.thermometer_wohnzimmer_kompensiert ---
Total records found: 3456
Time range: 2025-11-02 11:38:18.999203+00:00 to 2025-11-26 11:28:18.999203+00:00
Data preview (first 5 rows):
                              time  value
0 2025-11-02 11:38:18.999203+00:00   22.6
1 2025-11-02 11:48:18.999203+00:00   22.5
2 2025-11-02 11:58:18.999203+00:00   22.4
3 2025-11-02 12:08:18.999203+00:00   22.2
4 2025-11-02 12:18:18.999203+00:00   22.0

Data preview (last 5 rows):
                                 time  value
3451 2025-11-26 10:48:18.999203+00:00   21.0
3452 2025-11-26 10:58:18.999203+00:00   21.0
3453 2025-11-26 11:08:18.999203+00:00   21.0
3454 2025-11-26 11:18:18.999203+00:00   21.0
3455 2025-11-26 11:28:18.999203+00:00   21.0
Min value: 17.8
Max value: 23.2
Average value: 20.985305748456792

--- Validating sensor.thermometer_waermepume_kompensiert ---
Total records found: 3456
Time range: 2025-11-02 11:38:18.999203+00:00 to 2025-11-26 11:28:18.999203+00:00
Data preview (fi

Total records found: 3456
Time range: 2025-11-02 11:38:18.999203+00:00 to 2025-11-26 11:28:18.999203+00:00
Data preview (first 5 rows):
                              time  value
0 2025-11-02 11:38:18.999203+00:00    0.0
1 2025-11-02 11:48:18.999203+00:00    0.0
2 2025-11-02 11:58:18.999203+00:00    0.0
3 2025-11-02 12:08:18.999203+00:00    0.0
4 2025-11-02 12:18:18.999203+00:00    1.0

Data preview (last 5 rows):
                                 time  value
3451 2025-11-26 10:48:18.999203+00:00    1.0
3452 2025-11-26 10:58:18.999203+00:00    1.0
3453 2025-11-26 11:08:18.999203+00:00    1.0
3454 2025-11-26 11:18:18.999203+00:00    1.0
3455 2025-11-26 11:28:18.999203+00:00    1.0
Min value: 0.0
Max value: 1.0
Average value: 0.9427083333333334
Binary sensor was 'ON' for 3258 of 3456 periods.
Approximate total 'ON' duration: 22 days, 15:00:00

--- Validating sensor.saj_pv1_power ---
Total records found: 3456
Time range: 2025-11-02 11:38:18.999203+00:00 to 2025-11-26 11:28:18.999203+00:00
D

Next Steps

Based on the output above, you can now:
1.  **Verify Entity IDs**: Ensure that the entity IDs specified in your `.env` and `config.py` accurately reflect the sensors logging data to InfluxDB.

2.  **Check Data Presence**: Confirm that data exists for critical entities, especially for `pv_now`, `fireplace_on`, and `tv_on` if they are being used.

3.  **Inspect Values**: Look at the data previews to ensure the values are reasonable and not static or erroneous.

4.  **Adjust `TRAINING_LOOKBACK_HOURS`**: If data is sparse, consider if `TRAINING_LOOKBACK_HOURS` is too short.

5.  **Review Home Assistant to InfluxDB Integration**: If data is missing or incorrect, investigate your Home Assistant configuration for sending data to InfluxDB.