# Explore District Heating and Kinergy Data

This notebook will

- show the data distribution (how many gas, how many district heating sensors)
- compare statistics about gas and district heating data
- show where the sensors are located
- find missing data spans

## Load Data

In [1]:
import polars as pl

from src.energy_forecast.config import RAW_DATA_DIR

dh_hourly = RAW_DATA_DIR / "district_heating_hourly.csv"
dh_daily = RAW_DATA_DIR / "district_heating_daily.csv"
kinergy_daily = RAW_DATA_DIR / "kinergy_daily.csv"
kinergy_hourly = RAW_DATA_DIR / "kinergy_hourly.csv"

[32m2025-01-25 10:47:50.480[0m | [1mINFO    [0m | [36msrc.energy_forecast.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: C:\Users\User\PycharmProjects\energy-forecast-wahl[0m


In [5]:
col_list = ["id", "date", "diff", "primary_energy", "adresse", "ort", "plz", "source"]

In [32]:
df_dh_daily = pl.read_csv(dh_daily)
df_dh_daily = df_dh_daily.with_columns(pl.col("date").str.to_date(),
                                       pl.lit("district heating").alias("source"),
                                       pl.concat_str([pl.col("eco_u_id"), pl.col("data_provider_id")]).alias("id")
                                       ).select(col_list)
df_dh_daily

id,date,diff,primary_energy,adresse,ort,plz,source
str,date,f64,str,str,str,i64,str
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-09-30,147.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""district heating"""
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-10-01,295.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""district heating"""
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-10-02,253.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""district heating"""
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-10-03,315.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""district heating"""
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-10-04,245.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""district heating"""
…,…,…,…,…,…,…,…
"""fb5cc271-ae15-4f24-b9d5-30782b…",2024-05-10,10.0,"""district heating""","""Moorbekstraße 33""","""Norderstedt""",22846,"""district heating"""
"""fb5cc271-ae15-4f24-b9d5-30782b…",2024-05-11,15.0,"""district heating""","""Moorbekstraße 33""","""Norderstedt""",22846,"""district heating"""
"""fb5cc271-ae15-4f24-b9d5-30782b…",2024-05-12,16.0,"""district heating""","""Moorbekstraße 33""","""Norderstedt""",22846,"""district heating"""
"""fb5cc271-ae15-4f24-b9d5-30782b…",2024-05-13,10.0,"""district heating""","""Moorbekstraße 33""","""Norderstedt""",22846,"""district heating"""


In [7]:
df_kinergy_daily = pl.read_csv(kinergy_daily)
df_kinergy_daily = df_kinergy_daily.with_columns(pl.col("date").str.to_date(),
                                                 pl.lit("kinergy").alias("source")
                                                 ).rename({"hash": "id"}
                                                          ).select(
    ["id", "date", "Val", "diff", "primary_energy", "qmbehfl", "anzlwhg", "adresse", "ort", "plz", "source"])
df_kinergy_daily

id,date,Val,diff,primary_energy,qmbehfl,anzlwhg,adresse,ort,plz,source
str,date,f64,f64,str,f64,i64,str,str,i64,str
"""2 # JMe4""",2021-08-13,1.0732e6,59.89,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
"""2 # JMe4""",2021-08-14,1.0733e6,189.84,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
"""2 # JMe4""",2021-08-15,1.0736e6,352.56,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
"""2 # JMe4""",2021-08-16,1.0738e6,244.08,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
"""2 # JMe4""",2021-08-17,1.0741e6,186.45,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
…,…,…,…,…,…,…,…,…,…,…
"""20 # SSg1-3""",2023-09-17,2.2951e6,214.7,"""gas""",8125.67,0,"""Slomanstieg 1-3""","""Hamburg""",20539,"""kinergy"""
"""20 # SSg1-3""",2023-09-18,2.2954e6,248.6,"""gas""",8125.67,0,"""Slomanstieg 1-3""","""Hamburg""",20539,"""kinergy"""
"""20 # SSg1-3""",2023-09-19,2.2957e6,350.3,"""gas""",8125.67,0,"""Slomanstieg 1-3""","""Hamburg""",20539,"""kinergy"""
"""20 # SSg1-3""",2023-09-20,2.2960e6,282.5,"""gas""",8125.67,0,"""Slomanstieg 1-3""","""Hamburg""",20539,"""kinergy"""


In [33]:
df_daily = pl.concat([df_dh_daily, df_kinergy_daily], how="diagonal")
df_daily.sort(by=["id", "date"])

id,date,diff,primary_energy,adresse,ort,plz,source,Val,qmbehfl,anzlwhg
str,date,f64,str,str,str,i64,str,f64,f64,i64
"""0c9ad311-b86f-4371-a695-512ca4…",2022-09-30,364.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""district heating""",,,
"""0c9ad311-b86f-4371-a695-512ca4…",2022-10-01,703.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""district heating""",,,
"""0c9ad311-b86f-4371-a695-512ca4…",2022-10-02,334.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""district heating""",,,
"""0c9ad311-b86f-4371-a695-512ca4…",2022-10-03,891.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""district heating""",,,
"""0c9ad311-b86f-4371-a695-512ca4…",2022-10-04,661.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""district heating""",,,
…,…,…,…,…,…,…,…,…,…,…
"""fb684f25-a63d-4d3e-9277-6d759b…",2024-05-10,50.0,"""district heating""","""Heidehofweg 99c""","""Norderstedt""",22850,"""district heating""",,,
"""fb684f25-a63d-4d3e-9277-6d759b…",2024-05-11,51.0,"""district heating""","""Heidehofweg 99c""","""Norderstedt""",22850,"""district heating""",,,
"""fb684f25-a63d-4d3e-9277-6d759b…",2024-05-12,45.0,"""district heating""","""Heidehofweg 99c""","""Norderstedt""",22850,"""district heating""",,,
"""fb684f25-a63d-4d3e-9277-6d759b…",2024-05-13,46.0,"""district heating""","""Heidehofweg 99c""","""Norderstedt""",22850,"""district heating""",,,


In [34]:
print(f"Total number of datapoints (daily): {len(df_daily)}")
print(f"Total number of sensors: {len(df_daily.group_by(pl.col('id')).agg())}")

Total number of datapoints (daily): 56253
Total number of sensors: 103


In [49]:
df_dh_hourly = pl.read_csv(dh_hourly)
df_dh_hourly = df_dh_hourly.with_columns((pl.col("date").str.to_date().dt.combine(pl.time(pl.col("hour")))),
                                         pl.lit("dh").alias("source"),
                                         pl.concat_str([pl.col("eco_u_id"), pl.col("data_provider_id")]).alias("id")
                                         ).select(col_list)
df_dh_hourly

id,date,diff,primary_energy,adresse,ort,plz,source
str,datetime[μs],f64,str,str,str,i64,str
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-09-30 15:00:00,32.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""dh"""
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-09-30 17:00:00,30.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""dh"""
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-09-30 19:00:00,37.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""dh"""
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-09-30 23:00:00,48.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""dh"""
"""8f7b3862-a50d-44eb-8ac9-de0cf4…",2022-10-01 05:00:00,50.0,"""district heating""","""Kielort 20""","""Norderstedt""",22850,"""dh"""
…,…,…,…,…,…,…,…
"""fb5cc271-ae15-4f24-b9d5-30782b…",2024-05-13 23:00:00,0.0,"""district heating""","""Moorbekstraße 33""","""Norderstedt""",22846,"""dh"""
"""fb5cc271-ae15-4f24-b9d5-30782b…",2024-05-14 00:00:00,0.0,"""district heating""","""Moorbekstraße 33""","""Norderstedt""",22846,"""dh"""
"""fb5cc271-ae15-4f24-b9d5-30782b…",2024-05-14 03:00:00,1.0,"""district heating""","""Moorbekstraße 33""","""Norderstedt""",22846,"""dh"""
"""fb5cc271-ae15-4f24-b9d5-30782b…",2024-05-14 06:00:00,1.0,"""district heating""","""Moorbekstraße 33""","""Norderstedt""",22846,"""dh"""


In [47]:
df_kinergy_hourly = pl.read_csv(kinergy_hourly)
df_kinergy_hourly = df_kinergy_hourly.with_columns(pl.col("date").str.to_date().dt.combine(pl.time(pl.col("hour"))),
                                                   pl.lit("kinergy").alias("source")
                                                   ).rename({"hash": "id"}
                                                            ).select(
    ["id", "date", "avg_sum_kwh", "total_kwh_diff", "primary_energy", "qmbehfl", "anzlwhg", "adresse", "ort",
     "plz", "source"])
df_kinergy_hourly

id,date,avg_sum_kwh,total_kwh_diff,primary_energy,qmbehfl,anzlwhg,adresse,ort,plz,source
str,datetime[μs],f64,f64,str,f64,i64,str,str,i64,str
"""2 # JMe4""",2021-08-13 12:00:00,1.0731e6,0.0,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
"""2 # JMe4""",2021-08-13 13:00:00,1.0731e6,15.82,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
"""2 # JMe4""",2021-08-13 14:00:00,1.0732e6,22.6,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
"""2 # JMe4""",2021-08-13 15:00:00,1.0732e6,0.0,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
"""2 # JMe4""",2021-08-13 16:00:00,1.0732e6,21.47,"""gas""",1201.78,64,"""Mittlere Schulstraße 4""","""Erlangen""",91054,"""kinergy"""
…,…,…,…,…,…,…,…,…,…,…
"""20 # SSg1-3""",2023-09-21 05:00:00,2.2962e6,22.6,"""gas""",8125.67,0,"""Slomanstieg 1-3""","""Hamburg""",20539,"""kinergy"""
"""20 # SSg1-3""",2023-09-21 06:00:00,2.2962e6,11.3,"""gas""",8125.67,0,"""Slomanstieg 1-3""","""Hamburg""",20539,"""kinergy"""
"""20 # SSg1-3""",2023-09-21 07:00:00,2.2962e6,0.0,"""gas""",8125.67,0,"""Slomanstieg 1-3""","""Hamburg""",20539,"""kinergy"""
"""20 # SSg1-3""",2023-09-21 08:00:00,2.2962e6,22.6,"""gas""",8125.67,0,"""Slomanstieg 1-3""","""Hamburg""",20539,"""kinergy"""


In [51]:
df_hourly = pl.concat([df_dh_hourly, df_kinergy_hourly], how="diagonal")
df_hourly.sort(by=["id", "date"])

id,date,diff,primary_energy,adresse,ort,plz,source,avg_sum_kwh,total_kwh_diff,qmbehfl,anzlwhg
str,datetime[μs],f64,str,str,str,i64,str,f64,f64,f64,i64
"""0c9ad311-b86f-4371-a695-512ca4…",2022-09-30 11:00:00,24.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""dh""",,,,
"""0c9ad311-b86f-4371-a695-512ca4…",2022-09-30 12:00:00,28.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""dh""",,,,
"""0c9ad311-b86f-4371-a695-512ca4…",2022-09-30 13:00:00,29.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""dh""",,,,
"""0c9ad311-b86f-4371-a695-512ca4…",2022-09-30 15:00:00,55.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""dh""",,,,
"""0c9ad311-b86f-4371-a695-512ca4…",2022-09-30 16:00:00,27.0,"""district heating""","""Kielortring 14""","""Norderstedt""",22850,"""dh""",,,,
…,…,…,…,…,…,…,…,…,…,…,…
"""fb684f25-a63d-4d3e-9277-6d759b…",2024-05-14 03:00:00,3.0,"""district heating""","""Heidehofweg 99c""","""Norderstedt""",22850,"""dh""",,,,
"""fb684f25-a63d-4d3e-9277-6d759b…",2024-05-14 04:00:00,5.0,"""district heating""","""Heidehofweg 99c""","""Norderstedt""",22850,"""dh""",,,,
"""fb684f25-a63d-4d3e-9277-6d759b…",2024-05-14 05:00:00,6.0,"""district heating""","""Heidehofweg 99c""","""Norderstedt""",22850,"""dh""",,,,
"""fb684f25-a63d-4d3e-9277-6d759b…",2024-05-14 06:00:00,3.0,"""district heating""","""Heidehofweg 99c""","""Norderstedt""",22850,"""dh""",,,,


In [52]:
print(f"Total number of datapoints (hourly): {len(df_hourly)}")

Total number of datapoints (hourly): 1047541


## Data distribution

In [26]:
print(f"Number of gas sensors: {len(df_daily.filter(pl.col('primary_energy') == 'gas').group_by(pl.col('id')).agg())}")
print(f"Number of datapoints (daily) for gas sensors: {len(df_daily.filter(pl.col('primary_energy') == 'gas'))}")
print(
    f"Number of district heating sensors: {len(df_daily.filter(pl.col('primary_energy') == 'district heating').group_by(pl.col('id')).agg())}")
print(
    f"Number of datapoints (daily) for district heating sensors: {len(df_daily.filter(pl.col('primary_energy') == 'district heating'))}")

Number of gas sensors: 9
Number of datapoints (daily) for gas sensors: 5733
Number of district heating sensors: 94
Number of datapoints (daily) for district heating sensors: 50520


Datapoints (daily) per gas sensors:

In [29]:
df_daily.filter(pl.col('primary_energy') == 'gas').group_by(pl.col("id")).agg(pl.len(),
                                                                              pl.col("date").min().alias("min_date"),
                                                                              pl.col("date").max().alias("max_date"))

id,len,min_date,max_date
str,u32,date,date
"""7 # WHg2-8""",783,2021-07-08,2023-09-13
"""3 # JOe11""",92,2023-06-19,2023-09-18
"""4 # JSe21/23""",752,2021-08-13,2023-09-18
"""8 # WUe4""",783,2021-07-08,2023-09-13
"""19 # SKg63""",391,2022-08-31,2023-09-25
"""2 # JMe4""",752,2021-08-13,2023-09-18
"""6 # WGg8a""",856,2021-05-06,2023-09-13
"""20 # SSg1-3""",606,2022-01-24,2023-09-21
"""17 # SFp36""",718,2021-10-04,2023-09-21


In [35]:
print(
    f"Average length of datapoints (daily) for gas sensors: {df_daily.filter(pl.col('primary_energy') == 'gas').group_by(pl.col('id')).agg(pl.len()).mean()['len'].item()}")

Average length of datapoints (daily) for gas sensors: 637.0


Datapoints (daily) per district heating:

In [36]:
df_daily.filter(pl.col('primary_energy') == 'district heating').group_by(pl.col("id")).agg(pl.len(),
                                                                                           pl.col("date").min().alias(
                                                                                               "min_date"),
                                                                                           pl.col("date").max().alias(
                                                                                               "max_date"))

id,len,min_date,max_date
str,u32,date,date
"""e7ad9b75-bc6c-4891-a8fd-45e393…",537,2022-09-30,2024-05-14
"""1657f5b3-fad0-4685-b56c-d57982…",608,2022-08-23,2024-05-14
"""fb5cc271-ae15-4f24-b9d5-30782b…",600,2022-08-23,2024-05-14
"""c00c8cba-b6de-4c10-89c0-e92312…",593,2022-08-23,2024-05-14
"""c00c8cba-b6de-4c10-89c0-e92312…",592,2022-08-23,2024-05-14
…,…,…,…
"""fb5cc271-ae15-4f24-b9d5-30782b…",605,2022-08-23,2024-05-14
"""10af300b-a270-4e41-928d-e4048b…",535,2022-09-30,2024-05-14
"""8e9b1544-434e-44a7-8049-8f2e4b…",592,2022-08-23,2024-05-14
"""fb5cc271-ae15-4f24-b9d5-30782b…",602,2022-08-23,2024-05-14


In [37]:
print(
    f"Average length of datapoints (daily) for district heating sensors: {df_daily.filter(pl.col('primary_energy') == 'district heating').group_by(pl.col('id')).agg(pl.len()).mean()['len'].item()}")

Average length of datapoints (daily) for district heating sensors: 537.4468085106383


### Summary

- we have way less gas sensor data than district heating data so far (9/94)
- the gas sensors recorded about 100 days longer on average

## Statistics

In [41]:
df_daily.filter(pl.col("primary_energy") == "gas").describe()["statistic", "diff"]

statistic,diff
str,f64
"""count""",5733.0
"""null_count""",0.0
"""mean""",1140.123146
"""std""",1792.365846
"""min""",0.0
"""25%""",155.601
"""50%""",459.91
"""75%""",1126.61
"""max""",18028.02


In [42]:
df_daily.filter(pl.col("primary_energy") == "district heating").describe()["statistic", "diff"]

statistic,diff
str,f64
"""count""",50520.0
"""null_count""",0.0
"""mean""",342.830245
"""std""",7373.526528
"""min""",-1031452.0
"""25%""",6.0
"""50%""",124.0
"""75%""",500.0
"""max""",386659.0


## Erroneous Values

In [44]:
df_daily.filter((pl.col("primary_energy") == "district heating") & (pl.col("diff") < 0))

id,date,diff,primary_energy,adresse,ort,plz,source,Val,qmbehfl,anzlwhg
str,date,f64,str,str,str,i64,str,f64,f64,i64
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-14,-530377.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-05-09,-772883.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,
"""5e2fd59d-603a-488b-a525-513541…",2023-12-13,-632734.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,
"""d5fb4343-04d4-4521-8a4b-feaf77…",2023-11-08,-1031452.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-14,-233503.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-05-04,-330862.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,


In [49]:
df_daily.filter((pl.col("primary_energy") == "district heating") & (
        pl.col("id") == "bc098a2e-0cc7-4f01-b6ad-9d647ae9f627ebf76f41mvxg") & (
                        pl.col("date") > pl.date(2023, 4, 9)) & (pl.col("date") < pl.date(2023, 5, 6))).sort(
    by="date")

id,date,diff,primary_energy,adresse,ort,plz,source,Val,qmbehfl,anzlwhg
str,date,f64,str,str,str,i64,str,f64,f64,i64
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-04-10,190.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-04-11,180.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-04-12,45.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-04-13,0.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-04-14,0.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,
…,…,…,…,…,…,…,…,…,…,…
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-04-24,0.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-04-25,0.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-04-26,0.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,
"""bc098a2e-0cc7-4f01-b6ad-9d647a…",2023-05-04,-330862.0,"""district heating""","""Ulzburger Straße 459 A""","""Norderstedt""",22846,"""kinergy""",,,


In [52]:
df_daily.filter((pl.col("primary_energy") == "district heating") & (
        pl.col("id") == "8e9b1544-434e-44a7-8049-8f2e4b14a819c5ebff07syra") & (
                        pl.col("date") > pl.date(2023, 4, 9)))

id,date,diff,primary_energy,adresse,ort,plz,source,Val,qmbehfl,anzlwhg
str,date,f64,str,str,str,i64,str,f64,f64,i64
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-10,112.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-11,111.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-12,106.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-13,108.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-14,-233503.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
…,…,…,…,…,…,…,…,…,…,…
"""8e9b1544-434e-44a7-8049-8f2e4b…",2024-05-10,133.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2024-05-11,147.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2024-05-12,160.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2024-05-13,142.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,


In [69]:
df_daily.filter((pl.col("primary_energy") == "district heating") & (
        pl.col("id") == "5e2fd59d-603a-488b-a525-513541039c4a47e757c2y9pw") & (
                        pl.col("date") > pl.date(2023, 1, 1)))  # große Datenlücke märz bis juli 2023

id,date,diff,primary_energy,adresse,ort,plz,source,Val,qmbehfl,anzlwhg
str,date,f64,str,str,str,i64,str,f64,f64,i64
"""5e2fd59d-603a-488b-a525-513541…",2023-01-02,298.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,
"""5e2fd59d-603a-488b-a525-513541…",2023-01-03,428.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,
"""5e2fd59d-603a-488b-a525-513541…",2023-01-04,198.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,
"""5e2fd59d-603a-488b-a525-513541…",2023-01-07,1157.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,
"""5e2fd59d-603a-488b-a525-513541…",2023-01-11,1518.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,
…,…,…,…,…,…,…,…,…,…,…
"""5e2fd59d-603a-488b-a525-513541…",2024-05-10,257.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,
"""5e2fd59d-603a-488b-a525-513541…",2024-05-11,214.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,
"""5e2fd59d-603a-488b-a525-513541…",2024-05-12,209.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,
"""5e2fd59d-603a-488b-a525-513541…",2024-05-13,175.0,"""district heating""","""Kielortring 51""","""Norderstedt""",22850,"""kinergy""",,,


In [70]:
df_daily.filter((pl.col("primary_energy") == "district heating") & (
        pl.col("id") == "d5fb4343-04d4-4521-8a4b-feaf772ff376e225f415kyug") & (
                        pl.col("date") > pl.date(2023, 9, 1)))

id,date,diff,primary_energy,adresse,ort,plz,source,Val,qmbehfl,anzlwhg
str,date,f64,str,str,str,i64,str,f64,f64,i64
"""d5fb4343-04d4-4521-8a4b-feaf77…",2023-09-06,2147.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,
"""d5fb4343-04d4-4521-8a4b-feaf77…",2023-09-07,0.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,
"""d5fb4343-04d4-4521-8a4b-feaf77…",2023-09-08,0.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,
"""d5fb4343-04d4-4521-8a4b-feaf77…",2023-09-09,0.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,
"""d5fb4343-04d4-4521-8a4b-feaf77…",2023-09-10,0.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,
…,…,…,…,…,…,…,…,…,…,…
"""d5fb4343-04d4-4521-8a4b-feaf77…",2024-05-10,251.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,
"""d5fb4343-04d4-4521-8a4b-feaf77…",2024-05-11,242.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,
"""d5fb4343-04d4-4521-8a4b-feaf77…",2024-05-12,251.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,
"""d5fb4343-04d4-4521-8a4b-feaf77…",2024-05-13,176.0,"""district heating""","""Kielort 21""","""Norderstedt""",22850,"""kinergy""",,,


In [73]:
df_daily.filter((pl.col("primary_energy") == "district heating") & (
        pl.col("id") == "8e9b1544-434e-44a7-8049-8f2e4b14a819f1af853af3qs") & (
                        pl.col("date") > pl.date(2023, 4, 1)))

id,date,diff,primary_energy,adresse,ort,plz,source,Val,qmbehfl,anzlwhg
str,date,f64,str,str,str,i64,str,f64,f64,i64
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-02,260.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-03,269.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-04,275.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-05,311.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2023-04-06,260.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
…,…,…,…,…,…,…,…,…,…,…
"""8e9b1544-434e-44a7-8049-8f2e4b…",2024-05-10,36.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2024-05-11,36.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2024-05-12,39.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,
"""8e9b1544-434e-44a7-8049-8f2e4b…",2024-05-13,20.0,"""district heating""","""Ulzburger Straße 461""","""Norderstedt""",22846,"""kinergy""",,,


### Summary

- there are data holes where diff is zero and the end of the disconnection is marked by a negative large value -> ?
- there are large negative values that are probably mistakes in connection without preceding zeros -> take value of day before

## Location of Sensors

## Missing Data

Find missing data, like days inbetween without an entry

### Daily Data - District Heating

In [12]:
df_time = df_daily.filter(pl.col('primary_energy') == 'district heating').group_by(pl.col("id")).agg(pl.len(),
                                                                                                     pl.col(
                                                                                                         "date").min().alias(
                                                                                                         "min_date"),
                                                                                                     pl.col(
                                                                                                         "date").max().alias(
                                                                                                         "max_date"))
df_time

id,len,min_date,max_date
str,u32,date,date
"""1657f5b3-fad0-4685-b56c-d57982…",593,2022-08-23,2024-05-14
"""a141529b-2b91-4c85-ae9f-907901…",582,2022-08-23,2024-05-14
"""5 # WFe21-25""",527,2022-04-04,2023-09-12
"""2f025f96-af2c-4140-b955-766a79…",422,2022-10-27,2024-05-14
"""d00d6502-a08d-45df-99e3-7d8cd5…",590,2022-08-23,2024-05-14
…,…,…,…
"""c1469372-a3c4-4870-808c-8039cb…",603,2022-08-23,2024-05-14
"""c00c8cba-b6de-4c10-89c0-e92312…",594,2022-08-23,2024-05-13
"""c00c8cba-b6de-4c10-89c0-e92312…",596,2022-08-23,2024-05-14
"""fb5cc271-ae15-4f24-b9d5-30782b…",602,2022-08-23,2024-05-14


In [29]:
from datetime import timedelta
import pandas as pd

missing_dates = []

for row in df_time.iter_rows():
    id = row[0]
    print("\nSensor: ", id, "\n")
    print(f"Sensor {id} has {row[1]} datapoints")
    start_date = row[2]
    end_date = row[3]

    date_list_rec = df_daily.filter(pl.col("id") == id).select(pl.col("date")).to_pandas()["date"].tolist()
    one_day = timedelta(days=1)
    D = "D"

    date_list = pd.date_range(start_date, end_date, freq=D)

    missing_dates_sensor = list(set(date_list) - set(date_list_rec))
    print(f"Missing dates: {missing_dates_sensor}")
    missing_dates.append(
        {"id": id, "missing_dates": missing_dates_sensor, "len": len(missing_dates_sensor), "n": row[1],
         "per": (((row[1] + len(missing_dates_sensor)) / (row[1])) * 100) - 100})
pl.DataFrame(missing_dates).sort(pl.col("len"), descending=True)


Sensor:  1657f5b3-fad0-4685-b56c-d57982d5a35c86c8052ag2fa 

Sensor 1657f5b3-fad0-4685-b56c-d57982d5a35c86c8052ag2fa has 593 datapoints
Missing dates: [Timestamp('2023-02-11 00:00:00'), Timestamp('2024-02-03 00:00:00'), Timestamp('2023-02-16 00:00:00'), Timestamp('2022-11-12 00:00:00'), Timestamp('2024-01-30 00:00:00'), Timestamp('2023-02-14 00:00:00'), Timestamp('2024-03-16 00:00:00'), Timestamp('2023-05-04 00:00:00'), Timestamp('2023-02-10 00:00:00'), Timestamp('2023-05-06 00:00:00'), Timestamp('2024-03-17 00:00:00'), Timestamp('2023-04-27 00:00:00'), Timestamp('2024-02-01 00:00:00'), Timestamp('2023-02-15 00:00:00'), Timestamp('2022-11-13 00:00:00'), Timestamp('2023-02-13 00:00:00'), Timestamp('2023-02-12 00:00:00'), Timestamp('2024-02-07 00:00:00'), Timestamp('2022-12-01 00:00:00'), Timestamp('2023-05-02 00:00:00'), Timestamp('2023-04-29 00:00:00'), Timestamp('2023-05-05 00:00:00'), Timestamp('2024-02-05 00:00:00'), Timestamp('2024-02-04 00:00:00'), Timestamp('2023-02-09 00:00:00')

id,missing_dates,len,n,per
str,list[datetime[μs]],i64,i64,f64
"""d566a120-d232-489a-aa42-850e5a…","[2024-02-12 00:00:00, 2023-01-11 00:00:00, … 2023-05-25 00:00:00]",312,308,101.298701
"""7dd30c54-3be7-4a3c-b5e0-9841bb…","[2024-02-12 00:00:00, 2024-03-13 00:00:00, … 2023-05-25 00:00:00]",308,285,108.070175
"""5c8f03f4-9165-43a2-8c42-1e8133…","[2023-01-11 00:00:00, 2024-02-03 00:00:00, … 2023-08-01 00:00:00]",286,232,123.275862
"""4ccc1cea-534d-4dbe-bf66-0d31d8…","[2023-01-11 00:00:00, 2024-02-03 00:00:00, … 2023-08-01 00:00:00]",275,347,79.25072
"""5e2fd59d-603a-488b-a525-513541…","[2024-02-03 00:00:00, 2023-06-17 00:00:00, … 2023-05-15 00:00:00]",232,336,69.047619
…,…,…,…,…
"""5 # WFe21-25""",[],0,527,0.0
"""11 # BSeH2""",[],0,332,0.0
"""9 # BMr03""",[],0,257,0.0
"""12 # BTr9""",[],0,255,0.0


Summary
- first 4 sensors have as much missing days as data
- first 20 sensors -> remove ?

### Daily Data - Gas

In [30]:
df_time = df_daily.filter(pl.col('primary_energy') == 'gas').group_by(pl.col("id")).agg(pl.len(),
                                                                                        pl.col(
                                                                                            "date").min().alias(
                                                                                            "min_date"),
                                                                                        pl.col(
                                                                                            "date").max().alias(
                                                                                            "max_date"))
df_time

id,len,min_date,max_date
str,u32,date,date
"""6 # WGg8a""",856,2021-05-06,2023-09-13
"""3 # JOe11""",92,2023-06-19,2023-09-18
"""20 # SSg1-3""",606,2022-01-24,2023-09-21
"""4 # JSe21/23""",752,2021-08-13,2023-09-18
"""8 # WUe4""",783,2021-07-08,2023-09-13
"""17 # SFp36""",718,2021-10-04,2023-09-21
"""19 # SKg63""",391,2022-08-31,2023-09-25
"""7 # WHg2-8""",783,2021-07-08,2023-09-13
"""2 # JMe4""",752,2021-08-13,2023-09-18


In [31]:
missing_dates = []

for row in df_time.iter_rows():
    id = row[0]
    print("\nSensor: ", id, "\n")
    print(f"Sensor {id} has {row[1]} datapoints")
    start_date = row[2]
    end_date = row[3]

    date_list_rec = df_daily.filter(pl.col("id") == id).select(pl.col("date")).to_pandas()["date"].tolist()
    one_day = timedelta(days=1)
    D = "D"

    date_list = pd.date_range(start_date, end_date, freq=D)

    missing_dates_sensor = list(set(date_list) - set(date_list_rec))
    print(f"Missing dates: {missing_dates_sensor}")
    missing_dates.append(
        {"id": id, "missing_dates": missing_dates_sensor, "len": len(missing_dates_sensor), "n": row[1],
         "per": (((row[1] + len(missing_dates_sensor)) / (row[1])) * 100) - 100})
pl.DataFrame(missing_dates).sort(pl.col("len"), descending=True)


Sensor:  6 # WGg8a 

Sensor 6 # WGg8a has 856 datapoints
Missing dates: [Timestamp('2021-06-13 00:00:00'), Timestamp('2021-06-12 00:00:00'), Timestamp('2021-06-11 00:00:00'), Timestamp('2021-05-30 00:00:00'), Timestamp('2021-05-29 00:00:00')]

Sensor:  3 # JOe11 

Sensor 3 # JOe11 has 92 datapoints
Missing dates: []

Sensor:  20 # SSg1-3 

Sensor 20 # SSg1-3 has 606 datapoints
Missing dates: []

Sensor:  4 # JSe21/23 

Sensor 4 # JSe21/23 has 752 datapoints
Missing dates: [Timestamp('2021-09-08 00:00:00'), Timestamp('2021-09-20 00:00:00'), Timestamp('2021-09-17 00:00:00'), Timestamp('2021-09-11 00:00:00'), Timestamp('2021-09-12 00:00:00'), Timestamp('2021-09-09 00:00:00'), Timestamp('2021-09-07 00:00:00'), Timestamp('2021-09-21 00:00:00'), Timestamp('2021-09-18 00:00:00'), Timestamp('2021-09-16 00:00:00'), Timestamp('2021-09-19 00:00:00'), Timestamp('2021-09-14 00:00:00'), Timestamp('2021-09-15 00:00:00'), Timestamp('2021-09-13 00:00:00'), Timestamp('2021-09-10 00:00:00')]

Sensor:  8

id,missing_dates,len,n,per
str,list[datetime[μs]],i64,i64,f64
"""4 # JSe21/23""","[2021-09-08 00:00:00, 2021-09-20 00:00:00, … 2021-09-10 00:00:00]",15,752,1.994681
"""8 # WUe4""","[2021-09-08 00:00:00, 2021-09-20 00:00:00, … 2021-09-10 00:00:00]",15,783,1.915709
"""7 # WHg2-8""","[2021-09-08 00:00:00, 2021-09-20 00:00:00, … 2021-09-10 00:00:00]",15,783,1.915709
"""2 # JMe4""","[2021-09-08 00:00:00, 2021-09-20 00:00:00, … 2021-09-10 00:00:00]",15,752,1.994681
"""6 # WGg8a""","[2021-06-13 00:00:00, 2021-06-12 00:00:00, … 2021-05-29 00:00:00]",5,856,0.584112
"""3 # JOe11""",[],0,92,0.0
"""20 # SSg1-3""",[],0,606,0.0
"""17 # SFp36""",[],0,718,0.0
"""19 # SKg63""",[],0,391,0.0


Summary
- Gas sensors dont have as much missing data, can be disregarded

### Hourly Data - Gas

In [54]:
df_time = df_hourly.filter(pl.col('primary_energy') == 'gas').group_by(pl.col("id")).agg(pl.len(),
                                                                                        pl.col(
                                                                                            "date").min().alias(
                                                                                            "min_date"),
                                                                                        pl.col(
                                                                                            "date").max().alias(
                                                                                            "max_date"))
df_time

id,len,min_date,max_date
str,u32,datetime[μs],datetime[μs]
"""3 # JOe11""",2176,2023-06-19 13:00:00,2023-09-18 15:00:00
"""7 # WHg2-8""",18551,2021-07-08 15:00:00,2023-09-13 05:00:00
"""6 # WGg8a""",20151,2021-05-06 14:00:00,2023-09-13 07:00:00
"""8 # WUe4""",18550,2021-07-08 16:00:00,2023-09-13 06:00:00
"""19 # SKg63""",9323,2022-08-31 22:00:00,2023-09-25 07:00:00
"""17 # SFp36""",17047,2021-10-04 10:00:00,2023-09-21 10:00:00
"""20 # SSg1-3""",14414,2022-01-24 08:00:00,2023-09-21 09:00:00
"""4 # JSe21/23""",17834,2021-08-13 13:00:00,2023-09-18 15:00:00
"""2 # JMe4""",17815,2021-08-13 12:00:00,2023-09-18 13:00:00


In [55]:
for row in df_time.iter_rows():
    id = row[0]
    print("\nSensor: ", id, "\n")
    print(f"Sensor {id} has {row[1]} datapoints")
    start_date = row[2]
    end_date = row[3]

    date_list_rec = df_daily.filter(pl.col("id") == id).select(pl.col("date")).to_pandas()["date"].tolist()
    one_day = timedelta(days=1)
    D = "h"

    date_list = pd.date_range(start_date, end_date, freq=D)

    missing_dates_sensor = list(set(date_list) - set(date_list_rec))
    print(f"Missing dates: {missing_dates_sensor}")
    missing_dates.append(
        {"id": id, "missing_dates": missing_dates_sensor, "len": len(missing_dates_sensor), "n": row[1],
         "per": (((row[1] + len(missing_dates_sensor)) / (row[1])) * 100) - 100})
pl.DataFrame(missing_dates).sort(pl.col("len"), descending=True)


Sensor:  3 # JOe11 

Sensor 3 # JOe11 has 2176 datapoints
Missing dates: [Timestamp('2023-09-15 05:00:00'), Timestamp('2023-09-16 10:00:00'), Timestamp('2023-06-20 23:00:00'), Timestamp('2023-07-02 12:00:00'), Timestamp('2023-07-17 03:00:00'), Timestamp('2023-08-11 10:00:00'), Timestamp('2023-08-15 07:00:00'), Timestamp('2023-09-09 10:00:00'), Timestamp('2023-08-22 10:00:00'), Timestamp('2023-06-24 11:00:00'), Timestamp('2023-07-24 22:00:00'), Timestamp('2023-08-01 12:00:00'), Timestamp('2023-09-12 11:00:00'), Timestamp('2023-07-17 16:00:00'), Timestamp('2023-07-26 21:00:00'), Timestamp('2023-07-29 11:00:00'), Timestamp('2023-08-01 14:00:00'), Timestamp('2023-07-21 13:00:00'), Timestamp('2023-09-11 02:00:00'), Timestamp('2023-08-21 15:00:00'), Timestamp('2023-07-08 05:00:00'), Timestamp('2023-08-28 19:00:00'), Timestamp('2023-08-25 22:00:00'), Timestamp('2023-07-11 06:00:00'), Timestamp('2023-09-10 06:00:00'), Timestamp('2023-09-11 12:00:00'), Timestamp('2023-07-06 08:00:00'), Timesta

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



Missing dates: [Timestamp('2023-09-15 05:00:00'), Timestamp('2023-04-14 08:00:00'), Timestamp('2022-09-10 01:00:00'), Timestamp('2022-10-10 03:00:00'), Timestamp('2022-07-28 20:00:00'), Timestamp('2022-06-21 15:00:00'), Timestamp('2022-01-28 08:00:00'), Timestamp('2022-08-19 09:00:00'), Timestamp('2022-07-14 12:00:00'), Timestamp('2022-11-13 16:00:00'), Timestamp('2021-12-14 10:00:00'), Timestamp('2022-11-11 23:00:00'), Timestamp('2022-02-10 08:00:00'), Timestamp('2022-11-30 02:00:00'), Timestamp('2023-01-31 12:00:00'), Timestamp('2022-09-19 18:00:00'), Timestamp('2022-12-23 15:00:00'), Timestamp('2021-09-25 21:00:00'), Timestamp('2021-12-07 23:00:00'), Timestamp('2023-09-05 01:00:00'), Timestamp('2022-01-06 19:00:00'), Timestamp('2022-12-17 21:00:00'), Timestamp('2022-05-22 03:00:00'), Timestamp('2023-07-28 23:00:00'), Timestamp('2022-07-29 23:00:00'), Timestamp('2022-10-03 06:00:00'), Timestamp('2022-10-01 19:00:00'), Timestamp('2021-12-22 07:00:00'), Timestamp('2022-08-05 18:00:00')

id,missing_dates,len,n,per
str,list[datetime[μs]],i64,i64,f64
"""6 # WGg8a""","[2023-04-14 08:00:00, 2022-09-10 01:00:00, … 2022-12-20 14:00:00]",19779,20151,98.153938
"""7 # WHg2-8""","[2023-04-14 08:00:00, 2022-09-10 01:00:00, … 2022-12-20 14:00:00]",18337,18551,98.846423
"""8 # WUe4""","[2023-04-14 08:00:00, 2022-09-10 01:00:00, … 2022-12-20 14:00:00]",18337,18550,98.851752
"""4 # JSe21/23""","[2023-09-15 05:00:00, 2023-04-14 08:00:00, … 2022-12-20 14:00:00]",17636,17834,98.889761
"""2 # JMe4""","[2023-09-15 05:00:00, 2023-04-14 08:00:00, … 2022-12-20 14:00:00]",17635,17815,98.989615
…,…,…,…,…
"""6 # WGg8a""","[2021-06-13 00:00:00, 2021-06-12 00:00:00, … 2021-05-29 00:00:00]",5,856,0.584112
"""3 # JOe11""",[],0,92,0.0
"""20 # SSg1-3""",[],0,606,0.0
"""17 # SFp36""",[],0,718,0.0


- the gas values hourly data also has almost as much missing as much data we have