# Kinergy Data

This notebook analyzes the kinergy dataset regarding

- number of datapoints and sensors
- location of sensors
- runtime for recording of data for each sensor, average time
- data preprocessing steps that are needed


In [43]:
from datetime import datetime

import pandas as pd
import polars as pl
import json

In [23]:
kinergy_data_folder = '../kinergy/kinergy_data/consumption_data'
file_path = '/home/marja/PycharmProjects/energy-forecast-wahl/kinergy/kinergy_data/consumption_data/1a9266de-dfff-11eb-9d61-02b402f0c1de_consumption.csv'
eco_u_data_file = '/home/marja/PycharmProjects/energy-forecast-wahl/kinergy/kinergy_data/kinergy_eco_u_list.json'

In [24]:
with open(eco_u_data_file, "r", encoding="UTF-8") as f:
    eco_u_data = json.loads(f.read())

## Count the datapoints

In [83]:
df = pl.read_csv(file_path)
df

bucket,7d6b8c0c-79d0-4d38-b2f5-9cb830a2df7f,sum_kwh,sum_kwh_diff,env_temp
str,str,f64,f64,f64
"""2021-07-08T15:30:00.000Z""",,0.0,,21.49
"""2021-07-08T16:00:00.000Z""",,0.0,0.0,21.95
"""2021-07-08T16:30:00.000Z""",,0.0,0.0,19.42
"""2021-07-08T17:31:00.000Z""",,0.0,0.0,19.93
"""2021-07-08T18:31:00.000Z""",,0.0,0.0,18.69
…,…,…,…,…
"""2023-09-12T15:50:00.000Z""","""545247.0""",545247.0,0.0,29.45
"""2023-09-12T15:51:00.000Z""","""545247.0""",545247.0,0.0,29.45
"""2023-09-12T15:52:00.000Z""","""545247.0""",545247.0,0.0,29.45
"""2023-09-12T15:53:00.000Z""","""545247.0""",545247.0,0.0,29.45


Add column datetime

In [84]:
df = df.with_columns(
    pl.col("bucket")
    .str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.fZ")
    .dt.strftime("%Y-%m-%dT%H:%M:%S")
    .alias("datetime")
)
df

bucket,7d6b8c0c-79d0-4d38-b2f5-9cb830a2df7f,sum_kwh,sum_kwh_diff,env_temp,datetime
str,str,f64,f64,f64,str
"""2021-07-08T15:30:00.000Z""",,0.0,,21.49,"""2021-07-08T15:30:00"""
"""2021-07-08T16:00:00.000Z""",,0.0,0.0,21.95,"""2021-07-08T16:00:00"""
"""2021-07-08T16:30:00.000Z""",,0.0,0.0,19.42,"""2021-07-08T16:30:00"""
"""2021-07-08T17:31:00.000Z""",,0.0,0.0,19.93,"""2021-07-08T17:31:00"""
"""2021-07-08T18:31:00.000Z""",,0.0,0.0,18.69,"""2021-07-08T18:31:00"""
…,…,…,…,…,…
"""2023-09-12T15:50:00.000Z""","""545247.0""",545247.0,0.0,29.45,"""2023-09-12T15:50:00"""
"""2023-09-12T15:51:00.000Z""","""545247.0""",545247.0,0.0,29.45,"""2023-09-12T15:51:00"""
"""2023-09-12T15:52:00.000Z""","""545247.0""",545247.0,0.0,29.45,"""2023-09-12T15:52:00"""
"""2023-09-12T15:53:00.000Z""","""545247.0""",545247.0,0.0,29.45,"""2023-09-12T15:53:00"""


Remove unnecessary columns

In [85]:
df = df.select(["datetime", "sum_kwh", "sum_kwh_diff", "env_temp"])
df

datetime,sum_kwh,sum_kwh_diff,env_temp
str,f64,f64,f64
"""2021-07-08T15:30:00""",0.0,,21.49
"""2021-07-08T16:00:00""",0.0,0.0,21.95
"""2021-07-08T16:30:00""",0.0,0.0,19.42
"""2021-07-08T17:31:00""",0.0,0.0,19.93
"""2021-07-08T18:31:00""",0.0,0.0,18.69
…,…,…,…
"""2023-09-12T15:50:00""",545247.0,0.0,29.45
"""2023-09-12T15:51:00""",545247.0,0.0,29.45
"""2023-09-12T15:52:00""",545247.0,0.0,29.45
"""2023-09-12T15:53:00""",545247.0,0.0,29.45


Remove zero block at beginning

In [86]:
start_date: str = eco_u_data["1a9266de-dfff-11eb-9d61-02b402f0c1de"]["begin"]
start_datetime = pd.to_datetime(start_date).strftime("%Y-%m-%dT%H:%M:%S")
start_datetime

'2023-03-01T00:00:00'

In [87]:
end_date: str = eco_u_data["1a9266de-dfff-11eb-9d61-02b402f0c1de"]["end"]
end_datetime = pd.to_datetime(end_date).strftime("%Y-%m-%dT%H:%M:%S")
end_datetime

'2023-05-01T00:00:00'

In [68]:
duration = datetime.strptime(end_datetime, "%Y-%m-%dT%H:%M:%S") - datetime.strptime(start_datetime, "%Y-%m-%dT%H:%M:%S")
duration.days

61

In [70]:
df = df.filter(
    (pl.col("datetime") >= start_datetime) & (pl.col("datetime") <= end_datetime)
)
df

datetime,sum_kwh,sum_kwh_diff,env_temp
str,f64,f64,f64
"""2023-03-01T00:00:00""",446950.0,0.0,-1.59
"""2023-03-01T00:01:00""",446950.0,0.0,-2.11
"""2023-03-01T00:02:00""",446950.0,0.0,-2.11
"""2023-03-01T00:03:00""",446950.0,0.0,-2.11
"""2023-03-01T00:04:00""",446950.0,0.0,-2.11
…,…,…,…
"""2023-04-30T23:56:00""",500949.0,0.0,10.02
"""2023-04-30T23:57:00""",500949.0,0.0,10.02
"""2023-04-30T23:58:00""",500949.0,0.0,10.02
"""2023-04-30T23:59:00""",500949.0,0.0,10.02


Aggregate to hourly data

In [72]:
df = df.with_columns(
    pl.col("datetime")
    .str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S")
    .dt.strftime("%Y-%m-%dT%H")
    .alias("datetime_hour")
)
df

datetime,sum_kwh,sum_kwh_diff,env_temp,datetime_hour
str,f64,f64,f64,str
"""2023-03-01T00:00:00""",446950.0,0.0,-1.59,"""2023-03-01T00"""
"""2023-03-01T00:01:00""",446950.0,0.0,-2.11,"""2023-03-01T00"""
"""2023-03-01T00:02:00""",446950.0,0.0,-2.11,"""2023-03-01T00"""
"""2023-03-01T00:03:00""",446950.0,0.0,-2.11,"""2023-03-01T00"""
"""2023-03-01T00:04:00""",446950.0,0.0,-2.11,"""2023-03-01T00"""
…,…,…,…,…
"""2023-04-30T23:56:00""",500949.0,0.0,10.02,"""2023-04-30T23"""
"""2023-04-30T23:57:00""",500949.0,0.0,10.02,"""2023-04-30T23"""
"""2023-04-30T23:58:00""",500949.0,0.0,10.02,"""2023-04-30T23"""
"""2023-04-30T23:59:00""",500949.0,0.0,10.02,"""2023-04-30T23"""


In [75]:
df = df.group_by("datetime_hour").agg([
    pl.col("env_temp").mean().alias("avg_env_temp"),
    pl.col("sum_kwh_diff").sum().alias("total_kwh_diff")
]).sort(by="datetime_hour")
df

datetime_hour,avg_env_temp,total_kwh_diff
str,f64,f64
"""2023-03-01T00""",-2.111,38.0
"""2023-03-01T01""",-2.381,41.0
"""2023-03-01T02""",-2.679667,28.0
"""2023-03-01T03""",-2.670333,27.0
"""2023-03-01T04""",-2.665,28.0
…,…,…
"""2023-04-30T20""",12.814333,18.0
"""2023-04-30T21""",11.7685,18.0
"""2023-04-30T22""",11.484167,20.0
"""2023-04-30T23""",9.953443,23.0


Overall statistics

In [82]:
print(f"Stats for {eco_u_data["1a9266de-dfff-11eb-9d61-02b402f0c1de"]["name"]}")
print(f"All datapoints: {len(df)}")
print(f"Start date: {start_datetime}")
print(f"End date: {end_datetime}")
print(f"Duration: {duration.days} days")

Stats for Friedrichstraße 21-25
All datapoints: 1464
Start date: 2023-03-01T00:00:00
End date: 2023-05-01T00:00:00
Duration: 61 days


## Further data preprocessing steps needed

- check that data is actually minutely/half-hourly/hourly, that there are no data holes
- +02:00 at end of begin datetime?