# Data Vis: Plotting Time Series Data
* Notebook 1: Data Preparation

## Setup

In [None]:
import numpy as np
import pandas as pd
import missingno as msno

# Data

In this notebook, we will use a private dataset about (solar) power generation and use of a single family house in Germany. The dataset contains the following columns:
- `timestamp`: The date and time of the measurement. The data is recorded every 1 hour.
- `total_consumption_kw`: The amount of power consumed per hour in kilowatts.
- `from_grid_kw`: The amount of power provided from the grid per hour in kilowatts.
- `from_pv_kw`: The amount of power generated by the solar panels per hour in kilowatts.
- `from_battery_kw`: The amount of power provided by the battery per hour in kilowatts.
- `to_grid_kw`: The amount of power provided to the grid per hour in kilowatts.
- `to_battery_kw`: The amount of power provided to the battery per hour in kilowatts.
- `battery_percent`: The average percentage of battery charge at the time of measurement.
- `battery_kwh`: The average amount of power in the battery at the time of measurement in kilowatt hours.
- various weather data, including temperature, humidity, precipitation, wind speed, and solar radiation (ghi, dni, dhi).

In [None]:
data = pd.read_csv("solar.csv")

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.columns

Use thew `missingno` package to visualize the missing values in the dataframe.

In [None]:
msno.bar(data)

# Pandas Datetime Properties

Check, if the `timestamp` column is in datetime format. If not, convert it to datetime format using `pd.to_datetime()`. Then, set the `timestamp` column as the index of the dataframe.

In [None]:
data.info()

In [None]:
data["timestamp"] = pd.to_datetime(data["timestamp"])

Now, we can use various pandas datetime properties to extract useful information from the `timestamp`. 

In [None]:
data["timestamp"].min(), data["timestamp"].max()

In [None]:
data["timestamp"].max() - data["timestamp"].min()

We can use the `dt` accessor to extract the year, month, day, and hour from the timestamp. We can also create a new column that indicates whether the timestamp is on a weekend or a weekday.

In [None]:
data["year"] = data["timestamp"].dt.year
data["month"] = data["timestamp"].dt.month_name()
data["day"] = data["timestamp"].dt.day
data["hour"] = data["timestamp"].dt.hour
data["weekday"] = data["timestamp"].dt.day_name()
data["is_weekend"] = np.where(data["weekday"].isin(["Saturday", "Sunday"]), 1, 0)   

In [None]:
data.head()

We can use the extracted information to group and aggregate the data. For example, we can group the data by weekday and calculate the average power consumption for each weekday.

In [None]:
data.groupby("weekday")["total_consumption_kw"].mean().sort_values(ascending=False)

Working with a datetime index (i.e. `DatetimeIndex`) provides powerful functionalities. For example, we do not need the `dt` accessor to get the time series properties, but have these properties available on the index directly.

In [None]:
data.set_index("timestamp", inplace=True)

In [None]:
data_202503 = data["2025-03-01":"2025-03-31"]

In [None]:
data_202503.head()

In [None]:
data_202503.tail()

Another useful functionality is the `resample()` method, which allows us to group the data by a specific time frequency. For example, we can resample the data to daily frequency and calculate the average power generation and consumption for each day. Note that we actually should rename the columns from "_kw" to "_kwh" to indicate that the values are in kilowatt hours.


In [None]:
data_daily = data[["total_consumption_kw", "from_grid_kw", "from_pv_kw", "from_battery_kw", "to_grid_kw", "to_battery_kw"]].resample("D").sum()

In [None]:
data_daily.head()