# Solution: Translate `pandas` to `polars` code

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import random
from datetime import datetime, timedelta

import numpy as np
import pandas as pd
import polars as pl

## Generate Data
This simulated dataset contains a timeseries of electrical voltage and current values for different substations.

In [None]:
# Fix the random seeds to have the same outcome when the cell is executed again
np.random.seed(42)
random.seed(42)

# Number of generated data points
n = 1_000_000  

# Generate timestamps by subtraction from the current time the desired number of generated data points in minutes
timestamps = [datetime.now() - timedelta(minutes=i) for i in range(n)]
# Generate substation ids in the range of 1 to 10 
substation_ids = [random.randint(1, 10) for _ in range(n)]
# Generate voltages as normal distribution with mean 230 and standard deviation 10
voltages = np.random.normal(230, 10, n)   
# Generate currents as normal distribution with mean 5 and standard deviation 2
currents = np.random.normal(5, 2, n)

# Collect all lists and create a DataFrame
df = pd.DataFrame({
    'timestamp': timestamps,
    'substation_id': substation_ids,
    'voltage': voltages,
    'current': currents
})

# Write the DataFrame to a CSV file
df.to_csv("electricity_usage.csv", index=False)

## "What is the daily average power?" - `pandas` version
Now, we want to analyse the data and answer the question "What is the daily average power for each substation when looking at powers greater than 1000 V?", i.e. read data, data preparation, calculating power, gropuby according to substation, resampling to daily timestamps, calculating the mean, and in the end filter for powers greater than the threshold.

In [None]:
%%timeit
# Step 1: Read data from csv
df = pd.read_csv("electricity_usage.csv")

# Step 2: Set the correct timestamp format and use the timestamp as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.set_index(df['timestamp'])

# Step 3: Calculate power (P=VI) and add it as a new column
df['power'] = df['voltage'] * df['current']

# Step 4: Group by 'substation_id', resample timestamp and calculate daily average power
df_grouped = df.groupby('substation_id').resample("D")["power"].mean()
df_grouped = df_grouped.reset_index()

# Step 5: Filter out data where daily average power is less than a certain threshold
threshold = 1000
df_grouped = df_grouped[df_grouped['power'] > threshold]

# Step 6: Sort result by substation and timestamp
df_grouped = df_grouped.sort_values(by=["substation_id", "timestamp"])

# Step 7: Write the transformed data to a new CSV file
df_grouped.to_csv("transformed_electricity_usage_pandas.csv", index=False)

## Exercise:"What is the daily average power?" - `polars` version

Translate the above steps/pipeline from `pandas` to `polars` and compare run times with `%%timeit`. [`%%timeit`](https://ipython.org/ipython-doc/dev/interactive/magics.html#magic-timeit) is known as *cell magic* in IPython (for more information read [here](https://ipython.org/ipython-doc/dev/interactive/magics.html#cell-magics)). It measures the execution time of the cell by executing it more than once. Feel free to use the [`polars`](https://www.pola.rs) documentation.

The pipeline should include the following steps:

- Read data csv
- Convert the date string to a datetime format
- Calculate the power
- Get the power for each substation per day
- Remove all entries with a power value less than 1000
- Sort the result by substation and date
- Write the result in a csv file

**Hint**: Polars has a `group_by_dynamic` method    


### Solution 1: "What is the daily average power?" - `polars` version, with *group_by_dynamic*

In [None]:
%%timeit
# Step 1: Read data from csv
df = pl.read_csv("electricity_usage.csv")

# Step 2: Set the correct timestamp format
df = df.with_columns(pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S%.f"))

# Step 3: Calculate power (P=VI) and add it as a new column
df = df.with_columns((pl.col("voltage") * pl.col("current")).alias("power"))

# Step 4: Group by 'substation_id', resample timestamp and calculate daily average power
df = df.sort("timestamp")  # data has to be sorted for group_by_dynamic!
df_grouped = df.group_by_dynamic(
    index_column="timestamp",
    every="1d",
    closed="right",
    by="substation_id",
    include_boundaries=False,
).agg(pl.col("power").mean().alias("daily_avg_power"))

# Step 5: Filter out data where daily average power is less than a certain threshold
threshold = 1000
df_grouped = df_grouped.filter(pl.col("daily_avg_power") > threshold)

# Step 6: Sort result by substation and timestamp
df_grouped = df_grouped.sort(["substation_id", "timestamp"])

# Step 7: Write the transformed data to a new CSV file
df_grouped.write_csv("transformed_electricity_usage.csv")

### Solution 2: "What is the daily average power?" - `polars` version, with *group_by*

In [None]:
%%timeit
# Step 1: Read data from csv
df = pl.read_csv("electricity_usage.csv")

# Step 2: Set the correct timestamp format
# Use pl.Date instead of pl.Datetime for group_by on dates directly
df = df.with_columns(pl.col("timestamp").str.strptime(pl.Date, "%Y-%m-%d %H:%M:%S%.f").alias("date"))

# Step 3: Calculate power (P=VI) and add it as a new column
df = df.with_columns((pl.col("voltage") * pl.col("current")).alias("power"))

# Step 4: Group by 'substation_id' and date, and calculate daily average power
df_grouped = df.group_by(["date", "substation_id"]).agg(pl.col("power").mean().alias("daily_avg_power"))

# Step 5: Filter out data where daily average power is less than a certain threshold
threshold = 1000
df_grouped = df_grouped.filter(pl.col("daily_avg_power") > threshold)

# Step 6: Sort result by substation and timestamp
df_grouped = df_grouped.sort(["substation_id", "date"])

# Step 7: Write the transformed data to a new CSV file
df_grouped.write_csv("transformed_electricity_usage.csv")

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © [Point 8 GmbH](https://point-8.de)_