# Solution:  Write query for larger than memory data sets

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import random
from datetime import datetime, timedelta

import numpy as np
import pandas as pd
import polars as pl

## Generate Data
This simulated dataset contains a timeseries of electrical voltage and current values for different substations.

In [None]:
# Fix the random seeds to have the same outcome when the cell is executed again
np.random.seed(42)
random.seed(42)

# Number of generated data points
n = 4_000_000  

# Generate timestamps by subtraction from the current time the desired number of generated data points in minutes
timestamps = [datetime.now() - timedelta(minutes=i) for i in range(n)]
# Generate substation ids in the range of 1 to 10 
substation_ids = [random.randint(1, 10) for _ in range(n)]
# Generate voltages as normal distribution with mean 230 and standard deviation 10
voltages = np.random.normal(230, 10, n)   
# Generate currents as normal distribution with mean 5 and standard deviation 2
currents = np.random.normal(5, 2, n)

# Collect all lists and create a DataFrame
df = pd.DataFrame({
    'timestamp': timestamps,
    'substation_id': substation_ids,
    'voltage': voltages,
    'current': currents
})

# Write the DataFrame to a CSV file
df.to_csv("electricity_usage.csv", index=False)

## Exercise: "How can we make the code even faster?" - polars version with lazy mode
 
Now, we imagine that there is a larger than memory data set. Think about updating the number of generated data points above to e.g. 4_000_000 (not more than that, please).
Write the previous `polars` code as a `polars` query/pipeline (either use your code or the provided solution below as base).

Execute the query in lazy mode and measure the run time with `%%timeit`


In [None]:
%%timeit
# Step 1: Read data from csv
df = pl.read_csv("electricity_usage.csv")

# Step 2: Set the correct timestamp format
df = df.with_columns(pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S%.f"))

# Step 3: Calculate power (P=VI) and add it as a new column
df = df.with_columns((pl.col("voltage") * pl.col("current")).alias("power"))

# Step 4: Group by 'substation_id', resample timestamp and calculate daily average power
df = df.sort("timestamp")  # data has to be sorted for group_by_dynamic!
df_grouped = df.group_by_dynamic(
    index_column="timestamp",
    every="1d",
    closed="right",
    by="substation_id",
    include_boundaries=False,
).agg(pl.col("power").mean().alias("daily_avg_power"))

# Step 5: Filter out data where daily average power is less than a certain threshold
threshold = 1000
df_grouped = df_grouped.filter(pl.col("daily_avg_power") > threshold)

# Step 6: Sort result by substation and timestamp
df_grouped = df_grouped.sort(["substation_id", "timestamp"])

# Step 7: Write the transformed data to a new CSV file
df_grouped.write_csv("transformed_electricity_usage.csv")

### Solution: "How can we make the code even faster?" - polars version with lazy mode

In [None]:
%%timeit
threshold = 1000

# Define lazy query/pipeline
q = (
    pl.scan_csv("electricity_usage.csv")
    .with_columns(pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S%.f"))
    .with_columns((pl.col("voltage") * pl.col("current")).alias("power"))
    .sort("timestamp")
    .group_by_dynamic(
        index_column="timestamp",
        every="1d",
        closed="right",
        by="substation_id",
        include_boundaries=False,
    ).agg(pl.col("power").mean().alias("daily_avg_power"))
    .filter(pl.col("daily_avg_power") > threshold)
    .sort(["substation_id", "timestamp"])
)

# possibility to test the pipeline with reduced amount data
# df_test = q.fetch(n_rows=int(100))

# Collect the data
df = q.collect(streaming=True) # set streaming = True if the data might not fit into memory

# Write the transformed data to a new CSV file
df.write_csv("transformed_electricity_usage.csv")


### For plotting and further data exploration change to `pandas.DataFrame`

In [None]:
# in case you want a pandas dataframe for plotting etc.
df_pandas = df.to_pandas()

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © [Point 8 GmbH](https://point-8.de)_