# Intoduction to `polars`: Translate `pandas` to `polars` code

[`polars`](https://www.pola.rs) as a lightning-fast DataFrame library for Rust and Python. It is an alternative to `pandas` when developing data processing pipelines, because as of late 2023 there is no real support for plotting libraries such as matplotlib or plotly.

It is written in Rust and uses the Apache Arrow data format in the background compared to numpy as the basis of pandas. Although, as of [`pandas 2.0`](https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html) it possible to use the Arrow data format as well in pandas, but is not the default.

`polars` is upto 5 to 10 times faster than pandas and provides "lazy execution" when reading data and executing queries. When to use `polars`, when you have larger than memory data sets or when you develop a pipeline for data processing and filtering.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import random
from datetime import datetime, timedelta

import numpy as np
import pandas as pd
import polars as pl

## Generate Data
This simulated dataset contains a timeseries of electrical voltage and current values for different substations.

In [None]:
# Fix the random seeds to have the same outcome when the cell is executed again
np.random.seed(42)
random.seed(42)

# Number of generated data points
n = 1_000_000

# Generate timestamps by subtraction from the current time the desired number of generated data points in minutes
timestamps = [datetime.now() - timedelta(minutes=i) for i in range(n)]
# Generate substation ids in the range of 1 to 10
substation_ids = [random.randint(1, 10) for _ in range(n)]
# Generate voltages as normal distribution with mean 230 and standard deviation 10
voltages = np.random.normal(230, 10, n)
# Generate currents as normal distribution with mean 5 and standard deviation 2
currents = np.random.normal(5, 2, n)

# Collect all lists and create a DataFrame
df = pd.DataFrame(
    {
        "timestamp": timestamps,
        "substation_id": substation_ids,
        "voltage": voltages,
        "current": currents,
    }
)

# Write the DataFrame to a CSV file
df.to_csv("electricity_usage.csv", index=False)

In [None]:
df.head()

## "What is the daily average power?" - `pandas` version
Now, we want to analyse the data and answer the question "What is the daily average power for each substation when looking at powers greater than 1000 V?", i.e. read data, data preparation, calculating power, gropuby according to substation, resampling to daily timestamps, calculating the mean, and in the end filter for powers greater than the threshold.

In [None]:
%%timeit
# Step 1: Read data from csv
df = pd.read_csv("electricity_usage.csv")

# Step 2: Set the correct timestamp format and use the timestamp as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.set_index(df['timestamp'])

# Step 3: Calculate power (P=VI) and add it as a new column
df['power'] = df['voltage'] * df['current']

# Step 4: Group by 'substation_id', resample timestamp and calculate daily average power
df_grouped = df.groupby('substation_id').resample("D")["power"].mean()
df_grouped = df_grouped.reset_index()

# Step 5: Filter out data where daily average power is less than a certain threshold
threshold = 1000
df_grouped = df_grouped[df_grouped['power'] > threshold]

# Step 6: Sort result by substation and timestamp
df_grouped = df_grouped.sort_values(by=["substation_id", "timestamp"])

# Step 7: Write the transformed data to a new CSV file
df_grouped.to_csv("transformed_electricity_usage_pandas.csv", index=False)

## Exercise:"What is the daily average power?" - `polars` version

Translate the above steps/pipeline from `pandas` to `polars` and compare run times with `%%timeit`. [`%%timeit`](https://ipython.org/ipython-doc/dev/interactive/magics.html#magic-timeit) is known as *cell magic* in IPython (for more information read [here](https://ipython.org/ipython-doc/dev/interactive/magics.html#cell-magics)). It measures the execution time of the cell by executing it more than once. Feel free to use the [`polars`](https://www.pola.rs) documentation.

The pipeline should include the following steps:

- Read data csv
- Convert the date string to a datetime format
- Calculate the power
- Get the power for each substation per day
- Remove all entries with a power value less than 1000
- Sort the result by substation and date
- Write the result in a csv file

**Hint**: Polars has a `group_by_dynamic` method    


In [None]:
# Your code goes here!



---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © [Point 8 GmbH](https://point-8.de)_