# Accelerating data operations

Dataset - BlocPower

This notebook explores methods to speed up and optimize working with large dataframes in Python. Because of compute and storage limitations, we desperately need ways to
- make operations faster

- consume less memory and/or handle data larger than memory

There are three methods tried here:

- Pandas: Regular workflow

- Dask: Flexible library for parallel computing in Python.

- Polars: Pandas alternative using Apache Arrow columnar memory written in Rust

The first step is common for all three methods - **define a query in SQL in the python notebook** to refer to our chosen data. This marks a distinct change in our usual approach, where we pre-add a dataset in the project or thru the visual GUI, transform/filter and then import it into the notebook


## Results:

For the limited use case we loop thru a list of state code (here NY and RI), read in the data, count the number of missing values in each variable, append the dataframes together and perform a simple groupby to calculate the mean Energy Use Intensity in a county.

The time and memory taken by each library is given.

- **Library |  Time   | Memory**

- Pandas  | 36.06s  | 3000 MB

- Dask    | 75.33s  | 64 BYTES

- Pandas  | 38.66s  | 48 BYTES

Documentation says that as datasets get larger, the difference grows and Pandas performs much worse.


In [None]:
# Full documentation available at:
# https://apidocs.redivis.com/client-libraries/redivis-python

import redivis
import numpy as np
import pandas as pd
#import polars as pl
import sys as sys
import dask.dataframe as dd
import dask.bag as db
from dask.delayed import delayed
import time as time
import dask.array as da


!conda install -y polars
import polars as pl


# DASK
## delayed
Dask Delayed is a way to create Dask graphs for parallel computing. It allows you to parallelize computations on a single machine by creating a task graph of the operations that need to be performed, and then executing the graph in parallel.

When you use Dask Delayed, you're essentially creating a lazy evaluation graph, where each computation is represented by a delayed object. The delayed object is a wrapper around a Python function call that hasn't been executed yet. The function call is only executed when you trigger the computation using the compute() method.

Dask Delayed works by breaking down your computations into a series of tasks that can be executed independently. Each task is represented by a delayed object, which is added to the task graph. Once the entire task graph has been constructed, you can execute it in parallel by calling dask.compute() on the delayed objects.

Dask Delayed is useful when you have a large computation that can be broken down into smaller, independent computations. By using Dask Delayed, you can avoid loading the entire dataset into memory and instead process it in smaller chunks, which can be more efficient. It also allows you to parallelize your computations across multiple cores or machines, which can dramatically speed up your analysis.



the delayed() function is used to create a delayed object for each query result. The delayed objects are collected in a list dfs. Finally, the from_delayed() function is used to create the dask dataframe dask_df. Note that you can use dask's distributed computing capabilities to read the data more efficiently in parallel, by setting up a dask cluster and submitting the delayed objects to it using the dask.distributed module.

In [None]:

st1= time.time()
states = ['NY', 'RI']
dfs = []
for state in states:
    query = redivis.query(f"""
        SELECT * 
        FROM EIDC.blocpower_active.blocpower_core
        WHERE state = '{state}'
        """)
#convert query to dask delayed    
    delayed_df = delayed(query.to_dataframe)()
    print(f"Memory consumed by DASK Delayed DF for {state}: {sys.getsizeof(delayed_df)} bytes")
    missing_values_count = delayed_df.isna().sum().compute()
    print("######################################")
    print(f"Missingness for {state}")
    print("-------------------------------------")
    print(missing_values_count)
    print("######################################")

    dfs.append(delayed_df)
    
dfs = dd.from_delayed(dfs)
    
dfcounty = dfs.groupby("county").total_source_energy_GJ.mean().compute()
      
et1=time.time()

dur1 = et1-st1
print("time Dask -", dur1)

dask_df.head(1)

# Pandas

In [None]:
st2=time.time()
states = ['DC']
dfs = []
for state in states:
    query = redivis.query(f"""
        SELECT * 
        FROM EIDC.blocpower_active.blocpower_core
        WHERE state = '{state}'
        """)
    df = query.to_dataframe()
    size_in_mb = sys.getsizeof(df)/ (1024**2)
    print(f"Memory consumed by Pandas DF for {state}: {size_in_mb:.2f} MB")
    
    missing_values_count = df.isna().sum()
    
    print("######################################")
    print(f"Missingness for {state}")
    print("-------------------------------------")
    print(missing_values_count)
    dfs.append(df)
    
pandas_df = pd.concat(dfs, ignore_index=True)

dfcounty = pandas_df.groupby('county').total_source_energy_GJ.mean()

et2=time.time()

dur2 = et2-st2

print("time pandas -", dur2)

In [None]:
st2=time.time()
states = ['DC']
dfs = []
for state in states:
    query = redivis.query(f"""
        SELECT * 
        FROM EIDC.blocpower_active.blocpower_core
        WHERE state = '{state}'
        """)
    df = query.to_dataframe()
    size_in_mb = sys.getsizeof(df)/ (1024**2)
    print(f"Memory consumed by Pandas DF for {state}: {size_in_mb:.2f} MB")
    
    missing_values_count = df.isna().sum()
    
    uniques = df.nunique()

    print("######################################")
    print(f"Missingness for {state}")
    print("-------------------------------------")
    print(missing_values_count)
    print("-------------------------------------")
    print(uniques)
    dfs.append(df)
    
pandas_df = pd.concat(dfs, ignore_index=True)


et2=time.time()

dur2 = et2-st2

print("time pandas -", dur2)

In [None]:
pandas_df.shape

# Polars

In [None]:
#https://towardsdatascience.com/understanding-groupby-in-polars-dataframe-by-examples-1e910e4095b3

st2=time.time()

states = ['NY', 'RI']
dfs = []
for state in states:
    query = redivis.query(f"""
        SELECT * 
        FROM EIDC.blocpower_active.blocpower_core
        WHERE state = '{state}'
        """)
    #CONVERT TO PANDAS
    df = query.to_dataframe()
    
    #CONVERT TO POLARS
    df = pl.from_pandas(df)
    #GET SIZE IN MEMORY
    print(f"Memory consumed by Polars DF for {state}: {sys.getsizeof(delayed_df):.2f} BYTES ")
    
    null_count_df=df.null_count().to_pandas()
    
    print(null_count_df)
    dfs.append(df)
    
polars_df = pl.concat(dfs)

q = (
    polars_df    
    .lazy()
    .groupby(by='county')
    .agg(
        [
            pl.col('total_source_energy_GJ').mean().alias('mean_energy'),
            
        ]
    )    
)

polars_df = q.collect()

# q = (
#     polars_df.lazy()
#     .groupby("county")
#     .agg(mean_energy=('total_source_energy_GJ', pl.mean()))
# )

et2=time.time()

dur2 = et2-st2
print("time polars -", dur2)


In [None]:

states = ["DC", "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", 
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
dfs = []
for state in states:
    query = redivis.query(f"""
        SELECT * 
        FROM EIDC.blocpower_active.blocpower_core
        WHERE state = '{state}'
        """)
    #CONVERT TO PANDAS
    df = query.to_dataframe()
    
    #CONVERT TO POLARS
    df = pl.from_pandas(df)
    #GET SIZE IN MEMORY
    print(f"Memory consumed by Polars DF for {state}: {sys.getsizeof(delayed_df):.2f} BYTES ")
    
    null_count_df=df.null_count().to_pandas()
    
    print(null_count_df)
    dfs.append(df)
    
polars_df = pl.concat(dfs)
