# Dask Acceleration vs Pandas

This notebook explores speed advantages realized by using the Dask library over Pandas for dataframe operations in Python.

| Tech | Version |
| --- | --- |
| Python | 3.6.9 |
| jupyter-notebook | 6.0.1 |
| pandas | 1.0.3 |
| dask | 2.17.2

Hat - tip to Saturn Cloud's [Your Practical Guide to Dask](https://www.saturncloud.io/s/practical-guide-to-dask/).

In [1]:
# importing required libraries
import random
import pandas as pd
import os # for directory operations

__Randomize stock data in Python and save the data as a Pandas dataframe.__

In [2]:
# instantiating 1M random stock records
num_rows = 1000000

symbols = ["AAPL", "AMD", "GOOG", "MSFT", "NVDA"]
prices = [random.randint(1, 500) for _ in range(50)]

In [3]:
def get_stock_data(symbols, prices):
    '''
    function to generate random stock data from the
    `symbols` list and the randomized `prices` list
    '''
    return {"symbol": random.sample(symbols, 1)[0],
            "price": random.sample(prices, 1)[0]}

In [4]:
# using the function to generate stock data for the
# number of rows instantiated in `num_rows`
stock_data = [get_stock_data(symbols, prices) for _ in range(num_rows)]

# instantiate data as a pandas dataframe
stock_df = pd.DataFrame(stock_data,
                        columns=["symbol", "price"])

__Export stock data to a csv file.__

In [5]:
# save `stock_df` as a csv file in the data subdirectory
if not os.path.exists('data'):
    os.makedir('data')

# prefix filename with '__rc__' to .gitignore
stock_df.to_csv("data/_rc_stock_data.csv")

# preview the dataframe
stock_df.head()

Unnamed: 0,symbol,price
0,AAPL,229
1,GOOG,47
2,AAPL,213
3,AMD,325
4,NVDA,276


## Load csv Data to a Dask Dataframe

In [6]:
import dask.dataframe as dd

# loading csv data to a dask dataframe
dask_df = dd.read_csv("data/_rc_stock_data.csv")

__Repartition dask data to sizes of 100MB-or-less, per [official documentation](https://docs.dask.org/en/latest/dataframe-best-practices.html#repartition-to-reduce-overhead)__.

In [7]:
# repartitioning dask dataframe from csv
dask_df = dask_df.repartition(partition_size="100MB")

# loading pandas dataframe from csv
pandas_df = pd.read_csv("data/_rc_stock_data.csv")

## Compute CPU Acceleration with Dask vs Pandas

__Calculate Means__

In [8]:
from timeit import default_timer as timer

# computing and comparing task execution
start = timer()

# Calculate mean using pandas
pandas_df["price"].mean()

end = timer()

# print time elapsed in seconds
print("pandas execution time: ", round(end - start, 5))
print("-"*30)

start = timer()

# Calculate mean using Dask
dask_df["price"].mean()

end = timer()

# print time elapsed in seconds
print("\ndask execution time: ", round(end - start, 5))

pandas execution time:  0.00275
------------------------------

dask execution time:  0.00224


__Filter dataframes__

In [9]:
start_p = timer()
# Filtering by price in Pandas
pandas_df[pandas_df["price"] > 250]
end_p = timer()
pandas_time = round(end_p - start_p, 5)
print("pandas execution time: ", pandas_time)
print("-"*30)

start_d = timer()
# Filtering by price in Dask
dask_df[dask_df["price"] > 250]
end_d = timer()
dask_time = round(end_d - start_d, 5)
print("\ndask execution time: ", dask_time)

pandas execution time:  0.02367
------------------------------

dask execution time:  0.00097


In [10]:
# calculating acceleration for dask execution over pandas
print(round(pandas_time/dask_time, 3), 
      "% speed increase")

24.402 % speed increase


__Add dataframes__

In [11]:
start_p = timer()
# Adding big DataFrames in Pandas
pandas_df + pandas_df + pandas_df + pandas_df + pandas_df
end_p = timer()
pandas_time = round(end_p - start_p, 5)
print("pandas execution time: ", pandas_time)
print("-"*30)

start_d = timer()
# Adding big DataFrames in Dask
dask_df + dask_df + dask_df + dask_df + dask_df
end_d = timer()
dask_time = round(end_d - start_d, 5)
print("\ndask execution time: ", dask_time)

# calculating dask speed improvement
print("-"*30, "\n")

print(round(pandas_time/dask_time, 3), 
      "% increase in speed")

pandas execution time:  0.41559
------------------------------

dask execution time:  0.01944
------------------------------ 

21.378 % increase in speed


## Compute GPU Acceleration with Dask vs Pandas

__The cuDF library enables GPU acceleration for Pandas - Dask dataframe computation.__

In [12]:
# import cudf

# # reinstantiate `dask_df` for use with GPU backend
# dask_df = dask_df.map_partitions(cudf.from_pandas) 

### ======_cuDF no longer available via pip [or for Windows???]_ ======

__Resume in containerized environment.__