# Dask Speed Test

The purpose of this notebook is to play around with the Dask module and understand the speedups it offers over (1) conventional Python operations, (2) Numpy operations, and (3) vectorized Python code (via numba). Moreover, I'll include some experiments in here for profiling Dask code, and if things go well, will build some example questions at the end that a reader could review to assess their understanding.

### Ex 1: Computation that could be parallelized

In [5]:
import pandas as pd
import numpy as np
import dask
from timeit import timeit
from numba import vectorize, float64

In [7]:
df = pd.DataFrame({'A':np.arange(0, 100, 0.01), 'B':np.random.randn(10000)})

In [12]:
def slow_add():
    c = []
    for a, b in zip(df['A'], df['B']):
        c.append(a+b)
    return c

def fast_add():
    c = df['A'] + df['B']

In [3]:
SETUP_CODE = """
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':np.arange(0, 100, 0.01), 'B':np.random.randn(10000)})

def slow_add():
    c = []
    for a, b in zip(df['A'], df['B']):
        c.append(a+b)
    return c

def fast_add():
    c = df['A'] + df['B']
"""
timeit(setup=SETUP_CODE, stmt='slow_add()', number=1000)

9.708962222999986

In [4]:
SETUP_CODE = """
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':np.arange(0, 100, 0.01), 'B':np.random.randn(10000)})

def slow_add():
    c = []
    for a, b in zip(df['A'], df['B']):
        c.append(a+b)
    return c

def fast_add():
    c = df['A'] + df['B']
"""
timeit(setup=SETUP_CODE, stmt='fast_add()', number=1000)

0.41694216999999867

Imagine we recognized that our function `slow_add` is slow, but we didn't know how to write it in terms of pure Numpy functions. We might try to use numba to vectorize it:

In [7]:
@vectorize([float64(float64, float64)])
def med_add(x, y):
    return x + y

SETUP_CODE = """
import pandas as pd
import numpy as np
from numba import vectorize, float64

df = pd.DataFrame({'A':np.arange(0, 100, 0.01), 'B':np.random.randn(10000)})

@vectorize([float64(float64, float64)])
def med_add(x, y):
    return x + y

def fast_add():
    c = df['A'] + df['B']
"""

timeit(setup=SETUP_CODE, stmt="med_add(df['A'], df['B'])", number=1000)

0.846783060000007