## Basic Dask

What is an efficient way to process very large data? What if we do not have enough RAM to perform the calculations? Dask to the rescue! Dask allows for fast pre-compiled operations. It's a little more code and overhead to run, but for large computations will be worth the effort. It also allows reading and processing data larger than the available RAM memory.

In [1]:
import dask.array as da  # Convention is to import as da

Create a Dask data array with dask module, similar to Numpy. We define chunks to help manage multiprocessing and allow for a smaller RAM memory footprint. We could process data larger than the RAM memory we have available to us. Jupyter and Dask play well together and Jupyter will print a nice graphical output of the Dask array size/shape/type. Notice when we print it does not print the values, only information about the Dask array. That is because the values have not been created yet, only the steps to perform the operation are created.

In [2]:
x = da.random.random((10000, 10000), chunks=(1000, 1000))
x

Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,7.63 MiB
Shape,"(10000, 10000)","(1000, 1000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 762.94 MiB 7.63 MiB Shape (10000, 10000) (1000, 1000) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,7.63 MiB
Shape,"(10000, 10000)","(1000, 1000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Perform math similar to Numpy.

In [3]:
y = x + x.T  # .T means transpose the array

Here we calculate the mean along an axis.

But notice when we print z it does not show any values, just information about the dask object. At this time no computation has occured, just the information about what we want to do.

In [4]:
z = y[::2, 5000:].mean(axis=1)
z

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,3.91 kiB
Shape,"(5000,)","(500,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 39.06 kiB 3.91 kiB Shape (5000,) (500,) Dask graph 10 chunks in 7 graph layers Data type float64 numpy.ndarray",5000  1,

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,3.91 kiB
Shape,"(5000,)","(500,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


We will need to compute the values and convert to Numpy before printing the values to the screen. .compute() on the object will return a Numpy array of the values.

In [5]:
z = z.compute()
z

array([1.00450745, 0.99041878, 0.99831558, ..., 0.99420789, 1.00546017,
       0.99417984])

Once we call .compute() Jupyter will treat z as a normal Numpy array. We could have also delayed all the processing and rolled the greater than and .any() computation into Dask for faster overall computation.

In [6]:
a = z > 0  # Return array of boolean values where value is greater than zero
b = a.any()  # Are any values set to True?
b

True

Can also perform all computations delayed.

In [7]:
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x + x.T  # .T means transpose the array
z = y[::2, 5000:].mean(axis=1)
a = z > 0  # Return array of boolean values where value greater than zero
b = a.any()  # Are any values set to True
print('b:', b)
b = b.compute()
print('b:', b)

b: dask.array<any-aggregate, shape=(), dtype=bool, chunksize=(), chunktype=numpy.ndarray>
b: True


Most of the Numpy operations are available in Dask computations. Notice that the nanmean() method is not actually run yet. The printed variable "c" is a placeholder for the operation.

In [8]:
a = da.random.random(1000, chunks=100)
b = da.ones(1000, chunks=100)

c = b - a
c = da.nanmean(c)
c

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
Array Chunk Bytes 8 B 8 B Shape () () Dask graph 1 chunks in 6 graph layers Data type float64 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


We need to tell Dask to perform the computation to produce the value.

In [9]:
c = c.compute()
c

0.4979883783977874

## How much faster is dask than Numpy at some calculations?

Let's make a large array.

We will import a different library to do the timing to demonstrate the difference in computation speeds.

In [10]:
import time
import numpy as np

Let's create two functions that perform the same task. One uses Dask and one uses Numpy. They return the number of seconds to make the computations.

In [11]:
def numpy_computation(num):
    start_time = time.time()
    b = np.ones(num) - np.random.random(num)
    b[np.random.randint(0, num, int(num/10))] = np.nan
    b = np.nanmean(b)

    return time.time() - start_time
    
def dask_computation(num):
    start_time = time.time()
    chunks = int(num/10)
    b = da.ones(num, chunks=chunks) - da.random.random(num, chunks=chunks)
    b[da.random.randint(0, 1, num, chunks=chunks).astype(bool)] = np.nan
    b = da.nanmean(b)
    b = b.compute()

    return time.time() - start_time

In [12]:
num = 100_000_000
numpy_time = numpy_computation(num)
dask_time = dask_computation(num)

print(f'Numpy Elapsed Time: {numpy_time} seconds')
print(f'Dask Elapsed Time: {dask_time} seconds')

print(f'\nDask is about {numpy_time/dask_time} times faster than Numpy for this operation.\n')

Numpy Elapsed Time: 1.7535278797149658 seconds
Dask Elapsed Time: 1.1301848888397217 seconds

Dask is about 1.5515407231423743 times faster than Numpy for this operation.



What about a smaller array. It takes overhead operations to perform the work so for small operations Dask can be slower than normal Numpy operations.

In [13]:
num = 1000
numpy_time = numpy_computation(num)
dask_time = dask_computation(num)

print(f'Numpy Elapsed Time: {numpy_time} seconds\n')
print(f'Dask Elapsed Time: {dask_time} seconds\n')

print(f'\nNumpy is about {dask_time/numpy_time} times faster than Dask for this much smaller operation.\n')

Numpy Elapsed Time: 0.00017404556274414062 seconds

Dask Elapsed Time: 0.008721113204956055 seconds


Numpy is about 50.108219178082194 times faster than Dask for this much smaller operation.

