Like Pandas dataframes, NumPy arrays can only work in memory. The good news is that Dask also supports NumPy arrays in a similar way to it does Pandas. It provides a structure called `Array` which enhances NumPy arrays with parallelization capabilities. In a nutshell, A Dask array coordinates several NumPy arrays as the figure below demonstrates:

![Dask array](dask_array.svg)

Moreover, Dask arrays provide most of the familiar NumPy functionalities and apis and hence using them in our code is something like writing code using NumPy. This is the beauty of Dask.

# Working with Dask arrays

There are basically three advantages of using Dask arrays instead of NumPy arrays:

*  **Parallel**: Dask can use all of the cores on the computer it runs on. Even more, it can scale to multiple machines.

*   **Larger-than-memory**: NumPy arrays can only work with data that fits into the memory. Dask arrays let us work on datasets that are larger than the available memory by breaking up the arrays into many smaller pieces called **chunks**. This is illustrated in the figure below:

<img src="array.png" width="25%">

*  **Blocked Algorithms**: Blocked algorithms are the algorithms that perform a large computation by dividing it into smaller computations and then aggregating the results. Hence, when implementing a blocked algorithm using Dask arrays instead of NumPy arrays, we can take advantage of the parallelization. 

We'll start using Dask arrays with generating random values. But before that, we start a Dask client. Remember that starting a client isn't necessary when working with Dask on our local computer but still it provides us a dashboard where we can investigate our computations. We'll again configure our client to have as four workers and 2 thread in each worker. We also set 2GB memory limit to each worker.

In [1]:
import warnings
warnings.filterwarnings("ignore")

from dask.distributed import Client, progress

client = Client(n_workers=4, threads_per_worker=2, memory_limit='2GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:50485  Dashboard: http://127.0.0.1:50486/status,Cluster  Workers: 4  Cores: 8  Memory: 8.00 GB


# Using Dask as if it's NumPy

We'll be using Dask arrays by creating random arrays and do mathematical calculations on them. We also do the same thing using NumPy arrays and compare the run times.

## Creating a random array

In order to use Dask arrays, we need to import it as follows:

In [2]:
import dask.array as da
import numpy as np

Below we create a 10000x10000 array of random numbers. Note that we set `chunks` parameter to `(1000, 1000)`. This is something different than what we normally do when generating random arrays with NumPy. By setting chunks, we tell Dask that it should represent as many numpy arrays of size 1000x1000 (or smaller if the array cannot be divided evenly). In our case, there will be 100 numpy arrays of size 1000x1000.

What we do below is:

* We first create a random Dask array of size 10000X10000.
* Then we add this array to its transpose.
* Last, we filter the resulting array and calculate its mean.

As usual, we call `.compute()` to make Dask evaluate the results. Note that we calculate the run time of the following cell using jupyter notebook's magic command `%%time`.

In [3]:
%%time
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

CPU times: user 386 ms, sys: 40.6 ms, total: 427 ms
Wall time: 1.51 s


First, notice that the code above is almost identical to what we would write by using NumPy. The only difference is that we set the `chunks` parameter to when generating a random Dask array.

Second, our code block took 427 milliseconds to run.

Now, let's do the same thing using NumPy arrays:

In [4]:
%%time
x = np.random.random((10000, 10000))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)

CPU times: user 2.89 s, sys: 873 ms, total: 3.76 s
Wall time: 3.97 s


When we run the same code using NumPy, it took 3.76 seconds to run which is equivalent to 3760 milliseconds. That is, running the same operations took almost 9 times longer when using NumPy instead of Dask.

# Persisting data in memory

So far, we saw that we can use Dask arrays instead of NumPy arrays to parallelize the computations. Moreover, using Dask, we can even work with data that doesn't fit into the memory. However, If we have the available memory for an array and just want to speed up the computations using Dask, then we can persist the data in memory and take advantage of the memory speed. If we do this, all of the future computations on the persisted array will be much faster.

We demonstrate this doing the same computations before and after we persist our Dask array into the memory. First, we make our computations without persisting the array:

In [5]:
x = da.random.random((10000, 10000), chunks=(1000, 1000))

In [6]:
%%time
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

CPU times: user 343 ms, sys: 132 ms, total: 475 ms
Wall time: 1.08 s


Now, we do the same thing this time after persisting our array into the memory. In order to persist an array to memory, we just call `.persist()` method of the Dask arrays:

In [7]:
x = da.random.random((10000, 10000), chunks=(1000, 1000))
# This persists the x array into the memory
x.persist()

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 800.00 MB 8.00 MB Shape (10000, 10000) (1000, 1000) Count 100 Tasks 100 Chunks Type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray


And we run the same computations above after persisting the array into the memory:

In [8]:
%%time
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

CPU times: user 236 ms, sys: 22.5 ms, total: 259 ms
Wall time: 514 ms


As we see, when we persisted the array into the memory and made the computations, it took around half the time of the computations when we didn't persist the array.

# Drawbacks of Dask arrays

As is the case for Dask dataframes, `Dask.array` package doesn't implement the entire NumPy interface. The main differences are the following:

1. Dask project is an ongoing one and NumPy has a huge api. So, implementing them takes time.

2. Some operations like sorting is difficult to parallelize as we discussed in the previous checkpoint. So, some functionalities around sorting is deliberately not supported in Dask.

3. If the results of an operation depends on the values in the inputs, then Dask doesn't implement these operations. This is because of the lazy evaluation strategy of Dask.

That being said, many of the most commonly used functionalities of NumPy are available in the Dask dataframes. 

# Closing the connection

We're done with the Dask for this checkpoint. So, we can close our connection:

In [9]:
client.close()