**Dask Arrays** provide some of the significant features described below:
</br>
- **Larger-than-memory:** Dask Arrays let us work on datasets more enormous than the size of available memory. Dask helps break down the array into many minor fragments, functioning on those fragments to decrease the memory footprint of the computation and effectively streaming data from disk.</br>
- **Parallel:** Dask Arrays utilize all the cores for parallel computation.</br>
- **Blocked Algorithms:** Dask Arrays also provides blocked algorithms in order to operate on blocks or submatrices rather than running on entire rows or columns of an array. This function helps in performing large computations by working on many minor calculations.


In [1]:
import dask.array as darray    
      
# using arange for creating an array with values from 0 to 35  
my_array = darray.arange(35, chunks = 6)  
print( my_array.compute())  
      
# using chunks for checking the size of each chunk  
print(my_array.chunks)  

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34]
((6, 6, 6, 6, 6, 5),)


##### Converting numpy array into dask array

In [2]:
import numpy as np  
import dask.array as darray  
  
first_array = np.arange(15)  
  
second_array = darray.from_array(first_array, chunks = 5)  
  
# resulting in a dask array  
print(second_array.compute())  

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]


##### Calculating the sum of the first 100 numbers

In [3]:
import numpy as np  
import dask.array as darray  
  
# arange is used to create array on values from 0 to 100  
first_array = np.arange(100)    
  
# converting numpy array to dask array  
second_array = darray.from_array(first_array, chunks = (10))    
  
# computing mean of the array  
print(second_array.sum().compute())

4950


In [4]:
# NumPy array
a_np = np.ones(10)
a_np

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [5]:
a_np_sum = a_np[:5].sum() + a_np[5:].sum()
a_np_sum

10.0

Now notice that each sum in the computation above is completely independent so they could be done in parallel. To do this with Dask array, we need to define our “slices”, we do this by defining the amount of elements we want per block using the variable chunks.</br>
**Important!**

Note here that to get two blocks, we specify chunks=5, in other words, we have 5 elements per block.

In [7]:
a_da = darray.ones(10, chunks=5)
a_da

Unnamed: 0,Array,Chunk
Bytes,80 B,40 B
Shape,"(10,)","(5,)"
Dask graph,2 chunks in 1 graph layer,2 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 80 B 40 B Shape (10,) (5,) Dask graph 2 chunks in 1 graph layer Data type float64 numpy.ndarray",10  1,

Unnamed: 0,Array,Chunk
Bytes,80 B,40 B
Shape,"(10,)","(5,)"
Dask graph,2 chunks in 1 graph layer,2 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [8]:
a_da_sum = a_da.sum()
a_da_sum

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
Array Chunk Bytes 8 B 8 B Shape () () Dask graph 1 chunks in 3 graph layers Data type float64 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [9]:
# visualize the low level Dask graph using cytoscape
a_da_sum.visualize(engine="cytoscape")

CytoscapeWidget(cytoscape_layout={'name': 'dagre', 'rankDir': 'BT', 'nodeSep': 10, 'edgeSep': 10, 'spacingFact…

In [10]:
a_da_sum.compute()

10.0

#### Comparison between numpy array and dask array

In [11]:
%%time
xn = np.random.normal(10, 0.1, size=(30_000, 30_000))
yn = xn.mean(axis=0)
yn

CPU times: total: 1min 21s
Wall time: 2min 36s


array([ 9.99932323,  9.99921027,  9.99935457, ..., 10.00059416,
       10.00027634, 10.00027404])

In [12]:
xd = darray.random.normal(10, 0.1, size=(30_000, 30_000), chunks=(3000, 3000))
xd

Unnamed: 0,Array,Chunk
Bytes,6.71 GiB,68.66 MiB
Shape,"(30000, 30000)","(3000, 3000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 6.71 GiB 68.66 MiB Shape (30000, 30000) (3000, 3000) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",30000  30000,

Unnamed: 0,Array,Chunk
Bytes,6.71 GiB,68.66 MiB
Shape,"(30000, 30000)","(3000, 3000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [13]:
xd.nbytes / 1e9  # Gigabytes of the input processed lazily

7.2

In [14]:
yd = xd.mean(axis=0)
yd

Unnamed: 0,Array,Chunk
Bytes,234.38 kiB,23.44 kiB
Shape,"(30000,)","(3000,)"
Dask graph,10 chunks in 4 graph layers,10 chunks in 4 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 234.38 kiB 23.44 kiB Shape (30000,) (3000,) Dask graph 10 chunks in 4 graph layers Data type float64 numpy.ndarray",30000  1,

Unnamed: 0,Array,Chunk
Bytes,234.38 kiB,23.44 kiB
Shape,"(30000,)","(3000,)"
Dask graph,10 chunks in 4 graph layers,10 chunks in 4 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [16]:
%%time
xd = darray.random.normal(10, 0.1, size=(30_000, 30_000), chunks=(3000, 3000))
yd = xd.mean(axis=0)
yd.compute()

CPU times: total: 1min 15s
Wall time: 22.9 s


array([ 9.99960988, 10.00018468,  9.99909511, ...,  9.99978773,
        9.99886381,  9.99955695])

### Xarray
- In some applications we have multidimensional data, and sometimes working with all this dimensions can be confusing. 
  Xarray is an open source project and Python package that makes working with labeled multi-dimensional arrays easier.

In [17]:
import xarray as xr

In [18]:
ds = xr.tutorial.open_dataset(
    "air_temperature",
    chunks={  # this tells xarray to open the dataset as a dask array
        "lat": 25,
        "lon": 25,
        "time": -1,
    },
)
ds

  "class": algorithms.Blowfish,


Unnamed: 0,Array,Chunk
Bytes,14.76 MiB,6.96 MiB
Shape,"(2920, 25, 53)","(2920, 25, 25)"
Dask graph,3 chunks in 2 graph layers,3 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 14.76 MiB 6.96 MiB Shape (2920, 25, 53) (2920, 25, 25) Dask graph 3 chunks in 2 graph layers Data type float32 numpy.ndarray",53  25  2920,

Unnamed: 0,Array,Chunk
Bytes,14.76 MiB,6.96 MiB
Shape,"(2920, 25, 53)","(2920, 25, 25)"
Dask graph,3 chunks in 2 graph layers,3 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [19]:
ds.air

Unnamed: 0,Array,Chunk
Bytes,14.76 MiB,6.96 MiB
Shape,"(2920, 25, 53)","(2920, 25, 25)"
Dask graph,3 chunks in 2 graph layers,3 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 14.76 MiB 6.96 MiB Shape (2920, 25, 53) (2920, 25, 25) Dask graph 3 chunks in 2 graph layers Data type float32 numpy.ndarray",53  25  2920,

Unnamed: 0,Array,Chunk
Bytes,14.76 MiB,6.96 MiB
Shape,"(2920, 25, 53)","(2920, 25, 25)"
Dask graph,3 chunks in 2 graph layers,3 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [20]:
ds.air.chunks

((2920,), (25,), (25, 25, 3))

In [21]:
mean = ds.air.mean("time")  # no activity on dashboard
mean  # contains a dask array

Unnamed: 0,Array,Chunk
Bytes,5.18 kiB,2.44 kiB
Shape,"(25, 53)","(25, 25)"
Dask graph,3 chunks in 4 graph layers,3 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 5.18 kiB 2.44 kiB Shape (25, 53) (25, 25) Dask graph 3 chunks in 4 graph layers Data type float32 numpy.ndarray",53  25,

Unnamed: 0,Array,Chunk
Bytes,5.18 kiB,2.44 kiB
Shape,"(25, 53)","(25, 25)"
Dask graph,3 chunks in 4 graph layers,3 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [22]:
# we will see dashboard activity
mean.load()



In [23]:
dair = ds.air

In [24]:
dair2 = dair.groupby("time.month").mean("time")
dair_new = dair - dair2
dair_new

Unnamed: 0,Array,Chunk
Bytes,177.11 MiB,6.96 MiB
Shape,"(2920, 25, 53, 12)","(2920, 25, 25, 1)"
Dask graph,36 chunks in 44 graph layers,36 chunks in 44 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 177.11 MiB 6.96 MiB Shape (2920, 25, 53, 12) (2920, 25, 25, 1) Dask graph 36 chunks in 44 graph layers Data type float32 numpy.ndarray",2920  1  12  53  25,

Unnamed: 0,Array,Chunk
Bytes,177.11 MiB,6.96 MiB
Shape,"(2920, 25, 53, 12)","(2920, 25, 25, 1)"
Dask graph,36 chunks in 44 graph layers,36 chunks in 44 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
