## Working with Dask Arrays

In this chapter we'll explore how we can use `dask.array` to read multiple data sources and perform computations with them as a single data array. We'll learn some advanced uses of NumPy arrays when dealing with high dimensional data that also work on Dask arrays. Finally, we'll examine climate patterns in the US from monthly weather data in the US.

### Chunking a NumPy array
A NumPy array has been provided for you as energy. This is the electricity load in kWh for the state of Texas sampled every 15 minutes over the year 2000 (that's about 35 thousand samples).

Your job is to convert the NumPy array into a dask.array; the dask.array should have chunks whose sizes are 1/4 of the number of elements of the array energy. You will then inspect the chunk sizes of the Dask array. Finally, you'll compute the mean electricity load in kWh in two ways (using the dask.array and using the NumPy array) to compare the results.

In [14]:
# read hdf5 file

import h5py
f = h5py.File('texas2000.hdf5', 'r')

# check keys
f.keys()
# define load
load = f['load']
# print
print(load)

import dask.array as da

# Call da.from_array():  energy_dask
energy_dask = da.from_array(load, chunks = 0.25*load.shape[0])

# Print energy_dask.chunks
print(energy_dask.chunks)

# Print Dask array average and then NumPy array average
print(energy_dask.mean().compute())

<HDF5 dataset "load": shape (35136,), type "<f8">
((8784, 8784, 8784, 8784),)
6077.886444672131


### Timing Dask array computations
Your job now is to create two Dask arrays from energy using different chunksizes. You'll then measure the time required (in milliseconds) to compute the standard deviation of each Dask array.

In [16]:
# Import time
import time

# Call da.from_array() with arr: energy_dask4
energy_dask4 = da.from_array(load, chunks = 0.25 * load.shape[0])

# Print the time to compute standard deviation
t_start = time.time()
std_4 = energy_dask4.std()
t_end = time.time()
print((t_end - t_start) * 1.0e3)

1.068115234375


In [17]:
# Now use chunks set to 1/8 the length of energy and as before, print the time in milliseconds to compute the standard deviation.

# Call da.from_array() with arr: energy_dask8
energy_dask8 = da.from_array(load, chunks= 1/8*load.shape[0])

# Print the time to compute standard deviation
t_start = time.time()
std_8 = energy_dask8.std()
t_end = time.time()
print((t_end - t_start) * 1.0e3)

1.2850761413574219
