### Lazy Data

Lazy data, or lazy loading of data, is a technique used in which data is not actually loaded until it is needed.  Computational expressions are stored as an object, but are not evaluated until told to do so.  If used effectively, this allows for a more efficient use of system resources and performance by allowing a user to separate a **list of computations** from actually **performing them**.  For example, the following are a few examples of how storing evaluations lazily can be important:  

* You can select different compute engines to do the computations stored in the lazy pipeline (e.g. you can parallelize the computation, you can run it on GPUs, or run it on a remote server, etc.).
* You can describe a series of operations where some of them might drastically reduce the size of the problem before computation begins, e.g. by sampling/selecting/slicing a large dataset for a case when only a small portion is actually used.
* You can run the operation "out of core", i.e., processing only a chunk at a time rather than loading all of the data at once, when the same operation must be done on many datapoints that together would not fit into memory


Let's look at an example of lazy data in action. If we create a function that generates the Fibonacci sequence for a given number of values, you may produce something like this:

In [1]:
def fib_eager(n):
    a, b = 0, 1
    vals = []
    for num in range(n):
        vals.append(a)
        a, b = b, a + b
    return vals

Calling this function for a value of ten may give you something like this:

In [150]:
eager = fib_eager(10)
print(eager)

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]


If you want to generate a lazy sequence of Fibonacci numbers, then you can use the `yield` statement to return a generator instead of the actual result. The computation will be performed until a later time when it is called for. 

In [152]:
def fib_lazy(n):
    a, b = 0, 1
    for num in range(n):
        yield a
        a, b = b, a + b


In [153]:
lazy = fib_lazy(10)

In [154]:
print(lazy)

<generator object fib_lazy at 0x7fd3edeee9d0>


If you want to actually see the values, you can loop through and generate each one at a time.

In [155]:
for n in range(10):
    print(next(lazy))

0
1
1
2
3
5
8
13
21
34


Let's say you want to only show the first 10 values, but the function is set up to run 1000 times.  In the eager example, in order to get the first 10, you'd have to compute all 1000 values and then return 10.  

In [156]:
eager = fib_eager(1000)

In [157]:
len(eager)

1000

In [158]:
eager[0:10]

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

In the lazy version, you can set it for 1000, and then compute the first 10 values.  The generator returned will not actually compute anything until told on each run through the function.

### HoloViews and Lazy Data

Holoviews is compatible with a few different libraries that make use of lazy evaluations.
As explained here: https://holoviz.org/tutorial/Large_Data.html, HoloViews can accept [Dask](https://dask.org/) dataframes just as well as Pandas dataframes.  It has the computational infrastructure to accept lazy data objects, and will evaluate the objects as needed in order to display the relevant information.  In addition, HoloViews has more recently implemented the ability to plot [Ibis](https://ibis-project.org/) objects, making it possible to visualize data stored on a remote database.

The following a simple example of plotting a set of 100 random points stored in both a Pandas Dataframe and then the same plot is created from a Dask Dataframe.

In [165]:
import dask.dataframe as dd
import pandas as pd
import numpy as np
import holoviews as hv
hv.extension('bokeh')

The Pandas dataframe is created.

In [171]:
df = pd.DataFrame({'x': np.random.normal(size = 100), 'y': np.random.normal(size = 100)})

In [173]:
df.head()

Unnamed: 0,x,y
0,-0.063265,-0.547954
1,-0.648552,1.573911
2,-1.485326,1.000877
3,-0.165583,-0.284508
4,-0.938139,-1.153042


Calling holoviews `Points` function will produce the following plot:

In [188]:
hv.Points(data=df)

The Dask dataframe is created from the Pandas dataframe.  It is split up between 2 partitions, just for example's sake.

In [176]:
ddf = dd.from_pandas(df, npartitions=2)

As you can see here, the dataframe isn't actually shown unless you call `compute` on the dataframe

In [177]:
ddf

Unnamed: 0_level_0,x,y
npartitions=2,Unnamed: 1_level_1,Unnamed: 2_level_1
0,float64,float64
50,...,...
99,...,...


In [186]:
ddf.compute()

Unnamed: 0,x,y
0,-0.063265,-0.547954
1,-0.648552,1.573911
2,-1.485326,1.000877
3,-0.165583,-0.284508
4,-0.938139,-1.153042
...,...,...
95,1.023809,0.709338
96,-2.320280,-0.415662
97,-0.116019,-0.656105
98,0.701812,0.051164


The same plot is created now using the Dask dataframe `ddf` instead of the pandas dataframe:

In [189]:
hv.Points(data=ddf).compute()

### Persist vs Compute

When using lazy evaluation of data, there are 2 different calls you can make on the lazy data objects--`.compute()` or `.persist()`.  The call to `.compute()` is just like it sounds, this is the call to run the computation stored in your lazy object.  This function will compute the result and instead of a task graph, you will get the actual computation returned to you.  A Dask (or Ibis) dataframe computed would result in a single Pandas dataframe on your local computer. This should only be done if the result can fit into memory.



### When would you want to use .persist() when using HoloViews?

In some cases, it may be useful to obtain results mid-way through a computation, but leave the results split up among the different parallel processes that make up the entire computation.  If you have a Dask cluster that is running a computation on several partitions of a large Dask dataframe, you can use `.persist()` to compute something on each of the partitions in the cluster.  The result of each of those partitions will be stored on their respective nodes.  The object returned to you will now point to those running processes, where the computation is stored in memory.  The object you see is still a lazy object, but some part of the computation is persisting in memory split among many nodes.  The benefit of this is that you do not have to keep running the same computation over and over again every time you need to do a different computation or plot further down the task chain.

Doing any computations before plotting (aggregating or reducing your data, for example), calling .persist() on these computations will avoid evaluating that portion of the task graph repeatedly, e.g. as a user zooms or pans on the plot.  These computations will be stored in memory on the cluster nodes, ready to be used when needed for your various plots.  

### When do you use .compute()?  

Some methods you may use will not support data being stored on multiple nodes.  You may need to perform a sort on the entire dataset or get an index location with `.iloc()`, and these require putting all the data on one node.  Ideally, you would specify a series of evaluations that could reduce your data (by aggregating, selecting, slicing, etc.), then call `.compute()` to obtain a result that will fit on a single machine's memory, even though it originated from a large distributed dataset.
