# Performance notes

In most cases, minimizing memory usage is Vaex' first priority, and performance comes seconds. This allows Vaex to work with very large datasets, larger than the available RAM on the instance being used.

However, this sometimes comes at the cost of performance.

## Virtual columns effects

When we add a new column to a DataFrame based on existing columns, Vaex will create a virtual column, e.g.:


In [1]:
import vaex
import numpy as np

x = np.arange(100_000_000, dtype='float64')
df = vaex.from_arrays(x=x)
df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log()

In this dataframe, `x` uses memory, while `y` does not, it will be evaluate in chunks when needed. To demonstate the performance implications, let us perform some computations with these columns to force evaluation.

In [2]:
%%time
df.x.mean()

CPU times: user 2.47 s, sys: 0 ns, total: 2.47 s
Wall time: 85.7 ms


array(49999999.5)

In [3]:
%%time
df.y.mean()

CPU times: user 3.47 s, sys: 603 ms, total: 4.07 s
Wall time: 342 ms


array(-17.42068049)

With this example we see that computations using virtual columns can be much slower, which is the penalty we pay for saving memory.

### Materializing the columns

We can ask Vaex to materialize a column, or all virtual columns using [df.materialize](https://vaex.io/docs/api.html#vaex.dataframe.DataFrame.materialize). Materialization means that the virtual columns are evaluated and stored in memory.

In [4]:
df_mat = df.materialize()

In [5]:
%%time
df_mat.x.mean()

CPU times: user 1.79 s, sys: 9.64 ms, total: 1.8 s
Wall time: 65.5 ms


array(49999999.5)

In [6]:
%%time
df_mat.y.mean()

CPU times: user 1.54 s, sys: 14.1 ms, total: 1.55 s
Wall time: 64 ms


array(-17.42068049)

We now get equal performance for both columns, since they are both in-memory columns.

### Consideration in backends with multiple workers

It is common to use multiple workers with web frameworks (e.g. `gunicorn`) in Python. If all workers 
materialize the virtual columns, a lot of memory would be wasted. There are two solutions for this:
 
 - save data to disk
 - materialize a single time

#### Save to disk

One can export the dataframe to disk in HDF5 or Arrow format as a pre-process step, and let all workers access the same file. Due to memory mapping, each worker will share the same memory.

e.g.
```python
df.export('materialized-data.hdf5', progress=True)
```


#### Materialize a single time

Gunicorn has the following command line flag:
```
  --preload             Load application code before the worker processes are forked. [False]
```

This will tell `gunicorn` to first run you app (a single time), allowing you to do the materialize step. After your script ran, it will fork, and all workers will share the same memory.


#### Additional tips

A good idea could be to mix the two, and use use Vaex' [df.fingerprint](https://vaex.io/docs/api.html#vaex.dataframe.DataFrame.fingerprint) method to cache the file to disk.

E.g.

```python
import vaex
import numpy as np
import os

x = np.arange(100_000_000, dtype='float64')
df = vaex.from_arrays(x=x)
df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log()

filename = "vaex-cache-" + df.fingerprint() + ".hdf5"
if not os.path.exists(filename):
    df.export(filename, progress=True)
else:
    df = vaex.open(filename) 
```

In case the virtual columns change, rerunning will create a new cache file, and changing back will use the previously generated cache file. This is especially useful during development.

In this case, it is still important to let gunicorn run a single process first (using the `--preload` flag), to avoid multiple workers doing the same work.