In [None]:
#| default_exp test

In [None]:
#| export
from data_harmonising.data import *
from fastcore.utils import *

In [None]:
#| export
import pandas as pd
import dask.dataframe as dd
import pyreadstat
import pyspssio

## Handling data

### Ideas

- Persistent storage (ie. parquet)
- In-memory caching
- Batch processing/reading SPSS files
  - Parallel processing reading SPSS files
- Using profiling tools, like cProfile, to review time spent on each function

Using persistent storage format
```
# Save to Parquet
df.to_parquet('data.parquet')

# Save to HDF5
df.to_hdf('data.h5', key='df', mode='w')
```

In-memory caching
```
from joblib import Memory

memory = Memory('./cachedir', verbose=0)

@memory.cache
def load_data(file_path):
    return pd.read_spss(file_path)

df = load_data('path_to_file.sav')
```

See how quickly the different methods can open a dataframe.  
(Note: use %timeit or some kind of time benchmarking mechanism).

In [None]:
file = "../data/G227_Q.sav"

In [None]:
# %%timeit
df_pd = pd.read_spss(file)

1.54 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
# %%timeit
df_prs, meta_prs = pyreadstat.read_sav(file)

In [None]:
# %%timeit
df_pys, meta_pys = pyspssio.read_sav(file)

`pyreadstat` and `pyspssio` are similar, and both significantly faster than `pandas`.

How does that change when using in-memory caching?

In [None]:
from joblib import Memory

memory = Memory('./cachedir', verbose=0)

@memory.cache
def load_data(file_path):
    return pyreadstat.read_sav(file_path)

df, meta = load_data(file)

In [None]:
%%timeit
df, meta = load_data(file)

67.2 ms ± 635 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


That clearly makes a huge difference if re-loading, though if you re-start the kernel, it still takes a while to run the first time, which is the primary issue.  
Let's try converting the data to parquet form now, and see how quickly that runs.  

In [None]:
PARQUET_FILE = '../data/G227_Q.parquet'

In [None]:
df.to_parquet(PARQUET_FILE)

In [None]:
%%timeit
pq_pd = pd.read_parquet(PARQUET_FILE)

117 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
%%timeit
pq_dd = dd.read_parquet(PARQUET_FILE)

386 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Of course, let's not forget to validate that no data has been lost in the conversion.  
We'll convert the data back to SPSS format, and compare the original and converted data.

In [None]:
pq_pd = pd.read_parquet(PARQUET_FILE)

In [None]:
df.compare(pq_pd)

## Handling metadata

And now, check how metadata is handled with different libraries/methods.  
Check for both speed and accuracy - some libraries drop data in specific cases.