# Parallel processing with Pastastore

This notebook shows parallel processing capabilities of `PastaStore`.


<div class="alert alert-warning">

<strong>Note</strong> 

Parallel processing is platform dependent and may not
always work. The current implementation works well for Linux users, though this
will likely change with Python 3.13 and higher. For Windows users, parallel
solving does not work when called directly from Jupyter Notebooks or IPython.
To use parallel solving on Windows, the following code should be used in a
Python file. 

</div>

```python
from multiprocessing import freeze_support

if __name__ == "__main__":
    freeze_support()
    pstore.apply("models", some_func, parallel=True)
```

In [1]:
import pastas as ps

import pastastore as pst
from pastastore.datasets import example_pastastore

ps.logger.setLevel("ERROR")  # silence Pastas logger for this notebook
pst.show_versions()

Pastastore version : 1.11.0

Python version     : 3.13.7
Pandas version     : 2.3.3
Matplotlib version : 3.10.6
Pastas version     : 1.11.0
PyYAML version     : 6.0.3



## Example pastastore

Load some example data, create models and solve them to showcase parallel processing.

In [2]:
# get the example pastastore
conn = pst.PasConnector("my_connector", "./temp")
# conn = pst.ArcticDBConnector("my_connector", "lmdb://./temp")
pstore = example_pastastore(conn)
pstore.create_models_bulk();

[32mPasConnector: library 'oseries' created in '/home/david/github/pastastore/examples/notebooks/temp/my_connector/oseries'[0m
[32mPasConnector: library 'stresses' created in '/home/david/github/pastastore/examples/notebooks/temp/my_connector/stresses'[0m
[32mPasConnector: library 'models' created in '/home/david/github/pastastore/examples/notebooks/temp/my_connector/models'[0m
[32mPasConnector: library 'oseries_models' created in '/home/david/github/pastastore/examples/notebooks/temp/my_connector/oseries_models'[0m
[32mPasConnector: library 'stresses_models' created in '/home/david/github/pastastore/examples/notebooks/temp/my_connector/stresses_models'[0m


Bulk creation models:   0%|          | 0/5 [00:00<?, ?it/s]

## Solving models

The `PastaStore.solve_models()` method supports parallel processing.

In [3]:
pstore.solve_models(parallel=True)

Solving models (parallel):   0%|          | 0/5 [00:00<?, ?it/s]

## Parallel processing using `.apply()`

Define some function that takes a name as input and returns some result. In this case,
return the $R^2$ value for each model.

In [4]:
def rsq(model_name: str) -> float:
    """Compute the R-squared value of a Pastas model."""
    ml = pstore.get_models(model_name)
    return ml.stats.rsq()

We can apply this function to all models in the pastastore using `pstore.apply()`. 
By default this function is run sequentially. 

In [5]:
pstore.apply("models", rsq, progressbar=True)

Computing rsq:   0%|          | 0/5 [00:00<?, ?it/s]

head_nb5    0.438129
head_mw     0.159318
oseries1    0.904487
oseries2    0.931883
oseries3    0.030468
dtype: float64

In order to run this function in parallel, set `parallel=True` in the keyword arguments.

In [6]:
pstore.apply("models", rsq, progressbar=True, parallel=True)

Computing rsq (parallel):   0%|          | 0/5 [00:00<?, ?it/s]

head_nb5    0.438129
head_mw     0.159318
oseries1    0.904487
oseries2    0.931883
oseries3    0.030468
dtype: float64

## Get model statistics

The function `pstore.get_statistics` also supports parallel processing.

In [7]:
pstore.get_statistics(["rsq", "mae"])

Unnamed: 0,rsq,mae
head_nb5,0.438129,0.318361
head_mw,0.159318,0.631517
oseries1,0.904487,0.091329
oseries2,0.931883,0.087067
oseries3,0.030468,0.106254


In [8]:
pstore.get_statistics(["rsq", "mae"], parallel=True)

Unnamed: 0_level_0,rsq,mae
_get_statistics,Unnamed: 1_level_1,Unnamed: 2_level_1
head_nb5,0.438129,0.318361
head_mw,0.159318,0.631517
oseries1,0.904487,0.091329
oseries2,0.931883,0.087067
oseries3,0.030468,0.106254


## Compute prediction intervals

Let's try using a more complex function and passing that to apply to use
parallel processing. In this case we want to compute the prediction interval,
and pass along the $\alpha$ value via the keyword arguments.

In [9]:
def prediction_interval(model_name, **kwargs):
    """Compute the prediction interval for a Pastas model."""
    ml = pstore.get_models(model_name)
    return ml.solver.prediction_interval(**kwargs)

In [10]:
pstore.apply("models", prediction_interval, kwargs={"alpha": 0.05})

Computing prediction_interval:   0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0_level_0,head_nb5,head_nb5,head_mw,head_mw,oseries1,oseries1,oseries2,oseries2,oseries3,oseries3
Unnamed: 0_level_1,0.025,0.975,0.025,0.975,0.025,0.975,0.025,0.975,0.025,0.975
1960-04-29,,,6.308534,9.328015,,,,,,
1960-04-30,,,6.333755,9.493416,,,,,,
1960-05-01,,,6.293151,9.428607,,,,,,
1960-05-02,,,6.211644,9.640736,,,,,,
1960-05-03,,,6.098605,9.346611,,,,,,
...,...,...,...,...,...,...,...,...,...,...
2020-01-17,7.895097,9.660782,,,,,,,,
2020-01-18,7.933106,9.608321,,,,,,,,
2020-01-19,7.952470,9.699408,,,,,,,,
2020-01-20,7.873977,9.651855,,,,,,,,


In [11]:
pstore.apply("models", prediction_interval, kwargs={"alpha": 0.05}, parallel=True)

Computing prediction_interval (parallel):   0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0_level_0,head_nb5,head_nb5,head_mw,head_mw,oseries1,oseries1,oseries2,oseries2,oseries3,oseries3
Unnamed: 0_level_1,0.025,0.975,0.025,0.975,0.025,0.975,0.025,0.975,0.025,0.975
1960-04-29,,,6.189416,9.571070,,,,,,
1960-04-30,,,6.260067,9.415113,,,,,,
1960-05-01,,,6.173244,9.503560,,,,,,
1960-05-02,,,6.182438,9.475765,,,,,,
1960-05-03,,,6.252413,9.440298,,,,,,
...,...,...,...,...,...,...,...,...,...,...
2020-01-17,7.950277,9.630808,,,,,,,,
2020-01-18,7.947599,9.664797,,,,,,,,
2020-01-19,8.003087,9.666395,,,,,,,,
2020-01-20,7.940711,9.688187,,,,,,,,


## Get signatures

The function `pstore.get_signatures` does not explicitly support parallel processing but can be used in combination with `pstore.apply`

In [12]:
signatures = [
    "cv_period_mean",
    "cv_date_min",
    "cv_date_max",
    "cv_fall_rate",
    "cv_rise_rate",
]

In [13]:
pstore.get_signatures(signatures=signatures)

Unnamed: 0,head_nb5,head_mw,oseries1,oseries2,oseries3
cv_period_mean,0.061879,0.145062,0.013066,0.015199,0.029168
cv_date_min,0.246021,0.254627,0.145884,0.128636,1.394852
cv_date_max,1.262425,1.083929,0.300328,0.722945,0.444442
cv_fall_rate,-1.13645,-1.4302,-0.744797,-0.722718,-1.032837
cv_rise_rate,1.25945,1.097257,0.862981,0.836678,0.931181


In [14]:
pstore.apply(
    "oseries", pstore.get_signatures, kwargs={"signatures": signatures}, parallel=True
)

Computing get_signatures (parallel):   0%|          | 0/5 [00:00<?, ?it/s]

get_signatures,head_nb5,head_mw,oseries1,oseries2,oseries3
cv_period_mean,0.061879,0.145062,0.013066,0.015199,0.029168
cv_date_min,0.246021,0.254627,0.145884,0.128636,1.394852
cv_date_max,1.262425,1.083929,0.300328,0.722945,0.444442
cv_fall_rate,-1.13645,-1.4302,-0.744797,-0.722718,-1.032837
cv_rise_rate,1.25945,1.097257,0.862981,0.836678,0.931181


## Load models

Load models in parallel.

In [15]:
pstore.apply("models", pstore.get_models, fancy_output=True)

Computing get_models:   0%|          | 0/5 [00:00<?, ?it/s]

{'head_nb5': Model(oseries=head_nb5, name=head_nb5, constant=True, noisemodel=False),
 'head_mw': Model(oseries=head_mw, name=head_mw, constant=True, noisemodel=False),
 'oseries1': Model(oseries=oseries1, name=oseries1, constant=True, noisemodel=False),
 'oseries2': Model(oseries=oseries2, name=oseries2, constant=True, noisemodel=False),
 'oseries3': Model(oseries=oseries3, name=oseries3, constant=True, noisemodel=False)}

The `max_workers` keyword argument sets the number of workers that are spawned. The default value is often fine, but it can be set explicitly.

The following works for `PasConnector`. See alternative code below for `ArcticDBConnector`.  

In [16]:
pstore.apply(
    "models", pstore.get_models, fancy_output=True, parallel=True, max_workers=5
)

Computing get_models (parallel):   0%|          | 0/5 [00:00<?, ?it/s]

{'head_nb5': Model(oseries=head_nb5, name=head_nb5, constant=True, noisemodel=False),
 'head_mw': Model(oseries=head_mw, name=head_mw, constant=True, noisemodel=False),
 'oseries1': Model(oseries=oseries1, name=oseries1, constant=True, noisemodel=False),
 'oseries2': Model(oseries=oseries2, name=oseries2, constant=True, noisemodel=False),
 'oseries3': Model(oseries=oseries3, name=oseries3, constant=True, noisemodel=False)}

## ArcticDBConnector workaround

For `ArcticDBConnector`, the underlying database connection objects cannot be pickled, which is required for Python's multiprocessing. Therefore, passing methods directly from the `PastaStore` or `ArcticDBConnector` classes will not work in parallel mode.

**The workaround:** The `_parallel()` method uses an initializer that creates a new connector instance in each worker process and stores it in a global `conn` variable. Your custom functions can then access this connector to retrieve data from the database.

This is the standard Python pattern for using unpicklable objects with multiprocessing. See the [Python documentation](https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor) for more details.

**Example:** Write a simple function that uses the global `conn` variable to access the database:

In [17]:
# Simple function to get models from database
def get_model(model_name):
    """Get model using global connector (ArcticDBConnector workaround).

    The global 'conn' variable is set by the _parallel() initializer
    in each worker process, providing access to an ArcticDBConnector instance.
    """
    return conn.get_model(model_name)

In [18]:
pstore.apply("models", get_model, fancy_output=True, parallel=True, max_workers=5)

Computing get_model (parallel):   0%|          | 0/5 [00:00<?, ?it/s]

{'head_nb5': Model(oseries=head_nb5, name=head_nb5, constant=True, noisemodel=False),
 'head_mw': Model(oseries=head_mw, name=head_mw, constant=True, noisemodel=False),
 'oseries1': Model(oseries=oseries1, name=oseries1, constant=True, noisemodel=False),
 'oseries2': Model(oseries=oseries2, name=oseries2, constant=True, noisemodel=False),
 'oseries3': Model(oseries=oseries3, name=oseries3, constant=True, noisemodel=False)}

Clean up temporary pastastore.

In [19]:
pst.util.delete_pastastore(pstore)

[32mDeleting PasConnector database: 'my_connector' ... [0m
[32mDone![0m
