In [1]:
import pandas as pd
import numpy as np
import os
import random
import dill

Step 1: Save each multivariate time series (MTS) to its own numpy binary file (.npy):

In [2]:
# Generate a multivariate time-series dataset for this example
random.seed(42)
M = 3 # 3 independent processes
T = 100 # 100 samples per process
dataset = np.random.randn(M,T) # generate our multivariate time-series
dataset[:,:5] # Print the first five time points for each process

array([[-0.79067304, -0.08754497, -0.43380599, -1.13602887, -0.34579843],
       [ 0.53905942, -0.37755445,  1.15316678, -0.65020959,  0.10962188],
       [ 1.10066567,  0.01289209,  0.34136527, -0.26922331,  0.52186689]])

We'll generate three example datasets that are each stored as dictionary entries, then iterate over the dictioanry to save one .npy file per MTS dataset. Note that by default, each process is $z$-scored along the time domain, so no need to normalize the data beforehand -- although you can disable this functionality by setting `normalise=False` in line 108 of `distribute_jobs.py`.

In [3]:
# Generate dictionary of 3 MTS datasets as an example
MTS_datasets = {"Dataset_"+str(i) : np.random.randn(M,T) for i in range(3)}

# Save the datasets to files
for i, dataset in enumerate(MTS_datasets):
    np.save('example_data/multivariate_time_series_{}.npy'.format(i), dataset)

# Define the YAML file
yaml_file = "example_data/sample.yaml"

# Use ps dimension order to indicate that processes are the rows while timepoints are the columns
dim_order = "ps"

# Iterate over the keys and values of the dictionary
for key, value in MTS_datasets.items():
    # Define template string and fill in variables
    yaml_string = "{{file: example_data/{key}.npy, name: {key}, dim_order: {dim_order}, labels: [{key}]}}\n"
    yaml_string_formatted = f"{yaml_string.format(key=key, dim_order=dim_order)}"

    # Append line to file
    with open(yaml_file, "a") as f:
        f.write(yaml_string_formatted)



Note that here we set the MTS name to e.g. "Dataset_0" as well as the labels, but you can use the `labels` argument to include any metadata about a given MTS that you wish.

Now that we've saved our MTS datasets to .npy files and generated the configuration file, we're ready to submit PBS jobs through `pyspi-distribute`. Use the following as a guide:

```
cmd="python3 distribute_jobs.py --data_dir example_data/ --calc_file_name calc.pkl --compute_file pyspi_compute.py \
--template_pbs_file template.pbs --sample_yaml example_data/sample.yaml --pbs_notify a --email your_email_here \
--conda_env your_conda_env_here --queue your_PBS_queue_here --walltime_hrs 3 --cpu 2 --mem 20 --table_only"

echo $cmd
$cmd
```

Note that in the above command line code, you can customize the name of the pickle file (here we use `calc.pkl` as a standard) and you should input your email, conda environment, and PBS queue name as appropriate. The walltime hours, # CPUs, and memory requests are all examples---you should also do a trial run or two with a minimal example dataset to get a sense of the time/memory requirements for your dataset before submitting all of the jobs. We also include the optional flag `--table_only` such that only the SPI results table is saved as opposed to the entire `Calculator` object, but you can omit this if you wish to save the whole object.

Once an individual PBS job is completed, you will find a new folder in your data directory (`example_data/` here) with the corresponding sample name (e.g., `Dataset_0`) that contains job output information as well as the saved `pyspi` computation result in `calc.pkl`. Since we set the `--table_only` flag, we can read in `calc.pkl` to get the SPI results table:

In [8]:
with open('example_data/Dataset_0/calc.pkl', 'rb') as f:
    Dataset_0_res = dill.load(f)

# Print the results for the empirical covariance, which here is equivalent to the Pearson correlation
Dataset_0_res['cov_EmpiricalCovariance']

process,proc-0,proc-1,proc-2
proc-0,,-0.098141,0.134365
proc-1,-0.098141,,0.08786
proc-2,0.134365,0.08786,
