This Jupyter notebook is dedicated to generating time series data for multiple configurations using the `TSBuilder` class from the `d2c.data_generation.builder` module. The notebook is structured to perform data generation in several phases, leveraging Python’s multiprocessing capabilities to handle multiple data generation tasks simultaneously.

### Breakdown of the Notebook:

1. **Setup and Imports:**
   - The notebook starts by importing necessary libraries such as `Pool` from `multiprocessing` and `TSBuilder` for building time series.
   - Necessary constants like `N_JOBS` (number of parallel jobs) are defined.

2. **Initial Data Generation:**
   - A `run_process` function is defined to handle the generation of time series data. This function attempts to build time series with specific parameters (like the number of variables, neighborhood size, noise standard deviation, etc.) and save them as pickle files.
   - A multiprocessing pool is set up to run these data generation tasks in parallel across various configurations (different processes, variable counts, neighborhood sizes, and noise levels).

3. **Verification of Generated Data:**
   - The notebook includes a check for any missing data files based on expected combinations of parameters. If any files are missing, their names are collected into a list.

4. **Re-Generation of Missing Data:**
   - The missing data files are then targeted for regeneration. Adjustments are made (like changing the seed and increasing the number of max attempts) to ensure successful generation.
   - Another round of multiprocessing is initiated specifically for the missing files.

5. **Final Verification:**
   - A final check is performed to ensure all expected data files are generated. If any files are still missing, they are targeted for another attempt with even more attempts and a different seed.

6. **Summary:**
   - At the end of the notebook, a check is performed to confirm the total number of files in the data directory, indicating the completion of the data generation process.

### Key Features:
- **Use of Multiprocessing:** To expedite data generation, the notebook utilizes the multiprocessing library to run multiple instances of data generation tasks simultaneously.
- **Error Handling:** Each data generation task includes try-except blocks to handle potential errors that might occur during the generation process.
- **Progressive Problem Solving:** The notebook incrementally addresses issues by adjusting parameters and retrying data generation for failed tasks.

This notebook serves as a comprehensive script for generating a large dataset of time series with varying characteristics, aiming to robustly handle errors and ensure complete generation across specified configurations.

In [None]:
#! pip install ../../../.

In [1]:
from multiprocessing import Pool
from d2c.data_generation.builder import TSBuilder

In [2]:
N_JOBS = 55
def run_process(params):
    """
    Run a single process of the data generation.
    """
    process, n_variables, max_neighborhood_size, noise_std = params
    try:
        tsbuilder = TSBuilder(observations_per_time_series=250, 
                              maxlags=5, 
                              n_variables=n_variables, 
                              time_series_per_process=40, 
                              processes_to_use=[process], 
                              noise_std=noise_std, 
                              max_neighborhood_size=max_neighborhood_size, 
                              seed=42, 
                              max_attempts=200,
                              verbose=True)

        tsbuilder.build()
        tsbuilder.to_pickle(f'/home/jpalombarini/td2c/notebooks/paper_td2c/.data/P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}.pkl')
        print(f'P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std} done')
    except ValueError as e:
        print(f'P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std} failed: {e}')

In [4]:
if __name__ == '__main__':
    """
    This script generates the data for different parameters: processes, number of variables, neighborhood sizes and noise levels.
    The data is saved in the .data folder.
    The if __name__ == '__main__': is used to avoid multiprocessing issues in Jupyter notebooks, i.e. the script is run as a script and not
    as a module as it would have been if the script was imported, with the __name__ being the name of the module.
    If the script is imported, the __name__ is the name of the module, if it is run as a script, the __name__ is __main__.
    So, to run this script in a Jupyter notebook, we write the code inside the if __name__ == '__main__': block, while, if we want to import
    the functions from this script, we write "from script import run_process".
    """
    parameters = [(process, n_variables, max_neighborhood_size, noise_std)
                  for process in [1, 2, 3, 4, 6] # , 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20
                  for n_variables in [5, 10] # , 25
                  for max_neighborhood_size in [2, 4] # , 8
                  for noise_std in [0.01]] # , 0.005, 0.001

    with Pool(processes=N_JOBS) as pool:
        pool.map(run_process, parameters)

P3_N5_Nj2_n0.01 done
P4_N5_Nj2_n0.01 done
P2_N5_Nj2_n0.01 done
P6_N5_Nj2_n0.01 done
P3_N5_Nj4_n0.01 done
P1_N5_Nj2_n0.01 done
P2_N5_Nj4_n0.01 done
P6_N5_Nj4_n0.01 done
P4_N5_Nj4_n0.01 done
P1_N5_Nj4_n0.01 done
P3_N10_Nj2_n0.01 done
P6_N10_Nj2_n0.01 done
P3_N10_Nj4_n0.01 done
P4_N10_Nj2_n0.01 done
P4_N10_Nj4_n0.01 done
P2_N10_Nj2_n0.01 done
P1_N10_Nj2_n0.01 done
P6_N10_Nj4_n0.01 done
P1_N10_Nj4_n0.01 done
P2_N10_Nj4_n0.01 done


Let's check any missing combinations

In [10]:
import os 
missing = []
for process in [1, 2, 3, 4, 6]: # , 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20
    for n_variables in [5,10]: # ,25
        for max_neighborhood_size in [2,4]: # ,8
            for noise_std in [0.01]: # , 0.005, 0.001
                filename = f'/home/jpalombarini/td2c/notebooks/paper_td2c/.data/P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}.pkl'
                if not os.path.exists(filename):
                    missing.append(filename)


In [11]:
missing

[]

Now we focus on recreating the missing ones, by changing the seeds and increasing the number of max_attempts

In [None]:
from multiprocessing import Pool
from d2c.data_generation.builder import TSBuilder
N_JOBS = 55
def run_process(params):
    process, n_variables, max_neighborhood_size, noise_std = params
    try:# we change the seed and increase the max_attempts
        tsbuilder = TSBuilder(observations_per_time_series=250, 
                              maxlags=5, 
                              n_variables=n_variables, 
                              time_series_per_process=40, 
                              processes_to_use=[process], 
                              noise_std=noise_std, 
                              max_neighborhood_size=max_neighborhood_size, 
                              seed=24, 
                              max_attempts=400,
                              verbose=True)

        tsbuilder.build()
        tsbuilder.to_pickle(f'./data/P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}.pkl')
        print(f'P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std} done')
    except ValueError as e:
        print(f'P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std} failed: {e}')

if __name__ == '__main__':
    parameters = []
    for missing_file in missing:
        process = int(missing_file.split('/')[-1].split('_')[0][1:])
        n_variables = int(missing_file.split('/')[-1].split('_')[1][1:])
        max_neighborhood_size = int(missing_file.split('/')[-1].split('_')[2][2:])
        noise_std = float(missing_file.split('/')[-1].split('_')[3][1:-4])
        parameters.append((process, n_variables, max_neighborhood_size, noise_std))


    with Pool(processes=N_JOBS) as pool:
        pool.map(run_process, parameters)

Let's check what is still missing

In [None]:
import os 
missing = []
for process in [1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20]:
    for n_variables in [5,10,25]:
        for max_neighborhood_size in [2,4,8]:
            for noise_std in [0.01, 0.005, 0.001]:
                filename = f'./data/P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}.pkl'
                if not os.path.exists(filename):
                    missing.append(filename)


In [None]:
missing

We try one last time with even more max_attemps and a different seed.

In [None]:
from multiprocessing import Pool
from d2c.data_generation.builder import TSBuilder
N_JOBS = len(missing)
def run_process(params):
    process, n_variables, max_neighborhood_size, noise_std = params
    try:# we change the seed and increase the max_attempts
        tsbuilder = TSBuilder(observations_per_time_series=250, 
                              maxlags=5, 
                              n_variables=n_variables, 
                              time_series_per_process=40, 
                              processes_to_use=[process], 
                              noise_std=noise_std, 
                              max_neighborhood_size=max_neighborhood_size, 
                              seed=0, 
                              max_attempts=1000,
                              verbose=True)

        tsbuilder.build()
        tsbuilder.to_pickle(f'./data/P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}.pkl')
        print(f'P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std} done')
    except ValueError as e:
        print(f'P{process}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std} failed: {e}')

if __name__ == '__main__':
    parameters = []
    for missing_file in missing:
        process = int(missing_file.split('/')[-1].split('_')[0][1:])
        n_variables = int(missing_file.split('/')[-1].split('_')[1][1:])
        max_neighborhood_size = int(missing_file.split('/')[-1].split('_')[2][2:])
        noise_std = float(missing_file.split('/')[-1].split('_')[3][1:-4])
        parameters.append((process, n_variables, max_neighborhood_size, noise_std))
# what this if statement does is to run the function run_process in parallel for each parameter in the list parameters
# if __name__ == '__main__': is a python thing that allows you to run the code in the if statement only if you run the script directly

    with Pool(processes=N_JOBS) as pool:
        pool.map(run_process, parameters)

All time series have been generated correctly

In [1]:
len(os.listdir('/home/jpalombarini/td2c/notebooks/paper_td2c/.data'))

NameError: name 'os' is not defined