# Dataset Configs

This notebook creates dataset configurations using the datasets in [GIFT-Eval](https://github.com/SalesforceAIResearch/gift-eval) to train the models proposed in the [Ensemble TEMPO slides](https://docs.google.com/presentation/d/1GxL_qUQKizv5C_RPzxYiJjxJSPD6-81x1VPOxxYXAl0/edit?slide=id.g35d6bf22e47_0_10#slide=id.g35d6bf22e47_0_10).

Ensemble TEMPO will have models at three different granularities:
- **General** 
  - Three models across three forecasting terms (short, medium, and long)
- **Domain-specific** 
  - 15 models across seven domains and three forecasting terms
- **Dataset-specific** 
  - 97+ models across each dataset
  - **Note:**: Dataset-specific models will only be trained on the train-test split

For each split (pretraining and train-test), we want to get the datasets we want to use for the 
- Three general models
- 15 domain-specific models
- 97+ dataset-specific models (only for train-test)

First, we'll get the dataset configurations for the pretraining split. Then, we'll get the dataset configurations for train-test split.

## Pretraining Split

Load the pretraining split's metadata.

In [128]:
import pandas as pd
from pathlib import Path
from utils import SplitType

split_type = SplitType.PRETRAIN
metadata_path = Path("resources") / split_type / "metadata.csv"

df = pd.read_csv(metadata_path)

print(f"Number of unique {split_type} datasets: {df['name'].nunique()}")
print(f"Number of unique {split_type} terms: {df['term'].nunique()}")
print(f"Total number of {split_type} datasets: {len(df)}")
df.head()

Number of unique pretrain datasets: 152
Number of unique pretrain terms: 3
Total number of pretrain datasets: 456


Unnamed: 0,name,term,freq,domain,num_series,target_dim,_min_series_length,sum_series_length,prediction_length,windows
0,bull,short,H,Energy,41,1,17544,719304,48,20
1,bull,medium,H,Energy,41,1,17544,719304,480,4
2,bull,long,H,Energy,41,1,17544,719304,720,3
3,cmip6_1885,short,6H,Climate,434176,53,7300,3169484800,48,16
4,cmip6_1885,medium,6H,Climate,434176,53,7300,3169484800,480,2


### General

View all of the unique forecasting terms and count the number of datasets that belong to each term.
- Each forecasting term multiplies the original prediction length by a multipler:
  - "long" multiplies the original prediction length by 15
  - "medium" multiplies the original prediction length by 10  
  - "short" multiplies the original prediction length by 1 (no change)

In [129]:
def get_count_df(
    df: pd.DataFrame,
    columns: list[str],
    ascending: bool = False,
) -> pd.DataFrame:
    """
    Counts the number of unique combinations of all unique values in the
    specified columns of the given a DataFrame.

    Args:
        df (pd.DataFrame): The input DataFrame.
        columns (list[str]): A list of column names to group by.
        ascending (bool, optional): If True, sort the result in ascending order
            of count. Defaults to False (descending order).

    Returns:
        pd.DataFrame: A DataFrame with the grouped columns and a 'count'
            column, sorted by count.
    """
    count_df = df.groupby(columns).size().reset_index(name="count")
    return count_df.sort_values(by="count", ascending=ascending).reset_index(drop=True)


term_counts = get_count_df(df, columns=["term"], ascending=False)

terms = df["term"].unique()

print(f'Terms: {", ".join(terms)}')
print(f"Number of unique terms: {len(term_counts)}")
display(term_counts)

Terms: short, medium, long
Number of unique terms: 3


Unnamed: 0,term,count
0,long,152
1,medium,152
2,short,152


Save each term and its associated dataset names to a YAML file.

In [130]:
import yaml

general_document = []

for term in terms:
    names = df[df["term"] == term]["name"].tolist()

    # Add a mapping for each term
    general_document.append(
        {
            "term": term,
            "names": names,
        }
    )

documents = [general_document]

num_general_configs = len(general_document)

yaml_path = Path("configs") / split_type / "datasets.yaml"

print(f"Number of general configurations: {num_general_configs}")
with open(yaml_path, "w") as file:
    yaml.dump_all(documents, file, explicit_start=True)

Number of general configurations: 3


Load the YAML file to ensure the configurations were saved correctly.

In [131]:
with open(yaml_path, "r") as file:
    documents = list(yaml.load_all(file, Loader=yaml.SafeLoader))

num_documents = len(documents)
print(f"Number of documents loaded: {num_documents}")

for i, document in enumerate(documents, 1):
    print(f"--- Document {i} ---")
    print(yaml.dump(document, sort_keys=False))

# Access the general configurations
general_configs = documents[0]

assert len(general_configs) == 3

for i, config in enumerate(general_configs):
    print(f"Config {i+1}:")

    term = config["term"]
    print(f"  Term: {term}")

    num_names = len(config["names"])
    term_mask = term_counts["term"] == term
    assert num_names == term_counts.loc[term_mask, "count"].iloc[0]

    print(f"  Number of names: {num_names}")
    print(f"  Names: {', '.join(config['names'])}\n")

Number of documents loaded: 1
--- Document 1 ---
- names:
  - bull
  - cmip6_1885
  - era5_1991
  - SHMETRO
  - era5_2006
  - BEIJING_SUBWAY_30MIN
  - gfc12_load
  - buildings_900k
  - london_smart_meters_with_missing
  - residential_pv_power
  - PEMS_BAY
  - wind_power
  - cmip6_1975
  - elecdemand
  - wiki-rolling_nips
  - spain
  - cdc_fluview_who_nrevss
  - covid_mobility
  - monash_m3_quarterly
  - era5_2015
  - alibaba_cluster_trace_2018
  - cmip6_1930
  - uber_tlc_hourly
  - era5_2012
  - cmip6_1940
  - bdg-2_rat
  - era5_2018
  - largest_2020
  - cmip6_1875
  - cmip6_1905
  - borg_cluster_data_2011
  - tourism_yearly
  - traffic_weekly
  - largest_2018
  - PEMS07
  - era5_2001
  - cif_2016_6
  - era5_1996
  - bdg-2_panther
  - cmip6_1985
  - monash_m3_yearly
  - bitcoin_with_missing
  - era5_1998
  - era5_1992
  - favorita_sales
  - cmip6_1965
  - hog
  - era5_2005
  - cmip6_1850
  - PEMS03
  - cmip6_1920
  - covid19_energy
  - LOS_LOOP
  - cmip6_1895
  - tourism_quarterly
  - 

### Domain-Specific

View all of the unique domain-term combinations and count the number of datasets that belong to each one.

In [132]:
domains, terms = df["domain"].unique(), df["term"].unique()

domain_term_counts = get_count_df(
    df,
    columns=["domain", "term"],
    ascending=False,
)

# Convert each term into its own column
pivoted_df = domain_term_counts.pivot(
    index="domain",
    columns="term",
    values="count",
).reset_index()

# Remove old "term" column
pivoted_df.columns.name = None

# Reorder columns
pivoted_df = pivoted_df[
    [
        "domain",
        "short",
        "medium",
        "long",
    ]
]

print(f'Domains: {", ".join(domains)}')
print(f'Terms: {", ".join(terms)}')
print(f"Number of unique domain-term combinations: {len(domain_term_counts)}")
display(pivoted_df)

Domains: Energy, Climate, Transport, Web, Healthcare, Econ/Fin, CloudOps, Sales, Nature
Terms: short, medium, long
Number of unique domain-term combinations: 27


Unnamed: 0,domain,short,medium,long
0,Climate,67,67,67
1,CloudOps,3,3,3
2,Econ/Fin,17,17,17
3,Energy,28,28,28
4,Healthcare,3,3,3
5,Nature,3,3,3
6,Sales,3,3,3
7,Transport,25,25,25
8,Web,3,3,3


Add each domain-term combination's domain, term, and dataset names to the yaml file.
- We'll only create short groups for the following domains because they don't have any medium or long datasets in the train-test split:
  - Econ/Fin
  - Healthcare
  - Sales
- We'll also create combined dataset configurations for the following domain pairs
  - Web and CloudOps
  - Nature and Climate

In [133]:
from utils import Domain

excluded_domains = [
    "Econ/Fin",
    "Healthcare",
    "Sales",
]

domain_document = []

for domain in domains:

    for term in terms:
        # Exclude domains that only have short term datasets
        if domain in excluded_domains and term != "short":
            print(f"Skipping '{term}' for '{domain}'...")
            continue

        domain_mask, term_mask = df["domain"] == domain, df["term"] == term
        filtered_df = df[domain_mask & term_mask]

        names = filtered_df["name"].tolist()

        if not names:
            print(f"No datasets found for domain '{domain}' and term '{term}'")
            continue

        # Add a mapping for each domain-term combination
        domain_document.append(
            {
                "domain": domain,
                "term": term,
                "names": names,
            }
        )

domain_pairs = [
    (Domain.WEB, Domain.CLOUDOPS),
    (Domain.NATURE, Domain.CLIMATE),
]

# Add custom domain pairs
for domain_pair in domain_pairs:
    num_short, num_medium, num_long = (0,) * 3
    for term in terms:
        domain_1, domain_2 = domain_pair
        domain_mask = df["domain"].isin([domain_1, domain_2])
        term_mask = df["term"] == term

        filtered_df = df[domain_mask & term_mask]
        names = filtered_df["name"].tolist()

        if not names:
            print(f"No datasets found for domain '{domain}' and term '{term}'")
            continue

        if term == "short":
            num_short = len(names)
        elif term == "medium":
            num_medium = len(names)
        else:
            num_long = len(names)

        # Add a mapping for each domain-term combination
        domain_document.append(
            {
                "domain": f"{domain_1},{domain_2}",
                "term": term,
                "names": names,
            }
        )

    new_row = pd.DataFrame(
        [
            {
                "domain": f"{domain_1},{domain_2}",
                "short": num_short,
                "medium": num_medium,
                "long": num_long,
            }
        ]
    )

    pivoted_df = pd.concat([pivoted_df, new_row], ignore_index=True)

documents.append(domain_document)
num_domain_configs = len(domain_document)
print(f"Number of domain configs: {num_domain_configs}")

with open(yaml_path, "w") as file:
    yaml.dump_all(
        documents,
        file,
        explicit_start=True,
    )

Skipping 'medium' for 'Healthcare'...
Skipping 'long' for 'Healthcare'...
Skipping 'medium' for 'Econ/Fin'...
Skipping 'long' for 'Econ/Fin'...
Skipping 'medium' for 'Sales'...
Skipping 'long' for 'Sales'...
Number of domain configs: 27


Load the YAML file to ensure the configurations were saved correctly.

In [134]:
with open(yaml_path, "r") as file:
    documents = list(yaml.load_all(file, Loader=yaml.SafeLoader))

num_documents = len(documents)
print(f"Number of documents loaded: {num_documents}\n")

general_document, domain_document = documents[0], documents[1]

print("--- General Document ---")
for i, config in enumerate(general_document):
    names = config["names"]
    term = config["term"]
    print(f"Config {i + 1} | Name: {names}, Term: {term}")
print()

print("--- Domain Document ---")
for i, config in enumerate(domain_document):
    names = config["names"]
    term = config["term"]
    print(f"Config {i + 1} | Name: {names}, Term: {term}")
print()

assert len(general_document) + len(domain_document) == 3 + 27

Number of documents loaded: 2

--- General Document ---
Config 1 | Name: ['bull', 'cmip6_1885', 'era5_1991', 'SHMETRO', 'era5_2006', 'BEIJING_SUBWAY_30MIN', 'gfc12_load', 'buildings_900k', 'london_smart_meters_with_missing', 'residential_pv_power', 'PEMS_BAY', 'wind_power', 'cmip6_1975', 'elecdemand', 'wiki-rolling_nips', 'spain', 'cdc_fluview_who_nrevss', 'covid_mobility', 'monash_m3_quarterly', 'era5_2015', 'alibaba_cluster_trace_2018', 'cmip6_1930', 'uber_tlc_hourly', 'era5_2012', 'cmip6_1940', 'bdg-2_rat', 'era5_2018', 'largest_2020', 'cmip6_1875', 'cmip6_1905', 'borg_cluster_data_2011', 'tourism_yearly', 'traffic_weekly', 'largest_2018', 'PEMS07', 'era5_2001', 'cif_2016_6', 'era5_1996', 'bdg-2_panther', 'cmip6_1985', 'monash_m3_yearly', 'bitcoin_with_missing', 'era5_1998', 'era5_1992', 'favorita_sales', 'cmip6_1965', 'hog', 'era5_2005', 'cmip6_1850', 'PEMS03', 'cmip6_1920', 'covid19_energy', 'LOS_LOOP', 'cmip6_1895', 'tourism_quarterly', 'extended_web_traffic_with_missing', 'subse

## Train-Test Split

Load the train-test split's metadata.

In [135]:
split = SplitType.TRAIN_TEST
metadata_path = Path("resources") / split / "metadata.csv"

df = pd.read_csv(metadata_path)

print(f"Total number of {split} datasets: {len(df)}")
df.head()

Total number of train_test datasets: 97


Unnamed: 0,name,term,freq,domain,num_series,target_dim,_min_series_length,sum_series_length,prediction_length,windows
0,LOOP_SEATTLE/5T,short,5T,Transport,323,1,105120,33953760,48,20
1,LOOP_SEATTLE/D,short,D,Transport,323,1,365,117895,30,2
2,LOOP_SEATTLE/H,short,H,Transport,323,1,8760,2829480,48,19
3,M_DENSE/D,short,D,Transport,30,1,730,21900,30,3
4,M_DENSE/H,short,H,Transport,30,1,17520,525600,48,20


### General

View all of the unique forecasting terms and count the number of datasets that belong to each term. Each forecasting term multiplies the original prediciton length by a given multipler:
- "long" multiplies the original prediction length by 15
- "medium" multiplies the original prediction length by 10  
- "short" multiplies the original prediction length by 1 (no change)

In [136]:
term_counts = get_count_df(df, columns=["term"], ascending=False)

print(f"Number of unique terms: {len(term_counts)}")
display(term_counts)

Number of unique terms: 3


Unnamed: 0,term,count
0,short,55
1,long,21
2,medium,21


Save each term and its associated dataset names to a YAML file.

In [137]:
import yaml

general_document = []

for term in terms:
    names = df[df["term"] == term]["name"].tolist()

    # Add a mapping for each term
    general_document.append(
        {
            "term": term,
            "names": names,
        }
    )

documents = [general_document]

num_general_configs = len(general_document)

yaml_path = Path("configs") / split_type / "datasets.yaml"

print(f"Number of general configurations: {num_general_configs}")
with open(yaml_path, "w") as file:
    yaml.dump_all(documents, file, explicit_start=True)

Number of general configurations: 3


Load the YAML file to ensure the configurations were saved correctly.

In [138]:
with open(yaml_path, "r") as file:
    documents = list(yaml.load_all(file, Loader=yaml.SafeLoader))

num_documents = len(documents)
print(f"Number of documents loaded: {num_documents}")

for i, document in enumerate(documents, 1):
    print(f"--- Document {i} ---")
    print(yaml.dump(document, sort_keys=False))

# Access the general configurations
general_configs = documents[0]

assert len(general_configs) == 3

for i, config in enumerate(general_configs):
    print(f"Config {i+1}:")

    term = config["term"]
    print(f"  Term: {term}")

    num_names = len(config["names"])
    term_mask = term_counts["term"] == term
    assert num_names == term_counts.loc[term_mask, "count"].iloc[0]

    print(f"  Number of names: {num_names}")
    print(f"  Names: {', '.join(config['names'])}\n")

Number of documents loaded: 1
--- Document 1 ---
- names:
  - LOOP_SEATTLE/5T
  - LOOP_SEATTLE/D
  - LOOP_SEATTLE/H
  - M_DENSE/D
  - M_DENSE/H
  - SZ_TAXI/15T
  - SZ_TAXI/H
  - bitbrains_fast_storage/5T
  - bitbrains_fast_storage/H
  - bitbrains_rnd/5T
  - bitbrains_rnd/H
  - bizitobs_application
  - bizitobs_l2c/5T
  - bizitobs_l2c/H
  - bizitobs_service
  - car_parts_with_missing
  - covid_deaths
  - electricity/15T
  - electricity/D
  - electricity/H
  - electricity/W
  - ett1/15T
  - ett1/D
  - ett1/H
  - ett1/W
  - ett2/15T
  - ett2/D
  - ett2/H
  - ett2/W
  - hierarchical_sales/D
  - hierarchical_sales/W
  - hospital
  - jena_weather/10T
  - jena_weather/D
  - jena_weather/H
  - kdd_cup_2018_with_missing/D
  - kdd_cup_2018_with_missing/H
  - m4_daily
  - m4_hourly
  - m4_monthly
  - m4_quarterly
  - m4_weekly
  - m4_yearly
  - restaurant
  - saugeenday/D
  - saugeenday/M
  - saugeenday/W
  - solar/10T
  - solar/D
  - solar/H
  - solar/W
  - temperature_rain_with_missing
  - us_b

### Domain-Specific

View all of the unique domain-term combinations and count the number of datasets that belong to each one.

In [139]:
domains, terms = df["domain"].unique(), df["term"].unique()

domain_term_counts = get_count_df(
    df,
    columns=["domain", "term"],
    ascending=False,
)

# Convert each term into its own column
pivoted_df = domain_term_counts.pivot(
    index="domain",
    columns="term",
    values="count",
).reset_index()

# Remove old "term" column
pivoted_df.columns.name = None

# Reorder columns
pivoted_df = pivoted_df[
    [
        "domain",
        "short",
        "medium",
        "long",
    ]
]

print(f'Domains: {", ".join(domains)}')
print(f'Terms: {", ".join(terms)}')
print(f"Number of unique domain-term combinations: {len(domain_term_counts)}")
display(pivoted_df)

Domains: Transport, Web/CloudOps, Sales, Healthcare, Energy, Nature, Econ/Fin
Terms: short, medium, long
Number of unique domain-term combinations: 15


Unnamed: 0,domain,short,medium,long
0,Econ/Fin,6.0,,
1,Energy,16.0,8.0,8.0
2,Healthcare,5.0,,
3,Nature,9.0,3.0,3.0
4,Sales,4.0,,
5,Transport,7.0,4.0,4.0
6,Web/CloudOps,8.0,6.0,6.0


Add each domain-term combination's domain, term, and dataset names to the yaml file.
- We'll only create short groups for the following domains because they don't have any medium or long datasets in the train-test split:
  - Econ/Fin
  - Healthcare
  - Sales

In [140]:
from utils import Domain

excluded_domains = [
    "Econ/Fin",
    "Healthcare",
    "Sales",
]

domain_document = []

for domain in domains:

    for term in terms:
        # Exclude domains that only have short term datasets
        if domain in excluded_domains and term != "short":
            print(f"Skipping '{term}' for '{domain}'...")
            continue

        domain_mask, term_mask = df["domain"] == domain, df["term"] == term
        filtered_df = df[domain_mask & term_mask]

        names = filtered_df["name"].tolist()

        if not names:
            print(f"No datasets found for domain '{domain}' and term '{term}'")
            continue

        # Add a mapping for each domain-term combination
        domain_document.append(
            {
                "domain": domain,
                "term": term,
                "names": names,
            }
        )

domain_pairs = [
    (Domain.WEB, Domain.CLOUDOPS),
    (Domain.NATURE, Domain.CLIMATE),
]

# Add custom domain pairs
for domain_pair in domain_pairs:
    num_short, num_medium, num_long = (0,) * 3
    for term in terms:
        domain_1, domain_2 = domain_pair
        domain_mask = df["domain"].isin([domain_1, domain_2])
        term_mask = df["term"] == term

        filtered_df = df[domain_mask & term_mask]
        names = filtered_df["name"].tolist()

        if not names:
            print(f"No datasets found for domain '{domain}' and term '{term}'")
            continue

        if term == "short":
            num_short = len(names)
        elif term == "medium":
            num_medium = len(names)
        else:
            num_long = len(names)

        # Add a mapping for each domain-term combination
        domain_document.append(
            {
                "domain": f"{domain_1},{domain_2}",
                "term": term,
                "names": names,
            }
        )

    new_row = pd.DataFrame(
        [
            {
                "domain": f"{domain_1},{domain_2}",
                "short": num_short,
                "medium": num_medium,
                "long": num_long,
            }
        ]
    )

    pivoted_df = pd.concat([pivoted_df, new_row], ignore_index=True)

documents.append(domain_document)
num_domain_configs = len(domain_document)
print(f"Number of domain configs: {num_domain_configs}")

with open(yaml_path, "w") as file:
    yaml.dump_all(
        documents,
        file,
        explicit_start=True,
    )

Skipping 'medium' for 'Sales'...
Skipping 'long' for 'Sales'...
Skipping 'medium' for 'Healthcare'...
Skipping 'long' for 'Healthcare'...
Skipping 'medium' for 'Econ/Fin'...
Skipping 'long' for 'Econ/Fin'...
No datasets found for domain 'Econ/Fin' and term 'short'
No datasets found for domain 'Econ/Fin' and term 'medium'
No datasets found for domain 'Econ/Fin' and term 'long'
Number of domain configs: 18


Load the YAML file to ensure the configurations were saved correctly.

In [141]:
with open(yaml_path, "r") as file:
    documents = list(yaml.load_all(file, Loader=yaml.SafeLoader))

num_documents = len(documents)
print(f"Number of documents loaded: {num_documents}\n")

general_document, domain_document = documents[0], documents[1]

print("--- General Document ---")
for i, config in enumerate(general_document):
    names = config["names"]
    term = config["term"]
    print(f"Config {i + 1} | Name: {names}, Term: {term}")
print()

print("--- Domain Document ---")
for i, config in enumerate(domain_document):
    names = config["names"]
    term = config["term"]
    print(f"Config {i + 1} | Name: {names}, Term: {term}")
print()

assert len(general_document) + len(domain_document) == 3 + 18

Number of documents loaded: 2

--- General Document ---
Config 1 | Name: ['LOOP_SEATTLE/5T', 'LOOP_SEATTLE/D', 'LOOP_SEATTLE/H', 'M_DENSE/D', 'M_DENSE/H', 'SZ_TAXI/15T', 'SZ_TAXI/H', 'bitbrains_fast_storage/5T', 'bitbrains_fast_storage/H', 'bitbrains_rnd/5T', 'bitbrains_rnd/H', 'bizitobs_application', 'bizitobs_l2c/5T', 'bizitobs_l2c/H', 'bizitobs_service', 'car_parts_with_missing', 'covid_deaths', 'electricity/15T', 'electricity/D', 'electricity/H', 'electricity/W', 'ett1/15T', 'ett1/D', 'ett1/H', 'ett1/W', 'ett2/15T', 'ett2/D', 'ett2/H', 'ett2/W', 'hierarchical_sales/D', 'hierarchical_sales/W', 'hospital', 'jena_weather/10T', 'jena_weather/D', 'jena_weather/H', 'kdd_cup_2018_with_missing/D', 'kdd_cup_2018_with_missing/H', 'm4_daily', 'm4_hourly', 'm4_monthly', 'm4_quarterly', 'm4_weekly', 'm4_yearly', 'restaurant', 'saugeenday/D', 'saugeenday/M', 'saugeenday/W', 'solar/10T', 'solar/D', 'solar/H', 'solar/W', 'temperature_rain_with_missing', 'us_births/D', 'us_births/M', 'us_births/W']

### Dataset-Specific 

View some of the dataset name-term combinations. For the sake of brevity, we'll only display a few of the dataset name-term combinations

In [142]:
name_term_combinations = df[["name", "term"]]

print(f"Number of unique name-term pairs: {len(name_term_combinations)}")
name_term_combinations.head()

Number of unique name-term pairs: 97


Unnamed: 0,name,term
0,LOOP_SEATTLE/5T,short
1,LOOP_SEATTLE/D,short
2,LOOP_SEATTLE/H,short
3,M_DENSE/D,short
4,M_DENSE/H,short


Add each unique name-term combination to our YAML data and save the resulting YAML file.

In [143]:
dataset_document = []

for _, row in df.iterrows():
    # Add a mapping for each name-term combination
    dataset_document.append(
        {
            "term": row["term"],
            "names": [row["name"]],
        }
    )

documents.append(dataset_document)

num_dataset_specific_configs = len(dataset_document)
print(f"Number of dataset-specific configurations: {num_dataset_specific_configs}")

with open(yaml_path, "w") as file:
    yaml.dump_all(
        documents,
        file,
        explicit_start=True,
    )

Number of dataset-specific configurations: 97


Load the YAML file to ensure the configurations were saved correctly.

In [144]:
with open(yaml_path, "r") as file:
    documents = list(yaml.load_all(file, Loader=yaml.SafeLoader))

num_documents = len(documents)
print(f"Number of documents loaded: {num_documents}\n")
assert len(documents) == 3

general_document, domain_document, dataset_document = documents

print("--- General Document ---")
for i, config in enumerate(general_document[:5]):
    names = config["names"]
    term = config["term"]
    print(f"Config {i + 1} | Name: {names}, Term: {term}")
print()

print("--- Domain Document ---")
for i, config in enumerate(domain_document[:5]):
    names = config["names"]
    term = config["term"]
    print(f"Config {i + 1} | Name: {names}, Term: {term}")
print()

print("--- Dataset Level Document ---")
for i, config in enumerate(dataset_document[:5]):
    names = config["names"]
    term = config["term"]
    print(f"Config {i + 1} | Name: {names}, Term: {term}")
print()

assert (
    len(general_document) + len(domain_document) + len(dataset_document) == 3 + 18 + 97
)

Number of documents loaded: 3

--- General Document ---
Config 1 | Name: ['LOOP_SEATTLE/5T', 'LOOP_SEATTLE/D', 'LOOP_SEATTLE/H', 'M_DENSE/D', 'M_DENSE/H', 'SZ_TAXI/15T', 'SZ_TAXI/H', 'bitbrains_fast_storage/5T', 'bitbrains_fast_storage/H', 'bitbrains_rnd/5T', 'bitbrains_rnd/H', 'bizitobs_application', 'bizitobs_l2c/5T', 'bizitobs_l2c/H', 'bizitobs_service', 'car_parts_with_missing', 'covid_deaths', 'electricity/15T', 'electricity/D', 'electricity/H', 'electricity/W', 'ett1/15T', 'ett1/D', 'ett1/H', 'ett1/W', 'ett2/15T', 'ett2/D', 'ett2/H', 'ett2/W', 'hierarchical_sales/D', 'hierarchical_sales/W', 'hospital', 'jena_weather/10T', 'jena_weather/D', 'jena_weather/H', 'kdd_cup_2018_with_missing/D', 'kdd_cup_2018_with_missing/H', 'm4_daily', 'm4_hourly', 'm4_monthly', 'm4_quarterly', 'm4_weekly', 'm4_yearly', 'restaurant', 'saugeenday/D', 'saugeenday/M', 'saugeenday/W', 'solar/10T', 'solar/D', 'solar/H', 'solar/W', 'temperature_rain_with_missing', 'us_births/D', 'us_births/M', 'us_births/W']