## Step 0: Dataset Preparation

In this notebook, we prepare the input datasets for anomaly detection in streaming settings. We use three domains from the TSB-UAD benchmark:

- **Daphnet** (Parkinson's acceleration data)
- **Genesis** (Synthetic mechanical data)
- **NASA-MSL** (Mars spacecraft telemetry)

We generate three Normality levels:

- **Normality 1**: One domain (no shift)
- **Normality 2**: Two domains concatenated (1 shift)
- **Normality 3**: Three domains concatenated (2 shifts)

Each time series is normalized individually before concatenation. We save the generated datasets as `.npy` files along with their distribution shift boundaries and visualizations.


### Import Libraries and Define Paths

We begin by importing the required libraries and defining the folder structure. Make sure the folders `original_datasets/` and `generated_datasets/` exist and contain the expected data.


In [7]:
import os
import numpy as np
import matplotlib.pyplot as plt

# Base paths
RAW_DATA_PATH = "original_datasets"
OUTPUT_PATH = "generated_datasets"
os.makedirs(OUTPUT_PATH, exist_ok=True)

### Define Helper Functions

We define utility functions to:
- Read `.out` files
- Normalize each time series with Z-score
- Load all series from a domain folder


In [8]:
def read_out_file(filepath):
    """Reads a .out time series file and returns a NumPy array of floats."""
    data = []
    with open(filepath, 'r') as f:
        for line in f:
            try:
                value = float(line.strip().split(',')[0])
                data.append(value)
            except Exception:
                continue
    return np.array(data)

def normalize(ts):
    """Z-score normalization"""
    return (ts - np.mean(ts)) / (np.std(ts) + 1e-8)

def load_domain_timeseries(domain_folder):
    """Loads all .out files from a domain folder"""
    full_path = os.path.join(RAW_DATA_PATH, domain_folder)
    series = []
    for filename in os.listdir(full_path):
        if filename.endswith(".out"):
            ts = read_out_file(os.path.join(full_path, filename))
            if len(ts) > 0:
                series.append(ts)
    return series


### Load and Normalize Time Series

Here, we load all available time series from each of the selected domains and apply Z-score normalization individually. This ensures that magnitude differences between domains don't distort anomaly scoring later.


In [9]:
# Load datasets from each domain
domains = ["Daphnet", "Genesis", "NASA-MSL"]
all_series = {}

for domain in domains:    
    raw_series = load_domain_timeseries(domain)
    norm_series = [normalize(ts) for ts in raw_series]
    all_series[domain] = norm_series
    print(f"Loaded {len(norm_series)} time series.")


Loaded 45 time series.
Loaded 6 time series.
Loaded 54 time series.


### Function to Save and Visualize Datasets

This function:
- Concatenates the selected time series
- Tracks where domain boundaries (distribution shifts) occur
- Saves the dataset as `.npy` and `.png`


In [10]:
def save_dataset(name, series_list, domains):
    data = np.concatenate(series_list)
    boundaries = [len(series_list[0])]
    if len(series_list) > 2:
        boundaries.append(boundaries[0] + len(series_list[1]))
    np.save(os.path.join(OUTPUT_PATH, f"{name}.npy"), data)
    np.save(os.path.join(OUTPUT_PATH, f"{name}_boundaries.npy"), np.array(boundaries))
    print(f"Saved {name} with shape {data.shape} and shift boundaries {boundaries}")

    # Optional: plot
    plt.figure(figsize=(12, 4))
    plt.plot(data, label='Time Series')
    for b in boundaries:
        plt.axvline(x=b, color='red', linestyle='--', label='Distribution shift')
    plt.title(f"{name} - {' → '.join(domains)}")
    plt.xlabel("Time")
    plt.legend()
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_PATH, f"{name}.png"))
    plt.close()


### Generate Normality 1, 2, and 3 Datasets

We randomly pick one normalized time series from each domain and combine them according to the rules below:

- **Normality 1**: Single time series from one domain
- **Normality 2**: Two time series from two domains, concatenated
- **Normality 3**: Three time series from three domains, concatenated

Each result is saved along with a boundary marker file and a plot showing the transitions.


In [13]:
import random 

# Normality 1
for domain in domains:
    ts = random.choice(all_series[domain])
    save_dataset(f"normality_1_{domain.lower()}", [ts], [domain])

# Normality 2
pairs = [("Daphnet", "Genesis"), ("Daphnet", "NASA-MSL"), ("Genesis", "NASA-MSL")]
for i, (dom1, dom2) in enumerate(pairs, 1):
    ts1 = random.choice(all_series[dom1])
    ts2 = random.choice(all_series[dom2])
    save_dataset(f"normality_2_{i}_{dom1.lower()}_{dom2.lower()}", [ts1, ts2], [dom1, dom2])

# Normality 3
perms = [
    ("Daphnet", "Genesis", "NASA-MSL"),
    ("Genesis", "NASA-MSL", "Daphnet"),
    ("NASA-MSL", "Daphnet", "Genesis")
]
for i, (dom1, dom2, dom3) in enumerate(perms, 1):
    ts1 = random.choice(all_series[dom1])
    ts2 = random.choice(all_series[dom2])
    ts3 = random.choice(all_series[dom3])
    save_dataset(f"normality_3_{i}_{dom1.lower()}_{dom2.lower()}_{dom3.lower()}", [ts1, ts2, ts3], [dom1, dom2, dom3])


Saved normality_1_daphnet with shape (35840,) and shift boundaries [35840]
Saved normality_1_genesis with shape (16220,) and shift boundaries [16220]
Saved normality_1_nasa-msl with shape (3969,) and shift boundaries [3969]
Saved normality_2_1_daphnet_genesis with shape (25820,) and shift boundaries [9600]
Saved normality_2_2_daphnet_nasa-msl with shape (36279,) and shift boundaries [35840]
Saved normality_2_3_genesis_nasa-msl with shape (18492,) and shift boundaries [16220]
Saved normality_3_1_daphnet_genesis_nasa-msl with shape (54363,) and shift boundaries [35840, 52060]
Saved normality_3_2_genesis_nasa-msl_daphnet with shape (53887,) and shift boundaries [16220, 18047]
Saved normality_3_3_nasa-msl_daphnet_genesis with shape (27871,) and shift boundaries [2051, 11651]
