## Step 0: Dataset Preparation

In this notebook, we prepare the input datasets for anomaly detection in streaming settings. We use three domains from the TSB-UAD benchmark:

- **Daphnet** (Parkinson's acceleration data)
- **Genesis** (Synthetic mechanical data)
- **NASA-MSL** (Mars spacecraft telemetry)

We generate three Normality levels:

- **Normality 1**: One domain (no shift)
- **Normality 2**: Two domains concatenated (1 shift)
- **Normality 3**: Three domains concatenated (2 shifts)

Each time series is normalized individually before concatenation. We save the generated datasets as `.npy` files along with their distribution shift boundaries and visualizations.


### Import Libraries and Define Paths

We begin by importing the required libraries and defining the folder structure. Make sure the folders `original_datasets/` and `generated_datasets/` exist and contain the expected data.


In [3]:
import os
import numpy as np
import matplotlib.pyplot as plt

# Base paths
RAW_DATA_PATH = "original_datasets"
OUTPUT_PATH = "generated_datasets"
os.makedirs(OUTPUT_PATH, exist_ok=True)

### Define Helper Functions

We define utility functions to:
- Read `.out` files
- Normalize each time series with Z-score
- Load all series from a domain folder


In [4]:
def read_out_file(filepath):
    """Reads a .out time series file and returns a NumPy array of floats."""
    data = []
    labels = []
    with open(filepath, 'r') as f:
        for line in f:
            try:
                parts = line.strip().split(',')
                value = float(parts[0])
                label = int(parts[1]) if len(parts) > 1 else None
                data.append(value)
                labels.append(label)
            except Exception:
                continue
    return np.array(data), np.array(labels)

def normalize(ts):
    """Z-score normalization"""
    return (ts - np.mean(ts)) / (np.std(ts) + 1e-8)

def load_domain_timeseries(domain_folder):
    """Loads all .out files from a domain folder"""
    full_path = os.path.join(RAW_DATA_PATH, domain_folder)
    series = []
    series_labels = []
    for filename in os.listdir(full_path):
        if filename.endswith(".out"):
            ts, labels = read_out_file(os.path.join(full_path, filename))
            if len(ts) > 0:
                series.append(ts)
                series_labels.append(labels)
    return series, series_labels


### Load and Normalize Time Series

Here, we load all available time series from each of the selected domains and apply Z-score normalization individually. This ensures that magnitude differences between domains don't distort anomaly scoring later.


In [5]:
# Load datasets from each domain
domains = ["Daphnet", "Genesis", "NASA-MSL"]
all_series = {}

for domain in domains:    
    raw_series, raw_series_labels = load_domain_timeseries(domain)
    norm_series = [normalize(ts) for ts in raw_series]
    all_series[domain] = {
        "series": norm_series,
        "labels": raw_series_labels
    }
    print(f"Loaded {len(norm_series)} time series from {domain}.")

Loaded 40 time series from Daphnet.
Loaded 6 time series from Genesis.
Loaded 54 time series from NASA-MSL.


### Function to Save and Visualize Datasets

This function:
- Concatenates the selected time series
- Tracks where domain boundaries (distribution shifts) occur
- Saves the dataset as `.npy` and `.png`


In [6]:
def save_dataset(name, series_list, labels_list, domains):
    data = np.concatenate(series_list)
    labels = np.concatenate(labels_list)
    boundaries = []
    offset = 0
    for series in series_list:  # Exclude the last series — no boundary after it
        offset += len(series)
        boundaries.append(offset)

    # Save data
    np.save(os.path.join(OUTPUT_PATH, f"{name}.npy"), data)
    np.save(os.path.join(OUTPUT_PATH, f"{name}_boundaries.npy"), np.array(boundaries))
    np.save(os.path.join(OUTPUT_PATH, f"{name}_labels.npy"), np.array(labels))
    print(f"Saved {name} with shape {data.shape} and shift boundaries {boundaries}")

    # --- Plotting ---
    plt.figure(figsize=(14, 4))
    plt.style.use("seaborn-v0_8-muted")

    # Draw labeled segments
    start_idx = 0
    for i in range(1, len(data)):
        if labels[i] != labels[i - 1]:
            color = '#2c7bb6' if labels[i - 1] == 0 else '#d7191c'
            plt.plot(range(start_idx, i), data[start_idx:i], color=color, linewidth=1.5)
            start_idx = i

    # Draw last segment
    color = '#2c7bb6' if labels[-1] == 0 else '#d7191c'
    plt.plot(range(start_idx, len(data)), data[start_idx:], color=color, linewidth=1.5)

    # Draw domain boundaries
    for b in boundaries:
        plt.axvline(x=b, color='gray', linestyle='--', linewidth=1)

    # Decorations
    plt.title(f"{name.replace('_', ' ').title()}  |  Domains: {' → '.join(domains)}", fontsize=14, pad=10)
    plt.xlabel("Time", fontsize=12)
    plt.ylabel("Value", fontsize=12)
    plt.grid(False)
    plt.tight_layout()

    # Save figure
    plt.savefig(os.path.join(OUTPUT_PATH, f"{name}.png"), dpi=150)
    plt.close()

### Generate Normality 1, 2, and 3 Datasets

We randomly pick one normalized time series from each domain and combine them according to the rules below:

- **Normality 1**: Single time series from one domain
- **Normality 2**: Two time series from two domains, concatenated
- **Normality 3**: Three time series from three domains, concatenated

Each result is saved along with a boundary marker file and a plot showing the transitions.


In [7]:
import random
random.seed(47)

In [8]:
def select_random_samples(max_samples, all_series):
    selected_indices = {}

    for domain in domains:
        available = len(all_series[domain]["series"])
        num_samples = random.randint(1, min(max_samples, available))

        # Randomly sample without replacement
        indices = random.sample(range(available), num_samples)
        selected_indices[domain] = indices

    return selected_indices

In [9]:
selected_indices = select_random_samples(55, all_series)
print("Selected indices per domain:", selected_indices)

Selected indices per domain: {'Daphnet': [4, 27, 35, 29, 21, 16, 32, 24, 25, 19, 1, 31, 13, 37, 0, 3, 28, 8, 15, 17, 23, 10, 38], 'Genesis': [1, 4], 'NASA-MSL': [22, 28, 15, 2, 35, 23, 45, 44, 53, 38, 21, 49, 36, 12, 51, 50, 8, 43, 6, 48, 24, 32, 16, 3, 20, 27, 7, 47, 10, 42, 39, 30, 13]}


In [10]:
# Normality 1
for domain in domains:
    # Get the series and labels
    series_list = all_series[domain]["series"]
    labels_list = all_series[domain]["labels"]
    # Get random domain indexes
    indices = selected_indices[domain]
    # Extract random sample
    random_series = [series_list[idx] for idx in indices]
    random_labels = [labels_list[idx] for idx in indices]
    save_dataset(f"normality_1_{domain.lower()}", random_series, random_labels, [domain])

# Normality 2
pairs = [("Daphnet", "Genesis"), ("Daphnet", "NASA-MSL"), ("Genesis", "NASA-MSL")]
for i, (dom1, dom2) in enumerate(pairs, 1):
    # Get random domain indexes
    indices1 = selected_indices[dom1]
    indices2 = selected_indices[dom2]

    # Collect (series, label, domain) tuples
    combined = [
        (all_series[dom1]["series"][idx], all_series[dom1]["labels"][idx], dom1)
        for idx in indices1
    ] + [
        (all_series[dom2]["series"][idx], all_series[dom2]["labels"][idx], dom2)
        for idx in indices2
    ]

    # Shuffle the combined data
    random.shuffle(combined)

    # Unpack shuffled components
    shuffled_series, shuffled_labels, shuffled_domains = zip(*combined)

    save_dataset(
        f"normality_2_{i}_{dom1.lower()}_{dom2.lower()}",
        list(shuffled_series),
        list(shuffled_labels),
        list(shuffled_domains)
    )

# Normality 3
perms = [
    ("Daphnet", "Genesis", "NASA-MSL"),
    ("Genesis", "NASA-MSL", "Daphnet"),
    ("NASA-MSL", "Daphnet", "Genesis")
]
for i, (dom1, dom2, dom3) in enumerate(perms, 1):
    indices1 = selected_indices[dom1]
    indices2 = selected_indices[dom2]
    indices3 = selected_indices[dom3]
    # Get selected indices
    indices1 = selected_indices[dom1]
    indices2 = selected_indices[dom2]
    indices3 = selected_indices[dom3]

    # Collect (series, label, domain) tuples
    combined = [
        (all_series[dom1]["series"][idx], all_series[dom1]["labels"][idx], dom1)
        for idx in indices1
    ] + [
        (all_series[dom2]["series"][idx], all_series[dom2]["labels"][idx], dom2)
        for idx in indices2
    ] + [
        (all_series[dom3]["series"][idx], all_series[dom3]["labels"][idx], dom3)
        for idx in indices3
    ]

    # Shuffle the combined data
    random.shuffle(combined)

    # Unpack shuffled components
    shuffled_series, shuffled_labels, shuffled_domains = zip(*combined)

    save_dataset(
        f"normality_3_{i}_{dom1.lower()}_{dom2.lower()}_{dom3.lower()}",
        list(shuffled_series),
        list(shuffled_labels),
        list(shuffled_domains)
    )

Saved normality_1_daphnet with shape (445440,) and shift boundaries [25600, 35200, 44800, 54400, 71040, 87680, 97280, 112000, 121600, 138240, 167040, 176640, 212480, 222080, 250880, 276480, 286080, 341120, 357760, 374400, 400000, 435840, 445440]
Saved normality_1_genesis with shape (32440,) and shift boundaries [16220, 32440]
Saved normality_1_nasa-msl with shape (80427,) and shift boundaries [2127, 4176, 6687, 8738, 12707, 14744, 15889, 18319, 18758, 21614, 23822, 26094, 32194, 36116, 36864, 38383, 40574, 41719, 43877, 46095, 48133, 54233, 56720, 57484, 59761, 61793, 63867, 66139, 69561, 71991, 75673, 77829, 80427]
Saved normality_2_1_daphnet_genesis with shape (477880,) and shift boundaries [9600, 35200, 51420, 80220, 89820, 115420, 132060, 141660, 151260, 167900, 196700, 206300, 242140, 251740, 268380, 293980, 303580, 319800, 329400, 384440, 399160, 415800, 451640, 461240, 477880]
Saved normality_2_2_daphnet_nasa-msl with shape (525867,) and shift boundaries [2511, 4941, 7218, 43058