# BTS/WHITED Dataset - TSFM Embedding Generation

This notebook generates embeddings for both the BTS (Brick by Brick 2024) and WHITED datasets using time series foundation models (MOMENT and Chronos).

**Two workflows are demonstrated:**
1. **BTS Dataset:** Multi-label classification from building system sensor data (using MOMENT)
2. **WHITED Dataset:** Single-label appliance classification from electrical power signatures (using MOMENT and Chronos)

## Setup and Configuration

In [None]:
import numpy as np
import pandas as pd
import pickle
import zipfile
import torch
from torch.utils.data import Dataset, DataLoader
from scipy.interpolate import interp1d
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import matplotlib.pyplot as plt
import random
import soundfile as sf  # For WHITED FLAC files
import os

# Note: Foundation model imports are done in their respective sections:
# - momentfm (MOMENT model)
# - chronos (Chronos model)

In [None]:
# File paths for BTS dataset
zip_file_path = 'data/train_X_v0.1.0.zip'
train_y_path = 'data/train_y_v0.1.0.csv'

# WHITED file path is set later in the WHITED section

## BTS Label Mapping

Load and inspect the 94 label categories from the BTS dataset.

In [None]:
# Load labels and create mapping
df_train_y = pd.read_csv(train_y_path, index_col=0)

# Create label mapping
label_mapping = {idx: label for idx, label in enumerate(df_train_y.columns)}
print("Label mapping:", label_mapping)
print(f"\nTotal number of label categories: {len(label_mapping)}")

## BTS Data Loading and Processing

In [None]:
def resample_time_series(time_series, target_length=512):
    """
    Resamples a time series to the target length using linear interpolation.

    Parameters:
    - time_series: Original time series values
    - target_length: Desired length for the resampled time series

    Returns:
    - Resampled time series of length target_length
    """
    original_length = len(time_series)
    if original_length == target_length:
        return time_series
    original_indices = np.linspace(0, 1, original_length)
    target_indices = np.linspace(0, 1, target_length)
    interpolator = interp1d(original_indices, time_series, kind='linear', fill_value="extrapolate")
    return interpolator(target_indices)


def load_train_data(train_y_path, zip_file_path, seq_len=512, test_size=0.2, random_state=42):
    """
    Loads and resamples training data, then splits it into train and test sets.

    Parameters:
    - train_y_path: Path to the training target CSV file
    - zip_file_path: Path to the ZIP file containing the training .pkl files
    - seq_len: Desired sequence length for resampling
    - test_size: Proportion of the dataset to include in the test split
    - random_state: Seed for reproducibility of the split

    Returns:
    - train_data, test_data: Train and test datasets with resampled timeseries and associated labels
    """
    df_train_y = pd.read_csv(train_y_path, index_col=0)
    print(f"Loaded training labels. Number of samples: {len(df_train_y)}")

    resampled_data_list = []
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        pkl_files = zip_ref.namelist()
        print(f"Number of .pkl files in ZIP: {len(pkl_files)}")

        for idx, row in df_train_y.iterrows():
            filename = row.name
            if filename.endswith('.pkl'):
                pkl_file = f"train_X/{filename}"
            else:
                pkl_file = f"train_X/{filename}.pkl"

            labels = row.values

            if pkl_file in pkl_files:
                with zip_ref.open(pkl_file, 'r') as f:
                    data = pickle.load(f)
                    resampled_values = resample_time_series(data['v'], target_length=seq_len)
                    resampled_data_list.append({
                        "timeseries": resampled_values,
                        "labels": labels
                    })
            else:
                print(f"File not found in ZIP: {pkl_file}")

    print(f"Number of resampled samples: {len(resampled_data_list)}")

    if len(resampled_data_list) == 0:
        raise ValueError("No matching files found between the labels and ZIP contents.")

    # Split into train and test sets
    train_data, test_data = train_test_split(
        resampled_data_list, test_size=test_size, random_state=random_state
    )
    return train_data, test_data


class ClassificationDatasetWithMask(Dataset):
    """
    Dataset for multi-label classification with MOMENT model.
    """
    def __init__(self, data_list, seq_len=512):
        """
        Parameters:
        - data_list: List of dictionaries containing "timeseries" and "labels"
        - seq_len: Fixed sequence length for each time series
        """
        self.seq_len = seq_len
        self.data = []
        self.input_masks = []
        self.labels = []

        for entry in data_list:
            timeseries = entry["timeseries"]
            labels = entry["labels"]

            # Add channel dimension (1 for single-channel data)
            timeseries = np.expand_dims(timeseries, axis=0)

            # Create input mask
            input_mask = np.ones(seq_len)

            # Append to dataset
            self.data.append(timeseries)
            self.input_masks.append(input_mask)
            self.labels.append(labels)

        # Convert lists to NumPy arrays
        self.data = np.array(self.data)
        self.input_masks = np.array(self.input_masks)
        self.labels = np.array(self.labels)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        """
        Returns:
        - Tuple of (timeseries, input_mask, label)
        """
        timeseries = self.data[idx]
        input_mask = self.input_masks[idx]
        label = self.labels[idx]
        return (
            torch.tensor(timeseries, dtype=torch.float32),  # Shape: (1, seq_len)
            torch.tensor(input_mask, dtype=torch.float32),  # Shape: (seq_len,)
            torch.tensor(label, dtype=torch.float32),       # Shape: (num_labels,)
        )

## BTS: Load and Prepare Data

In [None]:
# Load data with 80/20 train/test split
train_data, test_data = load_train_data(train_y_path, zip_file_path, seq_len=512)

# Create datasets and dataloaders
train_dataset = ClassificationDatasetWithMask(train_data)
test_dataset = ClassificationDatasetWithMask(test_data)

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True, drop_last=False)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False, drop_last=False)

# Verify data format
for batch_x, batch_masks, batch_labels in train_dataloader:
    print(f"Batch X Shape: {batch_x.shape}")  # Should be (batch_size, 1, seq_len)
    print(f"Batch Masks Shape: {batch_masks.shape}")  # Should be (batch_size, seq_len)
    print(f"Batch Labels Shape: {batch_labels.shape}")  # Should be (batch_size, num_labels)
    break

## Optional: Visualize BTS Sample Time Series

In [None]:
def plot_random_timeseries(dataset, dataset_name="Dataset", label_mapping=None):
    """
    Plots a random time series from the dataset and prints its labels.

    Parameters:
    - dataset: ClassificationDatasetWithMask dataset
    - dataset_name: Name for the plot title
    - label_mapping: Mapping of label indices to tag names
    """
    random_idx = random.randint(0, len(dataset) - 1)
    timeseries, _, labels = dataset[random_idx]

    # Convert labels to numpy and find active labels
    labels = labels.numpy()
    active_label_indices = [idx for idx, value in enumerate(labels) if value > 0]
    active_labels = [label_mapping[idx] for idx in active_label_indices] if label_mapping else active_label_indices

    # Plot
    plt.figure(figsize=(10, 5))
    plt.plot(timeseries.squeeze(), label="Timeseries")
    plt.title(f"Random Time Series from {dataset_name}\nLabels: {', '.join(active_labels)}")
    plt.xlabel("Timestamps")
    plt.ylabel("Values")
    plt.legend()
    plt.grid(True)
    plt.show()

    print(f"Labels for the selected time series from {dataset_name}: {active_labels}")


# Plot random samples
plot_random_timeseries(train_dataset, dataset_name="Training Dataset", label_mapping=label_mapping)
plot_random_timeseries(test_dataset, dataset_name="Testing Dataset", label_mapping=label_mapping)

## MOMENT Embedding Generation (Used for Both Datasets)

In [None]:
def get_embedding(model, dataloader):
    """
    Generates embeddings using MOMENT model.
    
    Parameters:
    - model: MOMENT model in embedding mode
    - dataloader: DataLoader for the dataset
    
    Returns:
    - embeddings: Array of shape (num_samples, embedding_dim)
    - labels: Array of shape (num_samples, num_labels)
    """
    embeddings, labels = [], []
    with torch.no_grad():
        for batch_x, batch_masks, batch_labels in tqdm(dataloader, total=len(dataloader)):
            batch_x = batch_x.to("cpu").float()
            batch_masks = batch_masks.to("cpu")

            output = model(x_enc=batch_x, input_mask=batch_masks)  # [batch_size x d_model (=1024)]
            embedding = output.embeddings
            embeddings.append(embedding.detach().cpu().numpy())
            labels.append(batch_labels)

    embeddings, labels = np.concatenate(embeddings), np.concatenate(labels)
    return embeddings, labels

## Load MOMENT Model and Generate BTS Embeddings

In [None]:
from momentfm import MOMENTPipeline

# Load MOMENT-1-large in embedding mode
model = MOMENTPipeline.from_pretrained(
    "AutonLab/MOMENT-1-large", 
    model_kwargs={'task_name': 'embedding'},
)
model.init()
model.to("cpu").float()

# Generate embeddings for training data
print("Generating training embeddings...")
train_embeddings, train_labels = get_embedding(model, train_dataloader)
print(f"Train embeddings shape: {train_embeddings.shape}")
print(f"Train labels shape: {train_labels.shape}")

# Generate embeddings for test data
print("\nGenerating test embeddings...")
test_embeddings, test_labels = get_embedding(model, test_dataloader)
print(f"Test embeddings shape: {test_embeddings.shape}")
print(f"Test labels shape: {test_labels.shape}")

## Save BTS Embeddings and Labels

In [None]:
# Save embeddings and labels to files
np.save("train_embeddings.npy", train_embeddings)
np.save("train_labels.npy", train_labels)
np.save("test_embeddings.npy", test_embeddings)
np.save("test_labels.npy", test_labels)

print("Embeddings and labels saved successfully!")
print(f"  - train_embeddings.npy: {train_embeddings.shape}")
print(f"  - train_labels.npy: {train_labels.shape}")
print(f"  - test_embeddings.npy: {test_embeddings.shape}")
print(f"  - test_labels.npy: {test_labels.shape}")

## Summary

This notebook demonstrates the TSFM embedding generation process for both BTS and WHITED datasets:

### BTS Dataset
1. Loaded the dataset (31,839 samples with 94 multi-label targets)
2. Resampled all time series to length 512 using linear interpolation
3. Split data into 80% train (25,471 samples) and 20% test (6,368 samples)
4. Generated 1024-dimensional embeddings using MOMENT-1-large foundation model
5. Saved embeddings to: `train_embeddings.npy`, `test_embeddings.npy`

### WHITED Dataset
1. Loaded FLAC audio files and extracted instantaneous power (V × I)
2. Processed 56 appliance types with ~1,339 total samples
3. Resampled to length 512 and split into 80% train / 20% test
4. Generated embeddings using two foundation models:
   - **MOMENT-1-large**: 1024-dimensional embeddings → `moment_train_embeddings_whited.npy`, `moment_test_embeddings_whited.npy`
   - **Chronos-T5-Large**: Variable-dimensional embeddings → `chronos_train_embeddings_whited.npy`, `chronos_test_embeddings_whited.npy`

**The saved embedding files can be used for any downstream classification algorithm (SVM, Random Forest, Neural Networks, etc.).**

**Note:** For a complete training and evaluation pipeline with SVM classification and metrics reporting, use the Python scripts:
- `train_tsfm_whited.py --model moment` (MOMENT with SVM)
- `train_tsfm_whited.py --model chronos` (Chronos with SVM)
- `train_ts2vec_whited.py` (TS2Vec baseline with SVM)
- `train_dtw_whited.py` (DTW baseline)
- `train_resnet_whited.py` (ResNet baseline)

---

# WHITED Dataset - MOMENT Embedding Generation

The WHITED dataset has a different structure from BTS - it uses FLAC audio files to represent electrical appliance signatures. Below we demonstrate the same MOMENT embedding generation process for WHITED data.

## Key Differences: WHITED vs BTS

**WHITED Dataset:**
- **File Format:** FLAC audio files (2-channel: voltage and current)
- **Feature Extraction:** Instantaneous power = Voltage × Current
- **Label Structure:** Single-label classification (56 appliance types)
- **Dataset Size:** ~1,339 samples (1,071 train / 268 test)
- **Examples:** Kettle, Toaster, TV, Laptop, Microwave, etc.

**BTS Dataset:**
- **File Format:** PKL files from ZIP archive
- **Feature Extraction:** Direct time series values
- **Label Structure:** Multi-label classification (94 building system labels)
- **Dataset Size:** ~31,839 samples (25,471 train / 6,368 test)
- **Examples:** Temperature sensors, flow sensors, pressure sensors, etc.

## WHITED Data Loading from FLAC Files

In [None]:
import soundfile as sf  # To handle FLAC files

# Path to WHITED dataset folder
whited_folder_path = "data/WHITEDv1.1"

def process_flac_files(folder_path):
    """
    Process all FLAC files in a folder and extract instantaneous power.
    
    Parameters:
    - folder_path: Path to folder containing FLAC files
    
    Returns:
    - Dictionary of time series data keyed by appliance type
    """
    data_dict = {}
    
    for file_name in os.listdir(folder_path):
        if file_name.endswith(".flac"):
            # Extract appliance key from filename (e.g., "Kettle_001.flac" -> "Kettle")
            key = file_name.split("_")[0]
            file_path = os.path.join(folder_path, file_name)
            
            # Load FLAC file (2 channels: voltage and current)
            data, samplerate = sf.read(file_path)
            
            # Calculate instantaneous power (V × I)
            instantaneous_power = data[:, 0] * data[:, 1]
            
            # Store in dictionary
            if key not in data_dict:
                data_dict[key] = []
            data_dict[key].append(instantaneous_power)
    
    print(f"Processed {len(data_dict)} unique appliance types.")
    for key, timeseries_list in data_dict.items():
        print(f"  {key}: {len(timeseries_list)} samples")
    
    return data_dict

# Load WHITED data
whited_data_dict = process_flac_files(whited_folder_path)

## WHITED Data Resampling and Dataset Creation

In [None]:
def resample_whited_data(data_dict, target_steps=512):
    """
    Resample all WHITED time series to target length.
    
    Parameters:
    - data_dict: Dictionary of appliance time series
    - target_steps: Target length for resampling
    
    Returns:
    - Dictionary with resampled time series
    """
    resampled_dict = {}
    for key, timeseries_list in data_dict.items():
        resampled_dict[key] = []
        for ts in timeseries_list:
            # Resample using same function as BTS
            resampled_ts = resample_time_series(ts, target_length=target_steps)
            resampled_dict[key].append(resampled_ts)
    return resampled_dict


class WHITEDDatasetWithMask(Dataset):
    """
    Dataset for WHITED single-label classification with MOMENT model.
    """
    def __init__(self, resampled_data_dict, seq_len=512, data_split='train', 
                 test_size=0.2, random_state=42):
        """
        Parameters:
        - resampled_data_dict: Dictionary with appliance types and time series
        - seq_len: Fixed sequence length
        - data_split: 'train' or 'test'
        - test_size: Proportion for test split
        - random_state: Random seed
        """
        self.seq_len = seq_len
        
        # Flatten data and create labels
        all_data = []
        all_input_masks = []
        all_labels = []
        label_mapping = {key: idx for idx, key in enumerate(resampled_data_dict.keys())}
        
        print(f"Number of appliance classes: {len(label_mapping)}")
        
        for key, timeseries_list in resampled_data_dict.items():
            label = label_mapping[key]
            for timeseries in timeseries_list:
                # Flatten if multi-dimensional
                if timeseries.ndim > 1:
                    timeseries = timeseries.flatten()
                
                timeseries_len = len(timeseries)
                
                # Create input mask
                input_mask = np.ones(seq_len)
                input_mask[:seq_len - timeseries_len] = 0
                
                # Pad time series
                padded_timeseries = np.pad(timeseries, (seq_len - timeseries_len, 0))
                
                # Add channel dimension
                padded_timeseries = np.expand_dims(padded_timeseries, axis=0)
                
                all_data.append(padded_timeseries)
                all_input_masks.append(input_mask)
                all_labels.append(label)
        
        # Convert to arrays
        all_data = np.array(all_data)
        all_input_masks = np.array(all_input_masks)
        all_labels = np.array(all_labels)
        
        # Train/test split
        train_data, test_data, train_masks, test_masks, train_labels, test_labels = train_test_split(
            all_data, all_input_masks, all_labels, test_size=test_size, random_state=random_state
        )
        
        print(f"Train Data Shape: {train_data.shape}")
        print(f"Test Data Shape: {test_data.shape}")
        
        if data_split == 'train':
            self.data = train_data
            self.input_masks = train_masks
            self.labels = train_labels
        elif data_split == 'test':
            self.data = test_data
            self.input_masks = test_masks
            self.labels = test_labels
        else:
            raise ValueError("data_split must be 'train' or 'test'")
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        timeseries = self.data[idx]
        input_mask = self.input_masks[idx]
        label = self.labels[idx]
        return (
            torch.tensor(timeseries, dtype=torch.float32),
            torch.tensor(input_mask, dtype=torch.float32),
            torch.tensor(label, dtype=torch.long),  # Single label (not multi-label)
        )


# Resample WHITED data
whited_resampled = resample_whited_data(whited_data_dict, target_steps=512)

# Create datasets
whited_train_dataset = WHITEDDatasetWithMask(whited_resampled, data_split='train')
whited_test_dataset = WHITEDDatasetWithMask(whited_resampled, data_split='test')

# Create dataloaders
whited_train_dataloader = DataLoader(whited_train_dataset, batch_size=64, shuffle=True, drop_last=False)
whited_test_dataloader = DataLoader(whited_test_dataset, batch_size=64, shuffle=False, drop_last=False)

# Verify
for batch_x, batch_masks, batch_labels in whited_train_dataloader:
    print(f"WHITED Batch X Shape: {batch_x.shape}")
    print(f"WHITED Batch Masks Shape: {batch_masks.shape}")
    print(f"WHITED Batch Labels Shape: {batch_labels.shape}")
    break

## Generate WHITED Embeddings with MOMENT

We use the same MOMENT model and `get_embedding()` function from the BTS section.

In [None]:
# Note: Reuse the same MOMENT model loaded earlier for BTS
# If not already loaded, uncomment the following:
# from momentfm import MOMENTPipeline
# model = MOMENTPipeline.from_pretrained(
#     "AutonLab/MOMENT-1-large", 
#     model_kwargs={'task_name': 'embedding'},
# )
# model.init()
# model.to("cpu").float()

# Generate WHITED embeddings
print("Generating WHITED training embeddings...")
whited_train_embeddings, whited_train_labels = get_embedding(model, whited_train_dataloader)
print(f"WHITED Train embeddings shape: {whited_train_embeddings.shape}")
print(f"WHITED Train labels shape: {whited_train_labels.shape}")

print("\nGenerating WHITED test embeddings...")
whited_test_embeddings, whited_test_labels = get_embedding(model, whited_test_dataloader)
print(f"WHITED Test embeddings shape: {whited_test_embeddings.shape}")
print(f"WHITED Test labels shape: {whited_test_labels.shape}")

## Save WHITED Embeddings and Labels

In [None]:
# Save WHITED embeddings and labels with distinct filenames
np.save("moment_train_embeddings_whited.npy", whited_train_embeddings)
np.save("moment_train_labels_whited.npy", whited_train_labels)
np.save("moment_test_embeddings_whited.npy", whited_test_embeddings)
np.save("moment_test_labels_whited.npy", whited_test_labels)

print("WHITED embeddings and labels saved successfully!")
print(f"  - moment_train_embeddings_whited.npy: {whited_train_embeddings.shape}")
print(f"  - moment_train_labels_whited.npy: {whited_train_labels.shape}")
print(f"  - moment_test_embeddings_whited.npy: {whited_test_embeddings.shape}")
print(f"  - moment_test_labels_whited.npy: {whited_test_labels.shape}")

---

## Alternative: Generate WHITED Embeddings with Chronos

Chronos is another time series foundation model that can generate embeddings. Below we show how to generate embeddings using Chronos-T5-Large for comparison with MOMENT.

In [None]:
from chronos import ChronosPipeline

# Load Chronos-T5-Large model
pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-large",
    device_map="cpu",
    torch_dtype=torch.bfloat16,
)
chronos_model = pipeline.embed

In [None]:
def get_chronos_embedding(model, dataloader):
    """
    Generates embeddings using Chronos model.
    
    Parameters:
    - model: Chronos embed function (pipeline.embed)
    - dataloader: DataLoader for the dataset
    
    Returns:
    - embeddings: Array of shape (num_samples, embedding_dim)
    - labels: Array of labels
    """
    embeddings, labels = [], []
    with torch.no_grad():
        for batch_x, batch_masks, batch_labels in tqdm(dataloader, total=len(dataloader)):
            batch_x = batch_x.to("cpu").float()
            
            # Chronos processes each sample individually
            embedding = []
            for b in batch_x:
                _embedding = model(b[0].cpu())[0]
                embedding.append(_embedding)
            
            embedding = torch.stack(embedding).to(torch.float32)
            # Average the embedding over sequence length
            embedding = embedding.mean(dim=2)
            # Reshape to flatten
            embedding = embedding.reshape(embedding.shape[0], -1)
            
            embeddings.append(embedding.detach().cpu().numpy())
            labels.append(batch_labels)

    embeddings, labels = np.concatenate(embeddings), np.concatenate(labels)
    return embeddings, labels


# Generate Chronos embeddings for WHITED
print("Generating WHITED training embeddings with Chronos...")
chronos_train_embeddings, chronos_train_labels = get_chronos_embedding(chronos_model, whited_train_dataloader)
print(f"Chronos Train embeddings shape: {chronos_train_embeddings.shape}")
print(f"Chronos Train labels shape: {chronos_train_labels.shape}")

print("\nGenerating WHITED test embeddings with Chronos...")
chronos_test_embeddings, chronos_test_labels = get_chronos_embedding(chronos_model, whited_test_dataloader)
print(f"Chronos Test embeddings shape: {chronos_test_embeddings.shape}")
print(f"Chronos Test labels shape: {chronos_test_labels.shape}")

## Save Chronos Embeddings

In [None]:
# Save Chronos embeddings and labels with distinct filenames
np.save("chronos_train_embeddings_whited.npy", chronos_train_embeddings)
np.save("chronos_train_labels_whited.npy", chronos_train_labels)
np.save("chronos_test_embeddings_whited.npy", chronos_test_embeddings)
np.save("chronos_test_labels_whited.npy", chronos_test_labels)

print("Chronos embeddings and labels saved successfully!")
print(f"  - chronos_train_embeddings_whited.npy: {chronos_train_embeddings.shape}")
print(f"  - chronos_train_labels_whited.npy: {chronos_train_labels.shape}")
print(f"  - chronos_test_embeddings_whited.npy: {chronos_test_embeddings.shape}")
print(f"  - chronos_test_labels_whited.npy: {chronos_test_labels.shape}")

## Note: Using train_tsfm_whited.py for End-to-End Workflow

This notebook focuses on **embedding generation only**. If you want a complete end-to-end workflow that includes:
- Embedding generation
- SVM classifier training
- Evaluation metrics (accuracy, precision, recall, F1-score)
- Results saving

Use the Python script instead:
```bash
# For MOMENT embeddings + SVM classifier
python train_tsfm_whited.py --model moment

# For Chronos embeddings + SVM classifier
python train_tsfm_whited.py --model chronos
```

The script `train_tsfm_whited.py` implements the same embedding generation process shown in this notebook, plus automatic SVM training and comprehensive evaluation.