# Data Preprocessing for Classification and Regression

This notebook handles data preprocessing tasks, which are essential before training machine learning models. The main steps include:

- Loading the dataset
- Separating features and targets for classification and regression
- Splitting the data into training and testing sets
- Selecting top features for modeling
- Scaling the data using `RobustScaler`

The preprocessed data is saved for further use in model training.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
import joblib
import numpy as np

## Data Loading and Target Separation

The dataset is loaded using `pandas.read_csv()`. Features (`X`) are separated from the classification target (`NLOS`) and regression target (`RANGE`). This distinction allows for both classification and regression tasks to be performed later.

In [10]:
data = pd.read_csv('../data/processed/aggregated_dataset.csv')
X = data.drop(columns=['NLOS', 'RANGE'])  # Features
y_class = data['NLOS']                    # Classification target
y_reg = data['RANGE']                     # Regression target

## Splitting Data into Training and Testing Sets

The dataset is split using `train_test_split` from `sklearn`. A stratified split is used to maintain the class distribution in both the training and test sets. This ensures balanced representation of classes in classification tasks.

In [11]:
X_train, X_test, y_class_train, y_class_test, y_reg_train, y_reg_test = train_test_split(
    X, y_class, y_reg, 
    test_size=0.2, 
    stratify=y_class,  
    random_state=42     
)

## Feature Selection

A set of top-performing features is selected for model training. These features are stored in the `top_features` list. Feature selection helps reduce model complexity and improves performance.

In [12]:
# initial features set
top_features = [
    'RXPACC', 'FP_AMP1', 'FP_AMP2', 'FP_AMP3', 
    'RISE_TIME_CLIPPED', 'CIR_SKEW', 
    'CIR_ENERGY_FIRST_100', 'CIR_PWR', 'FP_IDX'
]

X_train_top = X_train[top_features]
X_test_top = X_test[top_features]

## Data Scaling using RobustScaler

`RobustScaler` is used to scale the selected features. Unlike other scalers, it is robust to outliers by using the median and interquartile range (IQR) for scaling. This ensures that extreme values do not dominate the scaling process.

In [13]:
scaler = RobustScaler()
X_train_top= scaler.fit_transform(X_train_top)
X_test_top= scaler.transform(X_test_top)

# Save the scaled top features
np.save('../data/processed/X_train_top.npy', X_train_top)
np.save('../data/processed/X_test_top.npy', X_test_top)

In [14]:
# updated features set
top_features = [
    'RXPACC', 'CIR_ENERGY_FIRST_100', 'RISE_TIME_CLIPPED', 'RISE_TIME', 
    'CIR742', 'CIR741', 'CIR574', 'CIR740', 'CIR326', 'CIR646', 'CIR582', 'CIR575', 
    'CIR246', 'CIR526', 'CIR654', 'CIR350', 'CIR590', 'CIR430'
]

X_train_top_scaled = X_train[top_features]
X_test_top_scaled = X_test[top_features]

In [15]:
scaler = RobustScaler()
X_train_top_scaled = scaler.fit_transform(X_train_top_scaled)
X_test_top_scaled = scaler.transform(X_test_top_scaled)

# Save the scaled top features
np.save('../data/processed/X_train_top_scaled.npy', X_train_top_scaled)
np.save('../data/processed/X_test_top_scaled.npy', X_test_top_scaled)

In [16]:
# from sklearn.decomposition import PCA

# # Option A: PCA on all features 
# pca = PCA(n_components=0.95)  # Retain 95% variance
# X_train_pca = pca.fit_transform(X_train_scaled)
# X_test_pca = pca.transform(X_test_scaled)
# print(f"PCA reduced dimensions to {X_train_pca.shape[1]} components.")

pca was not effective

In [17]:
# Targets (raw, no scaling)
# classification
np.save('../data/processed/y_class_train.npy', y_class_train) 
np.save('../data/processed/y_class_test.npy', y_class_test)

# Save regression targets
np.save('../data/processed/y_reg_train.npy', y_reg_train)
np.save('../data/processed/y_reg_test.npy', y_reg_test)