# Data Preprocessing for All Buildings

This notebook preprocesses the UJIIndoorLoc dataset for all three buildings (0, 1, 2) and saves the preprocessed data for use in subsequent experiments.

## Overview

- **Buildings**: 0, 1, 2
- **Preprocessing Steps**:
  1. Load training and validation data
  2. Filter by building ID
  3. Normalize RSSI values
  4. Normalize coordinates (longitude, latitude)
  5. Save preprocessed data to pickle and Excel formats

## Output

For each building, the following files are generated:
- `data/output_data/preprocessed_data/preprocessed_building_<ID>.pkl` (fast loading)
- `data/output_data/preprocessed_data/preprocessed_building_<ID>.xlsx` (human-readable)
- `data/system_input/system_parameters_building_<ID>.csv` (normalization parameters)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler
import warnings

warnings.filterwarnings('ignore')

In [None]:
# Import custom preprocessing modules
from scripts.data.pre_processing import load_and_preprocess_data
from scripts.data import pre_processing
from scripts.data.data_loaders import save_preprocessed_data

print("✓ Modules imported successfully")

## Configuration

In [None]:
# Data paths
data_dir = Path('data') / 'input_data'
df_train_path = data_dir / 'TrainingData.csv'
df_validation_path = data_dir / 'ValidationData.csv'

# Building IDs to process
building_ids = [0, 1, 2]

# Floor height parameter (in meters)
floor_height = 3.0

print(f"Training data path: {df_train_path}")
print(f"Validation data path: {df_validation_path}")
print(f"Buildings to process: {building_ids}")
print(f"Floor height: {floor_height} meters")

## Data Loading and Basic Statistics

In [None]:
# Load raw data to check statistics
train_df = pd.read_csv(df_train_path)
val_df = pd.read_csv(df_validation_path)

print("Dataset Statistics:")
print("="*60)
print(f"Total training samples: {len(train_df):,}")
print(f"Total validation samples: {len(val_df):,}")
print(f"\nNumber of WAP columns: {len([col for col in train_df.columns if col.startswith('WAP')])}")

print("\nSamples per building:")
print("-"*60)
for building_id in building_ids:
    train_count = len(train_df[train_df['BUILDINGID'] == building_id])
    val_count = len(val_df[val_df['BUILDINGID'] == building_id])
    print(f"Building {building_id}: {train_count:,} training, {val_count:,} validation")

## Preprocess All Buildings

This cell processes each building sequentially and saves the preprocessed data.

In [None]:
# Process each building
preprocessing_summary = []

for building_id in building_ids:
    print("\n" + "="*80)
    print(f"Processing Building {building_id}")
    print("="*80)
    
    try:
        # 1. Load and preprocess data
        rssi_train, coords_train, rssi_val, coords_val, ap_columns = load_and_preprocess_data(
            df_train_path, 
            df_validation_path, 
            building_id,
            floor_height=floor_height
        )
        
        # 2. Save preprocessed data
        save_preprocessed_data(
            rssi_train, coords_train,
            rssi_val, coords_val,
            ap_columns,
            building_id=building_id
        )
        
        # 3. Save system parameters (normalization values)
        system_params_dir = Path('data') / 'system_input'
        system_params_dir.mkdir(parents=True, exist_ok=True)
        
        system_params = pd.DataFrame({
            'Parameter': ['LON_MIN', 'LON_MAX', 'LAT_MIN', 'LAT_MAX', 'FLOOR_HEIGHT', 'BUILDING_ID'],
            'Value': [
                pre_processing.LON_MIN,
                pre_processing.LON_MAX,
                pre_processing.LAT_MIN,
                pre_processing.LAT_MAX,
                floor_height,
                building_id
            ],
            'Description': [
                'Minimum longitude value for denormalization',
                'Maximum longitude value for denormalization',
                'Minimum latitude value for denormalization',
                'Maximum latitude value for denormalization',
                'Height of each floor in meters',
                'Building ID used for preprocessing'
            ]
        })
        
        system_params_path = system_params_dir / f'system_parameters_building_{building_id}.csv'
        system_params.to_csv(system_params_path, index=False)
        
        print(f"\n✓ System parameters saved to: {system_params_path}")
        
        # Store summary
        preprocessing_summary.append({
            'Building_ID': building_id,
            'Train_Samples': len(rssi_train),
            'Val_Samples': len(rssi_val),
            'Num_APs': len(ap_columns),
            'Status': 'Success'
        })
        
        print(f"\n✓ Building {building_id} preprocessing complete!")
        
    except Exception as e:
        print(f"\n✗ Error processing building {building_id}: {str(e)}")
        preprocessing_summary.append({
            'Building_ID': building_id,
            'Train_Samples': 0,
            'Val_Samples': 0,
            'Num_APs': 0,
            'Status': f'Failed: {str(e)}'
        })

print("\n" + "="*80)
print("Preprocessing Complete!")
print("="*80)

## Preprocessing Summary

In [None]:
# Create summary DataFrame
summary_df = pd.DataFrame(preprocessing_summary)

print("\nPreprocessing Summary:")
print("="*80)
print(summary_df.to_string(index=False))

# Save summary to file
summary_path = Path('data') / 'output_data' / 'preprocessed_data' / 'preprocessing_summary.csv'
summary_df.to_csv(summary_path, index=False)
print(f"\n✓ Summary saved to: {summary_path}")

## Verification: Load Preprocessed Data

Verify that the preprocessed data can be loaded successfully.

In [None]:
from scripts.data.data_loaders import load_preprocessed_data

print("Verifying preprocessed data can be loaded...\n")

for building_id in building_ids:
    try:
        rssi_train, coords_train, rssi_val, coords_val, ap_columns = load_preprocessed_data(
            building_id=building_id,
            use_pickle=True
        )
        print(f"Building {building_id}: ✓ Loaded successfully")
        print(f"  Training samples: {len(rssi_train)}")
        print(f"  Validation samples: {len(rssi_val)}")
        print(f"  Number of APs: {len(ap_columns)}")
        print()
    except Exception as e:
        print(f"Building {building_id}: ✗ Failed to load - {str(e)}")
        print()

print("\n✓ Verification complete!")

## Generated Files

The following files have been created:

### Preprocessed Data (per building)
- `data/output_data/preprocessed_data/preprocessed_building_0.pkl`
- `data/output_data/preprocessed_data/preprocessed_building_0.xlsx`
- `data/output_data/preprocessed_data/preprocessed_building_1.pkl`
- `data/output_data/preprocessed_data/preprocessed_building_1.xlsx`
- `data/output_data/preprocessed_data/preprocessed_building_2.pkl`
- `data/output_data/preprocessed_data/preprocessed_building_2.xlsx`

### System Parameters (per building)
- `data/system_input/system_parameters_building_0.csv`
- `data/system_input/system_parameters_building_1.csv`
- `data/system_input/system_parameters_building_2.csv`

### Summary
- `data/output_data/preprocessed_data/preprocessing_summary.csv`