## **Data Source Summary**

- ### **Name**: `APS Failure at Scania Trucks (Training Set)`

- **Description**: The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.
- **Source**: https://archive.ics.uci.edu/dataset/421/aps+failure+at+scania+trucks
- **Files**: aps_failure_training_set.csv

## **Data Importation and Initial Clean & Imputation**
> ⚠️ **Warning**: The cell below will generate and save a cleaned dataset to `data/training_cleaned.csv`.
>
> By default, it won't run if the cleaned file already exists.  
> Set `overwrite = True` in the code block to force regeneration.

In [None]:
# Setup
import pandas as pd
from pathlib import Path
from sklearn.impute import KNNImputer

# Set path to original file and output file
base_dir = Path.cwd()
input_path = base_dir.parent / 'data' / 'aps_failure_training_set.csv'
output_path = base_dir.parent / 'data' / 'training_cleaned.csv'

# Optional: overwrite existing cleaned file
overwrite = False  # Set to True if you want to reprocess and overwrite

# Check for file existence; load & impute if necessary/desired
if not output_path.exists() or overwrite:
    print("Cleaning and saving the data...")

    # Load the data; skip first 20 rows; treat 'na' as NaN
    df = pd.read_csv(input_path, skiprows=20, na_values='na')

    # Separate the first column ('class')
    first_col = df.iloc[:, 0]                   # Grab first column
    features = df.iloc[:, 1:]                   # All columns except the first

    # Perform KNN imputation on the feature set
    imputer = KNNImputer(n_neighbors=3)
    features_imputed = pd.DataFrame(imputer.fit_transform(features), columns=features.columns)

    # Recombine the first column and the imputed features set
    df_imputed = pd.concat([first_col.reset_index(drop=True), features_imputed], axis=1)

    # Save cleaned file
    df_imputed.to_csv(output_path, index=False)

    print(f"Cleaned data saved to: {output_path}")
else:
    print(f"Cleaned file already exists at: {output_path}")
    print("Set `overwrite = True` to regenerate it.")