## **Data Source Summary**

- ### **Name**: `APS Failure at Scania Trucks`

- The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.
- **Source**: https://archive.ics.uci.edu/dataset/421/aps+failure+at+scania+trucks
- **Files**: aps_failure_training_set.csv, aps_failure_test_set.csv, aps_failure_description.txt

## **Data Importation and Initial Clean & Imputation**

> ⚠️ **Warning**: The cell below will generate and save a cleaned dataset to `data/cleaned.csv`.
>
> By default, it won't run if the cleaned file already exists.  
> Set `overwrite = True` in the code block to force regeneration.


In [None]:
# Setup
import pandas as pd
from pathlib import Path
from sklearn.impute import KNNImputer

# Set path to original file and output file
base_dir = Path.cwd()
input_path = base_dir.parent / 'data' / 'aps_failure_training_set.csv'
output_path = base_dir.parent / 'data' / 'cleaned.csv'

# Optional: overwrite existing cleaned file
overwrite = False  # Set to True if you want to reprocess and overwrite

# Check for file existence; load & impute if necessary/desired
if not output_path.exists() or overwrite:
    print("Cleaning and saving the data...")

    # Load the data; skip first 20 rows; treat 'na' as NaN
    df = pd.read_csv(input_path, skiprows=20, na_values='na')

    # KNN Imputation
    imputer = KNNImputer(n_neighbors=3) # Use 3 nearest neighbors 
    df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns) # Perform imputation; keep column names

    # Save cleaned file
    df_imputed.to_csv(output_path, index=False)

    print(f"Cleaned data saved to: {output_path}")
else:
    print(f"Cleaned file already exists at: {output_path}")
    print("Set `overwrite = True` to regenerate it.")

Cleaned data saved to c:\Users\mwals\OneDrive\Documents\VScode\Python\Data Analysis Projects\APS-Failure-Classification-Model\data\cleaned.csv
