# Data Preprocessing Notebook

This notebook demonstrates the data preprocessing pipeline using our custom `DataPreprocessor` class.  
We will perform the following steps:
1. Load the raw dataset.
2. Preprocess the data (remove duplicates, optimize data types, and impute missing values).
3. Save the cleaned data for further analysis.

Let's get started!


In [None]:
import os
import pandas as pd

# Import our DataPreprocessor class
# Adjust the import path based on your project structure.
# For example, if the file is located at 'src/data_preprocessor.py':
from src.data_preprocessor import DataPreprocessor

# Alternatively, if you haven't packaged your source code yet, you could insert the class definition directly here.


In [None]:
# Define the paths to the raw data and the directory where processed data will be saved.
raw_file_path = r"C:\Users\Ken Ira Talingting\Desktop\anomaly-detection-project\data\raw\equipment_anomaly_data.csv"
processed_dir = r"C:\Users\Ken Ira Talingting\Desktop\anomaly-detection-project\data\processed"

print("Raw file path:", raw_file_path)
print("Processed data directory:", processed_dir)


In [None]:
# Initialize the DataPreprocessor and load the raw data.
preprocessor = DataPreprocessor(raw_file_path, processed_dir)

# Load the raw data
try:
    raw_df = preprocessor.load_data()
    print("Raw Data Loaded Successfully:")
    display(raw_df.head())
except Exception as e:
    print("Error loading raw data:", e)


In [None]:
# Apply preprocessing to clean and prepare the data.
try:
    processed_df = preprocessor.preprocess(raw_df)
    print("Data Preprocessing Completed Successfully:")
    display(processed_df.head())
except Exception as e:
    print("Error during preprocessing:", e)


In [None]:
# Save the processed data to the specified directory.
try:
    preprocessor.save_processed_data(processed_df)
    print("Processed data saved successfully!")
except Exception as e:
    print("Error saving processed data:", e)


In [None]:
# Check if the processed file exists in the target directory.
processed_file_path = os.path.join(processed_dir, "equipment_anomaly_data_processed.csv")
if os.path.exists(processed_file_path):
    print("✅ Processed file exists:", processed_file_path)
else:
    print("❌ Processed file not found!")
