# Exploratory Data Analysis (EDA)

This notebook serves as a placeholder for exploratory data analysis
as part of the Linear Regression Architecture Workshop.

At this stage, the focus is on project structure and reproducibility.
Full data ingestion and visualization will be implemented in a future sprint.

## Planned EDA Steps
1. Import the data from Noen database.(connstr: postgresql://neondb_owner:npg_Sh8bV3HjZvkd@ep-plain-scene-ahmzh8by-pooler.c-3.us-east-1.aws.neon.tech/neondb?sslmode=require&channel_binding=require table: robot_data) Saved as 'data/raw/RMBR4-2_export_test.csv'
2. By analyzing the robot‚Äôs data, we observe that the data can be divided into working periods and idle periods. Therefore, for each working period, we extract the mean value and the peak value of the signal. (The mean value represents the level of equipment aging, while the peak value reflects the operational condition of the robot.)
3. Clean the data(Handle missing values. Normalize/standardize features. Split data into **train/test sets**. )
4. The training data are divided into multiple detection intervals, where each detection interval consists of 10 consecutive working cycles.
5. For each detection interval, a regression analysis is performed, resulting in a series of Œ∏‚ÇÄ (theta_0) and Œ∏‚ÇÅ (theta_1) values.
6. Based on these results, a threshold is defined using the training data, and the values from the test dataset are used for detection. 
When the slope of a detection interval exceeds the threshold, the trend of the curve is considered abnormal. 
In this case, we predict that the machine may experience a failure two weeks later.
The system outputs the abnormal interval (start and end time), the predicted failure time, and the failure type (equipment aging or equipment malfunction).

### 1. Import data 
Import the data from Noen database.(connstr: postgresql://neondb_owner:npg_Sh8bV3HjZvkd@ep-plain-scene-ahmzh8by-pooler.c-3.us-east-1.aws.neon.tech/neondb?sslmode=require&channel_binding=require table: robot_data) Saved as 'data/raw/RMBR4-2_export_test.csv'

In [3]:
import os
import yaml

# If the notebook is inside notebooks/, go back to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")


# Load config
with open("configs/experiment_config.yaml", "r", encoding="utf-8") as f:
    config = yaml.safe_load(f)

from src.db_export import export_postgres_table_to_csv

# Read database config
connstr = config["database"]["connstr"]
table_name = config["database"]["source_table"]

# Read output path from config
output_csv = config["paths"]["raw_csv"]

# Export data from Neon PostgreSQL to CSV
export_postgres_table_to_csv(
    connstr=connstr,
    table_name=table_name,
    output_csv=output_csv,
)

‚úÖ Exported table 'robot_data' to data/raw/RMBR4-2_export_test.csv


### 2. Divide working periods
By analyzing the robot‚Äôs data, we observe that the data can be divided into working periods and idle periods. Therefore, for each working period, we extract the mean value and the peak value of the signal. (The mean value represents the level of equipment aging, while the peak value reflects the operational condition of the robot.)

In [4]:
import os
import yaml

# If the notebook is inside notebooks/, go back to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")

# Load config
with open("configs/experiment_config.yaml", "r", encoding="utf-8") as f:
    config = yaml.safe_load(f)
    
from src.preprocessing import run_preprocessing_pipeline
period_df, _preprocessed_df = run_preprocessing_pipeline(config)

print("‚úÖ Step 1 done. Saved period summary to:", config["paths"]["period_csv"])
print("Period rows:", len(period_df))

# ---- Inspect result ----
print("\nüìå Period CSV columns:")
print(period_df.columns.tolist())

print("\nüìå First 5 rows of period summary:")
display(period_df.head())

‚úÖ Step 1 done. Saved period summary to: data/raw/RMBR4-2_export_test_1.csv
Period rows: 858

üìå Period CSV columns:
['work_period', 'mean_value', 'peak_value', 'period_start_time', 'period_end_time']

üìå First 5 rows of period summary:


Unnamed: 0,work_period,mean_value,peak_value,period_start_time,period_end_time
0,1,3.302866,8.908955,2022-10-17 12:19:22.005000+00:00,2022-10-17 12:19:55.136000+00:00
1,2,3.449832,6.300471,2022-10-17 12:20:56.771000+00:00,2022-10-17 12:21:31.056000+00:00
2,3,3.491427,6.707684,2022-10-17 12:21:46.588000+00:00,2022-10-17 12:22:21.050000+00:00
3,4,3.255471,5.87491,2022-10-17 12:22:45.266000+00:00,2022-10-17 12:23:17.894000+00:00
4,5,2.881057,6.632019,2022-10-17 12:23:49.839000+00:00,2022-10-17 12:24:22.332000+00:00


### 3. Clean data
Clean the data(Handle missing values. Normalize/standardize features. Split data into **train/test sets**. )

In [5]:
import pandas as pd
import os
import yaml

# If the notebook is inside notebooks/, go back to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")

# Load config
with open("configs/experiment_config.yaml", "r", encoding="utf-8") as f:
    config = yaml.safe_load(f)

from src.splitter import split_period_csv_to_train_test

out = split_period_csv_to_train_test(config)

print("‚úÖ Step 2 done.")
print(" - TRAIN:", out["preprocessed_train_csv"])
print(" - TEST :", out["preprocessed_test_csv"])

train_df = pd.read_csv(out["preprocessed_train_csv"])
test_df  = pd.read_csv(out["preprocessed_test_csv"])

# ---- Inspect TRAIN ----
print("\nüìå TRAIN columns:")
print(train_df.columns.tolist())

print("\nüìå TRAIN first 5 rows:")
display(train_df.head())

# ---- Inspect TEST ----
print("\nüìå TEST columns:")
print(test_df.columns.tolist())

print("\nüìå TEST first 5 rows:")
display(test_df.head())

‚úÖ Step 2 done.
 - TRAIN: data/preprocessed/RMBR4-2_export_preprocessed_train.csv
 - TEST : data/preprocessed/RMBR4-2_export_preprocessed_test.csv

üìå TRAIN columns:
['work_period', 'mean_value', 'peak_value', 'period_start_time', 'period_end_time', 'interval_id', 'mean_value_z', 'peak_value_z']

üìå TRAIN first 5 rows:


Unnamed: 0,work_period,mean_value,peak_value,period_start_time,period_end_time,interval_id,mean_value_z,peak_value_z
0,11,3.586541,8.039739,2022-10-17 12:35:51.838000+00:00,2022-10-17 12:36:25.716000+00:00,2,0.979374,0.741561
1,12,3.764583,7.634454,2022-10-17 12:36:48.315000+00:00,2022-10-17 12:37:23.086000+00:00,2,1.565188,0.409189
2,13,2.908206,6.261834,2022-10-17 12:38:03.164000+00:00,2022-10-17 12:38:32.099000+00:00,2,-1.252553,-0.716489
3,14,3.188224,6.185474,2022-10-17 12:44:50.740000+00:00,2022-10-17 12:45:25.360000+00:00,2,-0.331209,-0.779112
4,15,3.479667,6.995066,2022-10-17 12:45:46.216000+00:00,2022-10-17 12:46:19.107000+00:00,2,0.627726,-0.11517



üìå TEST columns:
['work_period', 'mean_value', 'peak_value', 'period_start_time', 'period_end_time', 'interval_id', 'mean_value_z', 'peak_value_z']

üìå TEST first 5 rows:


Unnamed: 0,work_period,mean_value,peak_value,period_start_time,period_end_time,interval_id,mean_value_z,peak_value_z
0,1,3.302866,8.908955,2022-10-17 12:19:22.005000+00:00,2022-10-17 12:19:55.136000+00:00,1,0.046,1.4544
1,2,3.449832,6.300471,2022-10-17 12:20:56.771000+00:00,2022-10-17 12:21:31.056000+00:00,1,0.52956,-0.684803
2,3,3.491427,6.707684,2022-10-17 12:21:46.588000+00:00,2022-10-17 12:22:21.050000+00:00,1,0.666422,-0.35085
3,4,3.255471,5.87491,2022-10-17 12:22:45.266000+00:00,2022-10-17 12:23:17.894000+00:00,1,-0.109944,-1.033803
4,5,2.881057,6.632019,2022-10-17 12:23:49.839000+00:00,2022-10-17 12:24:22.332000+00:00,1,-1.341881,-0.412903


### 4. Divide detection intervals
The training data are divided into multiple detection intervals, where each detection interval consists of 10 consecutive working cycles.

### 5. Model Implement
For each detection interval, a regression analysis is performed, resulting in a series of Œ∏‚ÇÄ (theta_0) and Œ∏‚ÇÅ (theta_1) values.

In [6]:
import os
import yaml
import pandas as pd

# If the notebook is inside /notebooks, go one level up to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")

# Load config
with open("configs/experiment_config.yaml", "r", encoding="utf-8") as f:
    config = yaml.safe_load(f)

# IMPORTANT: Train-only input
# Your model.py reads config["paths"]["preprocessed_train_csv"]
# and writes config["paths"]["theta_table_csv"]
from src.model import build_interval_theta_table

theta_df = build_interval_theta_table(config)

print("‚úÖ Step 3 done (TRAIN only).")
print(" - Input :", config["paths"]["preprocessed_train_csv"])
print(" - Output:", config["paths"]["theta_table_csv"])

# Inspect result
print("\nüìå interval_theta_table columns:")
print(theta_df.columns.tolist())

print("\nüìå interval_theta_table first 5 rows:")
display(theta_df.head())

‚úÖ Step 3 done (TRAIN only).
 - Input : data/preprocessed/RMBR4-2_export_preprocessed_train.csv
 - Output: data/models/interval_theta_table.csv

üìå interval_theta_table columns:
['interval_id', 'start_work_period', 'end_work_period', 'n_periods', 'scratch_mean_theta0', 'scratch_mean_theta1', 'scratch_peak_theta0', 'scratch_peak_theta1', 'sklearn_mean_theta0', 'sklearn_mean_theta1', 'sklearn_peak_theta0', 'sklearn_peak_theta1', 'learning_rate', 'iterations', 'target_space']

üìå interval_theta_table first 5 rows:


Unnamed: 0,interval_id,start_work_period,end_work_period,n_periods,scratch_mean_theta0,scratch_mean_theta1,scratch_peak_theta0,scratch_peak_theta1,sklearn_mean_theta0,sklearn_mean_theta1,sklearn_peak_theta0,sklearn_peak_theta1,learning_rate,iterations,target_space
0,2,11,20,10,-0.890499,0.10011,-0.975986,0.07542,-0.890499,0.10011,-0.975986,0.07542,0.05,3000,z
1,4,31,40,10,3.687289,-0.092567,5.146264,-0.12838,3.687289,-0.092567,5.146264,-0.12838,0.05,3000,z
2,6,51,60,10,-7.934994,0.149576,-6.466034,0.124693,-7.934994,0.149576,-6.466034,0.124693,0.05,3000,z
3,7,61,70,10,-1.38133,0.026113,-3.10991,0.051936,-1.38133,0.026113,-3.10991,0.051936,0.05,3000,z
4,8,71,80,10,8.836954,-0.118015,15.307592,-0.196919,8.836954,-0.118015,15.307592,-0.196919,0.05,3000,z


### 6. Error Prediction
Based on these results, a threshold is defined using the training data, and the values from the test dataset are used for detection. 
When the slope of a detection interval exceeds the threshold, the trend of the curve is considered abnormal. 
In this case, we predict that the machine may experience a failure two weeks later.
The system outputs the abnormal interval (start and end time), the predicted failure time, and the failure type (equipment aging or equipment malfunction).

In [None]:
from src.alerts import detect_alerts_on_test

# ---------------------------------------------------------
# STEP 2: Generate alerts on TEST data
#
# In this step, the alert detection logic is applied to
# the TEST dataset only.
#
# IMPORTANT:
#   - Thresholds were derived from TRAIN data in Step 1
#   - No statistics are re-computed on TEST data
#   - This prevents data leakage
# ---------------------------------------------------------

# ---------------------------------------------------------
# Ensure the TEST preprocessed period-level CSV is used
# ---------------------------------------------------------
config["paths"]["preprocessed_test_csv"] = config["paths"]["preprocessed_test_csv"]

# ---------------------------------------------------------
# Run alert detection on TEST data
#
# This function:
#   - Loads TEST period-level data
#   - Aggregates periods into interval-level summaries
#   - Loads TRAIN-derived alert thresholds
#   - Loads TRAIN regression slopes (theta table)
#   - Applies alert logic:
#       * Trend condition: |slope| > slope threshold
#       * Level condition: mean or peak > level threshold
#   - Reports only anomalous intervals
#
# If no anomalies are detected:
#   - An empty CSV file (headers only) is written
#   - A message is printed indicating no faults were found
# ---------------------------------------------------------
results_df = detect_alerts_on_test(config)

# Display detected anomalies (may be empty)
results_df


No anomalies detected. Wrote empty results file: experiments/results.csv


Unnamed: 0,interval_id,start_work_period,end_work_period,interval_start_time,interval_end_time,predicted_failure_time,failure_type,alert_reason
