# Data Preparation
## Getting the System Ready and Loading the data

In [1]:
import sys
from pathlib import Path

# make sure parent folder (project root) is on the path
project_root = Path.cwd().parent  # adjust if your notebook lives somewhere else
sys.path.insert(0, str(project_root))

from src.data_ingestion import load_sensor_data, load_profile_data, merge_data, save_cycles


raw_dir = "../data/raw"
profile_path = f"{raw_dir}/profile.txt"
output_csv = "../data/processed/hydraulic_cycles.csv"

sensors = load_sensor_data(raw_dir)
profiles = load_profile_data(profile_path)
df = merge_data(sensors, profiles)
save_cycles(df, output_csv)


## Understanding the Data

### 1. Raw Sensor Data Overview

| Group | Sensors                                                                 | Rate (Hz) | Samples per cycle | Columns                  |
|-------|-------------------------------------------------------------------------|-----------|-------------------|--------------------------|
| PS    | PS1–PS6                                                                 | 100       | 6 000             | 6 × 6 000 = 36 000       |
| EPS1  | EPS1 (power)                                                            | 100       | 6 000             | 6 000                    |
| FS    | FS1–FS2 (flow)                                                          | 10        | 600               | 2 × 600 = 1 200          |
| LOW   | TS1–TS4 (temp), VS1 (vibration), CE (cool eff.), CP (cool pow.), SE (eff factor) | 1         | 60                | 8 × 60 = 480             |
| **Total** |                                                                         |           |                   | **43 680**               |

- **Cycles (rows):** 2 205  
- **Flattened readings (columns):** 43 680  
- **Each cycle:** 60 s of sensor readings 

In [2]:
print("Sensor data shape:", sensors.shape)

Sensor data shape: (2205, 43680)


### 2. Health Profile Data

In [None]:
print("Profile data shape:", profiles.shape)
profiles.head()

NameError: name 'profile' is not defined

### 3. Column Naming Convention
All sensor columns follow `<sensor>_<t>`, where:

- `<sensor>` is the file/station name (e.g. PS1, TS3, CP, …)

- `<t>` is the sample index in that cycle

In [None]:
# Peek at first/last few column names
cols = sensors.columns.tolist()
print("First 5 cols:", cols[:5])
print("Last 5 cols: ", cols[-5:])

First 5 cols: ['PS1_0', 'PS1_1', 'PS1_2', 'PS1_3', 'PS1_4']
Last 5 cols:  ['SE_55', 'SE_56', 'SE_57', 'SE_58', 'SE_59']


## Missing Value and Outlier Treatment