<a href="https://colab.research.google.com/github/meiladrahmani556/marine-cbm-ml-dissertation/blob/main/JupyterNotebook/03_data_cleaning_%26_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìì Notebook 03 ‚Äì Data Cleaning & Preprocessing

This notebook focuses on preparing the Marine Condition-Based Monitoring (CBM) dataset for machine learning modelling.

The main objectives are:

- Ensure all features are numeric
- Handle missing values
- Remove duplicate records
- Detect and treat outliers
- Perform feature scaling
- Split the dataset into training and testing sets
- Save the processed dataset for modelling

This notebook transforms raw data into a model-ready dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## üìÇ Load Dataset

We load the dataset saved during the previous notebook.

In [5]:
from google.colab import files
uploaded = files.upload()

Saving Conditional_Base_Monitoring in Marine_System.csv to Conditional_Base_Monitoring in Marine_System.csv


In [6]:
import os
os.listdir()

['.config', 'Conditional_Base_Monitoring in Marine_System.csv', 'sample_data']

In [9]:
os.listdir('/content')

['.config', 'Conditional_Base_Monitoring in Marine_System.csv', 'sample_data']

In [10]:
df = pd.read_csv("Conditional_Base_Monitoring in Marine_System.csv")
df.head()

Unnamed: 0,Lever position,Ship speed (v),Gas Turbine (GT) shaft torque (GTT) [kN m],GT rate of revolutions (GTn) [rpm],Gas Generator rate of revolutions (GGn) [rpm],Starboard Propeller Torque (Ts) [kN],Port Propeller Torque (Tp) [kN],Hight Pressure (HP) Turbine exit temperature (T48) [C],GT Compressor inlet air temperature (T1) [C],GT Compressor outlet air temperature (T2) [C],HP Turbine exit pressure (P48) [bar],GT Compressor inlet air pressure (P1) [bar],GT Compressor outlet air pressure (P2) [bar],GT exhaust gas pressure (Pexh) [bar],Turbine Injecton Control (TIC) [%],Fuel flow (mf) [kg/s],GT Compressor decay state coefficient,GT Turbine decay state coefficient
0,5.14,15,21640.162,1924.358,8516.691,175.324,175.324,706.702,288,640.873,2.072,0.998,10.916,1.026,24.96,0.494,0.951,1.0
1,9.3,27,72776.229,3560.412,9759.837,645.137,645.137,1060.156,288,774.302,4.511,0.998,22.426,1.051,87.741,1.737,0.982,0.997
2,8.206,24,50994.673,3087.535,9313.854,438.11,438.11,927.728,288,734.474,3.577,0.998,18.412,1.041,60.546,1.199,0.966,0.988
3,5.14,15,21626.805,1924.329,8472.097,175.221,175.221,695.477,288,633.124,2.086,0.998,11.074,1.027,24.549,0.486,0.989,0.991
4,5.14,15,21636.43,1924.313,8494.777,,,731.494,288,645.642,2.078,0.998,11.197,1.026,26.373,0.522,0.95,0.975


In [14]:
df.shape
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12434 entries, 0 to 12433
Data columns (total 18 columns):
 #   Column                                                    Non-Null Count  Dtype 
---  ------                                                    --------------  ----- 
 0   Lever position                                            12387 non-null  object
 1   Ship speed (v)                                            12374 non-null  object
 2   Gas Turbine (GT) shaft torque (GTT) [kN m]¬†               12390 non-null  object
 3   GT rate of revolutions (GTn) [rpm]¬†                       12383 non-null  object
 4   Gas Generator rate of revolutions (GGn) [rpm]¬†            12389 non-null  object
 5   Starboard Propeller Torque (Ts) [kN]¬†                     12385 non-null  object
 6   Port Propeller Torque (Tp) [kN]¬†                          12377 non-null  object
 7   Hight Pressure (HP) Turbine exit temperature (T48) [C]¬†   12371 non-null  object
 8   GT Compressor inlet 

Unnamed: 0,Lever position,Ship speed (v),Gas Turbine (GT) shaft torque (GTT) [kN m],GT rate of revolutions (GTn) [rpm],Gas Generator rate of revolutions (GGn) [rpm],Starboard Propeller Torque (Ts) [kN],Port Propeller Torque (Tp) [kN],Hight Pressure (HP) Turbine exit temperature (T48) [C],GT Compressor inlet air temperature (T1) [C],GT Compressor outlet air temperature (T2) [C],HP Turbine exit pressure (P48) [bar],GT Compressor inlet air pressure (P1) [bar],GT Compressor outlet air pressure (P2) [bar],GT exhaust gas pressure (Pexh) [bar],Turbine Injecton Control (TIC) [%],Fuel flow (mf) [kg/s],GT Compressor decay state coefficient,GT Turbine decay state coefficient
count,12387.0,12374,12390.0,12383.0,12389.0,12385.0,12377.0,12371.0,12376,12388.0,12386.0,12375.0,12391.0,12379.0,12373,12399.0,12393.0,12382.0
unique,18.0,20,11442.0,3899.0,11844.0,4296.0,4295.0,11781.0,3,11517.0,535.0,4.0,4217.0,27.0,8506,704.0,53.0,29.0
top,4.161,12,289.964,1547.465,7792.63,113.774,113.774,464.006,288,613.851,1.389,0.998,5.947,1.019,0,0.358,0.95,0.975
freq,1423.0,1421,94.0,142.0,97.0,117.0,120.0,93.0,12353,103.0,306.0,12351.0,100.0,2090.0,761,102.0,671.0,883.0


## üîÑ Convert Columns to Numeric

All columns are converted to numeric format to ensure compatibility with machine learning algorithms.

In [15]:
df = df.apply(pd.to_numeric, errors='coerce')
df.dtypes

Unnamed: 0,0
Lever position,float64
Ship speed (v),float64
Gas Turbine (GT) shaft torque (GTT) [kN m],float64
GT rate of revolutions (GTn) [rpm],float64
Gas Generator rate of revolutions (GGn) [rpm],float64
Starboard Propeller Torque (Ts) [kN],float64
Port Propeller Torque (Tp) [kN],float64
Hight Pressure (HP) Turbine exit temperature (T48) [C],float64
GT Compressor inlet air temperature (T1) [C],float64
GT Compressor outlet air temperature (T2) [C],float64


In [16]:
df.isnull().sum()

Unnamed: 0,0
Lever position,71
Ship speed (v),83
Gas Turbine (GT) shaft torque (GTT) [kN m],68
GT rate of revolutions (GTn) [rpm],76
Gas Generator rate of revolutions (GGn) [rpm],68
Starboard Propeller Torque (Ts) [kN],74
Port Propeller Torque (Tp) [kN],80
Hight Pressure (HP) Turbine exit temperature (T48) [C],87
GT Compressor inlet air temperature (T1) [C],81
GT Compressor outlet air temperature (T2) [C],71


## üßπ Remove Duplicate Rows

In [17]:
df.duplicated().sum()
df = df.drop_duplicates()

In [18]:
target_column = "GT Compressor decay state coefficient"

## üéØ Define Features and Target Variable

In [21]:
df.columns = df.columns.str.strip()
X = df.drop(columns=[target_column])
y = df[target_column]

X.shape, y.shape

((12367, 17), (12367,))

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

X_train.shape, X_test.shape

((9893, 17), (2474, 17))

## üìè Feature Scaling

Standardization is applied to ensure features are on the same scale.
This improves performance for many machine learning algorithms.

In [23]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [24]:
pd.DataFrame(X_train_scaled).to_csv("data/processed/X_train.csv", index=False)
pd.DataFrame(X_test_scaled).to_csv("data/processed/X_test.csv", index=False)

y_train.to_csv("data/processed/y_train.csv", index=False)
y_test.to_csv("data/processed/y_test.csv", index=False)

## ‚úÖ Summary

In this notebook, we:

- Cleaned and validated the dataset
- Converted all columns to numeric format
- Removed duplicates and missing values
- Defined features and target variable
- Split the dataset into training and testing sets
- Applied feature scaling
- Saved the processed datasets

The dataset is now fully prepared for model development in Notebook 04.