<a href="https://colab.research.google.com/github/meiladrahmani556/marine-cbm-ml-dissertation/blob/main/JupyterNotebook/01_dataset_acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01 – Dataset Acquisition

This notebook downloads and prepares the Condition-Based Monitoring (CBM) Marine dataset for further analysis.

Steps covered:

1. Kaggle API configuration  
2. Dataset download  
3. File extraction  
4. Data loading into pandas  
5. Initial inspection  
6. Saving raw dataset to project structure

In [1]:
!pip install -q kaggle

## Why Use the Kaggle API?

Using the Kaggle API ensures:

- Reproducibility of dataset acquisition  
- Automated dataset downloading  
- Clear documentation of data source  
- No manual file handling errors  

By scripting the download process, this notebook can be executed from scratch by any user with Kaggle credentials.

In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"meiladrahmani","key":"8637abe716924ad28df2dfe72572bdb9"}'}

## Dataset Overview

Dataset: *Condition-Based Monitoring (CBM) in Marine System*

This dataset contains multivariate numerical sensor measurements collected from a marine gas turbine propulsion system operating under various conditions.

Each row represents a system state.

The target variable represents the degradation coefficient of the propulsion system.

This is a regression problem where the goal is to predict continuous degradation behaviour from sensor readings.

In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

## Project Folder Structure

The following directory structure is used:

data/
│
├── raw/        → Original downloaded dataset  
├── processed/  → Cleaned and transformed data  

Maintaining separate raw and processed folders prevents accidental overwriting of original data and supports reproducibility.

In [4]:
import os

os.makedirs("data/raw", exist_ok=True)
os.makedirs("data/processed", exist_ok=True)

print("Project folders created.")

Project folders created.


In [5]:
!kaggle datasets download -d kunalnehete/condition-based-monitoring-cbm-in-marine-system -p data/raw

Dataset URL: https://www.kaggle.com/datasets/kunalnehete/condition-based-monitoring-cbm-in-marine-system
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading condition-based-monitoring-cbm-in-marine-system.zip to data/raw
  0% 0.00/483k [00:00<?, ?B/s]
100% 483k/483k [00:00<00:00, 676MB/s]


## Dataset Extraction

After downloading the compressed dataset from Kaggle, the ZIP file is extracted into the `data/raw` directory.

This ensures:

- The original archive is preserved  
- The extracted CSV files are accessible for loading  
- The dataset remains organised within the project structure  

In [6]:
import zipfile

for file in os.listdir("data/raw"):
    if file.endswith(".zip"):
        with zipfile.ZipFile(os.path.join("data/raw", file), 'r') as zip_ref:
            zip_ref.extractall("data/raw")

print("Extraction complete.")

Extraction complete.


## Loading the Dataset

The dataset is loaded using pandas for:

- Initial inspection  
- Shape verification  
- Column identification  
- Missing value assessment  

This early inspection helps identify potential data quality issues before proceeding to Exploratory Data Analysis (EDA).

In [7]:
os.listdir("data/raw")

['condition-based-monitoring-cbm-in-marine-system.zip',
 'Conditional_Base_Monitoring in Marine_System.csv']

## Reproducibility

A clean copy of the dataset is saved as:

data/raw/cbm_dataset_raw.csv

This ensures:

- A stable reference version of the dataset  
- Consistency across future notebooks  
- Protection against accidental modification  

All future preprocessing steps will use this saved raw dataset.

In [8]:
import pandas as pd

csv_files = [f for f in os.listdir("data/raw") if f.endswith(".csv")]

print("CSV Files Found:", csv_files)

df = pd.read_csv(os.path.join("data/raw", csv_files[0]))

df.head()

CSV Files Found: ['Conditional_Base_Monitoring in Marine_System.csv']


Unnamed: 0,Lever position,Ship speed (v),Gas Turbine (GT) shaft torque (GTT) [kN m],GT rate of revolutions (GTn) [rpm],Gas Generator rate of revolutions (GGn) [rpm],Starboard Propeller Torque (Ts) [kN],Port Propeller Torque (Tp) [kN],Hight Pressure (HP) Turbine exit temperature (T48) [C],GT Compressor inlet air temperature (T1) [C],GT Compressor outlet air temperature (T2) [C],HP Turbine exit pressure (P48) [bar],GT Compressor inlet air pressure (P1) [bar],GT Compressor outlet air pressure (P2) [bar],GT exhaust gas pressure (Pexh) [bar],Turbine Injecton Control (TIC) [%],Fuel flow (mf) [kg/s],GT Compressor decay state coefficient,GT Turbine decay state coefficient
0,5.14,15,21640.162,1924.358,8516.691,175.324,175.324,706.702,288,640.873,2.072,0.998,10.916,1.026,24.96,0.494,0.951,1.0
1,9.3,27,72776.229,3560.412,9759.837,645.137,645.137,1060.156,288,774.302,4.511,0.998,22.426,1.051,87.741,1.737,0.982,0.997
2,8.206,24,50994.673,3087.535,9313.854,438.11,438.11,927.728,288,734.474,3.577,0.998,18.412,1.041,60.546,1.199,0.966,0.988
3,5.14,15,21626.805,1924.329,8472.097,175.221,175.221,695.477,288,633.124,2.086,0.998,11.074,1.027,24.549,0.486,0.989,0.991
4,5.14,15,21636.43,1924.313,8494.777,,,731.494,288,645.642,2.078,0.998,11.197,1.026,26.373,0.522,0.95,0.975


In [9]:
print("Shape:", df.shape)
print("\nColumns:\n", df.columns)
print("\nMissing Values:\n", df.isnull().sum())

Shape: (12434, 18)

Columns:
 Index(['Lever position ', 'Ship speed (v) ',
       'Gas Turbine (GT) shaft torque (GTT) [kN m]  ',
       'GT rate of revolutions (GTn) [rpm]  ',
       'Gas Generator rate of revolutions (GGn) [rpm]  ',
       'Starboard Propeller Torque (Ts) [kN]  ',
       'Port Propeller Torque (Tp) [kN]  ',
       'Hight Pressure (HP) Turbine exit temperature (T48) [C]  ',
       'GT Compressor inlet air temperature (T1) [C]  ',
       'GT Compressor outlet air temperature (T2) [C]  ',
       'HP Turbine exit pressure (P48) [bar]  ',
       'GT Compressor inlet air pressure (P1) [bar]  ',
       'GT Compressor outlet air pressure (P2) [bar]  ',
       'GT exhaust gas pressure (Pexh) [bar]  ',
       'Turbine Injecton Control (TIC) [%]  ', 'Fuel flow (mf) [kg/s]  ',
       'GT Compressor decay state coefficient  ',
       'GT Turbine decay state coefficient '],
      dtype='object')

Missing Values:
 Lever position                                              47
Ship 

In [10]:
df.to_csv("data/raw/cbm_dataset_raw.csv", index=False)
print("Raw dataset saved successfully.")

Raw dataset saved successfully.
