# Intelligent Data Analysis Project
### Matej Bebej (50%), Marian Kurcina (50%)

## Table of Contents
- Assignment
- Phase 2 - Data preprocessing

# Assignment

Oxygen saturation is a key indicator of the proper functioning of the respiratory and circulatory systems. When its value drops to a critically low level, it may indicate life-threatening conditions such as hypoxemia, respiratory failure, or severe infections. In such cases, immediate intervention is essential. Traditional monitoring is performed using pulse oximeters, which, however, can be affected by noise, motion artifacts, or may have limitations in certain clinical situations.

Modern machine learning–based approaches offer the possibility to estimate and predict critical oxygen saturation values with higher accuracy (critical oxygen saturation estimation). Models can utilize multimodal data, such as heart rate, respiratory rate, blood pressure, or sensor signals. By being trained on diverse datasets, it is possible to identify early warning signs of desaturation, filter out noise, and provide timely alerts even before oxygen saturation drops below a safe threshold.

The goal of this assignment is to become familiar with the issue of oxygen saturation monitoring, understand the contribution of artificial intelligence, and design a solution that could improve critical care and reduce risks associated with undiagnosed hypoxemia.

Each pair of students will work with an assigned dataset starting from Week 2. Your task is to predict the dependent variable “oximetry” (the predicted variable) using machine learning methods. In doing so, you will need to deal with various issues present in the data, such as inconsistent formats, missing values, outliers, and others.

The expected outcomes of the project are:

- the best-performing machine learning model, and

- a data pipeline for building it from the input data.

# Phase 2 – Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
import dateparser
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
observation = pd.read_csv("dataset/observation.csv", sep='\t')
patient = pd.read_csv("dataset/patient.csv", sep='\t')
station = pd.read_csv("dataset/station.csv", sep='\t')

In this phase, you are expected to carry out data preprocessing for machine learning. The result should be a dataset (CSV or TSV), where each observation is described by one row.
Since scikit-learn only works with numerical data, something must be done with the non-numerical data.

Ensure the preprocessing is reproducible on both the training and test datasets, so that you can repeat the process multiple times as needed (iteratively).

Because preprocessing can change the shape and characteristics of the data, you may need to perform EDA (Exploratory Data Analysis) again as necessary. These techniques will not be graded again, but document any changes in the chosen methods.
You can solve data-related issues iteratively across all phases, as needed.

## 2.1 Implementation of Data Preprocessing

### A - Train–Test Split

Split the data into training and test sets according to your predefined ratio. Continue working only with the training dataset.

In [2]:
train, test = train_test_split(
    observation, 
    test_size=0.2,        # 20% for testing
    random_state=42,      # reproducibility
    shuffle=True
)

print("train shape:", train.shape)
print("test shape:", test.shape)

train shape: (9685, 23)
test shape: (2422, 23)


We split data into train and test parts, with 20% of observation data being test data () and 80% being train data ().

In [3]:
station_schema = {
    'location':'string',
    'code':'string',
    'revision':'date',
    'station':'string',
    'latitude':'float',
    'longitude':'float',
}
observation_schema = {
    'SpO₂':'float',
    'HR':'float',
    'PI':'float',
    'RR':'float',
    'EtCO₂':'float',
    'FiO₂':'float',
    'PRV':'float',
    'BP':'float',
    'Skin Temperature':'float',
    'Motion/Activity index':'float',
    'PVI':'float',
    'Hb level':'float',
    'SV':'float',
    'CO':'float',
    'Blood Flow Index':'float',
    'PPG waveform features':'float',
    'Signal Quality Index':'float',
    'Respiratory effort':'float',
    'O₂ extraction ratio':'float',
    'SNR':'float',
    'oximetry':'int',
    'latitude':'float',
    'longitude':'float'
}
patient_schema = {
    'residence':'string',              
    'current_location':'string',    
    'blood_group':'string',          
    'job':'string',                 
    'mail':'string',                
    'user_id':'int',             
    'birthdate':'date',           
    'company':'string',             
    'name':'string',                
    'username':'string',            
    'ssn':'string',                 
    'registration':'date',        
    'station_ID':'int'            
}

In [4]:
class EnforceSchema(BaseEstimator, TransformerMixin):
    def __init__(self, schema=None):
        self.schema = schema

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.schema is None:
            return X

        X = X.copy()

        for col, col_type in self.schema.items():
            if col not in X.columns:
                continue
            if col_type == 'int':
                X[col] = pd.to_numeric(X[col], errors='coerce').astype('Int64')

            elif col_type == 'float':
                X[col] = pd.to_numeric(X[col], errors='coerce').astype(float)

            elif col_type == 'numeric':
                X[col] = pd.to_numeric(X[col], errors='coerce')

            elif col_type == 'date':
                X[col] = X[col].apply(
                    lambda x: dateparser.parse(str(x))
                    if pd.notnull(x) else pd.NaT
                )

            elif col_type == 'string':
                X[col] = X[col].astype(str)
                X[col] = X[col].replace('nan', np.nan)

            elif col_type == 'bool':
                X[col] = (
                    X[col]
                    .astype(str)
                    .str.lower()
                    .map({'true': True, 'false': False, '1': True, '0': False})
                )

            elif col_type == 'category':
                X[col] = X[col].astype('category')

    
        return X

In [5]:
valid_ranges = {
    'SpO₂': (95, 100),
    'HR': (60, 100),
    'PI': (0.2, 20),
    'RR': (12, 20),
    'EtCO₂': (35, 45),
    'FiO₂': (21, 100),
    'PRV': (20, 200),
    'BP': (60, 120),
    'Skin Temperature': (33, 38),
    'Motion/Activity index': None,
    'PVI': (10, 20),
    'Hb level': (12, 18),
    'SV': (60, 100),
    'CO': (4, 8),
    'Blood Flow Index': None,
    'PPG waveform features': None,
    'Signal Quality Index': (0, 100),
    'Respiratory effort': None,
    'O₂ extraction ratio': (0.2, 3),
    'SNR': (20, 40)
}

In [6]:
class EnforceValueRanges(BaseEstimator, TransformerMixin):
    def __init__(self, ranges=None):
        self.ranges = ranges

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        if self.ranges is None:
            return X

        for col, valid_range in self.ranges.items():
            if col not in X.columns:
                continue
            if valid_range is None:
                continue  

            low, high = valid_range
            X[col] = pd.to_numeric(X[col], errors='coerce')

            mask = ~X[col].between(low, high, inclusive='both')
            X.loc[mask, col] = np.nan

        return X

In [7]:
class ParseLocation(BaseEstimator, TransformerMixin):
    def __init__(self, column='location'):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        if self.column not in X.columns:
            return X
        
        split_cols = X[self.column].astype(str).str.split('/', n=1, expand=True)
        X['continent'] = split_cols[0].str.strip()
        X['city'] = split_cols[1].str.strip()
        X = X.drop(columns=[self.column], errors='ignore')
        return X

In [8]:
class DropColumns(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns or []

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop(columns=self.columns, errors='ignore')

In [9]:
class RemoveDuplicates(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop_duplicates()

In [10]:
class DropNA(BaseEstimator, TransformerMixin):
    def __init__(self, how='any', subset=None):
        self.how = how
        self.subset = subset

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.dropna(how=self.how, subset=self.subset)

In [15]:
observation_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=observation_schema)),
    ("ranges", EnforceValueRanges(ranges=valid_ranges)),
    ("drop_geo", DropColumns(columns=['latitude', 'longitude'])),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),
])

station_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=station_schema)),
    ("drop_station_and_date", DropColumns(columns=['station', 'revision'])),
    ("parse_location", ParseLocation(column='location')),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),
])

patient_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=patient_schema)),
    ("remove_duplicates", RemoveDuplicates()),
])

observation = observation_pipeline.fit_transform(train)
station = station_pipeline.fit_transform(pd.read_csv("dataset/station.csv", sep='\t'))
patient = patient_pipeline.fit_transform(pd.read_csv("dataset/patient.csv", sep='\t'))

station.head()

  X[col] = X[col].replace('nan', np.nan)


Unnamed: 0,code,latitude,longitude,continent,city
0,ES,37.35813,-6.03731,Europe,Madrid
1,CO,7.83389,-72.47417,America,Bogota
2,IN,26.44931,91.61356,Asia,Kolkata
3,US,37.95143,-91.77127,America,Chicago
4,DE,48.21644,9.02596,Europe,Berlin


### B - Data Transformation

Transform the data into a format suitable for machine learning, i.e. each observation must be described by one row, and each attribute must be numeric.
Iteratively integrate preprocessing steps from Phase 1 as part of a unified process.

For observation dataset there are no string or category features, therefore we will show encoding on station dataset. In pipeline we will still call encoding but it will have no effect.

We also ignore revision column and station column.

In [14]:
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

observation_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=observation_schema)),
    ("ranges", EnforceValueRanges(ranges=valid_ranges)),
    ("drop_geo", DropColumns(columns=['latitude', 'longitude'])),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),  
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, [])
    ], remainder='passthrough').set_output(transform="pandas")),
])

station_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=station_schema)),
    ("drop_station_and_date", DropColumns(columns=['station', 'revision'])),
    ("parse_location", ParseLocation(column='location')),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, ['continent', 'city', 'code'])
    ], remainder='passthrough').set_output(transform="pandas")),
])

patient_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=patient_schema)),
    ("remove_duplicates", RemoveDuplicates()),
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, ['blood_group'])
    ], remainder='passthrough').set_output(transform="pandas")),
])



observation = observation_pipeline.fit_transform(train)
station = station_pipeline.fit_transform(pd.read_csv("dataset/station.csv", sep='\t'))
patient = patient_pipeline.fit_transform(pd.read_csv("dataset/patient.csv", sep='\t'))

station.head()

  X[col] = X[col].replace('nan', np.nan)


Unnamed: 0,cat__continent_Africa,cat__continent_America,cat__continent_Asia,cat__continent_Atlantic,cat__continent_Australia,cat__continent_Europe,cat__continent_Indian,cat__continent_Pacific,cat__city_Abidjan,cat__city_Accra,...,cat__code_UA,cat__code_US,cat__code_UY,cat__code_UZ,cat__code_VE,cat__code_VU,cat__code_YE,cat__code_ZA,remainder__latitude,remainder__longitude
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.35813,-6.03731
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.83389,-72.47417
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.44931,91.61356
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,37.95143,-91.77127
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,48.21644,9.02596


### C - Feature Scaling and Transformation

Transform the dataset attributes for machine learning using at least the following techniques:

- Scaling (2 techniques)
- Transformers (2 techniques)

### D - Justification and Documentation

Justify your choices/decisions for implementation

## 2.2 Feature Selection 

### A - Identification of Informative Features

Identify which attributes (features) in your data are informative with respect to the target variable (use at least 3 techniques and compare their results).

### B - Ranking of Features

Rank the identified features by importance.

### C - Justification and Documentation

Justify your choices/decisions for implementation (i.e., provide documentation).

## 2.3 Reproducibility of Preprocessing

### Code Generalization for Reuse and Pipeline Implementation

Modify your preprocessing code for the training dataset so that it can be reused without further modifications to preprocess the test dataset in a machine learning context. Use the sklearn.pipeline functionality.