# Intelligent Data Analysis Project
### Matej Bebej (50%), Marian Kurcina (50%)

## Table of Contents
- Assignment
- Phase 2 - Data preprocessing

# Assignment

Oxygen saturation is a key indicator of the proper functioning of the respiratory and circulatory systems. When its value drops to a critically low level, it may indicate life-threatening conditions such as hypoxemia, respiratory failure, or severe infections. In such cases, immediate intervention is essential. Traditional monitoring is performed using pulse oximeters, which, however, can be affected by noise, motion artifacts, or may have limitations in certain clinical situations.

Modern machine learning–based approaches offer the possibility to estimate and predict critical oxygen saturation values with higher accuracy (critical oxygen saturation estimation). Models can utilize multimodal data, such as heart rate, respiratory rate, blood pressure, or sensor signals. By being trained on diverse datasets, it is possible to identify early warning signs of desaturation, filter out noise, and provide timely alerts even before oxygen saturation drops below a safe threshold.

The goal of this assignment is to become familiar with the issue of oxygen saturation monitoring, understand the contribution of artificial intelligence, and design a solution that could improve critical care and reduce risks associated with undiagnosed hypoxemia.

Each pair of students will work with an assigned dataset starting from Week 2. Your task is to predict the dependent variable “oximetry” (the predicted variable) using machine learning methods. In doing so, you will need to deal with various issues present in the data, such as inconsistent formats, missing values, outliers, and others.

The expected outcomes of the project are:

- the best-performing machine learning model, and

- a data pipeline for building it from the input data.

# Phase 2 – Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
import dateparser
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
observation = pd.read_csv("dataset/observation.csv", sep='\t')
patient = pd.read_csv("dataset/patient.csv", sep='\t')
station = pd.read_csv("dataset/station.csv", sep='\t')

In this phase, you are expected to carry out data preprocessing for machine learning. The result should be a dataset (CSV or TSV), where each observation is described by one row.
Since scikit-learn only works with numerical data, something must be done with the non-numerical data.

Ensure the preprocessing is reproducible on both the training and test datasets, so that you can repeat the process multiple times as needed (iteratively).

Because preprocessing can change the shape and characteristics of the data, you may need to perform EDA (Exploratory Data Analysis) again as necessary. These techniques will not be graded again, but document any changes in the chosen methods.
You can solve data-related issues iteratively across all phases, as needed.

## 2.1 Implementation of Data Preprocessing

First we preprocess all data by: 
- enforcing schemas on each data frame
- removing latitude and longitude columns from observation since they are not medical data and have no corelation to oxymetry (as we learned in First Phase of the project - EDA)
- removing logical outliers - values which are outside of acceptable range
- removing records which have oxymetry missing 
- removing duplicates

In [2]:
station_schema = {
    'location':'string',
    'code':'string',
    'revision':'date',
    'station':'string',
    'latitude':'float',
    'longitude':'float',
}
observation_schema = {
    'SpO₂':'float',
    'HR':'float',
    'PI':'float',
    'RR':'float',
    'EtCO₂':'float',
    'FiO₂':'float',
    'PRV':'float',
    'BP':'float',
    'Skin Temperature':'float',
    'Motion/Activity index':'float',
    'PVI':'float',
    'Hb level':'float',
    'SV':'float',
    'CO':'float',
    'Blood Flow Index':'float',
    'PPG waveform features':'float',
    'Signal Quality Index':'float',
    'Respiratory effort':'float',
    'O₂ extraction ratio':'float',
    'SNR':'float',
    'oximetry':'int',
    'latitude':'float',
    'longitude':'float'
}
patient_schema = {
    'residence':'string',              
    'current_location':'string',    
    'blood_group':'string',          
    'job':'string',                 
    'mail':'string',                
    'user_id':'int',             
    'birthdate':'date',           
    'company':'string',             
    'name':'string',                
    'username':'string',            
    'ssn':'string',                 
    'registration':'date',        
    'station_ID':'int'            
}

In [3]:

def enforce_schema(df, schema):
    if schema is None:
        return df 
        
    
    df = df.copy()

    for col, col_type in schema.items():
        if col not in df.columns:
            continue

        if col_type == 'int':
            df[col] = pd.to_numeric(df[col], errors='coerce').astype(int)

        elif col_type == 'float':
            df[col] = pd.to_numeric(df[col], errors='coerce').astype(float)

        elif col_type == 'numeric':
            df[col] = pd.to_numeric(df[col], errors='coerce')


        elif col_type == 'date':
            df[col] = df[col].apply(
                lambda x: dateparser.parse(str(x))
                if pd.notnull(x) else pd.NaT
            )

        elif col_type == 'string':
            df[col] = df[col].astype(str)
            df[col] = df[col].replace('nan', np.nan)

        elif col_type == 'bool':
            df[col] = (
                df[col]
                .astype(str)
                .str.lower()
                .map({'true': True, 'false': False, '1': True, '0': False})
            )

        elif col_type == 'category':
            df[col] = df[col].astype('category')

    return df

In [4]:
valid_ranges = {
    'SpO₂': (95, 100),
    'HR': (60, 100),
    'PI': (0.2, 20),
    'RR': (12, 20),
    'EtCO₂': (35, 45),
    'FiO₂': (21, 100),
    'PRV': (20, 200),
    'BP': (60, 120),
    'Skin Temperature': (33, 38),
    'Motion/Activity index': None,
    'PVI': (10, 20),
    'Hb level': (12, 18),
    'SV': (60, 100),
    'CO': (4, 8),
    'Blood Flow Index': None,
    'PPG waveform features': None,
    'Signal Quality Index': (0, 100),
    'Respiratory effort': None,
    'O₂ extraction ratio': (0.2, 3),
    'SNR': (20, 40)
}

In [5]:
def enforce_value_ranges(df, ranges):
    df = df.copy()

    for col, valid_range in ranges.items():
        if col not in df.columns:
            continue
        if valid_range is None:
            continue  # skip features with no range

        low, high = valid_range

        # Convert to numeric if needed
        df[col] = pd.to_numeric(df[col], errors='coerce')

        # Replace anything outside the range with NaN
        mask = ~df[col].between(low, high, inclusive='both')
        df.loc[mask, col] = np.nan

    return df


In [6]:
observation = enforce_schema(observation, observation_schema)
observation = enforce_value_ranges(observation, valid_ranges)

station = enforce_schema(station, station_schema)

patient = enforce_schema(patient, patient_schema)
observation.head()

  df[col] = df[col].replace('nan', np.nan)


Unnamed: 0,SpO₂,HR,PI,RR,EtCO₂,FiO₂,PRV,BP,Skin Temperature,Motion/Activity index,...,CO,Blood Flow Index,PPG waveform features,Signal Quality Index,Respiratory effort,O₂ extraction ratio,SNR,oximetry,latitude,longitude
0,96.511604,67.597663,14.451123,17.461063,41.262037,87.821291,126.965235,109.471152,35.650826,11.429092,...,4.004591,44.065731,36.60455,69.140438,57.097309,0.210117,33.584512,1,19.64745,-102.04897
1,98.113516,72.8729,4.699563,17.231104,40.220086,64.283914,139.509502,100.943658,35.313317,11.188645,...,4.014983,36.498878,61.305805,50.733704,61.220158,0.293664,30.528645,1,28.15112,-82.46148
2,98.623248,81.418306,12.056504,16.832868,39.953184,77.164206,104.396821,107.401302,36.017931,8.980842,...,4.083922,52.803185,49.432273,41.841466,57.554854,0.232518,22.357337,1,-38.16604,145.13643
3,96.821905,69.356881,11.04441,14.876013,38.765113,59.296747,180.845101,106.786082,35.433515,9.952747,...,4.006763,52.800923,68.710875,47.524447,48.971775,0.288125,25.88619,0,40.63316,-74.13653
4,98.523262,70.686313,5.963887,16.933547,41.470854,66.145767,111.525074,108.354216,35.258355,10.619401,...,4.008813,25.406073,33.993656,60.323832,54.807359,0.295855,20.836752,1,4.88441,101.96857


In [7]:
observation = observation.drop(columns=['latitude', 'longitude'], errors='ignore')
observation.head()

Unnamed: 0,SpO₂,HR,PI,RR,EtCO₂,FiO₂,PRV,BP,Skin Temperature,Motion/Activity index,...,Hb level,SV,CO,Blood Flow Index,PPG waveform features,Signal Quality Index,Respiratory effort,O₂ extraction ratio,SNR,oximetry
0,96.511604,67.597663,14.451123,17.461063,41.262037,87.821291,126.965235,109.471152,35.650826,11.429092,...,13.922029,83.801243,4.004591,44.065731,36.60455,69.140438,57.097309,0.210117,33.584512,1
1,98.113516,72.8729,4.699563,17.231104,40.220086,64.283914,139.509502,100.943658,35.313317,11.188645,...,16.116704,78.047479,4.014983,36.498878,61.305805,50.733704,61.220158,0.293664,30.528645,1
2,98.623248,81.418306,12.056504,16.832868,39.953184,77.164206,104.396821,107.401302,36.017931,8.980842,...,13.672567,86.640923,4.083922,52.803185,49.432273,41.841466,57.554854,0.232518,22.357337,1
3,96.821905,69.356881,11.04441,14.876013,38.765113,59.296747,180.845101,106.786082,35.433515,9.952747,...,14.955819,87.187544,4.006763,52.800923,68.710875,47.524447,48.971775,0.288125,25.88619,0
4,98.523262,70.686313,5.963887,16.933547,41.470854,66.145767,111.525074,108.354216,35.258355,10.619401,...,15.016218,87.354,4.008813,25.406073,33.993656,60.323832,54.807359,0.295855,20.836752,1


In [8]:
null_oximetry_rows = observation[observation['oximetry'].isna()]
print(len(null_oximetry_rows))

0


In [9]:
observation = observation.drop_duplicates()


In [10]:
def Parse(df):
    df = df.copy()

    split_cols = df['location'].astype(str).str.split('/', n=1, expand=True)

    df['continent'] = split_cols[0].str.strip()
    df['city'] = split_cols[1].str.strip()

    df = df.drop(columns=['location'])

    return df

station = Parse(station)

### A - Train–Test Split

Split the data into training and test sets according to your predefined ratio. Continue working only with the training dataset.

In [11]:
X = observation.drop(columns=['oximetry'])  
y = observation['oximetry']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,   
    random_state=42, 
    stratify=y       
)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (9542, 20)
X_test shape: (2386, 20)
y_train shape: (9542,)
y_test shape: (2386,)


We split data into train and test parts, with 20% of observation data being test data () and 80% being train data ().

### B - Data Transformation

Transform the data into a format suitable for machine learning, i.e. each observation must be described by one row, and each attribute must be numeric.
Iteratively integrate preprocessing steps from Phase 1 as part of a unified process.

For observation dataset there are no string or category features, therefore we will show encoding on station dataset. In pipeline we will still call encoding but it will have no effect.

We also ignore revision column and station column.

In [12]:
station_cleaned = station.drop(columns=['revision', 'station'], errors='ignore')

categorical_cols = ['continent', 'city', 'code']
numeric_cols = station.select_dtypes(include=['number']).columns.difference(categorical_cols)


cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore', dtype=int))
])

preprocessor = ColumnTransformer([
    ('cat', cat_pipeline, categorical_cols),
    ('num', StandardScaler(), numeric_cols)
])

processed = preprocessor.fit_transform(station)

encoded_cols = preprocessor.named_transformers_['cat'].named_steps['encoder'].get_feature_names_out(categorical_cols)

processed_df = pd.DataFrame(processed, columns=list(encoded_cols) + list(numeric_cols), index=station.index)

for col in encoded_cols:
    processed_df[col] = processed_df[col].astype(int)

processed_df.head()

Unnamed: 0,continent_Africa,continent_America,continent_Asia,continent_Atlantic,continent_Australia,continent_Europe,continent_Indian,continent_Pacific,city_Abidjan,city_Accra,...,code_UA,code_US,code_UY,code_UZ,code_VE,code_VU,code_YE,code_ZA,latitude,longitude
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0.363461,-0.206644
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,-0.869177,-1.11164
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,-0.091983,1.123546
3,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0.388231,-1.374503
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0.816796,-0.001454


### C - Feature Scaling and Transformation

Transform the dataset attributes for machine learning using at least the following techniques:

- Scaling (2 techniques)
- Transformers (2 techniques)

### D - Justification and Documentation

Justify your choices/decisions for implementation

## 2.2 Feature Selection 

### A - Identification of Informative Features

Identify which attributes (features) in your data are informative with respect to the target variable (use at least 3 techniques and compare their results).

### B - Ranking of Features

Rank the identified features by importance.

### C - Justification and Documentation

Justify your choices/decisions for implementation (i.e., provide documentation).

## 2.3 Reproducibility of Preprocessing

### Code Generalization for Reuse and Pipeline Implementation

Modify your preprocessing code for the training dataset so that it can be reused without further modifications to preprocess the test dataset in a machine learning context. Use the sklearn.pipeline functionality.