# Intelligent Data Analysis Project
### Matej Bebej (50%), Marian Kurcina (50%)

## Table of Contents
- Assignment
- Phase 2 - Data preprocessing

# Assignment

Oxygen saturation is a key indicator of the proper functioning of the respiratory and circulatory systems. When its value drops to a critically low level, it may indicate life-threatening conditions such as hypoxemia, respiratory failure, or severe infections. In such cases, immediate intervention is essential. Traditional monitoring is performed using pulse oximeters, which, however, can be affected by noise, motion artifacts, or may have limitations in certain clinical situations.

Modern machine learning–based approaches offer the possibility to estimate and predict critical oxygen saturation values with higher accuracy (critical oxygen saturation estimation). Models can utilize multimodal data, such as heart rate, respiratory rate, blood pressure, or sensor signals. By being trained on diverse datasets, it is possible to identify early warning signs of desaturation, filter out noise, and provide timely alerts even before oxygen saturation drops below a safe threshold.

The goal of this assignment is to become familiar with the issue of oxygen saturation monitoring, understand the contribution of artificial intelligence, and design a solution that could improve critical care and reduce risks associated with undiagnosed hypoxemia.

Each pair of students will work with an assigned dataset starting from Week 2. Your task is to predict the dependent variable “oximetry” (the predicted variable) using machine learning methods. In doing so, you will need to deal with various issues present in the data, such as inconsistent formats, missing values, outliers, and others.

The expected outcomes of the project are:

- the best-performing machine learning model, and

- a data pipeline for building it from the input data.

# Phase 2 – Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
import dateparser
observation = pd.read_csv("dataset/observation.csv", sep='\t')
patient = pd.read_csv("dataset/patient.csv", sep='\t')
station = pd.read_csv("dataset/station.csv", sep='\t')

In this phase, you are expected to carry out data preprocessing for machine learning. The result should be a dataset (CSV or TSV), where each observation is described by one row.
Since scikit-learn only works with numerical data, something must be done with the non-numerical data.

Ensure the preprocessing is reproducible on both the training and test datasets, so that you can repeat the process multiple times as needed (iteratively).

Because preprocessing can change the shape and characteristics of the data, you may need to perform EDA (Exploratory Data Analysis) again as necessary. These techniques will not be graded again, but document any changes in the chosen methods.
You can solve data-related issues iteratively across all phases, as needed.

## 2.1 Implementation of Data Preprocessing

### A - Train–Test Split

### B - Data Transformation

### C - Feature Scaling and Transformation

### D - Justification and Documentation

## 2.2 Feature Selection 

### A - Identification of Informative Features

### B - Ranking of Features

### C - Justification and Documentation

## 2.3 Reproducibility of Preprocessing

### A - Code Generalization for Reuse

In [49]:
#class Preprocess:

In [50]:
#class ExtractFeatures:

### B - Pipeline Implementation

In [52]:

def enforce_schema(df, schema):
    if schema is None:
        return df 
        
    
    df = df.copy()

    for col, col_type in schema.items():
        if col not in df.columns:
            continue

        if col_type == 'int':
            df[col] = pd.to_numeric(df[col], errors='coerce').astype(int)

        elif col_type == 'float':
            df[col] = pd.to_numeric(df[col], errors='coerce').astype(float)

        elif col_type == 'numeric':
            df[col] = pd.to_numeric(df[col], errors='coerce')


        elif col_type == 'date':
            df[col] = df[col].apply(
                lambda x: dateparser.parse(str(x))
                if pd.notnull(x) else pd.NaT
            )

        elif col_type == 'string':
            df[col] = df[col].astype(str)
            df[col] = df[col].replace('nan', np.nan)

        elif col_type == 'bool':
            df[col] = (
                df[col]
                .astype(str)
                .str.lower()
                .map({'true': True, 'false': False, '1': True, '0': False})
            )

        elif col_type == 'category':
            df[col] = df[col].astype('category')

    return df

In [83]:
station_schema = {
    'location':'string',
    'code':'string',
    'revision':'date',
    'station':'string',
    'latitude':'float',
    'longitude':'float',
}

In [84]:
observation_schema = {
    'SpO₂':'float',
    'HR':'float',
    'PI':'float',
    'RR':'float',
    'EtCO₂':'float',
    'FiO₂':'float',
    'PRV':'float',
    'BP':'float',
    'Skin Temperature':'float',
    'Motion/Activity index':'float',
    'PVI':'float',
    'Hb level':'float',
    'SV':'float',
    'CO':'float',
    'Blood Flow Index':'float',
    'PPG waveform features':'float',
    'Signal Quality Index':'float',
    'Respiratory effort':'float',
    'O₂ extraction ratio':'float',
    'SNR':'float',
    'oximetry':'int',
    'latitude':'float',
    'longitude':'float'
}

In [85]:
observation = pd.read_csv("dataset/observation.csv", sep='\t')
enforce_schema(observation, observation_schema)
station = pd.read_csv("dataset/station.csv", sep='\t')
enforce_schema(station, station_schema)
patient = pd.read_csv("dataset/patient.csv", sep='\t')
enforce_schema(patient, patient_schema)





1. pred rozdelenim: schema ,odstranim logickych outlierov, nespojim tabulky, removnem riadok ak nemam oxymetry, odstranenie irelevantnych stlpcov, duplikaty
2. rozdelenie dat na trenovacie a testovacie
3. zadefinovanie pipeline
4. fit na trenovacich datach
5. predict na testovacich datach