# Intelligent Data Analysis Project
### Matej Bebej (50%), Marian Kurcina (50%)

## Table of Contents
- Assignment
- Phase 2 - Data preprocessing

# Assignment

Oxygen saturation is a key indicator of the proper functioning of the respiratory and circulatory systems. When its value drops to a critically low level, it may indicate life-threatening conditions such as hypoxemia, respiratory failure, or severe infections. In such cases, immediate intervention is essential. Traditional monitoring is performed using pulse oximeters, which, however, can be affected by noise, motion artifacts, or may have limitations in certain clinical situations.

Modern machine learning–based approaches offer the possibility to estimate and predict critical oxygen saturation values with higher accuracy (critical oxygen saturation estimation). Models can utilize multimodal data, such as heart rate, respiratory rate, blood pressure, or sensor signals. By being trained on diverse datasets, it is possible to identify early warning signs of desaturation, filter out noise, and provide timely alerts even before oxygen saturation drops below a safe threshold.

The goal of this assignment is to become familiar with the issue of oxygen saturation monitoring, understand the contribution of artificial intelligence, and design a solution that could improve critical care and reduce risks associated with undiagnosed hypoxemia.

Each pair of students will work with an assigned dataset starting from Week 2. Your task is to predict the dependent variable “oximetry” (the predicted variable) using machine learning methods. In doing so, you will need to deal with various issues present in the data, such as inconsistent formats, missing values, outliers, and others.

The expected outcomes of the project are:

- the best-performing machine learning model, and

- a data pipeline for building it from the input data.

# Phase 2 – Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
import dateparser
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import PolynomialFeatures
observation = pd.read_csv("dataset/observation.csv", sep='\t')
patient = pd.read_csv("dataset/patient.csv", sep='\t')
station = pd.read_csv("dataset/station.csv", sep='\t')

In this phase, you are expected to carry out data preprocessing for machine learning. The result should be a dataset (CSV or TSV), where each observation is described by one row.
Since scikit-learn only works with numerical data, something must be done with the non-numerical data.

Ensure the preprocessing is reproducible on both the training and test datasets, so that you can repeat the process multiple times as needed (iteratively).

Because preprocessing can change the shape and characteristics of the data, you may need to perform EDA (Exploratory Data Analysis) again as necessary. These techniques will not be graded again, but document any changes in the chosen methods.
You can solve data-related issues iteratively across all phases, as needed.

## 2.1 Implementation of Data Preprocessing

### A - Train–Test Split

Split the data into training and test sets according to your predefined ratio. Continue working only with the training dataset.

In [2]:
before = len(observation)
observation = observation.drop_duplicates()
after = len(observation)
print("Removed duplicates:", before - after)

Removed duplicates: 1


In [3]:
train, test = train_test_split(
    observation, 
    test_size=0.2,        # 20% for testing
    random_state=42,      # reproducibility
    shuffle=True
)

print("train shape:", train.shape)
print("test shape:", test.shape)

train shape: (9684, 23)
test shape: (2422, 23)


We split data into train and test parts, with 20% of observation data being test data () and 80% being train data ().

In [4]:
station_schema = {
    'location':'string',
    'code':'string',
    'revision':'date',
    'station':'string',
    'latitude':'float',
    'longitude':'float',
}
observation_schema = {
    'SpO₂':'float',
    'HR':'float',
    'PI':'float',
    'RR':'float',
    'EtCO₂':'float',
    'FiO₂':'float',
    'PRV':'float',
    'BP':'float',
    'Skin Temperature':'float',
    'Motion/Activity index':'float',
    'PVI':'float',
    'Hb level':'float',
    'SV':'float',
    'CO':'float',
    'Blood Flow Index':'float',
    'PPG waveform features':'float',
    'Signal Quality Index':'float',
    'Respiratory effort':'float',
    'O₂ extraction ratio':'float',
    'SNR':'float',
    'oximetry':'int',
    'latitude':'float',
    'longitude':'float'
}
patient_schema = {
    'residence':'string',              
    'current_location':'string',    
    'blood_group':'string',          
    'job':'string',                 
    'mail':'string',                
    'user_id':'int',             
    'birthdate':'date',           
    'company':'string',             
    'name':'string',                
    'username':'string',            
    'ssn':'string',                 
    'registration':'date',        
    'station_ID':'int'            
}

In [5]:
class EnforceSchema(BaseEstimator, TransformerMixin):
    def __init__(self, schema=None):
        self.schema = schema

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.schema is None:
            return X

        X = X.copy()

        for col, col_type in self.schema.items():
            if col not in X.columns:
                continue
            if col_type == 'int':
                X[col] = pd.to_numeric(X[col], errors='coerce').astype('Int64')

            elif col_type == 'float':
                X[col] = pd.to_numeric(X[col], errors='coerce').astype(float)

            elif col_type == 'numeric':
                X[col] = pd.to_numeric(X[col], errors='coerce')

            elif col_type == 'date':
                X[col] = X[col].apply(
                    lambda x: dateparser.parse(str(x))
                    if pd.notnull(x) else pd.NaT
                )

            elif col_type == 'string':
                X[col] = X[col].astype(str)
                X[col] = X[col].replace('nan', np.nan)

            elif col_type == 'bool':
                X[col] = (
                    X[col]
                    .astype(str)
                    .str.lower()
                    .map({'true': True, 'false': False, '1': True, '0': False})
                )

            elif col_type == 'category':
                X[col] = X[col].astype('category')

    
        return X

In [6]:
valid_ranges = {
    'SpO₂': (95, 100),
    'HR': (60, 100),
    'PI': (0.2, 20),
    'RR': (12, 20),
    'EtCO₂': (35, 45),
    'FiO₂': (21, 100),
    'PRV': (20, 200),
    'BP': (60, 120),
    'Skin Temperature': (33, 38),
    'Motion/Activity index': None,
    'PVI': (10, 20),
    'Hb level': (12, 18),
    'SV': (60, 100),
    'CO': (4, 8),
    'Blood Flow Index': None,
    'PPG waveform features': None,
    'Signal Quality Index': (0, 100),
    'Respiratory effort': None,
    'O₂ extraction ratio': (0.2, 3),
    'SNR': (20, 40)
}

In [7]:
class EnforceValueRanges(BaseEstimator, TransformerMixin):
    def __init__(self, ranges=None):
        self.ranges = ranges

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        if self.ranges is None:
            return X

        for col, valid_range in self.ranges.items():
            if col not in X.columns:
                continue
            if valid_range is None:
                continue  

            low, high = valid_range
            X[col] = pd.to_numeric(X[col], errors='coerce')

            mask = ~X[col].between(low, high, inclusive='both')
            X.loc[mask, col] = np.nan

        return X

In [8]:
class ParseLocation(BaseEstimator, TransformerMixin):
    def __init__(self, column='location'):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        if self.column not in X.columns:
            return X
        
        split_cols = X[self.column].astype(str).str.split('/', n=1, expand=True)
        X['continent'] = split_cols[0].str.strip()
        X['city'] = split_cols[1].str.strip()
        X = X.drop(columns=[self.column], errors='ignore')
        return X

In [9]:
class DropColumns(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns or []

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop(columns=self.columns, errors='ignore')

In [10]:
class RemoveDuplicates(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop_duplicates()

In [11]:
class DropNA(BaseEstimator, TransformerMixin):
    def __init__(self, how='any', subset=None):
        self.how = how
        self.subset = subset

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.dropna(how=self.how, subset=self.subset)

In [12]:
observation_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=observation_schema)),
    ("ranges", EnforceValueRanges(ranges=valid_ranges)),
    ("drop_geo", DropColumns(columns=['latitude', 'longitude'])),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),
])

station_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=station_schema)),
    ("drop_station_and_date", DropColumns(columns=['station', 'revision'])),
    ("parse_location", ParseLocation(column='location')),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),
])

patient_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=patient_schema)),
    ("remove_duplicates", RemoveDuplicates()),
])

observation = observation_pipeline.fit_transform(train)
station = station_pipeline.fit_transform(pd.read_csv("dataset/station.csv", sep='\t'))
patient = patient_pipeline.fit_transform(pd.read_csv("dataset/patient.csv", sep='\t'))

station.head()

and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))
  X[col] = X[col].replace('nan', np.nan)
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))


Unnamed: 0,code,latitude,longitude,continent,city
0,ES,37.35813,-6.03731,Europe,Madrid
1,CO,7.83389,-72.47417,America,Bogota
2,IN,26.44931,91.61356,Asia,Kolkata
3,US,37.95143,-91.77127,America,Chicago
4,DE,48.21644,9.02596,Europe,Berlin


### B - Data Transformation

Transform the data into a format suitable for machine learning, i.e. each observation must be described by one row, and each attribute must be numeric.
Iteratively integrate preprocessing steps from Phase 1 as part of a unified process.

For observation dataset there are no string or category features, therefore we will show encoding on station dataset. In pipeline we will still call encoding but it will have no effect.

We also ignore revision column and station column.

In [13]:
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

observation_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=observation_schema)),
    ("ranges", EnforceValueRanges(ranges=valid_ranges)),
    ("drop_geo", DropColumns(columns=['latitude', 'longitude'])),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),  
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, [])
    ], remainder='passthrough').set_output(transform="pandas")),
])

station_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=station_schema)),
    ("drop_station_and_date", DropColumns(columns=['station', 'revision'])),
    ("parse_location", ParseLocation(column='location')),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, ['continent', 'city', 'code'])
    ], remainder='passthrough').set_output(transform="pandas")),
])

patient_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=patient_schema)),
    ("remove_duplicates", RemoveDuplicates()),
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, ['blood_group'])
    ], remainder='passthrough').set_output(transform="pandas")),
])



observation = observation_pipeline.fit_transform(train)
station = station_pipeline.fit_transform(pd.read_csv("dataset/station.csv", sep='\t'))
patient = patient_pipeline.fit_transform(pd.read_csv("dataset/patient.csv", sep='\t'))

station.head()

and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))
  X[col] = X[col].replace('nan', np.nan)
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))


Unnamed: 0,cat__continent_Africa,cat__continent_America,cat__continent_Asia,cat__continent_Atlantic,cat__continent_Australia,cat__continent_Europe,cat__continent_Indian,cat__continent_Pacific,cat__city_Abidjan,cat__city_Accra,...,cat__code_UA,cat__code_US,cat__code_UY,cat__code_UZ,cat__code_VE,cat__code_VU,cat__code_YE,cat__code_ZA,remainder__latitude,remainder__longitude
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.35813,-6.03731
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.83389,-72.47417
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.44931,91.61356
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,37.95143,-91.77127
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,48.21644,9.02596


In [14]:
def get_numeric_columns(df):
    return df.select_dtypes(include=['int', 'float']).columns.tolist()

numeric_pipeline = Pipeline([
    #("power", PowerTransformer(method="yeo-johnson")),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler()),
    #("scaler", MinMaxScaler(feature_range=(-1, 1)))
])

observation_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=observation_schema)),
    ("ranges", EnforceValueRanges(ranges=valid_ranges)),
    ("drop_geo", DropColumns(columns=['latitude', 'longitude'])),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),  
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, []),
        ('num', numeric_pipeline, get_numeric_columns)
    ], remainder='passthrough').set_output(transform="pandas")),
])

station_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=station_schema)),
    ("drop_station_and_date", DropColumns(columns=['station', 'revision'])),
    ("parse_location", ParseLocation(column='location')),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, []),
        ('num', numeric_pipeline, get_numeric_columns)
    ], remainder='passthrough').set_output(transform="pandas")),
])

patient_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=patient_schema)),
    ("remove_duplicates", RemoveDuplicates()),
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, [])
        #('num', numeric_pipeline, get_numeric_columns)
    ], remainder='passthrough').set_output(transform="pandas")),
])



observation = observation_pipeline.fit_transform(train)
station = station_pipeline.fit_transform(pd.read_csv("dataset/station.csv", sep='\t'))
patient = patient_pipeline.fit_transform(pd.read_csv("dataset/patient.csv", sep='\t'))

observation.head()

and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))
  X[col] = X[col].replace('nan', np.nan)
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))


Unnamed: 0,num__SpO₂,num__HR,num__PI,num__RR,num__EtCO₂,num__FiO₂,num__PRV,num__BP,num__Skin Temperature,num__Motion/Activity index,...,num__Respiratory effort^2,num__Respiratory effort O₂ extraction ratio,num__Respiratory effort SNR,num__Respiratory effort oximetry,num__O₂ extraction ratio^2,num__O₂ extraction ratio SNR,num__O₂ extraction ratio oximetry,num__SNR^2,num__SNR oximetry,num__oximetry^2
108,1.954521,0.033698,1.224197,0.73408,0.380885,-0.2285,0.321029,1.904331,0.891956,1.386029,...,1.3338,0.770413,1.776819,1.280842,-0.578689,0.52692,0.677264,0.99997,1.155678,0.813547
12092,-0.992378,0.267028,0.550853,-1.609397,-0.140488,1.056885,0.444067,0.587254,-0.308649,-0.970241,...,0.967185,1.308606,-0.455511,-1.170694,0.763855,-0.923175,-1.209004,-1.321273,-1.175528,-1.229185
9658,-2.104468,0.593214,0.227687,-3.442672,-0.016086,1.978872,-0.562443,-0.775785,-0.103155,2.704977,...,1.756834,1.643355,0.191396,-1.170694,0.341288,-0.755348,-1.209004,-1.053444,-1.175528,-1.229185
2929,-0.523476,-0.406923,-0.314754,-0.380334,0.271949,0.227378,-0.690989,-0.831707,-2.303606,-1.473373,...,-0.628914,0.002941,0.144519,-1.170694,1.141841,1.467996,-1.209004,0.897467,-1.175528,-1.229185
10405,0.634871,0.5919,0.354329,0.86405,1.820713,-0.106281,0.33212,-1.150007,-1.600295,-0.776803,...,-0.027473,0.627438,0.942852,0.803044,1.145685,1.834231,1.066976,1.352354,1.264562,0.813547


Logic behind StandardScaler

For each numeric column, it computes formula: z = (x - mean)/ std.

mean = 0

std = 1

StandardScaler makes each column "comparable" in scale. Prevents some features being too influential simply because of their scale.

Logic behind MinMax Scaler

Maps the minimum of each feature to −1 and the maximum to 1. It uses following formula z = -1 + 2 * ((x - x(min) / (x(max) - x(min)).

Preserves the original shape of the distribution, only makes the values more tightly bounded.

Good when we need values from from smaller ranges (here from -1 to 1).

In [15]:
def get_numeric_columns(df):
    return df.select_dtypes(include=['int', 'float']).columns.tolist()

numeric_pipeline = Pipeline([
    ("power", PowerTransformer(method="yeo-johnson")),
    #("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler()),
    #("scaler", MinMaxScaler(feature_range=(-1, 1)))
])

observation_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=observation_schema)),
    ("ranges", EnforceValueRanges(ranges=valid_ranges)),
    ("drop_geo", DropColumns(columns=['latitude', 'longitude'])),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),  
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, []),
        ('num', numeric_pipeline, get_numeric_columns)
    ], remainder='passthrough').set_output(transform="pandas")),
])

station_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=station_schema)),
    ("drop_station_and_date", DropColumns(columns=['station', 'revision'])),
    ("parse_location", ParseLocation(column='location')),
    ("drop_na", DropNA(how='any')),
    ("remove_duplicates", RemoveDuplicates()),
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, []),
        ('num', numeric_pipeline, get_numeric_columns)
    ], remainder='passthrough').set_output(transform="pandas")),
])

patient_pipeline = Pipeline([
    ("schema", EnforceSchema(schema=patient_schema)),
    ("remove_duplicates", RemoveDuplicates()),
    ("encode", ColumnTransformer([
        ('cat', cat_pipeline, [])
        #('num', numeric_pipeline, get_numeric_columns)
    ], remainder='passthrough').set_output(transform="pandas")),
])



observation = observation_pipeline.fit_transform(train)
station = station_pipeline.fit_transform(pd.read_csv("dataset/station.csv", sep='\t'))
patient = patient_pipeline.fit_transform(pd.read_csv("dataset/patient.csv", sep='\t'))

observation.head()

and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))
  X[col] = X[col].replace('nan', np.nan)
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  lambda x: dateparser.parse(str(x))


Unnamed: 0,num__SpO₂,num__HR,num__PI,num__RR,num__EtCO₂,num__FiO₂,num__PRV,num__BP,num__Skin Temperature,num__Motion/Activity index,...,num__Hb level,num__SV,num__CO,num__Blood Flow Index,num__PPG waveform features,num__Signal Quality Index,num__Respiratory effort,num__O₂ extraction ratio,num__SNR,num__oximetry
108,2.244423,0.036336,1.188052,0.717171,0.375025,-0.227016,0.322453,1.906521,0.892475,1.382773,...,-0.873398,0.524507,-0.312674,-0.754727,-0.463949,-0.705653,1.282818,-0.532311,1.001464,0.813547
12092,-1.010517,0.269449,0.610112,-1.542601,-0.147392,1.056563,0.44529,0.586723,-0.306314,-0.969859,...,-0.297935,0.19869,-0.029034,0.820387,0.230167,-0.751392,0.978776,0.791416,-1.450768,-1.229185
9658,-1.890977,0.594859,0.311683,-2.918267,-0.023086,1.974625,-0.561045,-0.776138,-0.100607,2.684545,...,0.373125,-0.405019,0.254606,-0.844819,-0.01976,1.871517,1.617233,0.39927,-1.073636,-1.229185
2929,-0.589237,-0.404662,-0.227979,-0.43018,0.265558,0.228752,-0.68983,-0.831986,-2.314869,-1.477345,...,1.728421,0.808179,-0.596314,-1.748582,-0.918355,0.398479,-0.568385,1.126381,0.91846,-1.229185
10405,0.59767,0.59355,0.430469,0.859807,1.837275,-0.104766,0.333528,-1.14976,-1.604353,-0.775262,...,-2.12608,1.281858,0.254606,0.017713,0.403107,0.989404,0.067939,1.129716,1.276301,0.813547


PowerTransformer (Yeo-Johnson methjod)

Makes given data from skewed and non-normal into more Gaussian like shape.

There is a big chance of model behaving better when the inputs we feed him are closer to normal

Polynomial features

We create new features which are results of us multiplying existing ones and by that expands the feature space.

We are hoping that this will result in boosting accuracy of our model.

### C - Feature Scaling and Transformation

Transform the dataset attributes for machine learning using at least the following techniques:

- Scaling (2 techniques)
- Transformers (2 techniques)

### D - Justification and Documentation

Justify your choices/decisions for implementation

## 2.2 Feature Selection 

### A - Identification of Informative Features

Identify which attributes (features) in your data are informative with respect to the target variable (use at least 3 techniques and compare their results).

### B - Ranking of Features

Rank the identified features by importance.

### C - Justification and Documentation

Justify your choices/decisions for implementation (i.e., provide documentation).

## 2.3 Reproducibility of Preprocessing

### Code Generalization for Reuse and Pipeline Implementation

Modify your preprocessing code for the training dataset so that it can be reused without further modifications to preprocess the test dataset in a machine learning context. Use the sklearn.pipeline functionality.