## Name: Jackline Mboya
## Adm No: 193670
## Course: Applied Machine Learning (AML) - DSA 8401
### Assignment 1.

## Introduction to the Data Preprocessing Assignment
This assignment focuses on a crucial phase of any Machine Learning Project, which is **Data Pre-processing**. The goal is to prepare a raw dataset for analysis by cleaning it and transforming it into a format suitable for machine learning models.

This specific assignment involves working with a dataset from the MIMIC-III database, which contains valuable information on in-hospital mortality obtained from monitoring of patients in the Intensive Care Unit (ICU).

The dataset contains a mix of numerical and categorical variables. The data preprocessing phase (this report) outlines the step-by-step process of cleaning this data, including:

- Categorizing variables based on their type.

- Performing an initial statistical description to understand the raw data.

- Identifying and treating outliers and missing values.

- Recalculating statistics to see the impact of the cleaning process.

- Analyzing the advantages and disadvantages of the chosen preprocessing techniques.

By the end of this process, the dataset will be transformed into a high-quality asset ready for building a machine learning model for in-hospital mortality.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/jackieMboyaAchieng/Applied-Machine-Learning/refs/heads/main/Assignments/assigment%201/ihm_48_hours.csv')
df.head()

Unnamed: 0,Capillary refill rate,Diastolic blood pressure,Fraction inspired oxygen,Glascow coma scale eye opening,Glascow coma scale motor response,Glascow coma scale total,Glascow coma scale verbal response,Glucose,Heart Rate,Height,Mean blood pressure,Oxygen saturation,Respiratory rate,Systolic blood pressure,Temperature,Weight,pH,Patient_id,target
0,,73.0,,Spontaneously,Obeys Commands,,Oriented,-11.396037,-19.976803,,76.0,94.0,17.0,116.0,36.388889,83.5,,30552,0
1,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.0,96.0,,76.0,95.0,18.0,116.0,36.388889,83.5,,30552,0
2,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.0,96.0,,76.0,-6.497052,18.0,116.0,36.388889,83.5,,30552,0
3,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.0,96.0,,76.0,95.0,18.0,116.0,36.388889,83.5,,30552,0
4,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.0,96.0,,76.0,95.0,18.0,116.0,36.388889,83.5,,30552,0


### Data Understanding

In [3]:
df.shape

(300912, 19)

This dataset contains 19 unique variables and 300,912 entries.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300912 entries, 0 to 300911
Data columns (total 19 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Capillary refill rate               6336 non-null    float64
 1   Diastolic blood pressure            296944 non-null  float64
 2   Fraction inspired oxygen            88464 non-null   float64
 3   Glascow coma scale eye opening      274190 non-null  object 
 4   Glascow coma scale motor response   296978 non-null  object 
 5   Glascow coma scale total            184416 non-null  float64
 6   Glascow coma scale verbal response  296884 non-null  object 
 7   Glucose                             300698 non-null  float64
 8   Heart Rate                          300912 non-null  float64
 9   Height                              55824 non-null   float64
 10  Mean blood pressure                 296984 non-null  float64
 11  Oxygen saturation         

In [5]:
df.describe()

Unnamed: 0,Capillary refill rate,Diastolic blood pressure,Fraction inspired oxygen,Glascow coma scale total,Glucose,Heart Rate,Height,Mean blood pressure,Oxygen saturation,Respiratory rate,Systolic blood pressure,Temperature,Weight,pH,target
count,6336.0,296944.0,88464.0,184416.0,300698.0,300912.0,55824.0,296984.0,300912.0,300864.0,300912.0,298848.0,221040.0,230614.0,300912.0
mean,0.219223,62.541099,0.599884,10.818123,130.628329,79.447793,168.543422,78.79108,95.34346,18.731265,119.694213,36.832834,82.969018,5.573617,0.142128
std,0.413753,341.559624,0.253919,4.334923,84.171126,32.14592,15.137414,29.52986,2529.203751,6.884248,23.396042,1.000075,26.765857,5.963634,0.349182
min,0.0,0.0,0.0,3.0,-19.999974,-19.999623,0.0,-34.0,-19.999687,0.0,0.0,0.0,0.0,-19.999706,0.0
25%,0.0,51.0,0.4,8.0,101.0,70.0,160.0,68.0,95.0,15.0,103.0,36.277802,66.6,7.31,0.0
50%,0.0,59.0,0.5,11.0,126.0,84.0,170.0,77.0,98.0,18.0,117.0,36.833333,79.099998,7.37,0.0
75%,0.0,69.0,0.7,15.0,158.0,97.0,178.0,88.0,100.0,22.0,134.0,37.388889,94.699997,7.42,0.0
max,1.0,100105.01,7.1,15.0,9999.0,941.0,203.0,9381.0,981023.0,1211.0,295.0,73.760002,931.224376,99.0,1.0


## Data Cleaning

In [8]:
## Rename Columns (Variables)
new_names = {
        'Diastolic blood pressure': 'diastolic_blood_pressure',
        'Systolic blood pressure': 'systolic_blood_pressure',
        'Mean blood pressure': 'mean_blood_pressure',
        'Heart Rate': 'heart_rate',
        'Oxygen saturation': 'oxygen_saturation',
        'Respiratory rate': 'respiratory_rate',
        'Temperature': 'temperature',
        'Capillary refill rate': 'capillary_refill_rate',
        'Fraction inspired oxygen': 'fraction_inspired_oxygen',
        'Glascow coma scale eye opening': 'gcs_eye',
        'Glascow coma scale motor response': 'gcs_motor',
        'Glascow coma scale total': 'gcs_total',
        'Glascow coma scale verbal response': 'gcs_verbal_response',
        'Patient_id': 'PatientID'
    }
df.rename(columns=new_names, inplace=True)

In [9]:
df

Unnamed: 0,capillary_refill_rate,diastolic_blood_pressure,fraction_inspired_oxygen,gcs_eye,gcs_motor,gcs_total,gcs_verbal,Glucose,heart_rate,Height,mean_blood_pressure,oxygen_saturation,respiratory_rate,systolic_blood_pressure,temperature,Weight,pH,PatientID,target
0,,73.0,,Spontaneously,Obeys Commands,,Oriented,-11.396037,-19.976803,,76.0,94.000000,17.0,116.0,36.388889,83.5,,30552,0
1,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.000000,96.000000,,76.0,95.000000,18.0,116.0,36.388889,83.5,,30552,0
2,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.000000,96.000000,,76.0,-6.497052,18.0,116.0,36.388889,83.5,,30552,0
3,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.000000,96.000000,,76.0,95.000000,18.0,116.0,36.388889,83.5,,30552,0
4,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.000000,96.000000,,76.0,95.000000,18.0,116.0,36.388889,83.5,,30552,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
300907,,59.0,,4 Spontaneously,6 Obeys Commands,15.0,5 Oriented,105.000000,129.000000,,89.0,98.000000,14.0,110.0,38.277802,,7.36,11623,0
300908,,61.0,,4 Spontaneously,,15.0,5 Oriented,105.000000,108.000000,,80.0,98.000000,15.0,116.0,37.444445,,7.36,11623,0
300909,,61.0,,4 Spontaneously,6 Obeys Commands,15.0,5 Oriented,100.000000,108.000000,,80.0,98.000000,15.0,116.0,37.444445,,7.36,11623,0
300910,,52.0,,4 Spontaneously,6 Obeys Commands,15.0,5 Oriented,100.000000,108.000000,,80.0,98.000000,16.0,101.0,37.444445,,7.36,11623,0


#### Why Renaming of Variables?

Column renaming is an important step in data pre-processing for various key reasons:

- **Readability and Clarity**: It makes the names easy to read for both analysist and end users. Instead of long names with spaces, we use short, simple names. This helps us understand what the data is about just by looking at the column names.
- **Consistency**: It keeps all the names in the same style. This helps us avoid confusion and mistakes when we write code.
- **Usability**: It helps the data work well with programming tools. As some programs don't like spaces or special symbols in names, changing the names can be a safe approach.
- **Documentation**: Clear names explain themselves. For example, 'systolic_blood_pressure' quickly tells us it's about "Systolic blood pressure" with no need to explan further.

### Enumeration of Variables