# Step 2: Data Preprocessing

---

## Imports and environment setup

In [1]:
import pandas as pd
from utils.visualization import header
from utils.visualization import count_zero_vals

## Preprocessing

As discussed in [Step 1. EDA](./Step1.EDA.ipynb)

In [2]:
# Load datasets
df_train = pd.read_csv("../data/Paitients_Files_Train.csv")
df_test = pd.read_csv("../data/Paitients_Files_Test.csv")

# Remove duplications (if exist)
df_train.drop_duplicates(inplace=True)
df_test.drop_duplicates(inplace=True)

# Dropping unused columns
df_train = df_train.drop(columns=['ID', 'Insurance'])  # Unused col
df_test = df_test.drop(columns=['ID', 'Insurance'])    # Unused col

# Remove duplications (if exist). We do this a second time because we just dropped two columns, and more duplications might show up
df_train.drop_duplicates(inplace=True)
df_test.drop_duplicates(inplace=True)

# Fix incorrect column name spelling, because I am very particular about grammar and spelling :)
df_train = df_train.rename(columns={"Sepssis": "Sepsis"})

# Remove invalid values
df_train['PRG'] = df_train['PRG'].replace(0, df_train['PRG'].mean())
df_train['PL'] = df_train['PL'].replace(0, df_train['PL'].mean())
df_train['PR'] = df_train['PR'].replace(0, df_train['PR'].mean())
df_train['SK'] = df_train['SK'].replace(0, df_train['SK'].mean())
df_train['TS'] = df_train['TS'].replace(0, df_train['TS'].mean())
df_train['M11'] = df_train['M11'].replace(0, df_train['M11'].mean())
df_train['BD2'] = df_train['BD2'].replace(0, df_train['BD2'].mean())

# Turn categorical values in the target Sepsis column into numerical values so that our model can work with the data
df_train["Sepsis"] = df_train["Sepsis"].map({"Negative": 0.0, "Positive": 1.0})

# Save our newly processed data into separate datasets
df_train.to_csv("../data/cleaned_train.csv", index=False)
df_test.to_csv("../data/cleaned_test.csv", index=False)

From now on, our model would simply need to load up these two file to have access to preprocessed data:
- /data/cleaned_train.csv
- /data/cleaned_test.csv

In [3]:
df_train = pd.read_csv("../data/cleaned_train.csv")
df_test = pd.read_csv("../data/cleaned_test.csv")

In [4]:
print(f"{header(9, 'SHAPE')}\n{df_train.shape}")
print(f"\n{header(19, 'NULL COUNT')}\n{df_train.isna().sum()}")
print(f"\n{header(39, 'COLUMNS OVERVIEW')}")
print(df_train.info())
df_train.head()

╔═══════╗
║ SHAPE ║
╚═══════╝
(599, 9)

╔═════════════════╗
║   NULL COUNT    ║
╚═════════════════╝
PRG       0
PL        0
PR        0
SK        0
TS        0
M11       0
BD2       0
Age       0
Sepsis    0
dtype: int64

╔═════════════════════════════════════╗
║          COLUMNS OVERVIEW           ║
╚═════════════════════════════════════╝
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PRG     599 non-null    float64
 1   PL      599 non-null    float64
 2   PR      599 non-null    float64
 3   SK      599 non-null    float64
 4   TS      599 non-null    float64
 5   M11     599 non-null    float64
 6   BD2     599 non-null    float64
 7   Age     599 non-null    int64  
 8   Sepsis  599 non-null    float64
dtypes: float64(8), int64(1)
memory usage: 42.2 KB
None


Unnamed: 0,PRG,PL,PR,SK,TS,M11,BD2,Age,Sepsis
0,6.0,148.0,72.0,35.0,79.460768,33.6,0.627,50,1.0
1,1.0,85.0,66.0,29.0,79.460768,26.6,0.351,31,0.0
2,8.0,183.0,64.0,20.562604,79.460768,23.3,0.672,32,1.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,0.0
4,3.824708,137.0,40.0,35.0,168.0,43.1,2.288,33,1.0


In [5]:
print(f"{header(9, 'SHAPE')}\n{df_test.shape}")
print(f"\n{header(19, 'NULL COUNT')}\n{df_test.isna().sum()}")
print(f"\n{header(39, 'COLUMNS OVERVIEW')}")
print(df_test.info())
df_train.head()

╔═══════╗
║ SHAPE ║
╚═══════╝
(169, 8)

╔═════════════════╗
║   NULL COUNT    ║
╚═════════════════╝
PRG    0
PL     0
PR     0
SK     0
TS     0
M11    0
BD2    0
Age    0
dtype: int64

╔═════════════════════════════════════╗
║          COLUMNS OVERVIEW           ║
╚═════════════════════════════════════╝
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PRG     169 non-null    int64  
 1   PL      169 non-null    int64  
 2   PR      169 non-null    int64  
 3   SK      169 non-null    int64  
 4   TS      169 non-null    int64  
 5   M11     169 non-null    float64
 6   BD2     169 non-null    float64
 7   Age     169 non-null    int64  
dtypes: float64(2), int64(6)
memory usage: 10.7 KB
None


Unnamed: 0,PRG,PL,PR,SK,TS,M11,BD2,Age,Sepsis
0,6.0,148.0,72.0,35.0,79.460768,33.6,0.627,50,1.0
1,1.0,85.0,66.0,29.0,79.460768,26.6,0.351,31,0.0
2,8.0,183.0,64.0,20.562604,79.460768,23.3,0.672,32,1.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,0.0
4,3.824708,137.0,40.0,35.0,168.0,43.1,2.288,33,1.0


In [6]:
print(f"Null values count: {df_test.isna().sum().sum() + df_train.isna().sum().sum()}")

Null values count: 0


In [7]:
lookup_cols = ['PRG', 'PL', 'PR', 'SK', 'TS', 'M11', 'BD2', 'Age']
print(f"{header(35, 'TRAIN DATA ZERO VALUES COUNT')}\n{count_zero_vals(df_train, lookup_cols)}")
print(f"{header(35, 'TEST DATA ZERO VALUES COUNT')}\n{count_zero_vals(df_test, lookup_cols)}")

╔═════════════════════════════════╗
║  TRAIN DATA ZERO VALUES COUNT   ║
╚═════════════════════════════════╝
               Count     Percentage
-----------------------------------
PRG                0          0.00%
PL                 0          0.00%
PR                 0          0.00%
SK                 0          0.00%
TS                 0          0.00%
M11                0          0.00%
BD2                0          0.00%
Age                0          0.00%

╔═════════════════════════════════╗
║   TEST DATA ZERO VALUES COUNT   ║
╚═════════════════════════════════╝
               Count     Percentage
-----------------------------------
PRG               18         10.65%
PL                 0          0.00%
PR                 7          4.14%
SK                52         30.77%
TS                85         50.30%
M11                2          1.18%
BD2                0          0.00%
Age                0          0.00%



After this step, we are ready to start processing our data and then developing our model(s).

**Note:** The code in this Notebook has been implemented in [utils.processing.Data](../utils/processing.py) for further use. See its documentation for more details.