# Explore the dataset

## 0. Load libreries

In [9]:
from ydata_profiling import ProfileReport
import pandas as pd
import webbrowser
import os
from scipy.stats import kstest

## 1. Load dataframe

Load the dataset procedes in the previous notebook.

In [10]:
df = pd.read_pickle("./data/patient_imputed.pkl")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37472 entries, 0 to 49968
Data columns (total 53 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   id                         37472 non-null  object        
 1   sex                        37472 non-null  category      
 2   age                        37472 non-null  int64         
 3   num_shots                  37472 non-null  int64         
 4   type_center                37472 non-null  category      
 5   vaccinated                 37472 non-null  category      
 6   icu                        37472 non-null  category      
 7   inpatient_days             37472 non-null  int64         
 8   admission_datetime         37472 non-null  datetime64[ns]
 9   discharge_datetime         37472 non-null  datetime64[ns]
 10  hospital_outcome           37472 non-null  category      
 11  death_datetime             9724 non-null   datetime64[ns]
 12  delta_day

## 2. First Look

Let's explore the full dataset using Ydata library and saving it in a html file interactive.

In [11]:
report_path = "Ydata_report.html"

# DEACTIVATE BECAUSE IT TAKES MANY TIME, JUST RUN THE FIRS TIME
#profile = ProfileReport(df.drop(columns='id'), title="Profile Report") #REMOVE ID JUST FOR THE PLOT
#profile.to_file(report_path)

Let's open the file with the default navigator

In [12]:
def open_html_file(file_path):
    # Check if the file exists
    if os.path.exists(file_path):
        # Open the HTML file in the default web browser
        webbrowser.open(file_path)
    else:
        print(f"The file '{file_path}' does not exist.")

open_html_file(report_path)

## 3. Normality

As numerical variables typically have a large sample size, we proceed to conduct the Kolmogorov-Smirnov normality test to identify those variables that are normal and those that are not. With this information, we can then select the most representative statistics in the tableone.

In [13]:
# List to store numerical variables that are normal
normal_numerical_variables = []

# List to store numerical variables that are not normal
non_normal_numerical_variables = []

for column in df.columns:
    # Check if the column is numeric and not a date
    if pd.api.types.is_numeric_dtype(df[column]) and not pd.api.types.is_datetime64_any_dtype(df[column]):
        # Drop NaN values for the normality test
        data_for_test = df[column].dropna()
        
        if data_for_test.dtype == 'Int64':
            data_for_test = data_for_test.astype('int64')

        # Perform Kolmogorov-Smirnov test
        stat, p_value = kstest(data_for_test, 'norm')
        
        # Set significance level (you can adjust as needed)
        alpha = 0.05
        
        # Check if p-value is less than alpha
        if p_value > alpha:
            normal_numerical_variables.append(column)
        else:
            non_normal_numerical_variables.append(column)

# Print the results
print("Normal Numerical Variables:", normal_numerical_variables)
print("Non-Normal Numerical Variables (excluding dates):", non_normal_numerical_variables)

Normal Numerical Variables: []
Non-Normal Numerical Variables (excluding dates): ['age', 'num_shots', 'inpatient_days', 'delta_days_death', 'lab_alt', 'lab_ast', 'lab_creatinine', 'lab_crp', 'lab_ddimer', 'lab_glucose', 'lab_hct', 'lab_hemoglobin', 'lab_inr', 'lab_ldh', 'lab_leukocyte', 'lab_lymphocyte', 'lab_lymphocyte_percentage', 'lab_mch', 'lab_mcv', 'lab_neutrophil', 'lab_neutrophil_percentage', 'lab_platelet', 'lab_potassium', 'lab_rbc', 'lab_sodium', 'lab_urea']


[GFX1-]: glxtest: ManageChildProcess failed

[GFX1-]: glxtest: X error, error_code=10, request_code=150, minor_code=5


All numerical variables in the dataset exhibit a non-normal distribution, as confirmed by the Kolmogorov-Smirnov test at a significance level of alpha = 0.05. It's important to note that dates were not included in the normality analysis. Additionally, variables with missing values, such as 'delta_days_death,' were preprocessed for the test by removing the NA values. However, it's crucial to emphasize that in the actual dataframe, these variables retain their original NA values.