In [None]:
import pandas as pd
import boto3
import io
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [None]:
bucket_name = "lneg-loka"
s3_key_name = "patient_data_raw/patient_data_raw.csv"
processed_file_name = "patient_data_processed.csv"

In [None]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket_name, Key=s3_key_name)
csv_string = obj['Body'].read().decode('utf-8')
df = pd.read_csv(io.StringIO(csv_string))

In [None]:
df

In [None]:
df['chronic_obstructive_pulmonary_disease'].unique()

In [None]:
df['chronic_obstructive_pulmonary_disease'].hist()

Check if there are NaN/Missing values. There are, only in 'exercise_frequency' and 'education_level' columns, the rest is clean.

In [None]:
df.isna().sum()

By plotting class-conditional distribution of 'exercise_frequency', it can be seen that it should have little predictive power for prediction chronic obstructive pulmonary disease - the distribution is virtually equal for all classes.

By plotting class-conditional distribution of 'education_level', it can be seen that it should have little predictive power for prediction chronic obstructive pulmonary disease - the distribution is virtually equal for all classes.

Since the distributions of educational_level and exercise_frequency are virtually class-independent, these features have little discriminative power for predicting chronic obstructive pulmonary disease. Because of this, I decided to discard these features instead of other more complex/data wasteful solutions to deal with missing/NaN values.

In [None]:
covariate_name = "days_hospitalized"
target="D"
bins_temp = df[df["chronic_obstructive_pulmonary_disease"]==target][covariate_name].unique()
bins = np.array([i for i in bins_temp if str(i).lower() != 'nan'])
bins.sort()
#df[df["chronic_obstructive_pulmonary_disease"]==target][covariate_name].value_counts().loc[bins].plot.bar() #use for categorical variables
df[df["chronic_obstructive_pulmonary_disease"]==target][covariate_name].hist(bins=20) #use for continuous real-valued variables

All variables have equal class-conditional distributions, so it is not possible to get meaningful predictions for chronic obstructive pulmonary disease from these features.
There's almost perfect collinearity between certain features, e.g. BMI and alanine_aminotransferase. Is the data of one of these features corrupted? In any case at least one of these should be dropped

In [None]:
df_new = df.copy()

Should make preprocessing a part of model pipeline to avoid having to preprocess test data at inference time

Drop problematic/useless/suspicious features

In [None]:
df_new = df_new.drop('exercise_frequency',axis=1)#has NaN values. Since feature is not predictive, might as well drop
df_new = df_new.drop('education_level',axis=1)#has NaN values. Since feature is not predictive, might as well drop
df_new = df_new.drop('patient_id',axis=1)#shouldn't be used to predict COPD, model could learn COPD for each single patient
df_new = df_new.drop("alanine_aminotransferase",axis=1)#not meant to be used for COPD prediction

In [None]:
df_new