# Exploratory data analysis

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
data  = pd.read_csv('data/D1.csv', low_memory=False)

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50031 entries, 0 to 50030
Data columns (total 39 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   encounter_id              50031 non-null  int64 
 1   patient_nbr               50031 non-null  int64 
 2   race                      50031 non-null  object
 3   gender                    50031 non-null  object
 4   age                       50031 non-null  object
 5   weight                    50031 non-null  object
 6   admission_type_id         50031 non-null  int64 
 7   discharge_disposition_id  50031 non-null  int64 
 8   admission_source_id       50031 non-null  int64 
 9   length_of_stay            50031 non-null  int64 
 10  payer_code                50031 non-null  object
 11  medical_specialty         50031 non-null  object
 12  num_lab_procedures        50031 non-null  int64 
 13  num_procedures            50031 non-null  int64 
 14  num_medications       

In [4]:
def value_counts(df):
    """
    Count the occurrences of each unique value in the object columns of a DataFrame.
    Present proportions of each unique value.

    :param df: A pandas DataFrame object.
    :return: None
    """
    for column in df.columns:
        if df[column].dtype == 'object':
            print(df[column].value_counts(normalize=True))
            print('-' * 50)


In [5]:
# value_counts(data)

In [6]:
data[['number_outpatient', 'number_inpatient', 'number_emergency']]

Unnamed: 0,number_outpatient,number_inpatient,number_emergency
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,0,0
...,...,...,...
50026,0,0,0
50027,1,0,0
50028,0,0,0
50029,0,0,0


In [7]:
data['number_inpatient'].value_counts(normalize=True)

number_inpatient
0     0.684975
1     0.184386
2     0.068518
3     0.030141
4     0.014291
5     0.007136
6     0.004237
7     0.002419
8     0.001299
9     0.000859
10    0.000520
?     0.000300
11    0.000300
12    0.000220
13    0.000100
15    0.000080
14    0.000080
16    0.000080
17    0.000020
21    0.000020
18    0.000020
Name: proportion, dtype: float64

In [8]:
data['number_outpatient'].value_counts(normalize=True)

number_outpatient
0     0.888589
1     0.062501
2     0.021806
3     0.012732
4     0.006696
5     0.003538
6     0.001339
7     0.000620
8     0.000560
?     0.000400
9     0.000320
10    0.000240
11    0.000180
12    0.000100
13    0.000080
14    0.000080
16    0.000060
15    0.000040
20    0.000020
21    0.000020
35    0.000020
17    0.000020
29    0.000020
36    0.000020
Name: proportion, dtype: float64

In [9]:
data['number_emergency'].value_counts(normalize=True)

number_emergency
0     0.923128
1     0.052547
2     0.012872
3     0.004637
4     0.002638
?     0.001379
5     0.000899
6     0.000540
7     0.000480
8     0.000320
9     0.000200
10    0.000160
11    0.000060
22    0.000040
25    0.000020
13    0.000020
42    0.000020
16    0.000020
28    0.000020
Name: proportion, dtype: float64

In [10]:
data['weight'].value_counts(normalize=True)

weight
?            0.962783
[75-100)     0.015870
[50-75)      0.010993
[100-125)    0.006316
[125-150)    0.001459
[25-50)      0.001359
[0-25)       0.000740
[150-175)    0.000340
[175-200)    0.000120
>200         0.000020
Name: proportion, dtype: float64

## Comments about the data types

- The `id` columns are integers, which is fine.
- The `race` and `gender` look fine, just some missing values. We should convert them to categorical data type.
- The `age` column is a string based on the formatting of the intervals. We should convert it to an interval data type.
- The `weight` column has 97% missing values. I suggest drop this column.
- The `payer_code` column has 40% missing values. We should discuss if it is necessary to keep this column. We might be able to assume that the emply payer code means that the patient does not have insurance. If not, I suggest drop this column.
- 
- 

## Comments about the goal of the data mining

This looks like a 'length of stay' prediction problem. The goal is to predict the length of stay of a patient in the hospital. The `length_of_stay` column is the target variable. It has no missing values and the data are in a manageable range. We should convert this column to a numeric data type. 

The `readmitted` column could be secondary target variable. It is a categorical variable with three classes. We should convert this column to a categorical data type.

The `discharge_disposition_id` could also be used as a secondary target variable. It is a categorical variable with 26 classes. It might be worth reducing the number of classes to binary outcome variable (all cause mortality), or categorical variable with fewer classes (e.g. discharged home, discharged to another facility, died.).

We should discuss if we want to filter out the `admission_type_id` column. If we choose length of stay as the target variable, we might want to filter out the `admission_type_id` column to exclude newborns and electives. The same goes for `single_day_admission`. We might want to filter out the single day admissions. 

In [11]:
data['length_of_stay'].value_counts(normalize=True)

length_of_stay
3     0.166677
2     0.164018
4     0.134497
1     0.133717
5     0.097220
6     0.076233
7     0.058764
8     0.047231
9     0.032060
10    0.026364
11    0.020947
12    0.016770
13    0.013252
14    0.012252
Name: proportion, dtype: float64

In [12]:
data['admission_type_id'].value_counts(normalize=True)

admission_type_id
1    0.484999
2    0.194180
3    0.167816
6    0.083628
5    0.066099
8    0.003098
4    0.000140
7    0.000040
Name: proportion, dtype: float64

### Data mismatch corrections

Some of the data fields have a `?` character. We should replace these with `np.nan` and convert the columns to correct data type.

In [13]:
# Make a copy for data quality corrections

data_quality = data.copy()

In [14]:
# Replace `?` characters in the data['race'] column with NA
data_quality['race'] = data['race'].replace('?', 'Other')

In [15]:
data_quality.race.value_counts()

race
Caucasian          35732
AfricanAmerican    11149
Other               1867
Hispanic            1020
Asian                263
Name: count, dtype: int64

In [16]:
data_quality.gender.value_counts()

gender
Female             27000
Male               23030
Unknown/Invalid        1
Name: count, dtype: int64

In [17]:
data_quality = data[data['gender'] != 'Unknown/Invalid']

In [18]:
data['age'].value_counts()

age
[70-80)     13109
[60-70)     10874
[50-60)      8775
[80-90)      7530
[40-50)      5064
[30-40)      2053
[90-100)     1178
[20-30)       842
[10-20)       468
[0-10)        138
Name: count, dtype: int64

In [19]:
data['weight'].value_counts(normalize=True)

weight
?            0.962783
[75-100)     0.015870
[50-75)      0.010993
[100-125)    0.006316
[125-150)    0.001459
[25-50)      0.001359
[0-25)       0.000740
[150-175)    0.000340
[175-200)    0.000120
>200         0.000020
Name: proportion, dtype: float64

In [20]:
data['discharge_disposition_id'].value_counts()

discharge_disposition_id
1     29385
3      5856
6      5309
18     3668
2       981
25      975
5       952
11      847
22      704
4       486
7       268
23      181
14      137
13       99
8        98
15       21
28       19
17       14
16       11
10        6
24        4
12        3
9         3
20        2
19        1
27        1
Name: count, dtype: int64

In [21]:
data['payer_code'].value_counts(normalize=True)

payer_code
?     0.652895
MC    0.187983
HM    0.031441
BC    0.029822
UN    0.025104
SP    0.023985
MD    0.019968
CP    0.014651
CM    0.004817
OG    0.003078
DM    0.002738
PO    0.001619
WC    0.000700
SI    0.000580
CH    0.000380
OT    0.000240
Name: proportion, dtype: float64

In [22]:
data_quality.loc[:,'payer_code'] = data.loc[:, 'payer_code'].replace('?', 'Unknown')

In [23]:
data_quality['payer_code'].value_counts(normalize=True)

payer_code
Unknown    0.652888
MC         0.187987
HM         0.031441
BC         0.029822
UN         0.025105
SP         0.023986
MD         0.019968
CP         0.014651
CM         0.004817
OG         0.003078
DM         0.002738
PO         0.001619
WC         0.000700
SI         0.000580
CH         0.000380
OT         0.000240
Name: proportion, dtype: float64

In [24]:
data.medical_specialty.value_counts(normalize=True)

medical_specialty
?                                0.354880
InternalMedicine                 0.215606
Family/GeneralPractice           0.109812
Cardiology                       0.073674
Emergency/Trauma                 0.039076
                                   ...   
SurgicalSpecialty                0.000020
Proctology                       0.000020
Psychiatry-Addictive             0.000020
Pediatrics-InfectiousDiseases    0.000020
Cardiology-Pediatric             0.000020
Name: proportion, Length: 68, dtype: float64

In [25]:
data_quality.loc[:,'medical_specialty'] = data.loc[:, 'medical_specialty'].replace('?', 'Unknown')

In [26]:
data_quality['medical_specialty'].value_counts(normalize=True)

medical_specialty
Unknown                          0.354887
InternalMedicine                 0.215611
Family/GeneralPractice           0.109814
Cardiology                       0.073656
Emergency/Trauma                 0.039077
                                   ...   
SurgicalSpecialty                0.000020
Proctology                       0.000020
Psychiatry-Addictive             0.000020
Pediatrics-InfectiousDiseases    0.000020
Cardiology-Pediatric             0.000020
Name: proportion, Length: 68, dtype: float64

In [27]:
def replace_question_mark(series):
    """
    Replace '?' values in the Series with NaN, and convert the Series to numeric.
    :param series: A pandas Series object.
    :return: Transformed Series
    """
    series = series.replace('?', np.nan)
    series = pd.to_numeric(series, errors='coerce')
    series = series.astype('Int64')
    return series



In [33]:
data_quality.loc[:, 'number_outpatient'] = replace_question_mark(data['number_outpatient'])


In [36]:
data_quality.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50030 entries, 0 to 50030
Data columns (total 39 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   encounter_id              50030 non-null  int64 
 1   patient_nbr               50030 non-null  int64 
 2   race                      50030 non-null  object
 3   gender                    50030 non-null  object
 4   age                       50030 non-null  object
 5   weight                    50030 non-null  object
 6   admission_type_id         50030 non-null  int64 
 7   discharge_disposition_id  50030 non-null  int64 
 8   admission_source_id       50030 non-null  int64 
 9   length_of_stay            50030 non-null  int64 
 10  payer_code                50030 non-null  object
 11  medical_specialty         50030 non-null  object
 12  num_lab_procedures        50030 non-null  int64 
 13  num_procedures            50030 non-null  int64 
 14  num_medications           5

In [38]:
data_quality.number_outpatient.value_counts()

number_outpatient
0     44456
1      3127
2      1091
3       637
4       335
5       177
6        67
7        31
8        28
9        16
10       12
11        9
12        5
13        4
14        4
16        3
15        2
20        1
21        1
35        1
17        1
29        1
36        1
Name: count, dtype: int64

In [32]:
data_quality['number_outpatient'] = data['number_outpatient'].map(replace_question_mark)

TypeError: replace() argument 2 must be str, not float

In [31]:
data_quality.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50030 entries, 0 to 50030
Data columns (total 39 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   encounter_id              50030 non-null  int64 
 1   patient_nbr               50030 non-null  int64 
 2   race                      50030 non-null  object
 3   gender                    50030 non-null  object
 4   age                       50030 non-null  object
 5   weight                    50030 non-null  object
 6   admission_type_id         50030 non-null  int64 
 7   discharge_disposition_id  50030 non-null  int64 
 8   admission_source_id       50030 non-null  int64 
 9   length_of_stay            50030 non-null  int64 
 10  payer_code                50030 non-null  object
 11  medical_specialty         50030 non-null  object
 12  num_lab_procedures        50030 non-null  int64 
 13  num_procedures            50030 non-null  int64 
 14  num_medications           5