# Variable transformation

In [134]:
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import requests
import sklearn
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from collections import Counter

%matplotlib inline

First, let's load in the full X and y dataframes.

In [52]:
with open('data/full_X_y_df.pkl', 'rb') as picklefile:
    [X, y] = pickle.load(picklefile)

We will separate predictors into 3 groups in order to keep track of them.

target: patient status (at time of discharge)

Predictor group 1: Personal demographics
* Age
* Gender
* Race
* Ethnicity
* Primary payer (Medicare, Medicaid, private insurance, etc.)
* Patient location (zip code/county vs public health region)

Predictor group 2: Details about hospital stay
* Day of the week patient admitted (difference between mortality in patients admitted on weekday vs weekend?)
* Type of hospital patient is admitted to; academic vs private vs community vs critical access hospitals
* Length of stay
* Type of admission (urgent vs emergent vs elective)
* Source of admission

Predictor group 3: Medical/procedural
* In-hospital vs out-of-hospital cardiac arrest (presumably, patients who had an admitting diagnosis of cardiac arrest experienced their arrest out-of-hospital, although there may be inaccuracy here if the patient was transferred from another hospital; need to account for this by looking at the source of admission)
* Other associated diagnosis codes, e.g. diabetes, heart failure, etc.
* Other associated procedural codes, e.g. heart surgery, mechanical ventilation, etc.
* Can separate out the associated diagnosis codes to distinguish between the medical conditions that are present on arrival (e.g. chronic conditions) vs medical conditions that develop during the hospital stay.

In [53]:
# target = 'pat_status'
# index = 'record_id'

personal_demographic_predictors = ['pat_age', 'sex_code', 'race', 'ethnicity', 'public_health_region', 'first_payment_src']
hospital_stay_predictors = ['provider_name', 'type_of_admission', 'source_of_admission', 'admit_weekday', 'length_of_stay', 'type_of_bill']

diag_codes_predictors = ['admitting_diagnosis']
diag_codes_predictors.extend([col for col in X.columns if 'diag_code' in col])

e_code_predictors = [col for col in X.columns if 'e_code' in col]

proc_code_predictors = ['princ_surg_proc_code', 'princ_surg_proc_day', 'princ_icd9_code']
proc_code_predictors.extend([col for col in X.columns if 'oth_surg' in col or 'oth_icd9' in col])

## Transform predictor columns

We now have to transform the features of the dataframe into a form where we can actually run our different models on the data set. The majority of the predictor columns are categorical variables, and so we have to use pd.get_dummies (or sklearn onehotencoder) in order to get the dummy variables necessary for the categorical data.

In [54]:
X[personal_demographic_predictors].head()

Unnamed: 0_level_0,pat_age,sex_code,race,ethnicity,public_health_region,first_payment_src
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
120110921391,20,M,4,2,7,MA
120110922505,11,F,3,2,7,MA
120110921645,17,F,4,2,7,MA
120110919484,19,F,4,2,7,16
120110919499,16,M,5,1,7,MA


We need to fix the age ranges for patients who have a pat_age of 22 through 26, since they code patients who have HIV or drug/alcohol related diagnoses within broader age ranges. Those ranges are listed below:

* 22: 0-17
* 23: 18-44
* 24: 45-64
* 25: 65-74
* 26: 75+

Therefore, we can map these age ranges to the following:
* 22 -> (encompasses 00 to 05) -> 05 (15-17)
* 23 -> (encompasses 06 to 11) -> 09 (30-34)
* 24 -> (encompasses 12 to 15) -> 14 (50-54)
* 25 -> (encompasses 16 to 17) -> 17 (70-74)
* 26 -> (encompasses 18 to 21) -> 18 (75-79)

We'll also add another flag called 'hiv_drug' which is true for the patients with these values for their patient age, and false for all other patients.

In [55]:
# First we need to make X['pat_age'] numeric.

X['pat_age'] = X['pat_age'].apply(pd.to_numeric)

In [56]:
# Now create a binary variable called 'hiv_drug' and initialize it to 0 for all patients.
X['hiv_drug'] = 0

# For every row where 'pat_age' is greater than 21, set the hiv_drug flag to drue.
X.loc[X['pat_age'] > 21, 'hiv_drug'] = 1

In [57]:
# Now replace the patient ages for the hiv_drug patients with an approximate mapping to another age category.
age_replace_dict = {22: 5, 23: 9, 24: 14, 25: 17, 26: 18}

X = X.replace({'pat_age': age_replace_dict})

Let's go ahead and create the dummified dataframe for the personal demographic predictors.

In [63]:
personal_demographic_predictors

['pat_age',
 'sex_code',
 'race',
 'ethnicity',
 'public_health_region',
 'first_payment_src']

In [153]:
test = X[personal_demographic_predictors].copy()

In [159]:
[col for col in test]

['pat_age',
 'sex_code',
 'race',
 'ethnicity',
 'public_health_region',
 'first_payment_src']

In [162]:
# https://stackoverflow.com/questions/24109779/running-get-dummies-on-several-dataframe-columns

test = pd.concat([pd.get_dummies(test[col], prefix=col) for col in test if col != 'pat_age'], axis=1)

In [163]:
test.head()

Unnamed: 0_level_0,sex_code_F,sex_code_M,sex_code_U,race_1,race_2,race_3,race_4,race_5,race_`,ethnicity_1,...,first_payment_src_CI,first_payment_src_HM,first_payment_src_LM,first_payment_src_MA,first_payment_src_MB,first_payment_src_MC,first_payment_src_OF,first_payment_src_VA,first_payment_src_WC,first_payment_src_ZZ
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
120110921391,0,1,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
120110922505,1,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
120110921645,1,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
120110919484,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
120110919499,0,1,0,0,0,0,0,1,0,1,...,0,0,0,1,0,0,0,0,0,0


In [164]:
test.columns

Index(['sex_code_F', 'sex_code_M', 'sex_code_U', 'race_1', 'race_2', 'race_3',
       'race_4', 'race_5', 'race_`', 'ethnicity_1', 'ethnicity_2',
       'ethnicity_`', 'public_health_region_01', 'public_health_region_02',
       'public_health_region_03', 'public_health_region_04',
       'public_health_region_05', 'public_health_region_06',
       'public_health_region_07', 'public_health_region_08',
       'public_health_region_09', 'public_health_region_10',
       'public_health_region_11', 'first_payment_src_09',
       'first_payment_src_11', 'first_payment_src_12', 'first_payment_src_13',
       'first_payment_src_14', 'first_payment_src_15', 'first_payment_src_16',
       'first_payment_src_AM', 'first_payment_src_BL', 'first_payment_src_CH',
       'first_payment_src_CI', 'first_payment_src_HM', 'first_payment_src_LM',
       'first_payment_src_MA', 'first_payment_src_MB', 'first_payment_src_MC',
       'first_payment_src_OF', 'first_payment_src_VA', 'first_payment_src_WC',
  

In [64]:
personal_demographic_dict = {
    'sex_code': pd.get_dummies(X['sex_code'], prefix='sex'),
    'race': pd.get_dummies(X['race'], prefix='race'),
    'ethnicity': pd.get_dummies(X['ethnicity'], prefix='ethn'),
    'public_health_region': pd.get_dummies(X['public_health_region'], prefix='ph_reg'),
    'first_payment_src': pd.get_dummies(X['first_payment_src'], prefix='first_pay')
}

In [120]:
personal_demographic_df = personal_demographic_dict['sex_code'].copy()
personal_demographic_df = personal_demographic_df.join(personal_demographic_dict['race'])\
.join(personal_demographic_dict['ethnicity'])\
.join(personal_demographic_dict['public_health_region'])\
.join(personal_demographic_dict['first_payment_src'])

In [121]:
personal_demographic_df.head()

Unnamed: 0_level_0,sex_F,sex_M,sex_U,race_1,race_2,race_3,race_4,race_5,race_`,ethn_1,...,first_pay_CI,first_pay_HM,first_pay_LM,first_pay_MA,first_pay_MB,first_pay_MC,first_pay_OF,first_pay_VA,first_pay_WC,first_pay_ZZ
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
120110921391,0,1,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
120110922505,1,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
120110921645,1,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
120110919484,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
120110919499,0,1,0,0,0,0,0,1,0,1,...,0,0,0,1,0,0,0,0,0,0


In [122]:
personal_demographic_df.columns

Index(['sex_F', 'sex_M', 'sex_U', 'race_1', 'race_2', 'race_3', 'race_4',
       'race_5', 'race_`', 'ethn_1', 'ethn_2', 'ethn_`', 'ph_reg_01',
       'ph_reg_02', 'ph_reg_03', 'ph_reg_04', 'ph_reg_05', 'ph_reg_06',
       'ph_reg_07', 'ph_reg_08', 'ph_reg_09', 'ph_reg_10', 'ph_reg_11',
       'first_pay_09', 'first_pay_11', 'first_pay_12', 'first_pay_13',
       'first_pay_14', 'first_pay_15', 'first_pay_16', 'first_pay_AM',
       'first_pay_BL', 'first_pay_CH', 'first_pay_CI', 'first_pay_HM',
       'first_pay_LM', 'first_pay_MA', 'first_pay_MB', 'first_pay_MC',
       'first_pay_OF', 'first_pay_VA', 'first_pay_WC', 'first_pay_ZZ'],
      dtype='object')

In [135]:
personal_demographic_df['pat_age'] = X['pat_age']

In [127]:
personal_demographic_for_logr = personal_demographic_df.drop(['sex_U', 'race_`', 'ethn_`', 'ph_reg_11', 'first_pay_ZZ'], axis=1)

In [128]:
logr = LogisticRegression()

logr.fit(personal_demographic_for_logr, y)

logr.score(personal_demographic_for_logr, y)

0.6878935768261965

In [130]:
dummy_classifier = DummyClassifier(strategy='most_frequent')
dummy_classifier.fit(personal_demographic_for_logr, y)
dummy_classifier.score(personal_demographic_for_logr, y)

0.6874212846347607

In [136]:
dtc = DecisionTreeClassifier()
dtc.fit(personal_demographic_df, y)
dtc.score(personal_demographic_df, y)

0.796205919395466

In [137]:
rfc = RandomForestClassifier()
rfc.fit(personal_demographic_df, y)
rfc.score(personal_demographic_df, y)

0.7899086901763224

Let's try using sklearn's labelencoder and onehotencoder to see if we can make a pipeline that automatically creates a sparse matrix for the demographic features.

In [138]:
test = X[personal_demographic_predictors].copy()

In [139]:
test = test.replace({None: 'U'})

In [148]:
le = LabelEncoder()
test = test.apply(le.fit_transform)

In [152]:
one_hot_enc = OneHotEncoder(sparse=False)
one_hot_enc.fit_transform(test)
one_hot_enc.feature_indices_

array([ 0, 22, 25, 32, 36, 48, 69])

In [87]:
le = LabelEncoder()

z = X['sex_code'].copy()
z = z.replace({None: 'U'})

z = le.fit_transform(z)

In [98]:
le.classes_

array(['F', 'M', 'U'], dtype=object)

In [90]:
one_hot_enc = OneHotEncoder()

zed = one_hot_enc.fit_transform(np.asarray(z).reshape(-1,1))

In [97]:
zed.toarray()

array([[0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       ...,
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

In [62]:
# We can also make length of stay numeric as well.
X['length_of_stay'] = X['length_of_stay'].apply(pd.to_numeric)

### Description of each of the predictor groups

In [None]:
X.shape

In [None]:
X[personal_demographic_predictors].describe()

In [None]:
X[hospital_stay_predictors].describe()

In [None]:
X[diag_codes_predictors].describe()

In [None]:
X[e_code_predictors].describe()

In [None]:
X[proc_code_predictors].describe()

In [None]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

When running logistic regression purely against age, it comes up with the same accuracy as the dummy classifier. It looks like this predictor never predicts that any patient is alive at the time of discharge; is this how the models should behave?

In [None]:
logr = LogisticRegression()

logr.fit(np.asarray(X['pat_age']).reshape(-1,1), y)

y_pred = logr.predict(np.asarray(X['pat_age']).reshape(-1,1))

print(set(y_pred))
print(accuracy_score(y, y_pred))
#logr.score(np.asarray(x_temp).reshape(-1, 1), y)

In [None]:
z = logr.predict_proba(np.asarray(X['pat_age']).reshape(-1,1))

In [None]:
z[0:20]

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(np.asarray(X['pat_age']).reshape(-1, 1), y)
y_pred = dummy.predict(np.asarray(X['pat_age']).reshape(-1, 1))

print(set(y_pred))
print(accuracy_score(y, y_pred))

In [None]:
t_df = pd.DataFrame({'pat_age': X['pat_age'], 'pat_status': y})

In [None]:
t_df.groupby('pat_status').describe()

In [None]:
dead = t_df[t_df.pat_status == 'expired'].copy()
alive = t_df[t_df.pat_status == 'alive'].copy()

In [None]:
data = [dead['pat_age'], alive['pat_age']]
labels = ['expired', 'alive']

In [None]:
plt.figure(figsize=(10,6))
plt.boxplot(data, labels=labels)
plt.show()

### Dummy dataframe dictionary

Most of the predictor variables are categorical variables, so we are going to have to dummify multiple variables. We will keep track of the each of the dummified categorical variables by using a dictionary to store the dataframe for each variable.

In [None]:
dummy_df_dictionary = {}

In [None]:
dummy_df_dictionary['sex_code'] = pd.get_dummies(X['sex_code'], prefix='sex')
dummy_df_dictionary['race'] = pd.get_dummies(X['race'], prefix='race')
dummy_df_dictionary['ethnicity'] = pd.get_dummies(X['ethnicity'], prefix='ethnicity')
dummy_df_dictionary['first_payment_src'] = pd.get_dummies(X['first_payment_src'], prefix='first_pay')

In [None]:
from sklearn import tree

In [None]:
dtc = tree.DecisionTreeClassifier()

In [None]:
dtc.fit(dummy_df_dictionary['first_payment_src'], y)

In [None]:
dtc.score(dummy_df_dictionary['first_payment_src'], y)

### Personal demographic predictors

Now, let's look at the personal demographic predictors and figure out how this may need to get transformed.

List of things that need to get fixed:
1. There are 1125 records where the sex_code for these patients is missing. Need to figure out how to compensate for this, e.g. randomly assign half to male and the other half to female, vs. keeping 'none' as a third value for this field, vs. dropping all the rows without a gender. I don't think dropping all the rows without a gender is a good idea, since 1125 out of 12704 records do not have a gender, and I also don't think gender is a hugely influential factor where not having a gender means that that record can't be analyzed.
2. Need to fix pat_age so that 22-26 gets recoded to the appropriate ages, and then you can also add an HIV/drug and alcohol use flag for those patients.
3. Need to figure out how many dummy variables to use for first_payment_src; can I assign 3-4 dummy variables for the major types of insurance, and then ignore everything else? (everybody else who doesn't have any of those gets 0s for all those variables as a default level)
4. Also need to figure out how to handle patient location data, as in which one would be the most useful; probably a combination of county and state? People who are out of state don't have a county location. Can again prob pick the top 10 most frequent counties as categorical variables; the alternative is to create ~200 dummy variables for each county.

In [None]:
X[personal_demographic_predictors].head()

In [None]:
#X[personal_demographic_predictors][X[personal_demographic_predictors].isnull().any(axis=1)]

no_sex_code = X[personal_demographic_predictors][X['sex_code'].isnull()]
print(no_sex_code.shape)
no_sex_code.head()

In [None]:
# Okay, so actually the reason why there's no gender for these patients is because they are either HIV or alcohol
# use patients.
no_sex_code.pat_age.describe()

In [None]:
hiv_alc_dr = X[personal_demographic_predictors][X.pat_age >= 22]

In [None]:
alive_df.head(10)

In [None]:
z = master_2011_df[master_2011_df['record_id'] == '120110918756']
print(z[personal_demographic_predictors])
z['pat_status']

### Diagnosis Codes

Let's deal with the diagnosis codes first. So what we can do is form tuples out of the diagnosis codes and whether or not they are present on arrival (poa is yes vs any other value, e.g. no or none), then create a global dictionary of these tuples and keep a count of how many times those tuples have occurred. Then you can pick the top 10 or 20 tuples that occur most frequently, and then use them as dummy variables (if they are present or absent).

You could also look at 4 different categories of tuple/poa pairs. 1) most common diag code + present on arrival, and pick the chronic conditions, 2) most common diag codes + *not* present on arrival, looking at the things that happen while the patient is in the hospital, and then 3) & 4) pick the top 10 of each for patients who did survive vs patients who did not survive.

In [165]:
# Let's make tuple pairs for each diagnosis and whether or not it was present on arrival, for each
# diagnosis column. First we get the names of the diagnosis columns, and then the names of the poa columns.

diag_codes_cols = (X.filter(regex=r'(^princ_diag)|(^oth_diag)', axis=1)).columns
diag_codes_poa_cols = (X.filter(regex=r'(^poa_princ)|(^poa_oth)', axis=1)).columns

In [166]:
# We'll pair the diag columns with the poa columns and store them as tuples, and then also join the
# paired column names together to come up with the name for the new tuple column.
paired_diag_poa = list(zip(diag_codes_cols, diag_codes_poa_cols))
paired_diag_poa_names = [pair[0] + '_AND_' + pair[1] for pair in paired_diag_poa]

In [167]:
# We'll create a mini-dataframe with the first two columns of X, just so that we can have a frame to add our
# tuple diag/POA columns to.
diag_poa_frame = X[['discharge', 'thcic_id']].copy()

# Iterate through the paired diagnosis/POA column names with an index, and zip those two columns together into
# a new column with that name.
for i, paired_diag_poa_col in enumerate(paired_diag_poa_names):
    diag_poa_frame[paired_diag_poa_col] = list(X[list(paired_diag_poa[i])].itertuples(index=False, name=None))

# Now that we've created the diagnosis/POA tuple columns, we can drop the first two columns copied from X.
diag_poa_frame.drop(['discharge', 'thcic_id'], axis=1, inplace=True)

In [168]:
# Make a list of all the diagnosis tuples by making a list of all rows
diag_poa_lists = diag_poa_frame.apply(list, axis=1)

# We now add this list of all the diagnosis/POA tuples as a new column to X.
X['diag_poa_list'] = diag_poa_lists

In [169]:
# We can make a list of lists from the diag/poa tuples for each record.
diag_poas_list_of_lists = X['diag_poa_list'].tolist()

In [170]:
# Let's flatten out the list of lists, and also sort the diagnosis/poa tuples into lists based on whether or not
# they were present on arrival or not.

not_poa_diags = []
poa_diags = []

for sublist in diag_poas_list_of_lists:
    for item in sublist:
        if item[0]:
            if item[1] == 'Y':
                poa_diags.append(item)
            else:
                not_poa_diags.append(item)

In [174]:
poa_diags_counter = Counter(poa_diags)
not_poa_diags_counter = Counter(not_poa_diags)

N = 30

# We can look at the top N most common present on arrival diagnosis codes:
poa_diags_most_common_dict = dict(poa_diags_counter.most_common(N))

# And we can look at the top N most common *not* present on arrival diagnosis codes:
not_poa_diags_most_common_dict = dict(not_poa_diags_counter.most_common(N))

In [177]:
not_poa_diags_most_common_dict

{('4275', 'N'): 6075,
 ('51881', 'N'): 2349,
 ('V4986', 'N'): 1340,
 ('V667', None): 1224,
 ('5849', 'N'): 1134,
 ('V1582', None): 1122,
 ('2762', 'N'): 886,
 ('V4581', None): 857,
 ('0389', 'N'): 820,
 ('99592', 'N'): 788,
 ('3481', 'N'): 782,
 ('412', None): 780,
 ('2768', 'N'): 777,
 ('V5869', None): 731,
 ('4271', 'N'): 723,
 ('V4511', None): 714,
 ('V5866', None): 704,
 ('V1254', None): 694,
 ('5070', 'N'): 682,
 ('V4582', None): 664,
 ('4275', None): 651,
 ('2760', 'N'): 635,
 ('78552', 'N'): 627,
 ('V5867', None): 609,
 ('42741', 'N'): 600,
 ('42789', 'N'): 596,
 ('486', 'N'): 554,
 ('5180', 'N'): 547,
 ('78551', 'N'): 539,
 ('2851', 'N'): 521}

In [182]:
clinical_tables_query = "https://clinicaltables.nlm.nih.gov/api/icd9cm_dx/v3/search?terms={}"

# poa_diag_most_common_text_dict = {}

# for key, value in poa_diags_most_common_dict.items():
#     response = requests.get(clinical_tables_query.format(key[0]))
#     val = response.json()
#     dx = val[3][0][1].strip()
#     poa_diag_most_common_text_dict[dx] = value

not_poa_diag_most_common_text_dict = {}

for key, value in not_poa_diags_most_common_dict.items():
    response = requests.get(clinical_tables_query.format(key[0]))
    val = response.json()
    dx = val[3][0][1].strip()
    if dx in not_poa_diag_most_common_text_dict:
        not_poa_diag_most_common_text_dict[dx] += value
    else:
        not_poa_diag_most_common_text_dict[dx] = value

In [183]:
display(poa_diag_most_common_text_dict)
display(not_poa_diag_most_common_text_dict)

{'Cardiac arrest': 5052,
 'Acute respiratory failure': 4747,
 'Unspecified essential hypertension': 3722,
 'Congestive heart failure, unspecified': 3525,
 'Coronary atherosclerosis of native coronary artery': 3087,
 'Other and unspecified hyperlipidemia': 2801,
 'Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled': 2661,
 'Acute kidney failure, unspecified': 2619,
 'Anoxic brain damage': 2540,
 'Acidosis': 2269,
 'Atrial fibrillation': 2115,
 'Unspecified septicemia': 1738,
 'Anemia, unspecified': 1652,
 'Severe sepsis': 1588,
 'Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage I through stage IV, or unspecified': 1507,
 'End stage renal disease': 1501,
 'Pneumonia, organism unspecified': 1453,
 'Chronic airway obstruction, not elsewhere classified': 1403,
 'Tobacco use disorder': 1401,
 'Unspecified acquired hypothyroidism': 1331,
 'Septic shock': 1273,
 'Hypertensive chronic kidney disease, unsp

{'Cardiac arrest': 6726,
 'Acute respiratory failure': 2349,
 'Do not resuscitate status': 1340,
 'Encounter for palliative care': 1224,
 'Acute kidney failure, unspecified': 1134,
 'Personal history of tobacco use': 1122,
 'Acidosis': 886,
 'Aortocoronary bypass status': 857,
 'Unspecified septicemia': 820,
 'Severe sepsis': 788,
 'Anoxic brain damage': 782,
 'Old myocardial infarction': 780,
 'Hypopotassemia': 777,
 'Long-term (current) use of other medications': 731,
 'Paroxysmal ventricular tachycardia': 723,
 'Renal dialysis status': 714,
 'Long-term (current) use of aspirin': 704,
 'Personal history of transient ischemic attack (TIA), and cerebral infarction without residual deficits': 694,
 'Pneumonitis due to inhalation of food or vomitus': 682,
 'Percutaneous transluminal coronary angioplasty status': 664,
 'Hyperosmolality and/or hypernatremia': 635,
 'Septic shock': 627,
 'Long-term (current) use of insulin': 609,
 'Ventricular fibrillation': 600,
 'Other specified cardiac d

### Procedure codes

We should also track the procedure codes that are the most common for patients. This is a similar process as for the diagnosis codes.

In [184]:
proc_codes_cols = (X.filter(regex=r'(icd9_code)', axis=1)).columns
proc_day_cols = (X.filter(regex=r'(proc_day)', axis=1)).columns

paired_proc_day = list(zip(proc_codes_cols, proc_day_cols))
paired_proc_day_names = [pair[0] + '_AND_day' for pair in paired_proc_day]

X[proc_day_cols] = X[proc_day_cols].apply(pd.to_numeric)

In [185]:
proc_day_frame = X[['discharge', 'thcic_id']].copy()

for i, paired_proc_day_col in enumerate(paired_proc_day_names):
    proc_day_frame[paired_proc_day_col] = list(X[list(paired_proc_day[i])].itertuples(index=False, name=None))

proc_day_frame.drop(['discharge', 'thcic_id'], axis=1, inplace=True)

In [186]:
# Make a list of all the procedure tuples by making a list of all rows
proc_day_lists = proc_day_frame.apply(list, axis=1)

# We now add this list of all the diagnosis/POA tuples as a new column to X.
X['proc_day_list'] = proc_day_lists

### Get most common procedures

Steps to get the most common procedures:
1. Make a list of lists of all the procedures
2. Flatten list
3. Create counter object that gets the most common N procedures
4. Look up the value of these procedure codes in ICD 9

In [188]:
# Let's also make a list of all the procedures associated with a particular discharge.
proc_lists = X[proc_codes_cols].apply(list, axis=1)

# Let's add the list of procedures back to the original dataframe.
proc_lists = proc_lists.apply(set).apply(list)

#proc_lists = [list(filter(None, x)) for x in z]

X['proc_list'] = proc_lists

all_proc_list = []

for sublist in proc_lists:
    for item in sublist:
        if item:
            all_proc_list.append(item)

all_proc_count = Counter(all_proc_list)

all_proc_dict = dict(all_proc_count.most_common(N))

clinical_tables_proc_query = "https://clinicaltables.nlm.nih.gov/api/icd9cm_sg/v3/search?terms={}"

all_proc_text_dict = {}

for key, value in all_proc_dict.items():
    response = requests.get(clinical_tables_proc_query.format(key))
    val = response.json()
    dx = val[3][0][1].strip()
    all_proc_text_dict[dx] = value

all_proc_text_dict

{'Insertion of endotracheal tube': 7356,
 'Continuous invasive mechanical ventilation for less than 96 consecutive hours': 6208,
 'Cardiopulmonary resuscitation, not otherwise specified': 4933,
 'Venous catheterization, not elsewhere classified': 4672,
 'Transfusion of packed cells': 3107,
 'Continuous invasive mechanical ventilation for 96 consecutive hours or more': 2996,
 'Arterial catheterization': 1945,
 'Hemodialysis': 1763,
 'Coronary arteriography using two catheters': 1746,
 'Left heart cardiac catheterization': 1599,
 'Angiocardiography of left heart structures': 1313,
 'Infusion of vasopressor agent': 1068,
 'Transfusion of other serum': 1041,
 'Central venous catheter placement with guidance': 1013,
 'Percutaneous transluminal coronary angioplasty [PTCA]': 967,
 'Venous catheterization for renal dialysis': 911,
 'Procedure on single vessel': 792,
 'Other electric countershock of heart': 756,
 'Temporary tracheostomy': 653,
 'Transfusion of platelets': 643,
 'Implant of puls

In [None]:
# We can make a list of lists from the proc/day tuples for each record.
proc_day_list_of_lists = X['proc_day_list'].tolist()

# Let's flatten out the list of lists, and then pick the most common N proc/day tuples that are present.

proc_day_pairs = []

for sublist in proc_day_list_of_lists:
    for item in sublist:
        if item[0]:
            proc_day_pairs.append(item)

proc_day_count = Counter(proc_day_pairs)

proc_day_dict = dict(proc_day_count.most_common(30))

proc_day_dict

clinical_tables_proc_query = "https://clinicaltables.nlm.nih.gov/api/icd9cm_sg/v3/search?terms={}"

proc_day_text_dict = {}

for key, value in proc_day_dict.items():
    response = requests.get(clinical_tables_proc_query.format(key[0]))
    val = response.json()
    dx = val[3][0][1].strip()
    proc_day_text_dict[dx] = value

proc_day_text_dict

### Scratch cells

In [None]:
list(X.columns)

The predictors that are important are likely going to be different for in-hospital vs out-of-hospital cardiac arrests; how to account for this? could split the populations manually.

The columns that I would be interested in as predictors:
* provider_name (also the same as thcic_id)
* type_of_admission (exclude 4 which is Newborn, and 9 which is unknown? prob okay to include 9)
* source_of_admission
* patient_state (can likely separate into in-state vs out of state patients)
* pat_zip (patient zip code)
* county (fips code)
* public_health_region
* sex_code
* race
* ethnicity
* admit_weekday (weekend vs weekday)
* length_of_stay
* pat_age (patient age on day of discharge)
* first_payment_src
* secondary_payment_src
* admitting_diagnosis (this would be how you would split the data between out-of-hospital vs in-hospital cardiac arrests)
* all of the other diagnoses codes (this is how you can figure out what are the comorbidities for these patients; you could use the most common ?chronic comorbidities as predictors, e.g. the presence or absence of them in any of the diagnoses codes; will have to do rearranging of this dataframe in order to come up with these 'flag' variables)
* e-code (again, you should probably look to see if there are external injury codes that come up particularly frequently, and then use those as flag variables)
* procedure codes (again, look at the procedures that are most frequently performed on these patients, use them as flags; these primarily serve as a surrogate marker for how severe the disease is, e.g. if a patient requires a cardiac cath then they're sick, if they require a CABG then they're probably sicker, etc.) --> if you have time to get really granular about it, there are fields that records on which day a procedure occurred, you could try to tie that in somehow.
* MS-MDC (major diagnosis code; not sure if it will add more info than already provided in the diagnosis codes, but maybe?)
* MS-DRG (diagnosis-related group)
* risk_mortality (Assignment of a risk of mortality score from the All Patient Refined (APR) Diagnosis Related
* Group (DRG) from the 3M APR-DRG Grouper. Indicates the likelihood of dying.)
* illness_severity (Assignment of a severity of illness score from the All Patient Refined (APR) Diagnosis Related
* Group (DRG) from the 3M APR-DRG Grouper. Indicates the extent of physiologic decompensation.)
* attending_physician
* operating_physician

Target variable:
* pat_status -> for a binary classifier, you could separate it out into either expired (20, 40, 41, 42) or hospice (50, 51) vs alive at discharge. You could probably more usefully separate it into 3 classes, which is expired, discharged home, and discharged to some other facility (but probably more specifically a SNF or inpatient rehability facility). You should probably pull all the rows where the patients experienced a cardiac arrest, and see what their patient status is.


In [None]:
t_df['oth_diag_code_1'].value_counts()

In [None]:
test_patient = X[diag_codes_predictors].loc['120110924439']

In [None]:
test_patient

In [None]:
test_patient_dxes = test_patient.filter(regex=r'(^adm)|(^princ)|(^oth)', axis=0)

In [None]:
clinical_tables_query = "https://clinicaltables.nlm.nih.gov/api/icd9cm_dx/v3/search?terms={}"

for dx in test_patient_dxes:
    if dx:
        response = requests.get(clinical_tables_query.format(dx))
        val = response.json()
        print(val[3][0][1].strip())