Initially, we'll just look at all single predictors of 30-day readmission.

In [1]:
import pandas as pd
import numpy as np
import sklearn as skl
from sklearn import preprocessing as pp
from sklearn.preprocessing import Imputer
import statsmodels.discrete.discrete_model as sm
import math
import warnings 
warnings.filterwarnings("ignore") # Ignore annoying warnings

dataDir = './Data/'
mungedFileName = dataDir + 'mungedData.pkl'

cdf = pd.read_pickle(mungedFileName)

There are a few different variables in this dataset that include readmission data. READMISSION is depricated, but still has data for older records. READMISSION1 is the current variable. It makes sense to try to merge these to be our regression target. No records have data for both. Of the 19533 records, 4537 don't have readmission data recorded either way. To start we'll just assume that these patients were not readmitted.

In [2]:
# y is True if any readmission variable is hot. It's NaN if all variables are null.
y = cdf[['READMISSION1-Yes','READMISSION-Yes']].any(1)
y[cdf[['READMISSION1-Yes','READMISSION-Yes']].isnull().all(1)] = np.nan

# Would like to drop rows with NaN y data
nanIdx = np.isnan(y).nonzero()
y = np.delete(y.ravel(), nanIdx ,axis=0)
cdf.drop(cdf.index[nanIdx], axis=0, inplace=True)

# For now we'll impute those missing values...
# imp = Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=False)
# imp.fit(y.reshape(-1,1))
# y = imp.transform(y.reshape(-1,1))

Now we need to make a predictor array to use for logistic regression, dropping our source data columns in the process. We also need to drop variables that will be post-hoc predictors. For example, patients that get reoperated will of course be readmitted, so this isn't a helpful predictor. 

In [3]:
# dropList = ['READMISSION-','READMISSION1-','REOPERATION-',\
#             'REOPERATION1-','NWNDINFD-','WNDINFD-','DEHIS-','NDEHIS-',\
#             'MORBPROB','NSUPINFEC-','SUPINFEC-','RETORPODAYS','OTHSYSEP',\
#             'NOTHSYSEP-']
# colsToDrop = [colName for colName in cdf.columns if np.any([dropItem in colName for dropItem in dropList])]
# cdf = cdf.drop(colsToDrop,1)
# print('Dropped some variables: ')
# print(colsToDrop)

# Dropping these columns is super-cumbersome. Let's find a list to keep
keepList = ['DISCHDEST-','URNINFEC-','DIABETES-','PRHCT', 'PRALBUM']
colsToKeep = [colName for colName in cdf.columns if np.any([keepItem in colName for keepItem in keepList])]
cdf = cdf[colsToKeep]

To start, let's just cycle through each variable independently, printing out variables that have significant (before multiple comparisons corrections) predictive value. We also need to impute missing data (for the numeric columns, categoricals should be fine).

In [4]:
fitResults = pd.DataFrame(np.nan, index=[], columns=['colName','pValue','coeff'])

for colName in cdf.columns:

    # colName = 'AGE'
    imp = Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=False)
    imp.fit(cdf[colName].reshape(-1,1))
    X = imp.transform(cdf[colName].reshape(-1,1))
    X = np.asarray(X).reshape(-1,1)
    X = np.concatenate((X, np.ones([X.shape[0],1])),axis=1) # Add an intercept term

    try:
        logit = sm.Logit(y, X)
        smresult = logit.fit(disp=False)
        # print(smresult.summary())
        # if smresult.pvalues[0] < 0.05:
        #    print('Var: ' + colName + '  logP = %.2f    Coeff = %.2f' % (math.log10(smresult.pvalues[0]), \
        #                                                                   smresult.params[0]))
        fitResults = fitResults.append(pd.Series({'colName': colName, \
                                                  'pValue': smresult.pvalues[0], \
                                                  'coeff': smresult.params[0]}),ignore_index=True)
        # else:
        #    print('Var: ' + colName)
            
    except:
        print('*** ' + colName + ': Exception ***')


*** DIABETES-ORAL: Exception ***
*** DISCHDEST-Expired: Exception ***


In [5]:
fitResults = fitResults.sort_values(by='pValue')
for index, row in fitResults.iterrows():
    print('\t %d \t %.4f \t %.2f \t %s' % (index, row['pValue'],row['coeff'], row['colName']))  

	 16 	 0.0000 	 -0.12 	 PRHCT
	 1 	 0.0000 	 -1.28 	 DIABETES-NO
	 0 	 0.0000 	 1.49 	 DIABETES-INSULIN
	 13 	 0.0000 	 2.59 	 NURNINFEC-1
	 4 	 0.0000 	 -1.26 	 DISCHDEST-Home
	 17 	 0.0000 	 2.57 	 URNINFEC-No Complication
	 12 	 0.0000 	 -2.57 	 NURNINFEC-0
	 15 	 0.0000 	 -1.25 	 PRALBUM
	 7 	 0.0000 	 1.23 	 DISCHDEST-Skilled Care, Not Home
	 5 	 0.0000 	 1.04 	 DISCHDEST-Rehab
	 2 	 0.0000 	 0.71 	 DIABETES-NON-INSULIN
	 3 	 0.0117 	 0.88 	 DISCHDEST-Facility Which was Home
	 6 	 0.0152 	 1.14 	 DISCHDEST-Separate Acute Care
	 8 	 0.2484 	 1.21 	 DISCHDEST-Unknown
	 10 	 0.6481 	 -0.00 	 DPRALBUM
	 11 	 0.6922 	 0.00 	 DPRHCT
	 9 	 0.9018 	 -0.13 	 DISCHDEST-Unskilled Facility Not Home
	 14 	 0.9992 	 -15.74 	 NURNINFEC-2


In [6]:
fitResults

Unnamed: 0,colName,pValue,coeff
16,PRHCT,6.605492e-34,-0.118061
1,DIABETES-NO,1.9938510000000003e-33,-1.275105
0,DIABETES-INSULIN,9.580115e-33,1.494727
13,NURNINFEC-1,5.206892e-32,2.585926
4,DISCHDEST-Home,7.135653e-32,-1.264956
17,URNINFEC-No Complication,8.151353e-32,2.573587
12,NURNINFEC-0,8.151353e-32,-2.573587
15,PRALBUM,2.623961e-30,-1.249008
7,"DISCHDEST-Skilled Care, Not Home",7.621395e-23,1.228765
5,DISCHDEST-Rehab,1.243691e-07,1.038261
