# NSQIP Analysis - Readmission risk factors

Ok, first load libraries and the processed data off of disk:

In [9]:
import pandas as pd
import numpy as np
from sklearn import preprocessing as pp
import statsmodels.api as sm
import pylab as pl

dataDir = './Data/'
compiledFileName = dataDir + 'compiledData.pkl'

df = pd.read_pickle(compiledFileName)

Ok, we'll look for risk factors predicting 30-day readmission. Here we should use: READMISSION1. READMISSION is depricated, but we should merge that data into our readmission variable. The strategy will be to use hierarchical logistic regression to predict these readmissions. We'll use the statsmodels package.

Where we have missing response data we're going to need to drop those rows.

In [24]:
df.iloc[:,9].unique()

array(['3-Severe Disturb', '2-Mild Disturb', '1-No Disturb',
       '4-Life Threat', nan, 'None assigned', '5-Moribund'], dtype=object)

In [8]:
print('Found: %d null rows.' % pd.isnull(df['READMISSION1']).tolist().count(True))
# df = df[~pd.isnull(df['READMISSION1'])]
print('We still have: %d rows left.' % len(df.index))

Found: 6408 null rows.
We still have: 13125 rows left.


In [3]:
# Convert the text value counts into a valued list
print(df['READMISSION1'].value_counts())

# Encode the labels into a logical response variable y
le_readmit = pp.LabelEncoder()
le_readmit.fit(df['READMISSION1'].unique().tolist())
print(le_readmit.classes_)
y = le_readmit.transform(df['READMISSION1'].tolist())

# Pick out predictors
X = df.loc[:,['AGE','WEIGHT']]
X['READMIT'] = y

# Some fields will have non-numeric values
# X[~X.applymap(np.isreal).all(1)]
# AGE has some '90+' strings. 
# WEIGHT has some NaNs

# Replace 90+ with 90
print('Changing: %d listings of 90+ to 90.' % sum(X['AGE'] == '90+'))
X.loc[X['AGE'] == '90+','AGE'] = 90.0

# Dropping rows with missing weight
print('Dropping: %d rows with missing weights.' % sum(X.applymap(pd.isnull).any(1)))
X = X[~X.applymap(pd.isnull).any(1)]
print('Retained %d rows.' % len(X.index))

No     12774
Yes      351
Name: READMISSION1, dtype: int64
['No' 'Yes']
Changing: 91 listings of 90+ to 90.
Dropping: 425 rows with missing weights.
Retained 12700 rows.


In [105]:
# X[X['AGE'] == '90+']
# X[~X.applymap(np.isreal).all(1)]

Ok, now let's try some logistic regression.

In [4]:
X['intercept'] = 1.0
logit = sm.Logit(X['READMIT'], X.loc[:,['AGE','WEIGHT','intercept']].astype('float64'))
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.120358
         Iterations 8


In [5]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                READMIT   No. Observations:                12700
Model:                          Logit   Df Residuals:                    12697
Method:                           MLE   Df Model:                            2
Date:                Sun, 04 Dec 2016   Pseudo R-squ.:                 0.02637
Time:                        17:39:08   Log-Likelihood:                -1528.5
converged:                       True   LL-Null:                       -1569.9
                                        LLR p-value:                 1.053e-18
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
AGE            0.0295      0.003      8.832      0.000         0.023     0.036
WEIGHT         0.0023      0.001      2.149      0.032         0.000     0.004
intercept     -5.6183      0.308    -18.223      0.0

In [5]:
np.exp(result.params)

AGE          1.029903
WEIGHT       1.002278
intercept    0.003631
dtype: float64

10.299

In [60]:
X = sm.add_constant(X, prepend=False)
model = sm.OLS(y, X)
#results = model.fit()


ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).