### Trying to predict length of stay(LoS) from reason of admission, age, clinic, month and season of admission and disease group of the main diagnose (by letter of ICD code). 

Looking only at those patients that stay > 1 and < 11 days. Mean LoS of all data is 7 days and 75% of stays are shorter than 8 days

as data is coming a from private source, it cannot be disclosed

In [1]:
import pandas as pd
import numpy as np

In [13]:
data = pd.read_csv('data.csv')

In [14]:
data.drop_duplicates('record_id', keep='first', inplace=True)

In [22]:
from sklearn.preprocessing import LabelEncoder

le1, le2 = LabelEncoder(), LabelEncoder()

data['icd_group'] = le1.fit_transform(data['icd_group'].values)
data['aufnahmeanlass'] = le2.fit_transform(data['aufnahmeanlass'].values)
data.drop(data[data['duration'] == 0].index, inplace=True)
data.drop(data[data['alter_in_jahren_am_aufnahmetag'] == 0].index, inplace=True)

print(' we have {} records'.format(len(data)))

 we have 295618 records


In [120]:
#getting the codes of icd groups back
le1.transform(['E'])

array([4], dtype=int32)

In [24]:
#take only patients that stayed less than 11 days but at least 1 day
data1 = data[(data['duration'] < 11)] 
duration1 = data1['duration'].values
data1.drop(['Unnamed: 0', 'duration'], axis=1, inplace=True)
print(' we have {} records with LoS between 1 and 11 days'.format(len(data1)))

 we have 240519 records with LoS between 1 and 11 days


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [17]:
#proportion of patients staying less tan 11 days but at least 1 day
len(data1)/len(data)

0.8136141912874046

In [8]:
from sklearn.ensemble import RandomForestRegressor
seed = 7

from sklearn.model_selection import train_test_split 

train_x,test_x, train_y, test_y = train_test_split(data1.iloc[:,2:].values, duration1,
                                                   test_size=0.33, random_state=seed)
rf = RandomForestRegressor(max_depth=12, n_estimators=100, criterion='mse', n_jobs=-1, random_state=seed, verbose=1)

In [156]:
rf.fit(train_x, train_y)

[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   25.3s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   52.9s finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=12,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=-1, oob_score=False, random_state=7,
           verbose=1, warm_start=False)

In [157]:
from sklearn.metrics import mean_absolute_error

print(mean_absolute_error(train_y, rf.predict(train_x)))
print(mean_absolute_error(test_y, rf.predict(test_x)))

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.8s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    1.7s finished


1.83518028317


[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.3s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.8s finished


1.8960053116


We are ~2 days off when predicting LoS for those that stay < 11 days

In [158]:
list(zip(data1.columns[2:], rf.feature_importances_))

[('aufnahmeanlass', 0.022895805047159559),
 ('alter_in_jahren_am_aufnahmetag', 0.40511244902384641),
 ('clinic_id', 0.092023937932102201),
 ('aufnahmegrund', 0.16228628297978662),
 ('month', 0.060762018624316128),
 ('season', 0.016958308518769581),
 ('clinic_bin', 0.016240319454861107),
 ('icd_group', 0.22372087841915864)]

Disease group, reason of admission and age are 3 best predictors

In [178]:
print('mean LoS < 11 days: {} days'.format(duration1.mean()))

mean LoS < 11 days: 4.20803761864967 days


### Now what if we specify the desease a bit further, by taking the 3-digid icd code of the main diagnosis

In [18]:
from config import db

query = '''
    SELECT DISTINCT record_id, icd_three
    FROM icd_records 
    WHERE diagnoseart = 'HD' 
    
    '''

df = pd.read_sql(query, con=db)


In [25]:
data3 = pd.merge(data, df, how='inner', on='record_id')
le3 = LabelEncoder()

data3['icd_three'] = le3.fit_transform(data3['icd_three'].values)

data3 = data3[(data3['duration'] < 11)] 
duration3 = data3.duration.values
data3.drop(['icd_group', 'Unnamed: 0', 'duration'], axis=1, inplace=True)
print(' we have {} records with LoS between 1 and 11 days'.format(len(data3)))

 we have 248720 records with LoS between 1 and 11 days


In [18]:
train_x3,test_x3, train_y3, test_y3 = train_test_split(data3.iloc[:,1:].values, duration3,
                                                   test_size=0.33, random_state=seed)
rf = RandomForestRegressor(max_depth=12, n_estimators=100, criterion='mse', n_jobs=-1, random_state=seed, verbose=1)


In [19]:
rf.fit(train_x3, train_y3)

[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   26.7s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   59.6s finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=12,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=-1, oob_score=False, random_state=7,
           verbose=1, warm_start=False)

In [20]:
print(mean_absolute_error(train_y3, rf.predict(train_x3)))
print(mean_absolute_error(test_y3, rf.predict(test_x3)))

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.7s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    1.6s finished


1.6786488606


[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.4s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.9s finished


1.75920380522


The error is reduced but not much, the mean LoS stays the same 4.2 days

In [21]:
duration3.mean()

4.2104253779350271

In [23]:
list(zip(data3.columns[1:], rf.feature_importances_))

[('aufnahmeanlass', 0.015394602417021643),
 ('alter_in_jahren_am_aufnahmetag', 0.27246264748035098),
 ('clinic_id', 0.081939274305180471),
 ('aufnahmegrund', 0.11460884618733401),
 ('month', 0.035708046591666681),
 ('season', 0.011074636977076476),
 ('clinic_bin', 0.0065808520902133931),
 ('icd_three', 0.46223109395115647)]

Twice more weight is put on the icd_three than on icd_group, suggesting that the feature is indeed discriminative

### Next, trying to predict LoS for exact icd codes inside category I, diseases of circulatory system

In [20]:

query = '''
    SELECT DISTINCT record_id, icd_kode
    FROM icd_records 
    WHERE diagnoseart = 'HD' AND icd_group = 'I'
    
    '''

data_i = pd.read_sql(query, con=db)

In [26]:
data_heart = pd.merge(data, data_i, how='inner', on='record_id')
le4 = LabelEncoder()

data_heart['icd_kode'] = le4.fit_transform(data_heart['icd_kode'].values)
data_heart = data_heart[(data_heart['duration'] < 11)]
duration4 = data_heart.duration.values
data_heart.drop(['icd_group', 'Unnamed: 0', 'duration'], axis=1, inplace=True)
print(' we have {} records with LoS between 1 and 11 days and icd_group = "I"'.format(len(data_heart)))

 we have 42900 records with LoS between 1 and 11 days and icd_group = "I"


In [19]:
train_x4,test_x4, train_y4, test_y4 = train_test_split(data_heart.iloc[:,1:].values, duration4,
                                                   test_size=0.33, random_state=seed)
rf = RandomForestRegressor(max_depth=12, n_estimators=100, criterion='mse', n_jobs=-1, random_state=seed, verbose=1)

In [20]:
rf.fit(train_x4, train_y4)

from sklearn.metrics import mean_absolute_error

print(mean_absolute_error(train_y4, rf.predict(train_x4)))
print(mean_absolute_error(test_y4, rf.predict(test_x4)))

[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    5.3s finished
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.2s finished


1.60084030663
1.84931063976


[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.1s finished


The same mean error of ~2 days 

In [21]:
duration4.mean()

4.599254079254079

With a mean of 4.6 days 

In [22]:
list(zip(data_heart.columns[1:], rf.feature_importances_))

[('aufnahmeanlass', 0.020927654141984373),
 ('alter_in_jahren_am_aufnahmetag', 0.18358119032521056),
 ('clinic_id', 0.078151177703175242),
 ('aufnahmegrund', 0.20069638324999939),
 ('month', 0.077888155574361823),
 ('season', 0.023865166716885784),
 ('clinic_bin', 0.010990796524561046),
 ('icd_kode', 0.40389947576382168)]

## Conclusions
Trying to predict LoS of patients from the data that is available at the beginning of their stay proved to be possible for those stays that are less than 11 days, which comprise the majority of cases. The 2-day error of prediction is similar when using  all data and a specific subgroup of deseases (icd I)