<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-in-preprocessed-patient-data" data-toc-modified-id="Read-in-preprocessed-patient-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read in preprocessed patient data</a></span><ul class="toc-item"><li><span><a href="#Encode-categorical-features-and-scale-numerical-values" data-toc-modified-id="Encode-categorical-features-and-scale-numerical-values-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Encode categorical features and scale numerical values</a></span></li></ul></li><li><span><a href="#Read-in-NOTEEVENTS-table" data-toc-modified-id="Read-in-NOTEEVENTS-table-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read in NOTEEVENTS table</a></span></li><li><span><a href="#Merge-notes-table-with-adm_processed-table" data-toc-modified-id="Merge-notes-table-with-adm_processed-table-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Merge notes table with adm_processed table</a></span></li><li><span><a href="#Create-training-and-test-dataframes" data-toc-modified-id="Create-training-and-test-dataframes-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create training and test dataframes</a></span></li><li><span><a href="#Preprocess-text-data" data-toc-modified-id="Preprocess-text-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Preprocess text data</a></span></li><li><span><a href="#Word2Vec-processing" data-toc-modified-id="Word2Vec-processing-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Word2Vec processing</a></span><ul class="toc-item"><li><span><a href="#Prepare-text-data-for-W2V-modeling" data-toc-modified-id="Prepare-text-data-for-W2V-modeling-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Prepare text data for W2V modeling</a></span></li><li><span><a href="#Train-Word2Vec-model" data-toc-modified-id="Train-Word2Vec-model-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Train Word2Vec model</a></span></li></ul></li><li><span><a href="#Vectorize-clinic-notes" data-toc-modified-id="Vectorize-clinic-notes-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Vectorize clinic notes</a></span><ul class="toc-item"><li><span><a href="#Vectorize-notes-and-store-as-text-data-frame" data-toc-modified-id="Vectorize-notes-and-store-as-text-data-frame-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Vectorize notes and store as text data frame</a></span></li><li><span><a href="#Append-vectorized-notes-to-train-and-test-X-dataframes" data-toc-modified-id="Append-vectorized-notes-to-train-and-test-X-dataframes-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Append vectorized notes to train and test X dataframes</a></span></li></ul></li><li><span><a href="#SMOTE-Balancing" data-toc-modified-id="SMOTE-Balancing-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>SMOTE Balancing</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Modeling</a></span></li><li><span><a href="#Oversample-the-minority-class" data-toc-modified-id="Oversample-the-minority-class-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Oversample the minority class</a></span><ul class="toc-item"><li><span><a href="#Train-2-models" data-toc-modified-id="Train-2-models-10.1"><span class="toc-item-num">10.1&nbsp;&nbsp;</span>Train 2 models</a></span></li><li><span><a href="#Pickle-all-of-the-models-we-need-for-the-dashboard" data-toc-modified-id="Pickle-all-of-the-models-we-need-for-the-dashboard-10.2"><span class="toc-item-num">10.2&nbsp;&nbsp;</span>Pickle all of the models we need for the dashboard</a></span></li></ul></li><li><span><a href="#Try-random-forest-on-non-normalized-values" data-toc-modified-id="Try-random-forest-on-non-normalized-values-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Try random forest on non-normalized values</a></span></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
# For interacting with PostgreSQL database for mimic queries
import psycopg2

# Ploting functions
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

plotly.tools.set_credentials_file(username='mlpaff', api_key='lYV8hhGxZlP988tplymj')
plotly.tools.set_config_file(world_readable=True,
                             sharing='public')

from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt
figsize(20, 10)
plt.style.use(['dark_background'])

from sklearn.model_selection import train_test_split 

from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from collections import Counter

from sklearn.preprocessing import MinMaxScaler

color_set = ['#A9CA59', '#6582C4', '#62C9BC', '#F58D50', '#2AD7F4',
             '#AB3EED', '#FF6CB2', '#FFA466', '#FFE256', '#47EAAC', '#2AD7F4', '#3C8CF9']

# specify user and database for SQL queries
sqluser = 'mattmimic'
dbname = 'mimic'
set_schema = '--search_path=mimiciii'

# Connect to the database
# con = psycopg2.connect(dbname = dbname, user = sqluser, options = set_schema)

  """)


# Read in preprocessed patient data

In [2]:
adm_processed = pd.read_csv('../admission_processed.csv', parse_dates=['admittime', 'dischtime', 'deathtime', 'edregtime', 'edouttime', 'next_admittime', 'dob'], date_parser=pd.to_datetime)

In [3]:
# Define the structured features that will be included in the model
feature_set_1 = ['admission_type', 'total_prior_admits','gender', 'age', 'length_of_stay', 'num_medications', 'num_lab_tests', 'perc_tests_abnormal', 'num_diagnosis']

In [4]:
# Filter only those features that we want and add in hadm_id (to merge notes on)
adm_processed = adm_processed[['hadm_id', 'subject_id', 'days_next_admit'] + feature_set_1]

## Encode categorical features and scale numerical values

In [5]:
# Defining dictionaries for encoding
admin_type_dict = {'EMERGENCY': 0, 'URGENT': 0, 'ELECTIVE': 1}
gender_dict = {'M': 0, 'F': 1}

# Mapping dictionaries to binary features
adm_processed['admission_type'] = adm_processed['admission_type'].map(admin_type_dict).astype(int)
adm_processed['gender'] = adm_processed['gender'].map(gender_dict).astype(int)

# Scale numerical features
scaler = MinMaxScaler()
standard_cols = ['total_prior_admits', 'age', 'length_of_stay', 'num_medications', 'num_lab_tests', 'num_diagnosis']

# Normalizing the numeric columns
adm_processed[standard_cols] = scaler.fit_transform(adm_processed[standard_cols])


Data with input dtype int64, float64 were all converted to float64 by MinMaxScaler.



# Read in NOTEEVENTS table

In [6]:
# %%time
con = psycopg2.connect(dbname = dbname, user = sqluser, options = set_schema)
query = 'SELECT * FROM noteevents;'
notes = pd.read_sql_query(query, con)
con.close()
notes.head()

Unnamed: 0,row_id,subject_id,hadm_id,chartdate,charttime,storetime,category,description,cgid,iserror,text
0,804333,5289,194762.0,2110-11-05,2110-11-05 06:52:00,NaT,Radiology,CHEST (PORTABLE AP),,,[**2110-11-5**] 6:52 AM\n CHEST (PORTABLE AP) ...
1,804334,13993,180704.0,2103-11-07,2103-11-07 06:53:00,NaT,Radiology,CHEST (PORTABLE AP),,,[**2103-11-7**] 6:53 AM\n CHEST (PORTABLE AP) ...
2,804467,4599,109574.0,2120-10-31,2120-10-31 12:37:00,NaT,Radiology,CT PERITONEAL DRAINAGE,,,[**2120-10-31**] 12:37 PM\n CT PERITONEAL DRAI...
3,804108,9090,,2180-09-25,2180-09-25 08:20:00,NaT,Radiology,UGI SGL CONTRAST W/ KUB,,,[**2180-9-25**] 8:20 AM\n UGI SGL CONTRAST W/ ...
4,804109,19621,102739.0,2193-09-23,2193-09-23 00:00:00,NaT,Radiology,PERSANTINE MIBI,,,PERSANTINE MIBI ...


In [7]:
# Filter to discharge summary notes only
dis_notes = notes[notes['category'] == 'Discharge summary'].copy()

assert dis_notes.duplicated(['hadm_id']).sum() == 0, 'Multiple discharge summaries per admission'

AssertionError: Multiple discharge summaries per admission

In [8]:
# Grab just the last discharge summaries by hadm_id
last_dis_notes = dis_notes.groupby(['subject_id', 'hadm_id']).nth(-1).reset_index()
assert last_dis_notes.duplicated(['hadm_id']).sum() == 0, 'Multiple discharge summaries per admission'

In [9]:
last_dis_notes.head()
note_features = ['subject_id', 'hadm_id', 'text']

# Merge notes table with adm_processed table

In [51]:
# Merge notes table with processed structural data
adm_notes = adm_processed.merge(last_dis_notes[note_features], on = ['subject_id', 'hadm_id'], how = 'left')

# assert len(admissions) == len(adm_notes), 'Number of rows increased'

# Generate output label for readmissions under 30 days
adm_notes['output_label'] = (adm_notes['days_next_admit'] < 30).astype('int')

In [52]:
adm_notes.head()

Unnamed: 0,hadm_id,subject_id,days_next_admit,admission_type,total_prior_admits,gender,age,length_of_stay,num_medications,num_lab_tests,perc_tests_abnormal,num_diagnosis,text,output_label
0,185777,4,,0,0.0,1,0.523442,0.026332,0.043219,0.017723,0.240816,0.230769,Admission Date: [**2191-3-16**] Discharge...,0
1,107064,6,,1,0.0,1,0.721418,0.055537,0.109538,0.041645,0.448517,0.205128,Admission Date: [**2175-5-30**] Dischar...,0
2,194540,11,,0,0.0,1,0.548635,0.086639,0.067064,0.032091,0.104072,0.025641,Admission Date: [**2178-4-16**] ...,0
3,143045,13,,0,0.0,1,0.436122,0.023266,0.061848,0.026037,0.29805,0.128205,"Name: [**Known lastname 9900**], [**Known fir...",0
4,194023,17,128.920833,1,0.0,1,0.519159,0.014824,0.040238,0.013857,0.296875,0.102564,Admission Date: [**2134-12-27**] ...,0


In [11]:
print('Fraction of admissions without notes:', round(adm_notes.text.isnull().sum() / len(adm_notes), 4))
print('number of patients that were re-admitted within 30 days:', len(adm_notes[adm_notes['output_label'] == 1]))
print('fraction of patients re-admitted within 30 days:', len(adm_notes[adm_notes['output_label'] == 1]) / len(adm_notes))

Fraction of admissions without notes: 0.0232
number of patients that were re-admitted within 30 days: 2779
fraction of patients re-admitted within 30 days: 0.0672441745106105


# Create training and test dataframes

In [12]:
# shuffle the samples
adm_notes = adm_notes.sample(n = len(adm_notes), random_state=42)
adm_notes.reset_index(drop=True, inplace=True)

target = adm_notes[['output_label']]
data = adm_notes[feature_set_1 + ['text']]

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state = 0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(33061, 10) (8266, 10) (33061, 1) (8266, 1)


# Preprocess text data

In [13]:
def preprocess_text(df):
    ''' Preprocesses the text by filling not a number and replacing new lines ('\n') and carriage returns ('\r')
    '''
    
    df['text'] = df['text'].fillna(' ')
    df['text'] = df['text'].str.replace('\n', ' ')
    df['text'] = df['text'].str.replace('\r', ' ')
    return df

In [14]:
X_test, X_train = preprocess_text(X_test), preprocess_text(X_train)

# Word2Vec processing

In [15]:
from gensim.models import Word2Vec
from gensim.models import word2vec
import re
import nltk
import string
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

## Prepare text data for W2V modeling
- Here we want to convert everything to lowercase and convert to list of sentences while droping stop words

In [16]:
my_stop_words = ['the','and','to','of','was','with','a','on','in','for','name',
                 'is','patient','s','he','at','as','or','one','she','his','her','am',
                 'were','you','pt','pm','by','be','had','your','this','date',
                'from','there','an','that','p','are','have','has','h','but','o',
                'namepattern','which','every','also', 'b', 'i', 'd', 'admission', 'q', 't']

In [17]:
def w2vTokenizer(sentence):
    ''' Tokenize the text by replacing punctuations and numbers with spaces and lowercase all words
    '''
    punc_list = string.punctuation + '0123456789'
    t = str.maketrans(dict.fromkeys(punc_list, ' '))
    text = str(sentence).lower().translate(t)
    tokens = [x for x in nltk.word_tokenize(text.strip()) if x not in my_stop_words]
#     tokens = word_tokenize(text)
    return tokens

def notes_to_sentences(notes, tokenizer, remove_stopwords=False):
    ''' Split text data into tokenized list of sentences
    '''
    try:
        # use NLTK tokenizer to split the text into sentences
        raw_sentences = tokenizer.tokenize(notes)
        
        # Loop over each sentence
        sentences = []
        for sent in raw_sentences:
            # if sentence is empty, skip it
            if len(sent) > 0:
                tokens = [x for x in w2vTokenizer(sent.strip()) if x not in my_stop_words]
                if len(tokens) > 0:
                    sentences.append(tokens)
        # Return the list of sentences
        return sentences
    except:
        print('nope')
        
def prepareW2Vtext(notes_list):
    ''' From the text corpus (list of tokenized sentences generated from all text data), Tokenize the data
    '''
    
    sentences = []
    for note in notes_list:
        note = str(note)
        if len(note) > 0:
            sentences += notes_to_sentences(note, tokenizer)
    return(sentences)

In [18]:
notes_list = list(X_train['text'])
processed_text = prepareW2Vtext(notes_list)

In [19]:
print(len(processed_text))

3354537


## Train Word2Vec model

In [20]:
num_features = 400      # Word vector dimentionality
min_word_count = 50     # min word count
num_workers = 4         # number of threads to run in parallel
context = 4             # Context window size
downsampling = 1e-3     # Downsample setting for frequent words

In [21]:
w2vModel = Word2Vec(processed_text, workers=num_workers, \
                          size=num_features, min_count=min_word_count, \
                          window=context, sample=downsampling)

In [96]:
# w2v = dict(zip(w2vModel.wv.index2word, w2vModel.wv.vectors))

w2vModel.wv.save_word2vec_format('mimic_w2v_model.bin')

# Load model
# model = Word2Vec.load('mimic_w2v_model.bin')

# Vectorize clinic notes
- Using the Word2Vec model trained on the clinic notes corpus, vectorize each patients discharge summary notes

In [23]:
def tokenize_clinic_notes(note):
    ''' Tokenize the patient text by replacing punctuations and numbers with spaces and lowercase all words
    '''
    punc_list = string.punctuation + '0123456789'
    t = str.maketrans(dict.fromkeys(punc_list, ' '))
    text = str(note).lower().translate(t)
#     tokens = (x for x in word_tokenize(text.strip()) if x not in my_stop_words)
    tokens = nltk.word_tokenize(text)
    return tokens

In [24]:
class MyTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        transformed_X = []
        for document in X:
            doc = [word for word in document if word in self.vocab]
            transformed_X.append(doc)
        return transformed_X
    
    def fit_transform(self, X, y=None):
        return self.transform(X)
            

class MeanEmbeddingVectorizer(object):
    ''' Convert notes to vector
    '''
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(word2vec.wv.vectors[0])

    def fit(self, X, y=None):
        return self

    def transform(self, X):
#         doc = [word for word in X if word in self.word2vec.wv.vocab]
        X = MyTokenizer(self.word2vec.wv.vocab).fit_transform(X)
    
        return np.array([
                    np.mean([self.word2vec.wv[w] for w in document] or 
                            [np.zeros(self.dim)], axis = 0) for document in X
        ])
#         return np.mean(self.word2vec[doc], axis = 0)
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

In [25]:
# Tokenize the clinic notes so that they can be vectorized: Do this for X_train and X_test
X_train['tokens'], X_test['tokens'] = X_train['text'].apply(lambda x: tokenize_clinic_notes(x)), X_test['text'].apply(lambda x: tokenize_clinic_notes(x))

## Vectorize notes and store as text data frame

In [26]:
w2vVectorizer = MeanEmbeddingVectorizer(w2vModel)

# Get the transformed, vectorized text data
X_train_vectors, X_test_vectors = pd.DataFrame(w2vVectorizer.fit_transform(X_train['tokens'])), pd.DataFrame(w2vVectorizer.fit_transform(X_test['tokens']))

In [27]:
# X_train_vectors[X_train_vectors.columns[-400:]].shape

## Append vectorized notes to train and test X dataframes

In [28]:
X_train_tf, X_test_tf = pd.concat([X_train.reset_index(drop=True), X_train_vectors], axis = 1),  pd.concat([X_test.reset_index(drop=True), X_test_vectors], axis = 1)

# Drop text data
X_train_tf.drop(['text', 'tokens'], axis = 1, inplace=True)
X_test_tf.drop(['text', 'tokens'], axis = 1, inplace=True)

# Check that the number of rows has not changed
assert X_train_tf.shape[0] == X_train.shape[0], 'Train data frame shape has changed'
assert X_test_tf.shape[0] == X_test.shape[0], 'Test data frame shape has changed'

In [41]:
X_train_tf.shape

(33061, 409)

# SMOTE Balancing
- Do some undersampling/oversampling on the training data for model training

In [53]:
print(X_train_tf.shape)
print(y_train.shape)
print('Original train dataset shape {}'.format(Counter(y_train['output_label'])))
print('Original test dataset shape {}'.format(Counter(y_test['output_label'])))

(33061, 409)
(33061, 1)
Original train dataset shape Counter({0: 30832, 1: 2229})
Original test dataset shape Counter({0: 7716, 1: 550})


In [30]:
print('Original train dataset shape {}'.format(Counter(y_train['output_label'])))
print('Original test dataset shape {}'.format(Counter(y_test['output_label'])))


def balancing(X, Y, undersample = None):
    # Oversampling with SMOTE
    smt = SMOTE(random_state=20)
    if undersample:
        smt = SMOTEENN(random_state=20)

    X_new, Y_new = smt.fit_sample(X, Y)
    print('New train dataset shape {}'.format(Counter(Y_new)))
    X_new = pd.DataFrame(X_new, columns = list(X.columns))
    return X_new, Y_new

X_train_balanced, y_train_balanced = balancing(X_train_tf, y_train['output_label'], undersample=True)

Original train dataset shape Counter({'output_label': 1})
Original test dataset shape Counter({'output_label': 1})
New train dataset shape Counter({1: 30741, 0: 12698})


In [123]:
# X_train_balanced.head()

# Modeling
- Train 3 models:
    1. Using only structural data
    2. Using only notes data
    3. Using all data features

In [154]:
# Model building
from scipy import interp
from scipy.stats import randint as sp_randint
from sklearn import svm
from sklearn.metrics import roc_curve, auc, roc_auc_score, confusion_matrix, classification_report
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

def DTCGrid(X_train, X_test, Y_train, Y_test, model):
    if model == 'lr':
        pipeline = Pipeline([('clf',LogisticRegression(penalty = 'l2', max_iter=1000, random_state = 20000, solver='lbfgs'))])
        param_dist = {'clf__C': [0.0001, 0.0005, 0.001, 0.0033, 0.0066, 0.01, 0.033, 0.066, 0.1, 0.33, 1, 3, 6, 10, 100]}
    
    if model == 'dt':
        pipeline = Pipeline([('clf',DecisionTreeClassifier(criterion='entropy', random_state=20000))])
        # specify parameters and distributions to sample from
        param_dist = {'clf__max_depth': sp_randint(20, 30),
                 'clf__min_samples_split': sp_randint(2, 11)
                    }
    if model == 'rf':
        pipeline = Pipeline([('clf',RandomForestClassifier(criterion='entropy', random_state=20000))])
        # specify parameters and distributions to sample from
        param_dist = {'clf__max_depth': sp_randint(20, 30),
                     'clf__max_features': sp_randint(1, X_train.shape[1]),
                 'clf__min_samples_split': sp_randint(2, 11)
                    }
    # run randomized search
    n_iter_search = 20
    rand_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, random_state=20000,
                                    n_iter=n_iter_search, cv = 10, n_jobs=-1,verbose=1, scoring='recall')
    rand_search.fit(X_train, Y_train)
    print('Best score: %0.3f' % rand_search.best_score_)
    print('Best parameters set:')
    best_parameters = rand_search.best_estimator_.get_params()
    for param_name in sorted(param_dist.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    predictions = rand_search.predict(X_test)
    print(classification_report(Y_test, predictions))
    print("AUC is {0:.2f}".format(roc_auc_score(Y_test, predictions)))
    print(confusion_matrix(Y_test, predictions))
    return rand_search.best_estimator_

In [93]:
# With SMOTE and Undersampling
best_estimator = DTCGrid(X_train_tf[feature_set_1], X_test_tf[feature_set_1], y_train['output_label'], y_test['output_label'], model = 'lr')

Fitting 10 folds for each of 15 candidates, totalling 150 fits



The total space of parameters 15 is smaller than n_iter=20. Running 15 iterations. For exhaustive searches, use GridSearchCV.

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    5.2s


Best score: 0.704
Best parameters set:
	clf__C: 100
              precision    recall  f1-score   support

           0       0.96      0.70      0.81      7716
           1       0.12      0.58      0.20       550

   micro avg       0.69      0.69      0.69      8266
   macro avg       0.54      0.64      0.50      8266
weighted avg       0.90      0.69      0.77      8266

AUC is 0.64
[[5384 2332]
 [ 230  320]]


[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:   10.3s finished


In [95]:
# best_estimator = DTCGrid(X_train_tf[feature_set_1], X_test_tf[feature_set_1], y_train['output_label'], y_test['output_label'], model = 'rf')

In [None]:
# Train a logistic regression on just the vectorized text data
nlp_model = DTCGrid(X_train_tf[X_train_tf.columns[-400:]], X_test_tf[X_test_tf.columns[-400:]], y_train['output_label'], y_test['output_label'], model = 'lr')

In [101]:
final_model = DTCGrid(X_train_tf, X_test_tf, y_train['output_label'], y_test['output_label'], model = 'lr')

Fitting 10 folds for each of 15 candidates, totalling 150 fits



The total space of parameters 15 is smaller than n_iter=20. Running 15 iterations. For exhaustive searches, use GridSearchCV.

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   19.7s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:  2.3min finished


Best score: 0.725
Best parameters set:
	clf__C: 0.1
              precision    recall  f1-score   support

           0       0.96      0.68      0.80      7716
           1       0.13      0.65      0.21       550

   micro avg       0.68      0.68      0.68      8266
   macro avg       0.54      0.66      0.50      8266
weighted avg       0.91      0.68      0.76      8266

AUC is 0.66
[[5247 2469]
 [ 195  355]]



lbfgs failed to converge. Increase the number of iterations.



In [50]:
# best_estimator = DTCGrid(X_train_tf[X_train_tf.columns[-400:]], X_test_tf[X_test_tf.columns[-400:]], y_train, y_test, model = 'dt')

In [61]:
# Logistic regression
from sklearn.linear_model import LogisticRegression as LR
from sklearn.metrics import roc_curve, auc, roc_auc_score, confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

In [97]:
model_lr = LR(C = 100, penalty = 'l2', class_weight='balanced', random_state = 3, solver="lbfgs")


print("Cross Validation Score: {:.2%}".format(np.mean(cross_val_score(model_lr, X_train_tf, y_train['output_label'], cv=5))))
# logreg.fit(X_train, Y_train)


model_lr.fit(X_train_tf, y_train['output_label'])
print("Test Set score: {:.2%}".format(model_lr.score(X_test_tf, y_test['output_label'])))


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.



Cross Validation Score: 69.17%
Test Set score: 68.47%



lbfgs failed to converge. Increase the number of iterations.



In [99]:
y_test_preds = model_lr.predict(X_test_tf)

print("Accuracy is {0:.2f}".format(accuracy_score(y_test, y_test_preds)))
print("Precision is {0:.2f}".format(precision_score(y_test, y_test_preds)))
print("Recall is {0:.2f}".format(recall_score(y_test, y_test_preds)))
print("AUC is {0:.2f}".format(roc_auc_score(y_test, y_test_preds)))

print(confusion_matrix(y_test, y_test_preds))

Accuracy is 0.68
Precision is 0.12
Recall is 0.61
AUC is 0.65
[[5322 2394]
 [ 212  338]]


In [None]:
# Grid search cv 
from sklearn.model_selection import GridSearchCV
# def get_lr_hyperparams(x, y, nfolds):
#     scoring = {'AUC': 'roc_auc', 
#                'prec': 'precision',
#                'recall': 'recall'}
#     param_grid = {'C': [0.0001, 0.0005, 0.001, 0.0033, 0.0066, 0.01, 0.033, 0.066, 0.1, 0.33, 1, 3, 6, 10, 100]}
#     grid_search = GridSearchCV(LR(penalty='l1', solver='saga', class_weight='balanced', random_state=5, max_iter=100), 
#                                param_grid, scoring=scoring, cv=nfolds, refit='recall')
#     grid_search.fit(x, y)
#     grid_search.best_params_
#     return grid_search.best_params_

# Oversample the minority class

In [102]:
from imblearn.over_sampling import RandomOverSampler

In [165]:
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X_train_tf, y_train['output_label'])

X_resampled = pd.DataFrame(X_resampled, columns = X_train_tf.columns)

print(sorted(Counter(y_resampled).items()))

[(0, 30832), (1, 30832)]


In [121]:
y_test_preds = model_lr.predict(X_test_tf)

print("Accuracy is {0:.2f}".format(accuracy_score(y_test, y_test_preds)))
print("Precision is {0:.2f}".format(precision_score(y_test, y_test_preds)))
print("Recall is {0:.2f}".format(recall_score(y_test, y_test_preds)))
print("AUC is {0:.2f}".format(roc_auc_score(y_test, y_test_preds)))

print(confusion_matrix(y_test, y_test_preds))

Accuracy is 0.68
Precision is 0.12
Recall is 0.62
AUC is 0.65
[[5316 2400]
 [ 211  339]]


## Train 2 models

- Model 1: Using only structural features
- Model 2: Adding in discharge notes

In [166]:
struct_model = DTCGrid(X_resampled[feature_set_1], X_test_tf[feature_set_1], y_resampled, y_test['output_label'], model = 'lr')

Fitting 10 folds for each of 15 candidates, totalling 150 fits



The total space of parameters 15 is smaller than n_iter=20. Running 15 iterations. For exhaustive searches, use GridSearchCV.

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.1s


Best score: 0.676
Best parameters set:
	clf__C: 0.0001
              precision    recall  f1-score   support

           0       0.96      0.54      0.69      7716
           1       0.09      0.67      0.17       550

   micro avg       0.55      0.55      0.55      8266
   macro avg       0.53      0.61      0.43      8266
weighted avg       0.90      0.55      0.66      8266

AUC is 0.61
[[4196 3520]
 [ 181  369]]


[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:   12.4s finished


In [167]:
bootstrap_model = DTCGrid(X_resampled, X_test_tf, y_resampled, y_test['output_label'], model = 'lr')


The total space of parameters 15 is smaller than n_iter=20. Running 15 iterations. For exhaustive searches, use GridSearchCV.

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 10 folds for each of 15 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   27.7s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed: 10.1min finished


Best score: 0.706
Best parameters set:
	clf__C: 3
              precision    recall  f1-score   support

           0       0.96      0.69      0.80      7716
           1       0.12      0.62      0.21       550

   micro avg       0.68      0.68      0.68      8266
   macro avg       0.54      0.65      0.51      8266
weighted avg       0.91      0.68      0.76      8266

AUC is 0.65
[[5320 2396]
 [ 210  340]]


## Pickle all of the models we need for the dashboard

In [168]:
import pickle

In [169]:
# Save struct model
pickle.dump(struct_model, open('model1.pkl', 'wb'))

# save nlp model
pickle.dump(bootstrap_model, open('nlp_model.pkl', 'wb'))

# Try random forest on non-normalized values

In [145]:
no_norm_adm = pd.read_csv('../admission_processed.csv', parse_dates=['admittime', 'dischtime', 'deathtime', 'edregtime', 'edouttime', 'next_admittime', 'dob'], date_parser=pd.to_datetime)

In [146]:
no_norm_adm = no_norm_adm[['hadm_id', 'subject_id', 'days_next_admit'] + feature_set_1]

# Defining dictionaries for encoding
admin_type_dict = {'EMERGENCY': 0, 'URGENT': 0, 'ELECTIVE': 1}
gender_dict = {'M': 0, 'F': 1}

# Mapping dictionaries to binary features
no_norm_adm['admission_type'] = no_norm_adm['admission_type'].map(admin_type_dict).astype(int)
no_norm_adm['gender'] = no_norm_adm['gender'].map(gender_dict).astype(int)

# Generate output label for readmissions under 30 days
no_norm_adm['output_label'] = (no_norm_adm['days_next_admit'] < 30).astype('int')

In [147]:
no_norm_adm.head()

Unnamed: 0,hadm_id,subject_id,days_next_admit,admission_type,total_prior_admits,gender,age,length_of_stay,num_medications,num_lab_tests,perc_tests_abnormal,num_diagnosis,output_label
0,185777,4,,0,0,1,47.843943,7.759028,59,245.0,0.240816,9,0
1,107064,6,,1,0,1,65.938398,16.364583,148,573.0,0.448517,8,0
2,194540,11,,0,0,1,50.146475,25.529167,91,442.0,0.104072,1,0
3,143045,13,,0,0,1,39.863107,6.855556,84,359.0,0.29805,5,0
4,194023,17,128.920833,1,0,1,47.45243,4.368056,55,192.0,0.296875,4,0


In [148]:
# shuffle the samples
no_norm_adm = no_norm_adm.sample(n = len(no_norm_adm), random_state=42)
no_norm_adm.reset_index(drop=True, inplace=True)

no_norm_target = no_norm_adm[['output_label']]
no_norm_data = no_norm_adm[feature_set_1]

nn_X_train, nn_X_test, nn_y_train, nn_y_test = train_test_split(no_norm_data, no_norm_target, test_size=0.2, random_state = 0)
print(nn_X_train.shape, nn_X_test.shape, nn_y_train.shape, nn_y_test.shape)

(33061, 9) (8266, 9) (33061, 1) (8266, 1)


In [150]:
nn_X_train_tf, nn_X_test_tf = pd.concat([nn_X_train.reset_index(drop=True), X_train_vectors], axis=1), pd.concat([nn_X_test.reset_index(drop=True), X_test_vectors], axis=1)

In [151]:
print(nn_X_train_tf.shape, nn_X_test_tf.shape)

(33061, 409) (8266, 409)


In [163]:
# Oversample trainset
nn_X_resampled, nn_y_resampled = ros.fit_resample(nn_X_train_tf, nn_y_train['output_label'])

nn_X_resampled = pd.DataFrame(nn_X_resampled, columns=nn_X_train_tf.columns)

print(sorted(Counter(nn_y_resampled).items()))

[(0, 30832), (1, 30832)]


In [164]:
nn_bootstrap_model = DTCGrid(nn_X_resampled[feature_set_1], nn_X_test_tf[feature_set_1], nn_y_resampled, nn_y_test['output_label'], model = 'rf')

Fitting 10 folds for each of 20 candidates, totalling 200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:  1.5min finished

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Best score: 1.000
Best parameters set:
	clf__max_depth: 28
	clf__max_features: 4
	clf__min_samples_split: 2
              precision    recall  f1-score   support

           0       0.94      0.98      0.96      7716
           1       0.20      0.05      0.08       550

   micro avg       0.92      0.92      0.92      8266
   macro avg       0.57      0.52      0.52      8266
weighted avg       0.89      0.92      0.90      8266

AUC is 0.52
[[7598  118]
 [ 521   29]]
