This notepad does the following:
 * Extract data from the MIMIC-III database
 * Inspect the data and choose an appropriate subset of documents
 * Divide the documents into a structured and free-text component
 * Parse the structured comonent of the documents
 * Convert the free text into bag-of-words (BOW) format
 * Write the documents as an arff format

In [51]:
import os
import pandas as pd
from random import sample

In [78]:
from bow_machine import BOWMachine

ModuleNotFoundError: No module named 'bow_machine'

In [32]:
import wasabi
msg = wasabi.Printer()

In [6]:
MIMIC_path = os.path.abspath('../../FeatureCat/data/raw/NOTEEVENTS.csv')
data = pd.read_csv(MIMIC_path)
data.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,CHARTDATE,CHARTTIME,STORETIME,CATEGORY,DESCRIPTION,CGID,ISERROR,TEXT
0,174,22532,167853.0,2151-08-04,,,Discharge summary,Report,,,Admission Date: [**2151-7-16**] Dischar...
1,175,13702,107527.0,2118-06-14,,,Discharge summary,Report,,,Admission Date: [**2118-6-2**] Discharg...
2,176,13702,167118.0,2119-05-25,,,Discharge summary,Report,,,Admission Date: [**2119-5-4**] D...
3,177,13702,196489.0,2124-08-18,,,Discharge summary,Report,,,Admission Date: [**2124-7-21**] ...
4,178,26880,135453.0,2162-03-25,,,Discharge summary,Report,,,Admission Date: [**2162-3-3**] D...


## Data Inspection

In [38]:
data.dtypes

ROW_ID           int64
SUBJECT_ID       int64
HADM_ID        float64
CHARTDATE       object
CHARTTIME       object
STORETIME       object
CATEGORY        object
DESCRIPTION     object
CGID           float64
ISERROR        float64
TEXT            object
dtype: object

In [13]:
data['CATEGORY'].value_counts()

Nursing/other        822497
Radiology            522279
Nursing              223556
ECG                  209051
Physician            141624
Discharge summary     59652
Echo                  45794
Respiratory           31739
Nutrition              9418
General                8301
Rehab Services         5431
Social Work            2670
Case Management         967
Pharmacy                103
Consult                  98
Name: CATEGORY, dtype: int64

In [37]:
data['DESCRIPTION'].value_counts()[:20]

Report                               1132519
Nursing Progress Note                 191836
CHEST (PORTABLE AP)                   169270
Physician Resident Progress Note       62698
CHEST (PA & LAT)                       43158
CT HEAD W/O CONTRAST                   34485
Respiratory Care Shift Note            31105
Nursing Transfer Note                  30773
Intensivist Note                       26144
CHEST PORT. LINE PLACEMENT             21596
Physician Attending Progress Note      21023
Physician Resident Admission Note      10654
Clinical Nutrition Note                 9395
PORTABLE ABDOMEN                        8143
CHEST (PRE-OP PA & LAT)                 8064
CT CHEST W/CONTRAST                     8001
CT ABDOMEN W/CONTRAST                   7304
MR HEAD W & W/O CONTRAST                7062
CT CHEST W/O CONTRAST                   6745
Generic Note                            6649
Name: DESCRIPTION, dtype: int64

## Look at n examples from each category.

In [56]:
n_examples = 2 

for cat in data['CATEGORY'].unique():
    cat_text = data[ data['CATEGORY']==cat ]['TEXT']
    cat_sample = sample(list(cat_text), n_examples)
    for i, example in enumerate(cat_sample):
        msg.divider(f'{cat} {i+1}')
        print()
        print(example)
        print()

[1m

Admission Date:  [**2152-4-10**]     Discharge Date:  [**2152-4-19**]

Date of Birth:   [**2078-9-11**]     Sex:  F

Service:  CCU

CHIEF COMPLAINT:  Respiratory distress.

HISTORY OF PRESENT ILLNESS:  This is a 73-year-old female
with a history of coronary artery disease, status post
anterior wall myocardial infarction with cardiogenic shock in
[**Month (only) 956**] of this year.  At that time, the patient had
underwent percutaneous transluminal coronary angioplasty and
stent of her left anterior descending with an ejection
fraction of 25% at [**Hospital3 **] Hospital.  She had required
balloon pump and pressors for cardiogenic shock and had been
transferred to the [**Hospital6 256**] post
catheterization for further evaluation.  She was discharged
on [**3-22**] on medical therapy to a Rehabilitation facility.
Patient had been discharged from the Rehabilitation facility
and was doing well at home until the morning of admission,
when she was found to be in respiratory distress b

## Parse Discharge Summaries

In [52]:
help(CountVectorizer)

Help on class CountVectorizer in module sklearn.feature_extraction.text:

class CountVectorizer(_VectorizerMixin, sklearn.base.BaseEstimator)
 |  CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
 |  
 |  Convert a collection of text documents to a matrix of token counts
 |  
 |  This implementation produces a sparse representation of the counts using
 |  scipy.sparse.csr_matrix.
 |  
 |  If you do not provide an a-priori dictionary and you do not use an analyzer
 |  that does some kind of feature selection then the number of features will
 |  be equal to the vocabulary size found by analyzing the data.
 |  
 |  Read more in the :ref:`User Guide <text_feature_extraction>`.
 |  
 |  Parameters
 |  ---------

In [68]:
vectorizer = CountVectorizer(
    lowercase=True, # convert to lowercase
    stop_words='english', # remove English stopwords
    binary=False, # use counts rather than binary inclusion
    max_df=0.99, # ignore tokens which occur more than 99% of documents
    min_df=0.01, # ignore tokens which occur in fewer than 1% documents
    token_pattern='[A-Za-z]+' # only use pure-alphabetic tokens (no numeric chars)
)


chosen_data = data[ data['CATEGORY']==cat ]
free_text = chosen_data['TEXT'].values
bow_obj = vectorizer.fit_transform(free_text)
vocab = vectorizer.get_feature_names()
bow_df = pd.DataFrame(bow_obj.toarray(), columns=vocab)

In [69]:
bow_df.head()

Unnamed: 0,abd,abdomen,abdominal,abg,abgs,able,abp,absent,abx,ac,...,written,wt,x,y,year,yellow,yes,yesterday,yo,zosyn
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,0,0,3,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [62]:
len(vocab)

1409

In [71]:
bow_df.to_csv('mimic_bow.csv', index=False)