# Part A - Finding the NASH

Natural Language Processing (NLP) doesn't have to be hard! For many tasks simply finding a bunch of notes that are helpful is enough. In this example we have a nice term (NASH) that is fairly unambiguous. We just want to find patients who may have NASH for some further study.

In [0]:
# First off - load all the silly python libraries we are going to need
import pandas as pd
import numpy as np
import random
from IPython.core.display import display, HTML


from google.colab import auth
from google.cloud import bigquery
from google.colab import files
import os

In [0]:
auth.authenticate_user() # authenticating 

In [0]:
project_id='hst-953-2018'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id
# Read data from BigQuery into pandas dataframes.
def run_query(query):
  return pd.io.gbq.read_gbq(query, project_id=project_id, verbose=False, configuration={'query':{'useLegacySql': False}})

In [0]:
# Now load the data. In general you'd load the whole set of notes but that would take
# several minutes so for this example we're just going to use a subset
# notes = pd.read_csv('A.csv')
notes = run_query('''
SELECT row_id, subject_id, hadm_id, TEXT
FROM `physionet-data.mimiciii_notes.noteevents`
WHERE CATEGORY = 'Discharge summary'
''')

In [0]:
# Here is the list of terms we are going to consider "good"
terms = ['NASH', 'nonalcoholic steathohepatitis']

In [8]:
notes.head()

Unnamed: 0,row_id,subject_id,hadm_id,TEXT
0,811,88360,130127,Admission Date: [**2149-11-16**] ...
1,1495,32013,185178,Admission Date: [**2188-7-2**] D...
2,2714,31260,191494,Admission Date: [**2107-8-5**] D...
3,2615,58938,178153,Admission Date: [**2179-1-31**] ...
4,3455,17552,175683,Admission Date: [**2150-4-3**] Discharge ...


In [9]:
# Now scan through all of the notes. Do any of the terms appear? If so stash the note 
# id for future use

matches = []

for index, row in notes.iterrows():
    if any(x in row['TEXT'] for x in terms):
        matches.append(row['row_id'])

print("Found " + str(len(matches)) + " matching notes.")

Found 256 matching notes.


In [10]:
# Display a random note that matches. You can rerun this cell to get another note.
# The fancy stuff is just highlighting the match to make it easier to find.

display_id = random.choice(matches)
text = notes[notes['row_id'] == display_id].iloc[0]['TEXT']
for term in terms:
    text = text.replace(term, "<font color=\"red\">" + term + "</font>")
display(HTML("<pre>" + text + "</pre>"))
