# Phrases

So far we have only thought in terms of single words: "lower", "lobe", "University", "of", "Utah". But in reality often times multiple words form one unit of thought: "University of Utah". Our word vectors will do a better job of representing our text if we fist recognize these phrases. We are going to use the [gensim](https://radimrehurek.com/gensim/models/phrases.html) package to detect and transform these phrases.

For example, the sentence, "I am a faculty member in the departments of Biomedical Informatics and Radiology and Imaging Sciences at the University of Utah." would be transformed to "I am a faculty member in the departments of Biomedical_Informatics and Radiology_and_Imaging_Sciences at the University_of_Utah."

"Biomedical_Informatics is an example of a **bigram phrase** and "University_of_Utah" is a **trigram phrase**. I guess "Radiology_and_Imaging_Sciences" is a quadgram phrase, but we will likely not try to detect phrases that long.

# Using the Gensim Phrases Module

In [8]:
%matplotlib inline

In [9]:
from nose.tools import assert_almost_equal, assert_true, assert_equal, assert_raises
from numbers import Number

## Upgrade to the latest version of gensim

In [10]:
#!conda install gensim -y

In [11]:
import pymysql
import pandas as pd
import getpass
from textblob import TextBlob
import re
from gensim.models.phrases import Phraser, Phrases
from IPython.display import clear_output, display, HTML
import pickle
import gzip
import seaborn as sns
from collections import Counter

In [12]:
import gensim
gensim.__version__

'3.1.0'

In [13]:
with open("rad_data.pickle.gz", "rb") as f0:
    rad_data = pickle.load(f0)
rad_data.head()

Unnamed: 0,subject_id,hadm_id,text,impression,impression no stops
0,56,28766.0,\n\n\n DATE: [**2644-1-17**] 10:53 AM\n ...,\n\n\n date dd:dd am\n mr head w & w/o...,date dd:dd mr head w & w/o contrast; mr contra...
1,56,28766.0,\n\n\n DATE: [**2644-1-17**] 10:43 AM\n ...,impression: stable appearance of right pariet...,impression: stable appearance right parietal l...
2,56,28766.0,\n\n\n DATE: [**2644-1-17**] 6:37 AM\n ...,impression:\n \n cardiomegaly and mild...,impression: cardiomegaly mild chf. nasogastric...
3,56,28766.0,\n\n\n DATE: [**2644-1-19**] 12:09 PM\n ...,impression:\n \n marked improvement in...,impression: marked improvement in left perihil...
4,37,18052.0,\n\n\n DATE: [**3264-8-14**] 6:06 AM\n ...,impression: stable cardiomegaly with pulmonary...,impression: stable cardiomegaly with pulmonary...


In [14]:
with open("rad_vocabulary.pickle.gz", "rb") as f0:
    word_map = pickle.load(f0)

## Let's recompute the impression column but don't convert to lowercase first

In [15]:
rad_data["impression"] = \
rad_data.apply(lambda row: get_impression(row["text"]), axis=1)

NameError: ("name 'get_impression' is not defined", 'occurred at index 0')

In [None]:
rad_data.shape

## What are our most common words 
### Hint: use a `Counter` and `most_common`

### Write a function to pre-process our text

* Lower case?
* Digits?
* Strip dates/times?
* stop words?

### But first, write unit tests to test whether `preprocess` is functioning correctly
#### Then write functionality to pass tests

You might want to use the `strings` module

In [18]:
import string
string.ascii_uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [24]:
date=re.compile(r"""((?P<month>[A-Z][a-z]{2,}(\.)?) (?P<day>[0-9]{1,2}))""")
last_name=re.compile(r"""(\[\*\*Last Name \((NamePattern|STitle)(\d+)?\) [0-9]*\*\*\])""")
clip = re.compile(r"""\[\*\*Clip Number \(Radiology\) \d+\*\*\]""")
date2 = re.compile(r"""DATE\: \[\*\*\d+-\d+-\d+\*\*\]""")
hospital=re.compile(r"""(\[\*\*Hospital \d+\*\*\])""")
unders = re.compile(r"""_{2,}""")

age2 = re.compile(r"""(?P<age>[0-9]+)(-|\s)y(ear(s)?|\.)(-|\s)?o(ld|\.)""")
age3 = re.compile(r"""\bage(d)? (?P<age>[0-9]+)""")
digits = re.compile(r"""\d""")
def age_in_decades(m):
    age = int(m.group("age"))
    
    return "[** Age in %ss**]"%(int(age/10)*10,)

age_in_decades(next(age2.finditer("74-year-old")))

'[** Age in 70s**]'

In [25]:
def preprocess(report):
    return digits.sub("d",unders.sub("\n",
               hospital.sub("HOSPITAL", 
                            date2.sub("DATE", 
                                      clip.sub("CLIP", 
                                               last_name.sub("LASTNAME", report))))))


In [26]:
def preprocess(txt):
    return txt.lower()

## Do we return a string

In [27]:
assert_true(type(preprocess("my name"))== str)

## Do we remove what we intend to?

In [28]:
assert_equal(
    set(string.ascii_uppercase).intersection(preprocess("The patient's name is Oscar Wilde.")), set([]))

In [None]:
assert_equal()

### Use our `preprocess` function to create a new column "clean_impression"

In [33]:
rad_data["clean_impression_no_stops"] = \
rad_data.apply(lambda row: preprocess(row["impression no stops"]), axis = 1)
rad_data.head(5)

Unnamed: 0,subject_id,hadm_id,text,impression,impression no stops,clean_impression_no_stops
0,56,28766.0,\n\n\n DATE: [**2644-1-17**] 10:53 AM\n ...,\n\n\n date dd:dd am\n mr head w & w/o...,date dd:dd mr head w & w/o contrast; mr contra...,date dd:dd mr head w & w/o contrast; mr contra...
1,56,28766.0,\n\n\n DATE: [**2644-1-17**] 10:43 AM\n ...,impression: stable appearance of right pariet...,impression: stable appearance right parietal l...,impression: stable appearance right parietal l...
2,56,28766.0,\n\n\n DATE: [**2644-1-17**] 6:37 AM\n ...,impression:\n \n cardiomegaly and mild...,impression: cardiomegaly mild chf. nasogastric...,impression: cardiomegaly mild chf. nasogastric...
3,56,28766.0,\n\n\n DATE: [**2644-1-19**] 12:09 PM\n ...,impression:\n \n marked improvement in...,impression: marked improvement in left perihil...,impression: marked improvement in left perihil...
4,37,18052.0,\n\n\n DATE: [**3264-8-14**] 6:06 AM\n ...,impression: stable cardiomegaly with pulmonary...,impression: stable cardiomegaly with pulmonary...,impression: stable cardiomegaly with pulmonary...


## Create a TextBlob from all the text in `rad_data["clean_impression"]`

In [34]:
blob = TextBlob(preprocess(" ".join(rad_data["clean_impression_no_stops"])))

#words = blob.words is slower because TextBlob hasn't computed anything until you ask for it 

## Write a function `train_phrases` that will train bigram and trigram detectors

* We want to be able to ignore common terms in our phrase detection
* We want to be able to specify the minimum number of occurences in our text to be considered a phrase
* Return a dictionary of detectors

### Write unit tests to determine whether `train_phrases` is working as expected

In [35]:
def train_phrases(blob, common_terms=None, min_count=5):
    sentences = [s.words for s in blob.sentences] #a phrase won't go outside a sentence boundary
                                                  #a list of lists of words
    if common_terms == None:
        common_terms = []
    phrases = Phrases(sentences, common_terms=common_terms, #don't let this interrupt the count 
                      min_count=min_count)
    bigram = Phraser(phrases)
    trigram = Phrases(bigram[sentences])
    
    return {"bigram":bigram, "trigram":trigram}
        

In [36]:
common_terms = ["of", "with", "without", "and", "or", "the", "a"]
generators = train_phrases(blob, common_terms=common_terms, min_count=5)

### Write a function that takes a `TextBlob` instance and phrase generators and returns a string of text
#### Unit tests first

In [37]:
def get_phrased_text(blob, generators):
    #must iterate across sentences
    return " ".join([w for s in blob.sentences for w in generators["trigram"][generators["bigram"][s.tokens]] ])

In [38]:
get_phrased_text(TextBlob("There is a mass in the left lower lobe, a 2.0 cm bvbs."), 
                generators)



'There is a mass in the left_lower_lobe'

In [39]:
TextBlob("There is a mass in the left lower lobe, a 2.0 cm bvbs.").tokens

WordList(['There', 'is', 'a', 'mass', 'in', 'the', 'left', 'lower', 'lobe', ',', 'a', '2.0', 'cm', 'bvbs', '.'])

In [None]:
assert_true()


In [None]:
assert_true()

In [40]:
phrased_txt = get_phrased_text(blob, generators)



## What phrases did we detect?

In [41]:
found_phrases = set([w for w in TextBlob(phrased_txt).words if "_" in w])
print(len(found_phrases))

2233


In [42]:
found_phrases

{'check_line_placement',
 'r/o_pe',
 'not_completely',
 'central_venous_line',
 'common_carotid',
 'same_date',
 'linear_atelectasis',
 'tubes_remain',
 'hip_fx',
 'heart_enlarged',
 'interval_improvement_in',
 'standard_placement',
 'gated_images',
 'date_mailed',
 'subcutaneous_edema',
 'anterior_tibial_artery',
 'ld_vertebral',
 'ct_guided',
 'low_lung',
 'tube_coiled',
 'one_view_portable',
 'feeding_tube_placement',
 'central_venous_catheter',
 'fatty_infiltration',
 'bone_destruction',
 'd-d_cm',
 'intrahepatic_ductal_dilatation',
 'small_bowel_obstruction',
 'supine_positioning',
 'septum_pellucidum',
 'has_been',
 'it_unclear',
 'abdomen_pelvis',
 'could_reflect',
 'bilateral_pulmonary_nodules',
 'more_prominent',
 'once_again',
 'asymmetric_pulmonary_edema',
 'dd_pm_date_mailed',
 'well_visualized',
 'sinus_disease',
 'no_significant_change_since',
 'bowel_ischemia',
 'pa_cath',
 'improvement_in',
 'lateral_aspect',
 'gated_wall_motion_images',
 'may_helpful',
 'bilateral_laye

### How often did each phrase occur?

In [43]:
from collections import Counter

In [44]:
phrased_blob = TextBlob(phrased_txt)

In [51]:
counter = Counter(phrased_blob.words)
counter.most_common(20) #d are our numbers

[('d', 5619),
 ('impression', 4096),
 ('with', 3492),
 ('in', 3027),
 ('right', 2403),
 ('dd', 2144),
 ('left', 1946),
 ('there', 1650),
 ('clip', 1309),
 ('date', 1242),
 ('to', 1102),
 ('no', 1078),
 ('clip_reason', 1039),
 ('chest', 1031),
 ('on', 996),
 ('or', 983),
 ('reason_this_examination', 740),
 ('admitting_diagnosis', 704),
 ('this', 664),
 ('within', 646)]

In [54]:
word_map_phrases = dict(zip(counter.keys(), range(len(counter))))

In [45]:
counted_phrases = Counter([w for w in phrased_blob.words if "_" in w])
counted_phrases

Counter({'check_line_placement': 11,
         'r/o_pe': 9,
         'central_venous_line': 28,
         'common_carotid': 3,
         'same_date': 11,
         'lower_lobe_consolidation': 22,
         'linear_atelectasis': 16,
         'tubes_remain': 2,
         'hip_fx': 9,
         'heart_enlarged': 10,
         'interval_improvement_in': 28,
         'standard_placement': 2,
         'gated_images': 1,
         'not_completely': 6,
         'date_mailed': 20,
         "paget_'s_disease": 8,
         'subcutaneous_edema': 7,
         'proximal_jejunum': 17,
         'anterior_tibial_artery': 8,
         'ld_vertebral': 3,
         'ct_guided': 11,
         'fractures_identified': 9,
         'low_lung': 1,
         'tube_coiled': 2,
         'feeding_tube_placement': 42,
         'central_venous_catheter': 32,
         'fatty_infiltration': 6,
         'brachial_vein': 21,
         'bone_destruction': 9,
         'd-d_cm': 13,
         'intrahepatic_ductal_dilatation': 8,
         '

In [48]:
list(counted_phrases.most_common())[-500:]

[('extensive_degenerative_changes', 6),
 ('innominate_vein', 6),
 ('stenosis_origin', 6),
 ('should_pulled_back', 6),
 ('vascular_redistribution', 6),
 ('last_cxr', 6),
 ('obtained_with_thallium', 6),
 ('mildly_enlarged', 6),
 ('degraded_by', 6),
 ('detailed_above', 6),
 ('intra-abdominal_ascites', 6),
 ('operative_report', 6),
 ('pulmonary_arteries', 6),
 ('stab_wound', 6),
 ('single-lumen_picc', 6),
 ('r/o_mets', 6),
 ('two_hours', 6),
 ('ij_line_placement', 6),
 ('double_lumen', 6),
 ('this_examination', 6),
 ('major_tributaries', 6),
 ('hazy_opacity', 6),
 ('evaluate_progression', 6),
 ('infant_with_increasing', 6),
 ('dth_rib', 6),
 ('overlies_proximal', 6),
 ('foci_abnormally', 6),
 ('cavity_size', 6),
 ('flow_signal', 6),
 ('femoral_vein', 6),
 ('cecum_not', 6),
 ('inguinal_region', 6),
 ('distended_gallbladder', 6),
 ('esophageal_cancer', 6),
 ('mra_neck', 6),
 ('without_radiologist_present', 6),
 ('chest_compared_to', 6),
 ('brain_imaging_was_obtained', 6),
 ('alveolar_infiltr

In [None]:
for phrase, count in list(counted_phrases.items())[:100]:
    print("%s\t%03d"%(phrase.ljust(40),count))


## Create a word vector vocabulary using only words and phrases that occur more than N times
### How to choose N?

### What is our vocabulary from phrased_txt (how many unqiue words)?

Why use `TextBlob.words` instead of just `phrased_txt.split()`?

#### why is `phrased_blob = TextBlob(phrased_txt)` fast and `print(len(set(phrased_blob.words)))` slow?

In [None]:
phrased_blob = TextBlob(phrased_txt)

In [None]:
print(len(set(phrased_blob.words)))

In [None]:
sns.distplot([c[1] for c in phrased_blob_count if c[1] > 500])

In [None]:
len([w for w in lcounted_phrases if w[1]>10])

In [None]:
vwords = [w for w in lcounted_phrases if w[1]>100 and w[0] not in stop_words]

In [None]:
len(vwords)

### Determining Similarity Between Reports
* CXR vs CT vs MR

In [None]:
rad_data[rad_data["text"].str.contains("MRI")]

## Create a Report Browser

In [None]:
num_reports = rad_data.shape[0]
while True:
    try:
        i = int(input("Enter a number between 0 and %d. otherwise to quit"%num_reports))
        clear_output()

        if i < 0 or i >=num_reports:
            break
        txt = TextBlob(rd.sub("""d""", rad_data.iloc[i]['text'].strip().lower()))
        display(HTML("<>%s</p>"%" ".join(trigram_generator[bigram_generator[txt.tokens]])))
        
    except ValueError:
        break


In [None]:
type(txt)

## Wrangling Doesn't Always Do What You Want

>technique : multiplanar_td and td-weighted_images of the brain with gadolinium_according to standard departmental protocol .