# Phrases

So far we have only thought in terms of single words: "lower", "lobe", "University", "of", "Utah". But in reality often times multiple words form one unit of thought: "University of Utah". Our word vectors will do a better job of representing our text if we fist recognize these phrases. We are going to use the [gensim](https://radimrehurek.com/gensim/models/phrases.html) package to detect and transform these phrases.

For example, the sentence, "I am a faculty member in the departments of Biomedical Informatics and Radiology and Imaging Sciences at the University of Utah." would be transformed to "I am a faculty member in the departments of Biomedical_Informatics and Radiology_and_Imaging_Sciences at the University_of_Utah."

"Biomedical_Informatics is an example of a **bigram phrase** and "University_of_Utah" is a **trigram phrase**. I guess "Radiology_and_Imaging_Sciences" is a quadgram phrase, but we will likely not try to detect phrases that long.

# Using the Gensim Phrases Module

In [1]:
%matplotlib inline

In [2]:
from nose.tools import assert_almost_equal, assert_true, assert_equal, assert_raises
from numbers import Number

## Upgrade to the latest version of gensim

In [3]:
#!conda install gensim -y

In [4]:
import pymysql
import pandas as pd
import getpass
from textblob import TextBlob
import re
from gensim.models.phrases import Phraser, Phrases
from IPython.display import clear_output, display, HTML
import pickle
import gzip
import seaborn as sns
from collections import Counter

In [5]:
import gensim
gensim.__version__

'2.2.0'

In [None]:
with open("rad_data.pickle.gz", "rb") as f0:
    rad_data = pickle.load(f0)
rad_data.head()

In [None]:
with open("rad_vocabulary.pickle.gz", "rb") as f0:
    word_map = pickle.load(f0)

## Let's recompute the impression column but don't convert to lowercase first

In [None]:
rad_data["impression"] = \
rad_data.apply(lambda row: get_impression(row["text"]), axis=1)

In [None]:
rad_data.shape

## What are our most common words 
### Hint: use a `Counter` and `most_common`

### Write a function to pre-process our text

* Lower case?
* Digits?
* Strip dates/times?
* stop words?

### But first, write unit tests to test whether `preprocess` is functioning correctly
#### Then write functionality to pass tests

You might want to use the `strings` module

In [None]:
import string
string.ascii_uppercase

In [None]:
def preprocess(txt):
    pass

## Do we return a string

In [None]:
assert_true(type(preprocess("my name"))== str)

## Do we remove what we intend to?

In [None]:
assert_equal()

In [None]:
assert_equal()

### Use our `preprocess` function to create a new column "clean_impression"

## Create a TextBlob from all the text in `rad_data["clean_impression"]`

In [None]:
blob = TextBlob(preprocess(" ".join(rad_data["clean_impression"])))


## Write a function `train_phrases` that will train bigram and trigram detectors

* We want to be able to ignore common terms in our phrase detection
* We want to be able to specify the minimum number of occurences in our text to be considered a phrase
* Return a dictionary of detectors

### Write unit tests to determine whether `train_phrases` is working as expected

In [None]:
def train_phrases(blob, common_terms=None, min_count=5):
    sentences = [s.words for s in blob.sentences]
    if common_terms == None:
        common_terms = []
    phrases = Phrases(sentences, common_terms=common_terms, 
                      min_count=min_count)
    bigram = Phraser(phrases)
    trigram = Phrases(bigram[sentences])
    
    return {"bigram":bigram, "trigram":trigram}
        

In [None]:
common_terms = ["of", "with", "without", "and", "or", "the", "a"]
generators = train_phrases(blob, common_terms=common_terms, min_count=5)

### Write a function that takes a `TextBlob` instance and phrase generators and returns a string of text
#### Unit tests first

In [None]:
def get_phrased_text(blob, generators):
    
    pass

In [None]:
get_phrased_text(TextBlob("There is a mass in the left lower lobe"), 
                generators)

In [None]:
assert_true()


In [None]:
assert_true()

In [None]:
phrased_txt = get_phrased_text(blob, generators)

## What phrases did we detect?

In [None]:
found_phrases = set([w for w in phrased_txt.split() if "_" in w])
print(len(found_phrases))

In [None]:
found_phrases

### How often did each phrase occur?

In [None]:
from collections import Counter

In [None]:
phrased_blob = TextBlob(phrased_txt)

In [None]:
counted_phrases = Counter([w for w in phrased_blob.words if "_" in w])
counted_phrases

In [None]:
for phrase, count in list(counted_phrases.items())[:100]:
    print("%s\t%03d"%(phrase.ljust(40),count))


## Create a word vector vocabulary using only words and phrases that occur more than N times
### How to choose N?

### What is our vocabulary from phrased_txt (how many unqiue words)?

Why use `TextBlob.words` instead of just `phrased_txt.split()`?

#### why is `phrased_blob = TextBlob(phrased_txt)` fast and `print(len(set(phrased_blob.words)))` slow?

In [None]:
phrased_blob = TextBlob(phrased_txt)

In [None]:
print(len(set(phrased_blob.words)))

In [None]:
sns.distplot([c[1] for c in phrased_blob_count if c[1] > 500])

In [None]:
len([w for w in lcounted_phrases if w[1]>10])

In [None]:
vwords = [w for w in lcounted_phrases if w[1]>100 and w[0] not in stop_words]

In [None]:
len(vwords)

### Determining Similarity Between Reports
* CXR vs CT vs MR

In [None]:
rad_data[rad_data["text"].str.contains("MRI")]

## Create a Report Browser

In [None]:
num_reports = rad_data.shape[0]
while True:
    try:
        i = int(input("Enter a number between 0 and %d. otherwise to quit"%num_reports))
        clear_output()

        if i < 0 or i >=num_reports:
            break
        txt = TextBlob(rd.sub("""d""", rad_data.iloc[i]['text'].strip().lower()))
        display(HTML("<>%s</p>"%" ".join(trigram_generator[bigram_generator[txt.tokens]])))
        
    except ValueError:
        break


In [None]:
type(txt)

## Wrangling Doesn't Always Do What You Want

>technique : multiplanar_td and td-weighted_images of the brain with gadolinium_according to standard departmental protocol .