# Florida Physician Violations: Dealing with ugly documents

I did most of the grunt work for you - scraped a databases, downloaded a million PDFs, automatically ran them through `convert` and `tesseract` in the hopes of coming out with readable text.

Unfortunately, **the text looks like trash**. Scans at weird angles, nothing looks right, text issues all over the place.

Maybe I'm looking for documents that deal with **opioids** - Vicodin, hydrocodone, oxycodone, etc. What are some techniques we can use to effectively search these documents? And most importantly, **is it different than what we did with the New York doctors?**

Maybe I'm looking for documents that deal with **sexual assault** or **sexual harassment**, which take a lot more forms than opioids, and may have indirect language used to describe them. Are our search techniques the same as above?

> **Tip:** When dealing with dirty text, you might want to use something like fuzzywuzzy [repo](https://github.com/seatgeek/fuzzywuzzy), which we [covered in class](http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/), or [jellyfish](https://github.com/jamesturk/jellyfish), which we didn't.

> **Question:** Is machine learning really the best technique? Maybe, maybe not! There are a handful other other approaches I can think of off of the top of my head, and not all of them are that technical.

I'm going to clean the data up a *little* down below, then it's up to you. You can search for whatever you'd like. Yes, there aren't that many, but that's just because it's soooo slow to convert to JPG and run tesseract!

## Combining the documents and the violations records

### Step 1.1: Read in the violations

In [39]:
import pandas as pd
violations_df = pd.read_csv("florida-violations.csv", dtype={'case': str})
violations_df.head(2)

Unnamed: 0,action_date,action_taken,case,case_url,city,county,lic_number,lic_url,name,profession,state
0,06/26/2001,Obligations Imposed,199615776,https://appsmqa.doh.state.fl.us/MQASearchServi...,ORMOND BEACH,VOLUSIA,27057,https://appsmqa.doh.state.fl.us/MQASearchServi...,"PARIKH, MADHUSUDAN",Medical Doctor,FL
1,03/27/2001,Obligations Imposed,199955593,https://appsmqa.doh.state.fl.us/MQASearchServi...,AUCKLAND,UNKNOWN,52282,https://appsmqa.doh.state.fl.us/MQASearchServi...,"WIMBROW, THOMAS",Medical Doctor,ZZ


### Step 1.2: Read in the converted PDFs

In [40]:
import glob

filenames = glob.glob("converted-docs/*")
contents = [open(filename).read() for filename in filenames]
docs_df = pd.DataFrame({
    'filename': filenames,
    'contents': contents
})
docs_df['case'] = docs_df.filename.str.extract("converted-docs/(.*).txt", expand=False)
docs_df.head(2)

Unnamed: 0,contents,filename,case
0,6M}:\n\nFILED\n\nDepartment of Professional Re...,converted-docs/100040.txt,100040
1,\n\nSTATE OF FLORIDA\nDEPARTMENT OF BUSINESS ...,converted-docs/100142.txt,100142


### Step 1.3: Merge the two

In [41]:
df = docs_df.merge(violations_df, left_on='case', right_on='case')
df.head(2)

Unnamed: 0,contents,filename,case,action_date,action_taken,case_url,city,county,lic_number,lic_url,name,profession,state
0,6M}:\n\nFILED\n\nDepartment of Professional Re...,converted-docs/100040.txt,100040,08/16/1990,Probation-App Rpts/Screens Req,https://appsmqa.doh.state.fl.us/MQASearchServi...,JACKSONVILLE,DUVAL,26676,https://appsmqa.doh.state.fl.us/MQASearchServi...,"DRUCKER, MICHAEL",Medical Doctor,FL
1,\n\nSTATE OF FLORIDA\nDEPARTMENT OF BUSINESS ...,converted-docs/100142.txt,100142,03/25/1994,Voluntary Surrender,https://appsmqa.doh.state.fl.us/MQASearchServi...,CORAL GABLES,MIAMI-DADE,34265,https://appsmqa.doh.state.fl.us/MQASearchServi...,"RICO-PEREZ, MANUEL",Medical Doctor,FL


In [42]:
# How many do we have?
df.shape

(190, 13)

## Finding the documents

This part is up to you! What topic are you trying to find?

In [43]:
df.head()

Unnamed: 0,contents,filename,case,action_date,action_taken,case_url,city,county,lic_number,lic_url,name,profession,state
0,6M}:\n\nFILED\n\nDepartment of Professional Re...,converted-docs/100040.txt,100040,08/16/1990,Probation-App Rpts/Screens Req,https://appsmqa.doh.state.fl.us/MQASearchServi...,JACKSONVILLE,DUVAL,26676,https://appsmqa.doh.state.fl.us/MQASearchServi...,"DRUCKER, MICHAEL",Medical Doctor,FL
1,\n\nSTATE OF FLORIDA\nDEPARTMENT OF BUSINESS ...,converted-docs/100142.txt,100142,03/25/1994,Voluntary Surrender,https://appsmqa.doh.state.fl.us/MQASearchServi...,CORAL GABLES,MIAMI-DADE,34265,https://appsmqa.doh.state.fl.us/MQASearchServi...,"RICO-PEREZ, MANUEL",Medical Doctor,FL
2,FILED\n\nDepartment of Professional Reguiation...,converted-docs/100146.txt,100146,02/06/1990,Voluntary Surrender,https://appsmqa.doh.state.fl.us/MQASearchServi...,SCOTTSDALE,OUT OF STATE,23691,https://appsmqa.doh.state.fl.us/MQASearchServi...,"KRASNER, BERNARD",Medical Doctor,AZ
3,DEPARTMENT OF PROFESSIONAL Rmﬁzgk'rgjp W ‘E E ...,converted-docs/100197.txt,100197,01/02/1998,Obligations Imposed-Othr Penal,https://appsmqa.doh.state.fl.us/MQASearchServi...,AVENTURA,MIAMI-DADE,22998,https://appsmqa.doh.state.fl.us/MQASearchServi...,"TICKTIN, STEPHEN",Medical Doctor,FL
4,\n\n \n\nFlLvED\n\nDepartment of Prolcsséonal...,converted-docs/100415.txt,100415,08/13/1992,Suspension-Other Penalty Imposed,https://appsmqa.doh.state.fl.us/MQASearchServi...,WEST PALM BEACH,PALM BEACH,22484,https://appsmqa.doh.state.fl.us/MQASearchServi...,"PEOPLES, SUZANNE",Medical Doctor,FL


## Using a simple words search

In [74]:
from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob

def textblob_tokenizer(str_input):
    blob = TextBlob(str_input.lower())
    tokens = blob.tokens
    words = [token.stem() for token in tokens]
    return words

sexual_misconduct = ['sexual', 'assault', 'harassment']
opioids = ['opioid', 'vicodin', 'hydrocodone', 'oxycodone']


vec = CountVectorizer(
    stop_words='english',
    tokenizer=textblob_tokenizer,
    vocabulary=sexual_misconduct)

matrix = vec.fit_transform(df['contents'])
vocab = vec.get_feature_names()
wordcount_df = pd.DataFrame(matrix.toarray(), columns=vocab)
wordcount_df.sort_values(by=sexual_misconduct, ascending=False).head()

Unnamed: 0,sexual,assault,harassment
68,35,0,0
87,35,0,0
13,10,0,0
168,10,0,0
163,8,0,0


## Trying to cluster the contents to see what we can learn

In [61]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(stop_words='english',
                      max_df=0.75,
                      use_idf=False)

matrix = vec.fit_transform(df['contents'])
results = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

number_of_clusters = 5
km = KMeans(n_clusters=number_of_clusters)
km.fit(matrix)

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :12]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Cluster 0: mm 93 33 hearing disciplinary relinquishment dr appeal patient wynn aa district
Cluster 1: agreement patient consent stipulation parties failed incorporating stipulated care terms treatment waives
Cluster 2: patient records treatment tests failed care count similar examination results patients course
Cluster 3: education continuing stipulation hours period renewal rule american management documentation code 002
Cluster 4: patient committee agreement monitoring monitor terms supervisor consent stipulation supervision care provisions
