# Classifying doctor violations by hand, then by machine

You're looking out for certain types of doctor violations! Whether keeping poor records, being addicted to drugs, or anything else. **You decide.**

**You're going to see how often doctors lose their license for that violation.** There are about 7000 records, though, and you ain't going to read all of them!

Steps:

1. **Classify some violations by hand**
1. Vectorize the **hand-classified violations**
1. Train a classifer on the **hand-classified violations**.
1. **Test the classifier**. If it's good, next step! If not, go back to training.
1. Vectorize the **unclassified violations**
1. Use the classifier to **predict the labels of the unclassified violations**
1. What actions were taken against those doctors?

It'll be magic!

In [501]:
import pandas as pd
import numpy as np

In [502]:
df = pd.read_csv("physicians-ny-violations.csv")
df = df.sample(frac=1).reset_index(drop=True) #Shuffle
df.head(2)

Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth
0,Probation for three years. The physician assi...,09/24/2009,10/01/2009,Michael,Francis,9745,RPA,George,The physician assistant did not contest the ch...,https://apps.health.ny.gov/pubdoh/professional...,The physician assistant shall practice medicin...,https://apps.health.ny.gov/pubdoh/professional...,1969.0
1,License revocation.,03/11/2010,03/17/2010,Inna,Rozentsvit,209254,MD,,The Hearing Committee sustained the charge fin...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1958.0


## Step 1: Classify some by hand

If you had a CSV with some sort of key in common, you'd be able to just do a join. But we don't! So **I'm going to help you out**.

I wrote this little script to help you **classify content by hand**. It will print the violation, then it's what you're looking for. If you type "y" or "Y" before hitting enter, that means YES. Once it's done it'll add the results to the dataframe in a column called `category`.

In [503]:
number_to_classify_by_hand = 30

In [504]:
def is_what_you_want(row):
    response = input("\n------------\n\n{desc}\n\n\nIS THIS WHAT YOU'RE LOOKING FOR? y for YES ".format(index=row.index, desc=row.misconduct))
    if response == "y" or response == "Y":
        print("\n** Classified as YES **")
        return "YES"
    else:
        print("\n** Classified as NO **")
        return "NO"

# Reset category column
df['category'] = np.nan
df['category'] = df[:number_to_classify_by_hand].apply(is_what_you_want, axis=1)

df.category.value_counts()


------------

The physician assistant did not contest the charge of negligence on more than one occasion.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The Hearing Committee sustained the charge finding the physician guilty of  having been convicted in the United States District Court, Eastern District of New York of making false statements relating to health care matters.  Previously on August 20, 2009, the physician's New York State medical license was summarily suspended by the New York State Commissioner of Health.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

This action is not disciplinary in nature.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The physician did not contest the charge of having been disciplined by the Illinois State Division of Professional Regulation for issuing prescriptions for non-controlled substances for patients over the internet.


IS 


** Classified as NO **

------------

The physician agreed he could not successfully defend against at least one of the charges of negligence on more than one occasion and practicing while his ability was impaired.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **


NO     26
YES     4
Name: category, dtype: int64

In [505]:
df.head()

Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth,category
0,Probation for three years. The physician assi...,09/24/2009,10/01/2009,Michael,Francis,9745,RPA,George,The physician assistant did not contest the ch...,https://apps.health.ny.gov/pubdoh/professional...,The physician assistant shall practice medicin...,https://apps.health.ny.gov/pubdoh/professional...,1969.0,NO
1,License revocation.,03/11/2010,03/17/2010,Inna,Rozentsvit,209254,MD,,The Hearing Committee sustained the charge fin...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1958.0,NO
2,Interim non-disciplinary order of conditions p...,12/07/2016,11/29/2016,Julia,Oweis,211304,MD,Y,This action is not disciplinary in nature.,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1969.0,NO
3,License surrender,03/04/2013,03/11/2013,James,Frede,256199,MD,R,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1951.0,NO
4,"Censure and reprimand and $2,500 fine. The ph...",05/05/2014,05/12/2014,Samuel,Grief,216343,MD,N,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,The physician must provide ninety (90) days no...,https://apps.health.ny.gov/pubdoh/professional...,1965.0,NO


## Step 2: Vectorize the violation descriptions

You want to **ONLY DO THIS WITH THE ONES YOU CLASSIFIED.**

In [506]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vec = CountVectorizer(stop_words='english', ngram_range=(1,3))
# vec = TfidfVectorizer(
#     stop_words='english',
#     norm='l1',
#     min_df=0.2,
#     max_df=0.85)
matrix = vec.fit_transform(df[:number_to_classify_by_hand].misconduct.str.replace('\d', ''))
features_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
features_df.head()

Unnamed: 0,ability,ability impaired,abusing,abusing patient,acceptable,acceptable medical,acceptable medical practice,accurate,accurate patient,accurate patient records,...,york conspiracy commit,york making,york making false,york petit,york petit larceny,york state,york state board,york state commissioner,york state medical,york state supreme
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,2,0,1,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 3: Create a classifier and train a model using the violation descriptions

You want to **ONLY DO THIS WITH THE ONES YOU CLASSIFIED.** You'll also need to make the `category` column a number, probably.

And remember your test/train split!

In [507]:
df['label'] = (df.category == "YES").astype(int)
df.head()

Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth,category,label
0,Probation for three years. The physician assi...,09/24/2009,10/01/2009,Michael,Francis,9745,RPA,George,The physician assistant did not contest the ch...,https://apps.health.ny.gov/pubdoh/professional...,The physician assistant shall practice medicin...,https://apps.health.ny.gov/pubdoh/professional...,1969.0,NO,0
1,License revocation.,03/11/2010,03/17/2010,Inna,Rozentsvit,209254,MD,,The Hearing Committee sustained the charge fin...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1958.0,NO,0
2,Interim non-disciplinary order of conditions p...,12/07/2016,11/29/2016,Julia,Oweis,211304,MD,Y,This action is not disciplinary in nature.,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1969.0,NO,0
3,License surrender,03/04/2013,03/11/2013,James,Frede,256199,MD,R,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1951.0,NO,0
4,"Censure and reprimand and $2,500 fine. The ph...",05/05/2014,05/12/2014,Samuel,Grief,216343,MD,N,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,The physician must provide ninety (90) days no...,https://apps.health.ny.gov/pubdoh/professional...,1965.0,NO,0


In [508]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

# What kind of classifier?
# clf = BernoulliNB()
#clf = MultinomialNB()
clf = DecisionTreeClassifier()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features_df.values,
    df[:number_to_classify_by_hand].label,
    test_size=0.2) 

## Step 4: Test the classifier

How does it look? Remember, we're only using the classified ones so far!

**If you don't like its predicting ability**, go back up and play around with your vectorizer, and even with your classifier. There are a lot of options!

In [509]:
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.83333333333333337

In [510]:
clf.score(X_train, y_train)

1.0

## Step 5: Vectorize the unclassified violations

Now we need to vectorize the violations we didn't classify by hand.

You **DO NOT MAKE A NEW VECTORIZOR**. You juse use the one we already have! Also, you **DON'T FIT IT AGAIN!** You just transform. I hope you read this line, but I'll give you some code anyway.

In [511]:
not_categorized = df[~df['misconduct'].isnull()]
matrix = vec.transform(not_categorized.misconduct)

## Step 6: Use the classifier to predict the labels of the unclassified violations

You **DON'T NEED A NEW CLASSIFIER**, use the one you have! You'll use `clf.predict`, and feed it... what? What does it need to predict the labels?

In [512]:
uncategorized_labels = list(clf.predict(matrix))
categorized_labels = list(df[:number_to_classify_by_hand]['label'])
df_labels =  categorized_labels + uncategorized_labels

### Step 6.2: Those labels are ugly

If you used a `LabelEncoder` to create your categories, you can feed the numbers to `le.inverse_transform` to get actual text back.

In [514]:
labels = {
    0: 'NO',
    1: 'YES'
}

df['label'] = df_labels
df['label'].replace(labels, inplace=True)

In [515]:
df.head()

Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth,category,label
0,Probation for three years. The physician assi...,09/24/2009,10/01/2009,Michael,Francis,9745,RPA,George,The physician assistant did not contest the ch...,https://apps.health.ny.gov/pubdoh/professional...,The physician assistant shall practice medicin...,https://apps.health.ny.gov/pubdoh/professional...,1969.0,NO,NO
1,License revocation.,03/11/2010,03/17/2010,Inna,Rozentsvit,209254,MD,,The Hearing Committee sustained the charge fin...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1958.0,NO,NO
2,Interim non-disciplinary order of conditions p...,12/07/2016,11/29/2016,Julia,Oweis,211304,MD,Y,This action is not disciplinary in nature.,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1969.0,NO,NO
3,License surrender,03/04/2013,03/11/2013,James,Frede,256199,MD,R,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1951.0,NO,NO
4,"Censure and reprimand and $2,500 fine. The ph...",05/05/2014,05/12/2014,Samuel,Grief,216343,MD,N,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,The physician must provide ninety (90) days no...,https://apps.health.ny.gov/pubdoh/professional...,1965.0,NO,NO


### 6.3: Put the category labels back into the original dataframe

In [516]:
df['category'] = df['label']
df.drop('label', 1, inplace=True)

## Step 7: What actions were taken against those doctors?

In [517]:
# alcoholic and drug addict doctors:
# Mostly License surrender and revocation
df[df['category'] == 'YES']['action'].value_counts()

License surrender                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             24
License revocation                                                                                                                                                                                                                                     