# Drug Utilization EDA
Explore the drug utilization CSV found at https://data.medicaid.gov/api/1/datastore/query/eec7fbe6-c4c4-5915-b3d0-be5828ef4e9d/0/download?format=csv

In [1]:
import pandas as pd
import os

The preferred route to get all the raw data is via the Makefile.\
The following cell is just a catch all to ensure the data is available for EDA.

In [2]:
path = 'raw_data/drug_utilization_2021.csv'
if not os.path.exists(path):
    import get_data
    get_data.download_drug_utiliztion(path)

In [3]:
df = pd.read_csv(path)

### What do we notice about the first 3 rows?

In [4]:
df.head(3)

Unnamed: 0,utilization_type,state,ndc,labeler_code,product_code,package_size,year,quarter,suppression_used,product_name,units_reimbursed,number_of_prescriptions,total_amount_reimbursed,medicaid_amount_reimbursed,non_medicaid_amount_reimbursed
0,FFSU,AK,2143380,2,1433,80,2021,4,False,TRULICITY,544.0,222.0,220042.29,215557.09,4485.2
1,FFSU,AK,2143480,2,1434,80,2021,4,False,TRULICITY,706.0,275.0,286542.68,281194.56,5348.12
2,FFSU,AK,2143611,2,1436,11,2021,4,False,EMGALITY P,27.0,27.0,16648.52,16648.52,0.0


Product names are more human readable than ndc codes so let's see how many we're dealing with.

In [5]:
# let's look at a specific record
df[df.ndc==2143380]['product_name'].unique()

array(['TRULICITY '], dtype=object)

Padding can be kind of a pain.  Let's strip that out and see if that changes our unique count

In [6]:
# count of unique product_name before blanks stripped out
len(df['product_name'].unique())

15326

In [7]:
df['product_name'] = df['product_name'].str.strip()

# count of unique product_name before blanks stripped out - should be the same as above
len(df['product_name'].unique())

15325

Since no ill effects (before/after counts are the same), we'll include stripping the padding in our cleaning step (clean.py) as it will dealing with product name queries a little easier.

### General data set stats

In [8]:
df.describe()

Unnamed: 0,ndc,labeler_code,product_code,package_size,year,quarter,units_reimbursed,number_of_prescriptions,total_amount_reimbursed,medicaid_amount_reimbursed,non_medicaid_amount_reimbursed
count,5020759.0,5020759.0,5020759.0,5020759.0,5020759.0,5020759.0,2571171.0,2571171.0,2571171.0,2571171.0,2571171.0
mean,37549800000.0,37549.71,1386.811,22.32003,2021.0,2.511108,55677.04,571.795,66235.37,63319.87,2915.5
std,28251870000.0,28251.97,2162.05,27.59782,0.0,1.117977,10768420.0,5147.971,1307438.0,1281386.0,85685.18
min,2010102.0,2.0,0.0,0.0,2021.0,1.0,0.006,11.0,0.0,0.0,0.0
25%,781108900.0,781.0,185.0,1.0,2021.0,2.0,776.0,23.0,432.33,412.07,0.0
50%,47781070000.0,47781.0,500.0,10.0,2021.0,3.0,2368.0,56.0,1565.07,1486.29,0.0
75%,64679080000.0,64679.0,1079.0,31.0,2021.0,4.0,9480.0,198.0,7544.5,7117.82,54.79
max,100000000000.0,99999.0,9999.0,99.0,2021.0,4.0,10280450000.0,1234377.0,518548900.0,514041100.0,36270570.0


The first thing I notice is while there are 5M rows, only about 1/2 of them have entries for: 
* units_reimbursed
* number_of_prescriptions
* total_amount_reimbursed
* medicaid_amount_reimbursed
* non_medicaid_amount_reimbursed

These seem to correlate with suppression_used as false.  To confirm, we'll perform the following two tests
1. Whenever suppression_used is true, all the above fields are na (no non na fields)
2. Whenever suppression_used is false, none of the above fields are na (no na fields)

##### 1. Whenever suppression_used is true, all the above fields are na (no non na fields)

In [9]:
# ensure count of suppression_used==True > 0
df[df['suppression_used']].shape[0]

2449588

In [10]:
# Find the intersection of suppression_used and any non-na values
# A count of 0 means no overlap between supression true and any other of specified fields having a non-na value

df[(df['suppression_used']) & 
   (
       (~df['number_of_prescriptions'].isna()) | 
       (~df['units_reimbursed'].isna()) | 
       (~df['total_amount_reimbursed'].isna()) |
       (~df['medicaid_amount_reimbursed'].isna()) |
       (~df['non_medicaid_amount_reimbursed'].isna()) 
)].shape[0]

0

##### 2. Whenever suppression_used is false, none of the above fields are na (no na fields)

In [11]:
# ensure count of suppression_used==False > 0
df[~df['suppression_used']].shape[0]

2571171

In [12]:
# Find the intersection of suppression_used==False and any na values
# A count of 0 means no overlap between supression false and any other of specified fields with value of na

df[(~df['suppression_used']) & 
   (
       (df['number_of_prescriptions'].isna()) | 
       (df['units_reimbursed'].isna()) | 
       (df['total_amount_reimbursed'].isna()) |
       (df['medicaid_amount_reimbursed'].isna()) |
       (df['non_medicaid_amount_reimbursed'].isna()) 
)].shape[0]

0

#### Conclusion
suppression_used can be used as a filter to determine if the above mentioned fields will have a non-na value