# Physician datasets exploratory analysis

The Physician Compare <a href="https://data.medicare.gov/data/physician-compare">datasets</a> provide information about clinicians and facilities that are enrolled in Medicare. Please download all four flat, csv files. You may also find the "Downloadable Database Dictionary" under the "Get supporting documents" dropdown menu helpful when answering these questions.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import statistics

ind = pd.read_csv('D:\Data\Physician_Compare_2015_Individual_EP_Public_Reporting___Performance_Scores.csv')
natdl = pd.read_csv('D:\Data\Physician_Compare_National_Downloadable_File.csv')
pubrep = pd.read_csv('D:\Data\Physician_Compare_2015_Group_Public_Reporting___Performance_Scores.csv')

  interactivity=interactivity, compiler=compiler, result=result)


How many clinicians are in the dataset? Each clinician has a unique NPI and PAC ID. However, there may be errors in the data. For this question, you will need to determine whether to identify clinicians using their NPI or PAC ID (or both).

In [2]:
clinit = natdl.append(ind)
#print('The number of unique NPIs is: ', len(clinit['NPI'].unique()))
#print('The number of unique PAC IDs is: ', len(clinit['PAC ID'].unique()))

g = clinit.groupby('NPI')
n=0 # The number of extra (incorrect) PAC IDs
for NPI, g_df in g:
    n += len(g_df['PAC ID'].unique()) - 1
print('The number of unique PAC IDs corresponding to unique NPIs is: ', len(clinit['PAC ID'].unique())-n )
g = clinit.groupby('PAC ID')
n=0 # The number of extra (incorrect) NPIs
for PAC_ID, g_df in g:
    n += len(g_df['NPI'].unique()) - 1
print('The number of unique NPIs corresponding to unique PAC IDs is: ', len(clinit['NPI'].unique())-n )

The number of unique PAC IDs corresponding to unique NPIs is:  1078908
The number of unique NPIs corresponding to unique PAC IDs is:  1078908


What is the ratio of male to female clinicians?

In [3]:
g = natdl.groupby('Gender')
print('The ratio of male to female clinicians is: ', len(g.get_group('M').NPI.unique())/len(g.get_group('F').NPI.unique()))

The ratio of male to female clinicians is:  1.1742541194058955


What is the highest ratio of female clinicians to male clinicians with a given type of credential?

In [4]:
df_cre = natdl.copy()
df_cre = df_cre.dropna(subset=['Credential'])
df_cre = df_cre.drop_duplicates(subset='NPI', keep='first')
g = df_cre.groupby(['Credential','Gender'])
clinitians = []
for cred, g_cre in g:
    tup = cred + tuple([len(g.Credential.get_group(cred))])
    clinitians.append(tup)
ratios = []
for tupf in clinitians:
    if tupf[1] == 'F':
        f = tupf[2]
        for tupm in clinitians:
            if (tupm[0] == tupf[0]) and (tupm[1] == 'M'):
                m = tupm[2]
                rat = tuple([tupf[0], f/m])
                ratios.append(rat)
ratios.sort(key=lambda tup: tup[1], reverse=True)
print('The highest ratio of female clinicians to male clinicians is in Credential %s, which is equal to:' 
      % ratios[0][0], ratios[0][1])

The highest ratio of female clinicians to male clinicians is in Credential CNM, which is equal to: 130.85714285714286


How many states have fewer than 10 healthcare facilities in this dataset? Include Washington D.C. and and U.S.territories in this calculation.

In [5]:
g = pubrep.groupby('State')

stt = 0
for state, g_df in g:
    if len(g_df['Group PAC ID'].unique())<10:
        stt += 1
print(r'Among all states and US territories (including Washington DC), %d have fewer than 10 healthcare facilities' % stt)

Among all states and US territories (including Washington DC), 9 have fewer than 10 healthcare facilities


All measure performance rates are on a 0 to 100 scale. Compute the average measure performance rate for each clinician (across all measures). Consider the distribution of these average rates for individuals who have at least 10. What is the standard deviation of that distribution?

In [None]:
g = ind.groupby('NPI')
indpr = []
for NPI, g_df in g:
    if len(g_df['Measure Performance Rate'])>=10:
        indpr.append(g_df['Measure Performance Rate'].mean())
print("The standard deviation of clinitians' performance rate is: ", statistics.stdev(indpr))

The standard deviation of clinitians' performance rate is:  15.959754013918348


What is the p-value of the linear regression of performance rates vs. graduation year? Consider the average performance rates (across all measures) of every doctor (MD) who graduated between 1973 and 2003 (inclusive). Only consider doctors who have at least 10 rates. For each graduation year, compute the mean of these rates. Assuming the relationship between graduation year and performance rates is linear, find the slope and determine if the relationship is significant. Return the p-value of the linear regression.

In [None]:
natdl_gy = natdl[pd.notnull(natdl['Graduation year'])]
natdl_gy = natdl_gy.drop_duplicates(subset='NPI', keep='first')
pr_gry = pd.merge(ind[['NPI','Measure Performance Rate']], natdl_gy[['NPI','Graduation year']], on='NPI')
g = pr_gry.groupby('NPI')
prrt = []
gryr = []
for NPI, g_df in g:
    if (len(g_df)>=10) and (g_df['Graduation year'].iloc[0]>=1973) and (g_df['Graduation year'].iloc[0]<=2003):
        prrt.append(g_df['Measure Performance Rate'].mean())
        gryr.append(g_df['Graduation year'].iloc[0])
slope, intercept, r_value, p_value, std_err = stats.linregress(prrt,gryr)
print('The p-value of performance rates vs. graduation year is: ', p_value)

What is the absolute difference in the average performance rates between doctors (MD) and nurse practitioners (NP)? For each clinician, use his or her average performance rates across all measures. Furthermore, only consider individuals who have at least 10 rates.

In [None]:
natdl_cr = natdl.drop_duplicates(subset='NPI', keep='first')
mdnp = pd.merge(ind[['NPI','Measure Performance Rate']], natdl_cr[['NPI','Credential']], on='NPI')
g = mdnp.groupby('Credential')
md = g.get_group('MD')
np = g.get_group('NP')
md_pr = []
np_pr = []
g = md.groupby('NPI')
for NPI, g_df in g:
    if (len(g_df)>=10):
        md_pr.append(g_df['Measure Performance Rate'].mean())
g = np.groupby('NPI')
for NPI, g_df in g:
    if (len(g_df['Measure Performance Rate'])>=10):
        np_pr.append(g_df['Measure Performance Rate'].mean())
print('The difference between average performance rates of MDs vs. NPs is: ',
      abs(sum(md_pr) / float(len(md_pr))-sum(np_pr) / float(len(np_pr))))

What is the p-value of the difference in MD and NP performance rates from the previous question? Perform a two-sample t-test and compute the two-tailed p-value. Assume that the distributions are normal and have equal variance.

In [None]:
print('The two-tailed p-value of the difference between MD and NP performance rates is: ',
      stats.ttest_ind(a= md_pr, b= np_pr).pvalue)