## Healthcare Bluebook Initial Importing and Pickling
**Files to import**
**Refer to class_notebook pickles_and_chunks**

1. Physician & Other Supplier Payments   
https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier2017   
Downloaded in class_notebook pickles_and_chunks   
Only saved the resulting pickle file to data in HCBB repo  


2. You want Detailed Data Hospital Outpatient   
https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Outpatient   

3. Detailed Data APC to CPT/HCPCS crosswalk, Addendum B – January 2020 (correction files aren't necessary)   
https://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/HospitalOutpatientPPS/Addendum-A-and-Addendum-B-Updates

4. Zip Code to CBSA   
https://www.huduser.gov/portal/datasets/usps_crosswalk.html

5. Other   
https://Data.CMS.gov

In [1]:
import pandas as pd
import pickle
import pprint    #Pretty Print.  To use, pprint.pprint()

###  File #1 already read in an pickled - see details in class_notebook named pickles_and_chunks

- **Filename:  hcpcs_pay_99213**

### File #2 - Detailed Data Hospital Outpatient  

In [6]:
%%time

hosp_outpatient_df = pd.DataFrame()     #Only need empty df because we're chunking
x = 1

for chunk in pd.read_csv('../data/Provider_Outpatient_Hospital_Charge_Data_by_APC__CY2017.csv', 
                         chunksize = 1000):
    if x == 1:
        hosp_outpatient_df = hosp_outpatient_df.append(chunk)
        x += 1
    else: break
hosp_outpatient_df.head()

Wall time: 26.9 ms


Unnamed: 0,Provider_ID,Provider_Name,Provider_Street_Address,Provider_City,Provider_State,Provider_Zip_Code,Provider_HRR,APC,APC_Desc,Beneficiaries,CAPC_Services,Average_Total_Submitted_Charges,Average_Medicare_Allowed_Amount,Average_Medicare_Payment_Amount,Outlier_Services,Average_Medicare_Outlier_Amount
0,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5072,Level 2 Excision/ Biopsy/ Incision and Drainage,249.0,259,9575.01,1038.45,826.28,,
1,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5073,Level 3 Excision/ Biopsy/ Incision and Drainage,52.0,53,12578.28,1792.6,1423.25,,
2,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5091,Level 1 Breast/Lymphatic Surgery and Related P...,26.0,27,11337.61,2113.58,1683.99,0.0,0.0
3,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5092,Level 2 Breast/Lymphatic Surgery and Related P...,23.0,23,17116.16,3737.14,2977.55,0.0,0.0
4,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5112,Level 2 Musculoskeletal Procedures,17.0,17,7382.73,1029.46,820.21,0.0,0.0


In [7]:
hosp_outpatient_df.shape

(1000, 16)

In [9]:
# To look at first row of data
hosp_outpatient_df.loc[0, :].values

array([10001, 'Southeast Alabama Medical Center',
       '1108 Ross Clark Circle', 'Dothan', 'AL', 36301, 'AL - Dothan',
       5072, 'Level 2 Excision/ Biopsy/ Incision and Drainage', 249.0,
       259, 9575.01, 1038.45, 826.28, nan, nan], dtype=object)

In [10]:
hosp_outpatient_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
Provider_ID                        1000 non-null int64
Provider_Name                      1000 non-null object
Provider_Street_Address            1000 non-null object
Provider_City                      1000 non-null object
Provider_State                     1000 non-null object
Provider_Zip_Code                  1000 non-null int64
Provider_HRR                       1000 non-null object
APC                                1000 non-null int64
APC_Desc                           1000 non-null object
Beneficiaries                      989 non-null float64
CAPC_Services                      1000 non-null int64
Average_Total_Submitted_Charges    1000 non-null float64
Average_Medicare_Allowed_Amount    1000 non-null float64
Average_Medicare_Payment_Amount    1000 non-null float64
Outlier_Services                   788 non-null float64
Average_Medicare_Outlier_Amount    788 non-null float64

In [18]:
# To change zip code to be object
hosp_outpatient_df['Provider_Zip_Code']= hosp_outpatient_df['Provider_Zip_Code'].astype('object')
hosp_outpatient_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
Provider_ID                        1000 non-null int64
Provider_Name                      1000 non-null object
Provider_Street_Address            1000 non-null object
Provider_City                      1000 non-null object
Provider_State                     1000 non-null object
Provider_Zip_Code                  1000 non-null object
Provider_HRR                       1000 non-null object
APC                                1000 non-null int64
APC_Desc                           1000 non-null object
Beneficiaries                      989 non-null float64
CAPC_Services                      1000 non-null int64
Average_Total_Submitted_Charges    1000 non-null float64
Average_Medicare_Allowed_Amount    1000 non-null float64
Average_Medicare_Payment_Amount    1000 non-null float64
Outlier_Services                   788 non-null float64
Average_Medicare_Outlier_Amount    788 non-null float6

In [16]:
%%time
# To pickle (condense) the 1000 row sample file by pickling

hosp_outpatient_df.to_pickle("../data/hosp_outpatient.pkl")

Wall time: 2.99 ms


In [17]:
print(hosp_outpatient_df.shape)
hosp_outpatient_df.head()

(1000, 16)


Unnamed: 0,Provider_ID,Provider_Name,Provider_Street_Address,Provider_City,Provider_State,Provider_Zip_Code,Provider_HRR,APC,APC_Desc,Beneficiaries,CAPC_Services,Average_Total_Submitted_Charges,Average_Medicare_Allowed_Amount,Average_Medicare_Payment_Amount,Outlier_Services,Average_Medicare_Outlier_Amount
0,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5072,Level 2 Excision/ Biopsy/ Incision and Drainage,249.0,259,9575.01,1038.45,826.28,,
1,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5073,Level 3 Excision/ Biopsy/ Incision and Drainage,52.0,53,12578.28,1792.6,1423.25,,
2,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5091,Level 1 Breast/Lymphatic Surgery and Related P...,26.0,27,11337.61,2113.58,1683.99,0.0,0.0
3,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5092,Level 2 Breast/Lymphatic Surgery and Related P...,23.0,23,17116.16,3737.14,2977.55,0.0,0.0
4,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5112,Level 2 Musculoskeletal Procedures,17.0,17,7382.73,1029.46,820.21,0.0,0.0


### 3. Detailed Data APC to CPT/HCPCS crosswalk, Addendum B – January 2020 (correction files aren't necessary)

No pickling done. This is a small file

In [32]:
## Apparently chunk is no longer available in pd.read_excel.
## In next cell will import full file


apc_to_cpt_crosswalk_df = pd.DataFrame()
chunk_nbr = 1

for chunk in pd.read_excel('../data/apc_to_cpt_crosswalk_2020_january_web_addendum_b_jan_2020_12312019.xlsx',
                         chunksize = 1000):
    if chunk_nbr == 1:
        apc_to_cpt_crosswalk_df = apc_to_cpt_crosswalk_df.append(chunk)
        chunk_nbr += 1
    else: break
apc_to_cpt_crosswalk_df.head()

NotImplementedError: chunksize keyword of read_excel is not implemented

In [33]:
apc_to_cpt_crosswalk_df = pd.read_excel('../data/apc_to_cpt_crosswalk_2020_january_web_addendum_b_jan_2020_12312019.xlsx')
apc_to_cpt_crosswalk_df.head()

Unnamed: 0.1,Unnamed: 0,Addendum B.-Final OPPS Payment by HCPCS Code for CY 2020,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,,CPT codes and descriptions only are copyright ...,,,,,,,,
1,HCPCS Code,Short Descriptor,SI,APC,Relative Weight,Payment Rate,National Unadjusted Copayment,Minimum Unadjusted Copayment,Note: Actual copayments would be lower due to ...,* Indicates a Change
2,00100,Anesth salivary gland,N,,,,,,,
3,00102,Anesth repair of cleft lip,N,,,,,,,
4,00103,Anesth blepharoplasty,N,,,,,,,


In [34]:
# To pickle apcapc_to_cpt_crosswalk_df. 
# (Not certain why this is needed, since file is small. Maybe for merging later?)
apc_to_cpt_crosswalk_df.to_pickle("../data/apc_to_cpt_crosswalk.pkl")

### 4. Zip Code to CBSA   
https://www.huduser.gov/portal/datasets/usps_crosswalk.html

In [29]:
zip_to_cbsa_032020_df = pd.read_excel('../data/ZIP_CBSA_032020.xlsx')
zip_to_cbsa_032020_df.head()

Unnamed: 0,ZIP,CBSA,RES_RATIO,BUS_RATIO,OTH_RATIO,TOT_RATIO
0,501,35620,0.0,1.0,0.0,1.0
1,601,38660,1.0,1.0,1.0,1.0
2,602,10380,1.0,1.0,1.0,1.0
3,603,10380,1.0,1.0,1.0,1.0
4,604,10380,1.0,1.0,1.0,1.0


In [30]:
# To pickle 
# (Not certain why this is needed, since file is small. Maybe for merging later?)
zip_to_cbsa_032020_df.to_pickle("../data/zip_to_cbsa_032020.pkl")

### 5. Other   https://Data.CMS.gov    
Nothing downloaded from this site yet.

###  EDA (using pickled files) will be done in new notebook, named hcbb_eda