# Landry BlueBook
This is the notebook for my Project BlueBook code

# What's working?

- imported libraries<br>
- imported all .csv files, even the one that was really a .tsv<br>
- combined classification files into one dataframe<br>
- removed trailing spaces from classification dataframe keys
- lowercased ,removed spaces and \n from column titles

# What's not working?

- 

In [None]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
# set up dataframes
#   absolute path: /Users/landrybutler/github/healthcare-bluebook-project-bluebook/data/my_file.csv
#   relative path: ../data/my_file.csv

# paths to files
#     NOTE: providers_csv is too large to store in git. It's kept outside this project folder in 'oversized_files'
providers_tsv = '/Users/landrybutler/github/oversized_files/Medicare_Provider_Util_Payment_PUF_CY2017.tsv'
outpatient_csv = '../data/MUP_OHP_R19_P04_V10_D17_APC_Provider.csv'
classification1_csv = '../data/508-Compliant-Version-of-2020_january_web_addendum_b.12312019.csv'
classification2_csv = '../data/2020_january_web_addendum_b.12312019.csv'
cbsa_csv = '../data/ZIP_CBSA_032020.csv'

# NOTE: providers_csv is a TAB-DELIMITED file, Use sep='\t'
providers = pd.read_csv(providers_tsv, sep='\t', low_memory=False) 
outpatient = pd.read_csv(outpatient_csv, low_memory=False) 
classification1 = pd.read_csv(classification1_csv) 
classification2 = pd.read_csv(classification2_csv) 
cbsa = pd.read_csv(cbsa_csv) 

# NOTE: the first line in providers_tsv is copyright info
providers.head()

In [None]:
providers.info()
# providers[['provider_type']]

In [None]:
# how can I join the classification files? 
# look at df.head()

classification1.head()

In [None]:
classification2.head()

In [None]:
# look at df.info()

classification1.info()

In [None]:
classification2.info()

NOTE: based on a quick examination of the head() and info(), it looks like these files are laid out in same manner:
        - HCPS Code
        - Short Descriptor
        - SI
        - APC
        - Relative Weight
        - Payment Rate
        - National Unadjusted Copayment
        - Minimum Unadjusted Copayment
        - Column1 or Unnamed: 8
        - Column2 or Unnamed: 9
        - Column3 or Unnamed: 10
            
The range index and memory usage are the same for both files. I wonder if they countain duplicate info?

In [None]:
# Inner join on HCPS code will eliminate any duplicates
# NOTE: Trailing spaces were found in column names, they're included below and will be removed later
# memory usage after join only increased by 0.5 MB

classifications = pd.merge(left=classification1, right=classification2, 
                           how='inner', 
                           on=['HCPCS Code','Short Descriptor','SI','APC ',
                               'Relative Weight','Payment Rate ','National Unadjusted Copayment ',
                              'Minimum Unadjusted Copayment '])
classifications.info()

In [None]:
# Rename columns to remove trailing spaces
# df.rename(columns=lambda x: x.strip())
classifications = classifications.rename(columns=lambda x: x.strip())

classifications.keys()

In [None]:
# Look at outpatient dataframe
outpatient.info()

In [None]:
outpatient.head()

In [None]:
# column names contain newline characters, remove them
# also convert spaces to underscores and make lowercase
outpatient.columns = outpatient.columns.str.replace('\n', '_').str.replace(' ', '_').str.lower()
providers.columns = providers.columns.str.replace(' ', '_').str.lower()
classifications.columns = classifications.columns.str.replace(' ', '_').str.lower()
cbsa.columns = cbsa.columns.str.replace(' ', '_').str.lower()

# providers.head()

In [None]:
# Look at cbsa dataframe
cbsa.info()

In [None]:
cbsa.head()

In [None]:
cbsa.shape

cbsa dataframe appears to be std tabular data w/o any issues … so far

In [None]:
cbsa.tail()

# Observations

## Datafields in use
<b>providers</b><br>
    - Contains provider name and contact info, HPCS and cost info
    - Joinability: 'HPCS Code' is common field with classifications
    - Fields:
       - npi
       - nppes_provider_last_org_name
       - nppes_provider_first_name
       - nppes_provider_mi
       - nppes_credentials
       - nppes_provider_gender
       - nppes_entity_code
       - nppes_provider_street1
       - nppes_provider_street2
       - nppes_provider_city
       - nppes_provider_zip
       - nppes_provider_state
       - nppes_provider_country
       - provider_type
       - medicare_participation_indicator
       - place_of_service
       - hcpcs_code
       - hcpcs_description
       - hcpcs_drug_indicator
       - line_srvc_cnt
       - bene_unique_cnt
       - bene_day_srvc_cnt
       - average_Medicare_allowed_amt
       - average_submitted_chrg_amt
       - average_Medicare_payment_amt
       - average_Medicare_standard_amt
<b>outpatient</b><br>
    - Contains provider name and contact info, procedures and costs
    - Joinability: 'Zip' is common field with cbsa and providers
    - Fields:
        - Provider ID
        - Provider Name
        - Provider Street Address
        - Provider City
        - Provider\nState
        - Provider\nZip Code
        - Provider\nHospital Referral Region\n(HRR)
        - APC
        - APC\nDescription
        - Beneficiaries
        - Comprehensive APC\nServices
        - Average\nEstimated\nTotal\nSubmitted\nCharges
        - Average\nMedicare\nAllowed\nAmount
        - Average\nMedicare\nPayment\nAmount
        - Outlier\nComprehensive\nAPC\nServices
        - Average\nMedicare\nOutlier\nAmount
<b>classifications</b><br>
    - Contains classification info about each procedure
    - Joinability: 'HPCS Code' is common field with providers
    - Fields:
        - HCPCS Code
        - Short Descriptor
        - SI
        - APC
        - Relative Weight
        - Payment Rate
        - National Unadjusted Copayment
        - Minimum Unadjusted Copayment
        - Column1
        - Column2
        - Column3
        - Unnamed: 8
        - Unnamed: 9
        - Unnamed: 10
<b>cbsa</b><br>
    - Contains Zip Code and Core Based Statistical Area (CBSA) data
    - Joinability: 'Zip' is common field with classifications and providers
    - Fields:
        - ZIP
        - CBSA
        - RES_RATIO
        - BUS_RATIO
        - OTH_RATIO
        - TOT_RATIO
    - What's this used for, do we need it? 


## These common fields can be used for joins
<b>HCPCS Code:</b> providers, classifications<br>
<b>Zip:</b> cbsa, outpatient, providers<br>
<b>Provider ID:</b> outpatient

## Potential things to look at
- <b>sort by Zip Code:</b>
    - what procedures are more popular in which locations?
    - which providers provide which procedures?
    - what's the average cost per zip code?
- <b>sort by Procedure:</b>
    - what's the average cost per procedure?
    - which procedures outpatient/inpatient?