## Main EDA file for teamwork - Team Phoenix
- Team members
    - Jack McCann (leader)
    - Nicole Muldowney
    - Teresa Whitesell
    - Ari Khursheed
    - Diego Alvarez
    - Lori Butler

## MVP

**Question from HCBB Presentation, Goal #2:**  
- Which procedures (HCPCS Codes) had the largest change in Average payment (using Average Medicare Allowed Amount, per pg 12 of HCBB presentation?
- Which procedures (HCPCS Codes) had the largest change in Utilization (using Number of Distinct Medicare Beneficiary/Per Day Services)?   
 
**Stretch goal:**   
Include Hospitals in analysis. Remember to convert old APCs in the 2015 hospital file to match the 2016-2017 APC codes.   
- APC is a group of related procedures (see pg 26 of HCBB presentation for more info).      
    - All procedures within an APC are paid at the same rate.  
    - Not every HCPCS/CPT code is assigned to an APC, but the most important ones are.  
    - APCs were renumbered in 2016.    

**MVP**  
- Static charts and/or interactive dashboard showing answers to questions:  
    - Largest Change in Ave. Payment over time  
    - Largest Change in Utilization  
        - Might also look at change in number of lives covered by Medicare as a whole, to see if there's useful information there (i.e., did utiliztaion go up or down at a faster rate than the number of covered beneficiaries?)   
- Creat dashboard that allows user to select   
    - A: specific HCPCS code/description (and maybe ACS for hospitals) change over the 3 years based on
    - B: amount, or utilization 

## Naming conventions

- **Columns:** Kept full name, replaced spaces and “/” with underscores. Otherwise kept the full column name without further edits

Dataframes:  
- **Initial DataFrames by year:**  
    - df_payments_20yy
- **Combined DataFrame** - concatenated, with separate row for each year (each NPI will have three rows):  
    - df_payments_combined
- **New Column for payment type description** (i..e, Doctor Only, Facility Only, or Doctor and Facility”
    - Payment_type   


- **New DataFrame to show average payment - concatenated** (year as value in ‘year’ column)
    - df_avg_pmt
- **New DataFrame to show average of the number_of_distinct_medicare_beneficiary_per_day_services**:
    - df_med_services_day   


- **New DataFrame to show average payment - pivoted** (year as column head)
    - df_pmt_pvt
- **Dataframe of unique beneficiaries per day pivoted by year**
    - df_bpd_pvt


## How we handled nulls

##### All years: Discovered 427 rows have null in last name. 
- Researched this info on data.gov visualization tool. (See notes under dropped columns)
- All other information was there (first name, middle initial). 
- Decided to KEEP these rows since all other columns that were meaningful to us were not null.

##### 2015 - Problem with single row
- Dropped one row that had irrelevant data in the last name field “CPT Copyright 2014….” and null values in all other columns.
- Not relevant to our project, so we dropped that column (done in prior notebeook, "step02..."

In [None]:
import pandas as pd
import pickle
from glob import glob
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Read in Combined DataFrame

This combined df has already eliminted the one irrelevant row from 2015, and has added a column for payment_type (Doctor Only, Facility Only, Doctor & Facility).

In [None]:
df_payments_combined = pd.read_pickle('../data/pickled_files/payments_combined.pkl')
print(df_payments_combined.shape)


In [None]:
df_payments_combined.head()

In [None]:
# To export file to .csv for the purpose of creating visualizations in Tableau

df_payments_combined.to_csv('../data/pickled_files/payments_combined.csv', index = False)

## EDA of combined file

In [None]:
df_payments_combined.shape

In [None]:
df_payments_combined.columns

In [None]:
df_payments_combined.head()

In [None]:
df_payments_combined.info()

In [None]:
# To look ta count of null values

df_payments_combined.isnull().sum()


# Results are as expeceted. We researched the 427 last name/org name nulls as a team earlier
# They don't have last/org name, many had first name (which we didn't import), yet all of them
# had full data in all other fields. We chose to keep the 427 rows with null last names
# since none of our final reporting requires knowing the name.


In [None]:
# To see the number of unique values of HCPCS codecodes  

count_of_codes = df_payments_combined.hcpcs_code.nunique()
print("There are", count_of_codes, "unique HCPCS codes in 2015-2017")

In [None]:
# Statistics for number_of_services column. 
# Range from 2.4M to 7.1M min/max

# NOTABLE: 
#  Statistics for number_of_services column:  
#  Big difference between mean 2.4M and 50th percentile (median) 4.4M. 

df_payments_combined.number_of_services.describe()  

In [None]:
df_payments_combined.number_of_medicare_beneficiaries.describe()

In [None]:
df_payments_combined.number_of_distinct_medicare_beneficiary_per_day_services.describe()

In [None]:
df_payments_combined.average_medicare_allowed_amount.describe()

In [None]:
# Checking the min in above number...
# 6.03538e-05 = 6.035380 x 10^-05 = 0.00006035380
# Used converter here: https://www.calculatorsoup.com/calculators/math/scientific-notation-converter.php

df_payments_combined.average_medicare_allowed_amount.min()

In [None]:
# Tried to make plot. Takes too long. This doesn't run. Takes too long
# df_payments_2015to2017.plot()

## Notable findings from EDA
### Update: These were all resolved/researched. None caused problems with the findings we are trying to discover.
- Statistics for number_of_services column:  
    - Big difference between mean (average) and median (50th percentile) in these columns:
        - number_of_services
        - number_of_medicare_beneficiaries
        - number_of_distinct_medicare_beneficiary_per_day_services
        - average_medicare_allowed_amount

    - Average medicare allowed amount's minimum is less that 1 cent. That seems odd.
    
===================================================================================


In [None]:
df_payments_combined.number_of_services.describe()

In [None]:
df_payments_combined.number_of_distinct_medicare_beneficiary_per_day_services.describe()

In [None]:
df_payments_combined.average_medicare_allowed_amount.describe()

# Thursday - new approach with Ari & Teresa

## Creating new df showing average payment by year, then by payment type, then by HCPCS code.  (Code from Nicole, from earlier in the week)

- **WEDNESDAY 6/3:** 
    - Team was still struggline this morning to get the data to work in Tableau (and had similar issues when I attempted to work with it in Power BI). We had been working with the full combined df.
        - We started by using the pivoted table (with a separate column for each year - with the year in the column header, not as a value in the table)
        - That didn't work, so yesterday we also tried using the combined df that had a single column for year, with the year 2015, 2016, 2017 as a value in the table. This worked a little better in Tableau, but there were still problems caused by trying to do the difference calculation in Tableau then also using it in a chart.
        - There was also a problem of trying to figure out how to find the **absolute** largest change, regardless of positive or negative.
    
- **THURSDAY 6/4:** 
    - We split up into teams, with Ari, Teresa and I going back to Tableau with a couple ideas to try; and Nicole, Diego and Jack continuing to work in Tableau. Both sub-teams are not working on both questins (amount and utilization), trying to come up with a method to create the calculated differences we need.
    - Ari and Teresa took the approach of using the initial df_avg_pmt and pivoting it to create columns for 2015, 2016, 2017, then calculating the differences between years.
    - For comparison and validation, I took an alternate approach (that was suggested by Mahesh) of splitting the df_avg_pmt into 3 different dfs, by year; then merging just 2015 and 2017; then calculating the difference.
        - I successfully split the df_avg_pmt into 3 dfs by year, and renamed columns. My df had the same number of rows as the df that Ari and Teresa had (total of 8379 rows), so that part worked to validate the results. 
            - The 8379 rows are the rows that had values in both 2015 and 2017.
            - Mahesh had said that it was appropriate to use ONLY those years because it answers the specific business question that was asked (difference between 2015 to 2017).
        - I ran into difficulties when trying to merge and create the calculated columns. At that point I rejoined Ari and Teresa and we got their df to work. 

## Next steps, Friday 6/5:
- With Ari, Teresa and me: Create total of 4 new dfs: 
    - df_pmt_melt_amt = 2-year change in ALLOWED AMOUNT, by AMOUNT
    - df_pmt_melt_pct = 2-year change in ALLOWED AMOUNT, by PERCENT
    - df_util_melt_number = 2-year change in BEN/DAY SERVICES, by NUMBER
    - df_util_melt_pct = 2-year change in BEN/DAY SERVICES, by PERCENT

In [None]:
# New approach Thursday: 
# STEP 1 create df_avg_pmt with only 4 columns, for all years

df_avg_pmt = df_payments_combined.groupby(['year',
                                           'payment_type',
                                           'hcpcs_code']).average_medicare_allowed_amount.mean().to_frame().reset_index()

In [None]:
df_avg_pmt.head()

In [None]:
# STEP 2: New df for 2015

df_avg_pmt_2015 = df_avg_pmt.loc[df_avg_pmt['year'] == 2015]
df_avg_pmt_2015.head()

In [None]:
df_avg_pmt_2015.columns = ['2015_year', 'payment_type', 'hcpcs_code', '2015_avg_medicare_allowed_amt']
df_avg_pmt_2015.head()

In [None]:
# STEP 2: New df for 2016

df_avg_pmt_2016 = df_avg_pmt.loc[df_avg_pmt['year'] == 2016]
df_avg_pmt_2016.head(2)

In [None]:
df_avg_pmt_2016.columns = ['2016_year', 'payment_type', 'hcpcs_code', '2016_avg_medicare_allowed_amt']
df_avg_pmt_2016.head(2)

In [None]:
# STEP 2: New df for 2017

df_avg_pmt_2017 = df_avg_pmt.loc[df_avg_pmt['year'] == 2017]
df_avg_pmt_2017.head(2)

In [None]:
df_avg_pmt_2017.columns = ['2015_year', 'payment_type', 'hcpcs_code', '2017_avg_medicare_allowed_amt']
df_avg_pmt_2017.head(2)

In [None]:
# Coming at the same thing Teresa and Ari are working on, and did get same result
# Total of 8379 rows on df that only has amounts in 2015 and 2017.
# Next step is to get the difference between 2015 and 2017.

df_avg_pmt_merge = pd.merge(df_avg_pmt_2015,
                            df_avg_pmt_2017,
                            how='inner',
                            on=['hcpcs_code',
                                'payment_type']
                           )
print(df_avg_pmt_merge.shape)
df_avg_pmt_merge.head(2)

In [None]:
df_avg_pmt_merge.rename(columns = {'2015_year_x' : '2015'})
#                            'payment_type',
#                            'hcpcs_code',
#                            '2015_avg_medicare_allowed_amt',
#                            '2017',
#                            '2017_avg_medicare_allowed_amt'}
df_avg_pmt_merge.head(2)

In [None]:
df_avg_pmt_merge['change_2017_2015'] = df_avg_pmt_merge['2017_avg_medicare_allowed_amt'] - df_avg_pmt_merge['2015_avg_medicare_allowed_amt']
df_avg_pmt_merge.head(2)

## End of work on Thursday 6/4/2020.  The code below is from work on prior days.

In [None]:
# To reset index

df_avg_pmt = df_avg_pmt.reset_index()

In [None]:
df_avg_pmt.head()

In [None]:
df_median_pmt = df_payments_combined.groupby(['year',
                                           'payment_type',
                                           'hcpcs_code']).average_medicare_allowed_amount.median().to_frame()

In [None]:
# To reset index

df_median_pmt = df_median_pmt.reset_index()

### Creating pivot table from above, to get one row per HPCPS by year/pmt type

In [None]:
pivot_index = ['payment_type',
               'hcpcs_code', 
               'average_medicare_allowed_amount'
              ]

pivot_cols = ['year']

In [None]:
%%time
df_median_pmt_pvt = df_median_pmt.pivot_table(index = pivot_index, 
                                              columns = pivot_cols, 
                                              values = 'average_medicare_allowed_amount', 
                                              aggfunc=np.median)
df_median_pmt_pvt = df_median_pmt.reset_index()

### Creating new df showing average count (# distinct med. beneficiary/per day services) by year, then by payment type, then by HCPCS code.  (Code from Nicole)

In [None]:
df_med_services_day = df_payments_combined.groupby(['year',
                                                    'hcpcs_code']).number_of_distinct_medicare_beneficiary_per_day_services.mean().to_frame().reset_index()

In [None]:
df_med_services_day.head()

In [None]:
# Still seeing wide variance between mean and median (50th percentile). 
# Need to look for outliers before plotting.

df_avg_pmt.describe()

In [None]:
# Still seeing wide variance between mean and median (50th percentile). 
# Need to look for outliers before plotting.

df_med_services_day.describe()

### Started new notebook named step04_eda_and_analysis. 
- Worked with Teresa on that notebook, attempting to get to the point of calculating the variance between 2015 and 2017.