## Team code for loading and pickling data

### UPDATED 6/2/2020:
- Deleted irrelevant column from 2015 (that had NaN's in almost all fields).
- Wanted this done at this stage so that the .pkl file would be clean.
- Also updated the "step02..." notebook to remove this step.

Data source: 
Physician & Other Supplier Payments   
https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier2017  

### Reading in 1 dataset
- Bringing only specific columns after initial EDA, and lookin at stated business requirements (from HCBB presentation)
- Adding year column to each year's df

#### Columns to keep
National Provider Identifier
Last Name/Organization Name of the Provider
Entity Type of the Provider
City of the Provider
Zip Code of the Provider
State Code of the Provider
Provider Type
Place of Service 
HCPCS Code
HCPCS Description
Number of Services 
Number of Medicare Beneficiaries 
Number of Distinct Medicare Beneficiary/Per Day Services 
Average Medicare Allowed Amount 
ADD: Year (in each df on import)  

In [None]:
import pandas as pd
import pickle
import numpy as np

### Tried importing via chunking and that took longer than NOT chunking. See bottom of notebook for code used for chunking experiment

In [None]:
%%time
# This step loads in only the columns we want, 
# adds a column for year, and 
# converts column headers to have no spaces or special characters
# This is for 2017. Years 2016 and 2015 are below.

cols = ['National Provider Identifier',
        'Last Name/Organization Name of the Provider',
        'Entity Type of the Provider',
        'City of the Provider',
        'Zip Code of the Provider',
        'State Code of the Provider',
        'Provider Type',
        'Place of Service',
        'HCPCS Code',
        'HCPCS Description',
        'Number of Services',
        'Number of Medicare Beneficiaries',
        'Number of Distinct Medicare Beneficiary/Per Day Services',
        'Average Medicare Allowed Amount']

df_payments_2017 = pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                               usecols = cols)
df_payments_2017['year'] = 2017
df_payments_2017.columns = df_payments_2017.columns.str.replace(" ", "_").str.replace("/", "_").str.lower()
df_payments_2017.head()

In [None]:
%%time

df_payments_2016 = pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2016.csv', 
                               usecols = cols)
df_payments_2016['year'] = 2016
df_payments_2016.columns = df_payments_2016.columns.str.replace(" ", "_").str.replace("/", "_").str.lower()
df_payments_2016.head()

In [None]:
%%time

df_payments_2015 = pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2015.csv', 
                               usecols = cols)
df_payments_2015['year'] = 2015
df_payments_2015.columns = df_payments_2015.columns.str.replace(" ", "_").str.replace("/", "_").str.lower()
df_payments_2015.head()


## UPDATE:  6/2/2020  To remove irrelevant row from 2015 file (all values are NaN)

In [None]:
# To find item that needs to be dropped from 2015, has irrelevant text in last name field (like a footnote) and
# no other data in any rows

df_payments_2015[df_payments_2015.national_provider_identifier == 1]

In [None]:
# To drop the irrelevant row, index # 7205022

df_payments_2015 = df_payments_2015.drop(labels = 7205022)

In [None]:
# To ensure that the irrelevant row was dropped (CONFIRMED)

df_payments_2015[df_payments_2015.national_provider_identifier == 1]

## Create pickle file for each year

In [None]:
df_payments_2017.to_pickle('..\data\pickled_files\payments_2017.pkl')

In [None]:
df_payments_2016.to_pickle('..\data\pickled_files\payments_2016.pkl')

In [None]:
df_payments_2015.to_pickle('..\data\pickled_files\payments_2015.pkl')

## Combine three years into a single file using concat.  Create pickle file.
- Resulting df will show 3 rows for each provider for each HCPCS code: one for each year (2015, 2016, 2016).

In [None]:
df_payments_combined_draft = pd.concat([df_payments_2015, df_payments_2016, df_payments_2017], ignore_index= True)

In [None]:
# To create a pickle file for combined data

df_payments_combined_draft.to_pickle('..\data\pickled_files\payments_combined_draft.pkl')

### Work is contined in next notebook named "step02_clean_combined_df"