# HCPCS Data Structures from Medicare Sample

This notebook constructs HCPCS data structures to facilitate training the skip-gram model using a sample of Medicare Part B data from 2012.

A HCPCS set is an unordered list of unique HCPCS codes that occur together.

In the Medicare data set, data is aggregated by provider (NPI), HCPCS code, and year.

For each (NPI, year) pair, we will combine all HCPCS codes provided by the given provider that year. These are effectively procedure codes rendered within the same context.

In [15]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import os
import sys
import pickle

pd.set_option('display.max_columns', 500)

## Load Medicare Data

In [18]:
proj_dir = '/Users/jujohnson/git/Hcpcs2Vec/'
data_dir = os.environ['CMS_RAW']

In [2]:
%%time

data_file = os.path.join(
    data_dir, 
    '2012', 
    'Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_CY2012.csv.gz')


# we only need the NPI and HCPCS Columns
columns = {
    'National Provider Identifier': 'npi',
    'HCPCS Code': 'hcpcs'
}

data = pd.read_csv(data_file, usecols=list(columns.keys()))
data.rename(columns=columns, inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9153272 entries, 0 to 9153271
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   npi     int64 
 1   hcpcs   object
dtypes: int64(1), object(1)
memory usage: 139.7+ MB
CPU times: user 21.1 s, sys: 388 ms, total: 21.4 s
Wall time: 21.4 s


In [3]:
data.head()

Unnamed: 0,npi,hcpcs
0,1194848424,76830
1,1497798078,99238
2,1609810126,99214
3,1093718066,83036
4,1417027897,72170


## Create HCPCS <-> ID Mapping

In [31]:
%%time

le = LabelEncoder()

data['hcpcs_id'] = le.fit_transform(data['hcpcs'])
print(f'Min hcpcs_id: {data["hcpcs_id"].min()}')
print(f'Max hcpcs_id: {data["hcpcs_id"].max()}')

# save label encoder results to enable inverse transform later
hcpcsIdFile = os.path.join(proj_dir, 'data', 'hcpcs-labelencoding.pickle')
with open(hcpcsIdFile, 'wb') as fout:
    pickle.dump(le.classes_, fout)

print(f'Saved HCPCS label encoded classes to {hcpcsIdFile}')

Min hcpcs_id: 0
Max hcpcs_id: 5948
Saved HCPCS label encoded classes to /Users/jujohnson/git/Hcpcs2Vec/data/hcpcs-labelencoding.pickle
CPU times: user 1.35 s, sys: 77.7 ms, total: 1.43 s
Wall time: 1.43 s


## Extract HCPCS Sets

In [33]:
%%time

corpus = []
for npi, group in data.groupby(by='npi'):
    hcpcs_set = np.unique(np.asarray(group['hcpcs_id'], dtype='int16'))
    corpus.append(hcpcs_set)

    
corpusFile = os.path.join(proj_dir, 'data', '2012-hcpcs-sets.pickle')
with open(corpusFile, 'wb') as fout:
    pickle.dump(corpus, fout)
    
print(f'Saved preprocessed HCPCS corpus to {corpusFile}')

Saved preprocessed HCPCS corpus to /Users/jujohnson/git/Hcpcs2Vec/data/2012-hcpcs-sets.pickle
CPU times: user 2min 49s, sys: 606 ms, total: 2min 50s
Wall time: 2min 50s
