# Unsupervised Learning

# 04_create_unsupervised_features

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 28/09/2025   | Adrienne | Created | Created file for unsupervised learning | 
| 29/09/2025   | Martin | New   | Processing to apply the HCPCS code descriptions + EDA on the new descriptions | 
| 02/10/2025 | Adrienne | Update | Created features |
| 05/10/2025 | Martin | Update | Added TFIDF transformation section for any "list-like" columns |
| 05/10/2025 | Adrienne | Update | Added a feature and cleaned up dataset to include relevant columns |
| 07/10/2025 | Adrienne | Update | Added preventative care indicator feature |

## Notes

- Preventative care indicator

## Content

* [Introduction](#introduction)
* [Load Data](##load-data)
* [Additional Features](#additional-features)
* [EDA](#eda)

## Introduction

In [3]:
%load_ext watermark

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

## Load Data

In [2]:
path = "../data/clean"
df = pd.read_pickle(f"{path}/patient_level.pkl")
#df = pd.read_pickle(f"{path}/patient_level_sample.pkl")

In [4]:
mapper_path = "../data/mappers"
combined_mapper = pd.read_pickle(f"{mapper_path}/combined_mapper.pkl")
preventative_mapper = pd.read_pickle(f"{mapper_path}/preventative_mapper.pkl")

In [5]:
combined_mapper.head()

Unnamed: 0,code,category,description
0,99201,HCPCS_level_1,Evaluation and Management (E/M) Codes
1,99202,HCPCS_level_1,Evaluation and Management (E/M) Codes
2,99203,HCPCS_level_1,Evaluation and Management (E/M) Codes
3,99204,HCPCS_level_1,Evaluation and Management (E/M) Codes
4,99205,HCPCS_level_1,Evaluation and Management (E/M) Codes


Need to drop columns that would be a source of data leakage or are not needed

In [6]:
# diagnosis columns:
keep_cols = ['patient_medicare_number', 'gender', 'age', 'number_of_claims', 'combined_hcpcs_ls', 'billablePeriod_start_ls', 'billablePeriod_end_ls', 'location_of_bill_ls', 'total_value']
df = df[keep_cols]

Just drop rows where age is missing

In [7]:
df[df['age'].isnull()]

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value
36187,1S00E00HT00,male,,99,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2012-02-19, 2012-04-15, 2012-05-13, 2012-06-1...","[2012-02-19, 2012-04-15, 2012-05-13, 2012-06-1...",[],142.58
72387,1S00E00ME11,male,,91,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2012-05-14, 2012-06-11, 2012-07-23, 2012-09-1...","[2012-05-14, 2012-06-11, 2012-07-23, 2012-09-1...",[],105.46
53496,1S00E00JU46,male,,40,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2013-10-25, 2015-03-20, 2015-12-11, 2016-02-1...","[2013-10-25, 2015-03-20, 2015-12-11, 2016-02-1...",[],123.66
14425,1S00E00GK24,male,,18,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2013-10-14, 2013-12-16, 2015-07-13, 2016-01-0...","[2013-10-14, 2013-12-16, 2015-07-13, 2016-01-0...",[],114.9


In [8]:
df = df[df['age'].notnull()]

Limiting patients to those with less than 1000 in the combined_hcpcs_ls as it's just five patients and it drops the longest length to 670

In [9]:
df['ls_len'] = df['combined_hcpcs_ls'].str.len()
df = df[df['ls_len'] < 1000]

## Additional Features

Focusing on transforming the HCPCS codes into a useable format for unsupervised learning.

- HCPCS
  - code
  - category
  - description

## Apply mapper to HCPCS lists

Using the mapper we can apply the additional columns with category and description to each column of HCPCS

In [10]:
# drop hcpcs columns that are all NaN
print(len(df))
df.dropna(axis=1, how='all', inplace=True)
print(len(df))

1138
1138


In [11]:
unique_values = set(value for sublist in df['combined_hcpcs_ls'] for value in sublist)
print(unique_values)
print(len(unique_values))

{'S9131', 'G0158', 'G0155', 'G0157', 'C8928', 'G0299', 'H2000', 'G9573', 'G0402', 'C8905', 'S9473', 'G0153', '99221', 'T1021', 'G8111', 'G0152', 'T1502', 'G9858', 'S8075', 'C8908', 'G0154', 'S9126', 'G0424', 'G0464', 'G0102', 'G0458', 'G0444', '99241', 'G9572', 'G0107', 'G9829', 'G9708', 'G0151', 'G8946', 'Q5001', 'S9122', 'S0605', 'G0129', 'S9129', 'G9857', 'G8159', 'G0156', 'G9833', 'G0300'}
44


In [12]:
maxlen = max(df['combined_hcpcs_ls'].str.len())
print(f"max combined_hcpcs_ls length: {maxlen}")
df_hcpcs = df['combined_hcpcs_ls'].apply(pd.Series)
df_hcpcs = df_hcpcs.add_prefix('hcpcs_')
df_hcpcs = pd.concat([df, df_hcpcs], axis = 1)
df_hcpcs.head()

max combined_hcpcs_ls length: 592


Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value,ls_len,...,hcpcs_582,hcpcs_583,hcpcs_584,hcpcs_585,hcpcs_586,hcpcs_587,hcpcs_588,hcpcs_589,hcpcs_590,hcpcs_591
8249,1S00E00GA44,female,71.0,14,"[G0444, 99241, G0444, 99241, 99221, G0444, G04...","[2012-05-20, 2012-05-27, 2013-08-04, 2014-06-1...","[2012-05-20, 2012-05-27, 2013-08-04, 2014-06-1...","[002, 002, 002, 002, 002, 002, 002]",60.7,14,...,,,,,,,,,,
48386,1S00E00JN08,female,74.0,32,"[G0444, 99241, 99241, G0444, 99241, 99241, 992...","[2012-09-05, 2014-01-18, 2014-06-23, 2014-09-1...","[2012-09-05, 2014-01-18, 2014-06-23, 2014-09-1...","[002, 002, 002, 002, 002, 002, 002, 003, 002, ...",2733.16,217,...,,,,,,,,,,
36869,1S00E00HT71,female,77.0,25,"[G0444, 99241, G0402, Q5001, S9131, G0300, G01...","[2012-07-07, 2012-07-28, 2013-04-20, 2013-07-1...","[2012-07-07, 2012-07-28, 2013-04-20, 2013-07-1...","[002, 003, 002, 002, 002, 002, 002, 002, 002, ...",85.56,87,...,,,,,,,,,,
84632,1S00E00MW82,male,76.0,110,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2013-06-29, 2013-12-02, 2014-02-01, 2014-02-0...","[2013-06-29, 2013-12-02, 2014-02-01, 2014-02-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",82.02,111,...,,,,,,,,,,
10413,1S00E00GE68,female,79.0,56,"[99241, S8075, 99241, 99241, G0444, S8075, 992...","[2012-05-11, 2012-09-08, 2012-09-08, 2012-09-0...","[2012-05-11, 2012-09-09, 2012-09-08, 2012-09-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",21270.24,160,...,,,,,,,,,,


In [13]:
for i in range(maxlen):
  df_hcpcs = pd.merge( df_hcpcs, 
    combined_mapper,
    left_on=f"hcpcs_{i}",
    right_on="code",
    how='left'
  )
  df_hcpcs = df_hcpcs.drop(['code'], axis=1)
  df_hcpcs = df_hcpcs.rename({
    'category': f"category_{i}",
    'description': f"description_{i}",
  }, axis=1)

df_hcpcs_combined = df_hcpcs.fillna(np.nan)

In [14]:
df_hcpcs_combined[['hcpcs_1', 'category_1', 'description_1', 'hcpcs_2', 'category_2',  'description_2']].head()

Unnamed: 0,hcpcs_1,category_1,description_1,hcpcs_2,category_2,description_2
0,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,G0444,HCPCS_level_2,Procedures/Professional Services
1,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes
2,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,G0402,HCPCS_level_2,Procedures/Professional Services
3,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes
4,S8075,HCPCS_level_2,Temporary National Codes (Non-Medicare),99241,HCPCS_level_1,Evaluation and Management (E/M) Codes


## Time interval between claims

Using `billablePeriod_end_ls`, sort and compare the time interval of days between each claim. Then expand it out into individual columns

In [235]:
def days_between_claim(item):
  sorted_dates = pd.to_datetime(pd.Series(item)).sort_values().reset_index(drop=True)
  return sorted_dates.diff().dt.days.dropna().astype(int).tolist()

In [236]:
day_interval = pd.DataFrame(df['billablePeriod_end_ls'].apply(days_between_claim))
day_maxlen = max(day_interval['billablePeriod_end_ls'].str.len())
df_day_interval = pd.DataFrame(day_interval['billablePeriod_end_ls'].to_list(), columns=[f"day_interval_{i}" for i in range(day_maxlen)])
df_day_interval.head()

Unnamed: 0,day_interval_0,day_interval_1,day_interval_2,day_interval_3,day_interval_4,day_interval_5,day_interval_6,day_interval_7,day_interval_8,day_interval_9,...,day_interval_655,day_interval_656,day_interval_657,day_interval_658,day_interval_659,day_interval_660,day_interval_661,day_interval_662,day_interval_663,day_interval_664
0,28,343,371,371,14,241,3,106,52,92,...,,,,,,,,,,
1,27,338,33,332,39,326,45,137,183,365,...,,,,,,,,,,
2,32,8,31,44,9,3,18,28,84,140,...,,,,,,,,,,
3,94,360,11,371,124,247,136,235,74,61,...,,,,,,,,,,
4,320,9,42,371,360,11,371,289,82,29,...,,,,,,,,,,


## Preventative Care Indicator

In [15]:
# want to see if a patient has had any preventative care by looking at combined_hpcps_ls
prev_ls = preventative_mapper['HCPCS Code'].tolist()
df_hcpcs_combined['preventative_care_ind'] = df_hcpcs_combined['combined_hcpcs_ls'].apply(lambda ls: list(set(1 for code in ls if code in prev_ls )))
for index, row in df_hcpcs_combined.iterrows():
    if len(row['preventative_care_ind']) > 0:
        df_hcpcs_combined.loc[index,'preventative_care_ind'] = 1
    else:
        df_hcpcs_combined.loc[index,'preventative_care_ind'] = 0
    

In [16]:
# quick check
df_hcpcs_combined['preventative_care_ind'].value_counts()

preventative_care_ind
1    1122
0      16
Name: count, dtype: int64

## Variable Encoding

Will make three datasets with three different versions of the features:
- df_lab_enc: will have the hcpcs columns encoded using label encoding
- df_freq_enc: will have the hcpcs columns encoded using frequency encoding
- df_TD_enc: will use the combined_hcpcs_ls column and treat it like a bag of words problem and use a TD-IDF transformation

There are some variables that will always be label encoded

In [17]:
# will always encode gender using labels
le_gen = LabelEncoder()
df_hcpcs_combined['gender'] = le_gen.fit_transform(df_hcpcs_combined['gender'])

In [18]:
# create a list of cateogory columns
category_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("category")]

# create a dataframe of unique category values for encoding
ls = list(set(value for value in combined_mapper['category']))
# new columns are filled with nan
ls.append(np.nan)
df_unique_category = pd.DataFrame( {'unique_category': ls})


# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first category column
le.fit(df_unique_category['unique_category'])
 
# apply same encoder to rest of columns
for col in category_cols:
    df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])

  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df

In [19]:
# create a list of description columns
desc_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("description")]

# create a dataframe of unique description values for encoding
ls = list(set(value for value in combined_mapper['description']))
# new columns are filled with nan
ls.append(np.nan)
df_unique_desc = pd.DataFrame( {'unique_desc': ls})


# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first description column
le.fit(df_unique_desc['unique_desc'])
 
# apply same encoder to rest of columns
for col in desc_cols:
    df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])

  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df

In [20]:
# create a list of hcpcs columns
# first col in list needs to be dropped
hcpcs_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("hcpcs")][1:]

# create a dataframe of unique hcpcs values for encoding
ls = list(set(value for sublist in df_hcpcs_combined['combined_hcpcs_ls'] for value in sublist))
# new hcpcs columns are filled with nan
ls.append(np.nan)
df_unique_hcpcs = pd.DataFrame( {'unique_hcpcs': ls})


In [21]:
# create copies of the dataset
df_lab_enc = df_hcpcs_combined.copy()
df_freq_enc = df_hcpcs_combined.copy()
df_TD_enc = df_hcpcs_combined.copy()

### Label Encoding HCPCS

In [22]:
# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first hcpcs column
le.fit(df_unique_hcpcs['unique_hcpcs'])
 
# apply same encoder to rest of columns
for col in hcpcs_cols:
    df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])

  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col

In [23]:
# check encodings
df_lab_enc[['category_0', 'category_0_enc', 'hcpcs_0', 'hcpcs_0_enc', 'hcpcs_1', 'hcpcs_1_enc', 'hcpcs_2', 'hcpcs_2_enc', 'gender']].head()

Unnamed: 0,category_0,category_0_enc,hcpcs_0,hcpcs_0_enc,hcpcs_1,hcpcs_1_enc,hcpcs_2,hcpcs_2_enc,gender
0,HCPCS_level_2,2,G0444,20,99241,1,G0444,20,0
1,HCPCS_level_2,2,G0444,20,99241,1,99241,1,0
2,HCPCS_level_2,2,G0444,20,99241,1,G0402,18,0
3,HCPCS_level_1,1,99241,1,99241,1,99241,1,1
4,HCPCS_level_1,1,99241,1,S8075,36,99241,1,0


In [24]:
# drop original columns and list columns
drop_ls = list(category_cols) + list(desc_cols) + list(hcpcs_cols) + ['patient_medicare_number', 'combined_hcpcs_ls', 'billablePeriod_start_ls', 'billablePeriod_end_ls','location_of_bill_ls', 'ls_len']
df_lab_enc = df_lab_enc.drop(drop_ls, axis = 1)

In [25]:
df_lab_enc.to_pickle("../data/clean/patient_level_lab_enc.pkl")

In [26]:
df_lab_enc.head()

Unnamed: 0,gender,age,number_of_claims,total_value,preventative_care_ind,category_0_enc,category_1_enc,category_2_enc,category_3_enc,category_4_enc,...,hcpcs_582_enc,hcpcs_583_enc,hcpcs_584_enc,hcpcs_585_enc,hcpcs_586_enc,hcpcs_587_enc,hcpcs_588_enc,hcpcs_589_enc,hcpcs_590_enc,hcpcs_591_enc
0,0,71.0,14,60.7,1,2,1,2,1,1,...,44,44,44,44,44,44,44,44,44,44
1,0,74.0,32,2733.16,1,2,1,1,2,1,...,44,44,44,44,44,44,44,44,44,44
2,0,77.0,25,85.56,1,2,1,2,2,2,...,44,44,44,44,44,44,44,44,44,44
3,1,76.0,110,82.02,1,1,1,1,1,1,...,44,44,44,44,44,44,44,44,44,44
4,0,79.0,56,21270.24,1,1,2,1,1,2,...,44,44,44,44,44,44,44,44,44,44


### Frequency Encoded HCPCS

In [None]:



freq_enc = df_hcpcs_combined.groupby('combined_hcpcs_ls')
 
# apply same encoder to rest of columns
for col in hcpcs_cols:
    df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])

### TD_IDF Encoding

In [23]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline

In [24]:
df_hcpcs_reset = df_hcpcs.reset_index(drop=True)
list_cols = df_hcpcs_reset.columns[:12]
df_hcpcs_reset[list_cols].head()

Unnamed: 0,patient_medicare_number,patient_first_name,patient_last_name,gender,birthdate,number_of_claims,drg_ls,combined_diagnosis_ls,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls
0,1S00E00AA16,Franklyn36,Tromp100,male,1950-08-12,1,[],"[J329, E669, E785, J0190]",[99241],[2018-01-21],[2018-01-21],[002]
1,1S00E00AA23,Bonita405,Hagenes547,female,1948-05-23,4,[],"[J329, E785, I10, B002, J029, J329, E785, I10,...","[99241, 99241, 99241, 99241]","[2012-05-15, 2012-07-12, 2017-02-20, 2020-02-20]","[2012-05-15, 2012-07-12, 2017-02-20, 2020-02-20]","[002, 002, 002, 002]"
2,1S00E00AA25,Carlota980,Gamez720,female,1947-04-15,5,[],"[E669, D649, O039, M810, J329, E669, D649, O03...","[G0444, G0444, 99241, 99241, 99241]","[2015-08-05, 2016-08-10, 2020-07-29, 2021-03-0...","[2015-08-05, 2016-08-10, 2020-07-29, 2021-03-0...","[002, 002, 002]"
3,1S00E00AA32,D.,Watsic,male,,1,[],"[I10, E669, I2510, I219]",[G0444],[2018-06-09],[2018-06-09],[]
4,1S00E00AA48,Man114,Halvorson124,male,1945-08-04,1,[],"[E785, P292, I2510, E669, J0190, J329]",[99241],[2021-09-01],[2021-09-01],[002]


In [25]:
def tokeniser(text):
  return text.split()

def get_corpus_and_vocab(df, col):
  corpus = df[col].apply(lambda x: " ".join(x)).to_list()

  vocab = list(set([i for sublist in df[col].to_list() for i in sublist]))
  vocab = {k: i for i, k in enumerate(vocab)}
  return corpus, vocab

In [26]:
corpus, vocab = get_corpus_and_vocab(df_hcpcs_reset, 'combined_hcpcs_ls')

In [27]:
# Pipeline for tfidf and countvectoriser
pipe = Pipeline([
  ('count', CountVectorizer(vocabulary=vocab, tokenizer=tokeniser, lowercase=False)),
  ('tfidf', TfidfTransformer())
])

tfidf_hcpcs = pipe.fit_transform(corpus)



In [28]:
# Create a useable dataframe - NOTE: DO NOT use this if the vocab is too big
df_tfidf = pd.DataFrame(tfidf_hcpcs.toarray(), columns=pipe['count'].get_feature_names_out())

In [30]:
# Combine with the original fixed columns
col_list = ['gender', 'birthdate', 'number_of_claims']
out = pd.concat([
  df_hcpcs[col_list].reset_index(drop=True),
  df_tfidf
], axis=1)

# Export
out.to_pickle(f"{path}/hcpcs_tfidf.pkl")

## EDA

## Breakdown of procedures

NOTE: This is on the `claim_mini_sample` dataset (10,000 entries)

- Most common description of procedures done are:
  1. Evaluation and Management (E/M) Codes (HCPCS Level I)
  2. Procedures/Professional Services (HCPCS Level II)
- Other codes include
  1. Alcohol and Drug Abuse Treatment
  2. National Codes Established for State Medicaid Agencies

In [198]:
def countplot_with_labels(l, title):
  ax = sns.countplot(l, palette='pastel')

  for p in ax.patches:
    ax.text(
      p.get_width() + 1,
      p.get_y() + p.get_height() / 2,
      int(p.get_width()),
      ha="center",
      va="center",
      color="black",
      fontsize=12,
      fontweight="bold"
    )
  
  plt.title(title)

  return plt

In totality, what is the distribution of HCPCS codes across all claims

In [199]:
all_hcpcs = df['combined_hcpcs_ls'].explode().reset_index()
all_hcpcs = all_hcpcs.merge(
  combined_mapper,
  left_on='combined_hcpcs_ls',
  right_on='code',
  how='left'
)
all_hcpcs = all_hcpcs.drop(['index', 'code'], axis=1)
all_hcpcs = all_hcpcs.fillna("Unknown")

In [None]:
countplot_with_labels(all_hcpcs['category'], "Breakdown of Category for HCPCS")

In [None]:
countplot_with_labels(all_hcpcs['description'], "Breakdown of Descriptions for HCPCS")

Compare for the first and second HCPCS, what are the most common category of procedures done

In [None]:
countplot_with_labels(df_plot['description_0'], "Breakdown of First Procedure")

In [None]:
countplot_with_labels(df_hcpcs_combined['description_3'], "Breakdown of Second Procedure")

## How long between claim submissions

In [None]:
plt.hist(day_interval.explode('billablePeriod_end_ls'), bins=50)
plt.title("Histogram of all Day Intervals between Claim Submissions")
plt.show()

In [None]:
plt.hist(df_day_interval['day_interval_0'])
plt.title('How long between the first and second claim submissions in Days')

In [None]:
%watermark