# Unsupervised Learning

# 04_create_unsupervised_features

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 28/09/2025   | Adrienne | Created | Created file for unsupervised learning | 
| 29/09/2025   | Martin | New   | Processing to apply the HCPCS code descriptions + EDA on the new descriptions | 
| 02/10/2025 | Adrienne | Update | Created features |
| 05/10/2025 | Martin | Update | Added TFIDF transformation section for any "list-like" columns |
| 05/10/2025 | Adrienne | Update | Added a feature and cleaned up dataset to include relevant columns |
| 07/10/2025 | Adrienne | Update | Added preventative care indicator feature |

## Notes

- Preventative care indicator

## Content

* [Introduction](#introduction)
* [Load Data](##load-data)
* [Additional Features](#additional-features)
* [EDA](#eda)

## Introduction

In [3]:
%load_ext watermark

In [140]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline

## Load Data

In [177]:
path = "../data/clean"
df = pd.read_pickle(f"{path}/patient_level.pkl")

In [178]:
mapper_path = "../data/mappers"
combined_mapper = pd.read_pickle(f"{mapper_path}/combined_mapper.pkl")
preventative_mapper = pd.read_pickle(f"{mapper_path}/preventative_mapper.pkl")

In [179]:
combined_mapper.head()

Unnamed: 0,code,category,description
0,99201,HCPCS_level_1,Evaluation and Management (E/M) Codes
1,99202,HCPCS_level_1,Evaluation and Management (E/M) Codes
2,99203,HCPCS_level_1,Evaluation and Management (E/M) Codes
3,99204,HCPCS_level_1,Evaluation and Management (E/M) Codes
4,99205,HCPCS_level_1,Evaluation and Management (E/M) Codes


Need to drop columns that would be a source of data leakage or are not needed

In [180]:
# diagnosis columns:
keep_cols = ['patient_medicare_number', 'gender', 'age', 'number_of_claims', 'combined_hcpcs_ls', 'billablePeriod_start_ls', 'billablePeriod_end_ls', 'location_of_bill_ls', 'total_value']
df = df[keep_cols]

In [181]:
df.head()

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value
1,1S00E00AA10,female,79.0,18,"[G0444, 99221, G0444, G0444, G0444, 99221, 992...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",149.37
18,1S00E00AA16,male,75.0,17,"[99241, G0444, 99241, G0444, 99241, G0444, G95...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",234.72
35,1S00E00AA23,female,77.0,29,"[99241, 99241, 99241, G0444, 99241, 99241, 992...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",85.55
64,1S00E00AA25,female,78.0,24,"[G0107, G0444, 99241, G0444, G0444, 99241, G04...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",21901.4
89,1S00E00AA32,male,80.0,19,"[G0444, 99241, G0444, G0444, G0444, G9572, 992...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[002, 002, 002, 002, 002, 002, 002, 002, 002]",8388.69


Just drop rows where age is missing

In [182]:
df[df['age'].isnull()]

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value
8773,1S00E00GA64,male,,49,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2012-04-24, 2012-05-22, 2012-12-04, 2013-05-1...","[2012-04-24, 2012-05-22, 2012-12-04, 2013-05-1...",[],101.17
14912,1S00E00GK24,male,,18,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2013-10-14, 2013-12-16, 2015-07-13, 2016-01-0...","[2013-10-14, 2013-12-16, 2015-07-13, 2016-01-0...",[],114.9
37555,1S00E00HT00,male,,99,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2012-02-19, 2012-04-15, 2012-05-13, 2012-06-1...","[2012-02-19, 2012-04-15, 2012-05-13, 2012-06-1...",[],142.58
55609,1S00E00JU46,male,,40,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2013-10-25, 2015-03-20, 2015-12-11, 2016-02-1...","[2013-10-25, 2015-03-20, 2015-12-11, 2016-02-1...",[],123.66
75508,1S00E00ME11,male,,91,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2012-05-14, 2012-06-11, 2012-07-23, 2012-09-1...","[2012-05-14, 2012-06-11, 2012-07-23, 2012-09-1...",[],105.46
78272,1S00E00MH19,male,,44,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2012-04-30, 2012-05-21, 2012-06-18, 2013-05-2...","[2012-04-30, 2012-05-21, 2012-06-18, 2013-05-2...",[],134.38


In [183]:
df = df[df['age'].notnull()]

Limiting patients to those with less than 1000 in the combined_hcpcs_ls as it's just five patients and it drops the longest length to 670

In [184]:
print(f"Before Length: {len(df)}")
df['ls_len'] = df['combined_hcpcs_ls'].str.len()
df = df[df['ls_len'] < 1000]
print(f"After Length: {len(df)}")

Before Length: 2612
After Length: 2607


## Additional Features

Focusing on transforming the HCPCS codes into a useable format for unsupervised learning.

- HCPCS
  - code
  - category
  - description

### Apply mapper to HCPCS lists

Using the mapper we can apply the additional columns with category and description to each column of HCPCS

In [185]:
# drop hcpcs columns that are all NaN
print(len(df))
df.dropna(axis=1, how='all', inplace=True)
print(len(df))

2607
2607


In [186]:
unique_values = set(value for sublist in df['combined_hcpcs_ls'] for value in sublist)
print(unique_values)
print(len(unique_values))

{'G9829', 'G0155', 'G0153', 'G8946', '99241', 'S0605', 'G0424', 'T1502', 'G0107', 'G9833', 'G9857', 'G0152', 'G0402', 'Q5001', 'G0154', 'S8075', 'G0299', 'G0102', 'C8908', 'G8159', 'S9129', 'G0157', 'G9573', 'S9122', 'G9858', 'S9473', 'C8905', 'C8928', 'G0158', 'G9708', 'G9572', 'G0464', 'G0458', 'G0129', 'G0151', 'G0444', '99221', 'S9131', 'T1021', 'H2000', 'S9126', 'G0300', 'G8111', 'G0156'}
44


In [187]:
maxlen = max(df['combined_hcpcs_ls'].str.len())
print(f"max combined_hcpcs_ls length: {maxlen}")
df_hcpcs = df['combined_hcpcs_ls'].apply(pd.Series)
df_hcpcs = df_hcpcs.add_prefix('hcpcs_')
df_hcpcs = pd.concat([df, df_hcpcs], axis = 1)
df_hcpcs.head()

max combined_hcpcs_ls length: 670


Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value,ls_len,...,hcpcs_660,hcpcs_661,hcpcs_662,hcpcs_663,hcpcs_664,hcpcs_665,hcpcs_666,hcpcs_667,hcpcs_668,hcpcs_669
1,1S00E00AA10,female,79.0,18,"[G0444, 99221, G0444, G0444, G0444, 99221, 992...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",149.37,19,...,,,,,,,,,,
18,1S00E00AA16,male,75.0,17,"[99241, G0444, 99241, G0444, 99241, G0444, G95...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",234.72,18,...,,,,,,,,,,
35,1S00E00AA23,female,77.0,29,"[99241, 99241, 99241, G0444, 99241, 99241, 992...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",85.55,30,...,,,,,,,,,,
64,1S00E00AA25,female,78.0,24,"[G0107, G0444, 99241, G0444, G0444, 99241, G04...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",21901.4,24,...,,,,,,,,,,
89,1S00E00AA32,male,80.0,19,"[G0444, 99241, G0444, G0444, G0444, G9572, 992...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[002, 002, 002, 002, 002, 002, 002, 002, 002]",8388.69,24,...,,,,,,,,,,


In [188]:
for i in range(maxlen):
  df_hcpcs = pd.merge( df_hcpcs, 
    combined_mapper,
    left_on=f"hcpcs_{i}",
    right_on="code",
    how='left'
  )
  df_hcpcs = df_hcpcs.drop(['code'], axis=1)
  df_hcpcs = df_hcpcs.rename({
    'category': f"category_{i}",
    'description': f"description_{i}",
  }, axis=1)

df_hcpcs_combined = df_hcpcs.fillna(np.nan)

In [189]:
df_hcpcs_combined[['hcpcs_1', 'category_1', 'description_1', 'hcpcs_2', 'category_2',  'description_2']].head()

Unnamed: 0,hcpcs_1,category_1,description_1,hcpcs_2,category_2,description_2
0,99221,HCPCS_level_1,Evaluation and Management (E/M) Codes,G0444,HCPCS_level_2,Procedures/Professional Services
1,G0444,HCPCS_level_2,Procedures/Professional Services,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes
2,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes
3,G0444,HCPCS_level_2,Procedures/Professional Services,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes
4,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,G0444,HCPCS_level_2,Procedures/Professional Services


### Time interval between claims

Using `billablePeriod_end_ls`, sort and compare the time interval of days between each claim. Then expand it out into individual columns

In [235]:
def days_between_claim(item):
  sorted_dates = pd.to_datetime(pd.Series(item)).sort_values().reset_index(drop=True)
  return sorted_dates.diff().dt.days.dropna().astype(int).tolist()

In [236]:
day_interval = pd.DataFrame(df['billablePeriod_end_ls'].apply(days_between_claim))
day_maxlen = max(day_interval['billablePeriod_end_ls'].str.len())
df_day_interval = pd.DataFrame(day_interval['billablePeriod_end_ls'].to_list(), columns=[f"day_interval_{i}" for i in range(day_maxlen)])
df_day_interval.head()

Unnamed: 0,day_interval_0,day_interval_1,day_interval_2,day_interval_3,day_interval_4,day_interval_5,day_interval_6,day_interval_7,day_interval_8,day_interval_9,...,day_interval_655,day_interval_656,day_interval_657,day_interval_658,day_interval_659,day_interval_660,day_interval_661,day_interval_662,day_interval_663,day_interval_664
0,28,343,371,371,14,241,3,106,52,92,...,,,,,,,,,,
1,27,338,33,332,39,326,45,137,183,365,...,,,,,,,,,,
2,32,8,31,44,9,3,18,28,84,140,...,,,,,,,,,,
3,94,360,11,371,124,247,136,235,74,61,...,,,,,,,,,,
4,320,9,42,371,360,11,371,289,82,29,...,,,,,,,,,,


### Preventative Care Indicator

In [190]:
# want to see if a patient has had any preventative care by looking at combined_hpcps_ls
prev_ls = preventative_mapper['HCPCS Code'].tolist()
df_hcpcs_combined['preventative_care_ind'] = df_hcpcs_combined['combined_hcpcs_ls'].apply(lambda ls: list(set(1 for code in ls if code in prev_ls )))
for index, row in df_hcpcs_combined.iterrows():
    if len(row['preventative_care_ind']) > 0:
        df_hcpcs_combined.loc[index,'preventative_care_ind'] = 1
    else:
        df_hcpcs_combined.loc[index,'preventative_care_ind'] = 0
    

In [191]:
# quick check
df_hcpcs_combined['preventative_care_ind'].value_counts()

preventative_care_ind
1    2578
0      29
Name: count, dtype: int64

## Variable Encoding

Will make three datasets with three different versions of the features:
- df_lab_enc: will have the hcpcs columns encoded using label encoding
- df_freq_enc: will have the hcpcs columns encoded using frequency encoding
- df_TD_enc: will use the combined_hcpcs_ls column and treat it like a bag of words problem and use a TD-IDF transformation

There are some variables that will always be label encoded

In [192]:
# Check to make sure gender is not missing
vals = df_hcpcs_combined['gender'].value_counts(normalize=True) * 100
pd.DataFrame({
  'gender_breakdown': vals
}).head(10)

Unnamed: 0_level_0,gender_breakdown
gender,Unnamed: 1_level_1
female,53.126199
male,46.873801


In [193]:
# will always encode gender using labels
le_gen = LabelEncoder()
df_hcpcs_combined['gender'] = le_gen.fit_transform(df_hcpcs_combined['gender'])

In [194]:
# create a list of hcpcs columns
# first col in list needs to be dropped
hcpcs_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("hcpcs")][1:]

# create a dataframe of unique hcpcs values for encoding
ls = list(set(value for sublist in df_hcpcs_combined['combined_hcpcs_ls'] for value in sublist))
# new hcpcs columns are filled with nan
ls.append(np.nan)
df_unique_hcpcs = pd.DataFrame( {'unique_hcpcs': ls})


In [195]:
# create copies of the dataset
df_lab_enc = df_hcpcs_combined.copy()
df_freq_enc = df_hcpcs_combined.copy()
df_TF_enc = df_hcpcs_combined.copy()

### Label Encoding HCPCS, Category and Description

In [231]:
# create a list of cateogory columns
category_cols =  df_lab_enc.columns[ df_lab_enc.columns.str.contains("category")]

# create a dataframe of unique category values for encoding
ls = list(set(value for value in combined_mapper['category']))
# new columns are filled with nan
ls.append(np.nan)
df_unique_category = pd.DataFrame( {'unique_category': ls})


# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first category column
le.fit(df_unique_category['unique_category'])
 
# apply same encoder to rest of columns
for col in category_cols:
    df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])

  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transfor

In [232]:
# create a list of description columns
desc_cols =  df_lab_enc.columns[ df_lab_enc.columns.str.contains("description")]

# create a dataframe of unique description values for encoding
ls = list(set(value for value in combined_mapper['description']))
# new columns are filled with nan
ls.append(np.nan)
df_unique_desc = pd.DataFrame( {'unique_desc': ls})


# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first description column
le.fit(df_unique_desc['unique_desc'])
 
# apply same encoder to rest of columns
for col in desc_cols:
    df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])

  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transfor

In [96]:
# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first hcpcs column
le.fit(df_unique_hcpcs['unique_hcpcs'])
 
# apply same encoder to rest of columns
for col in hcpcs_cols:
    df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])

  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col

In [233]:
# check encodings
df_lab_enc[['category_0', 'category_0_enc', 'hcpcs_0', 'hcpcs_0_enc', 'hcpcs_1', 'hcpcs_1_enc', 'hcpcs_2', 'hcpcs_2_enc', 'gender']].head()

KeyError: "['hcpcs_0_enc', 'hcpcs_1_enc', 'hcpcs_2_enc'] not in index"

In [37]:
# drop original columns and list columns
drop_ls = list(category_cols) + list(desc_cols) + list(hcpcs_cols) + ['patient_medicare_number', 'total_value', 'combined_hcpcs_ls', 'billablePeriod_start_ls', 'billablePeriod_end_ls','location_of_bill_ls', 'ls_len']
df_lab_enc = df_lab_enc.drop(drop_ls, axis = 1)

In [39]:
df_lab_enc.head()

Unnamed: 0,gender,age,number_of_claims,preventative_care_ind,category_0_enc,category_1_enc,category_2_enc,category_3_enc,category_4_enc,category_5_enc,...,hcpcs_582_enc,hcpcs_583_enc,hcpcs_584_enc,hcpcs_585_enc,hcpcs_586_enc,hcpcs_587_enc,hcpcs_588_enc,hcpcs_589_enc,hcpcs_590_enc,hcpcs_591_enc
0,0,71.0,14,1,2,1,2,1,1,2,...,44,44,44,44,44,44,44,44,44,44
1,0,74.0,32,1,2,1,1,2,1,1,...,44,44,44,44,44,44,44,44,44,44
2,0,77.0,25,1,2,1,2,2,2,2,...,44,44,44,44,44,44,44,44,44,44
3,1,76.0,110,1,1,1,1,1,1,1,...,44,44,44,44,44,44,44,44,44,44
4,0,79.0,56,1,1,2,1,1,2,2,...,44,44,44,44,44,44,44,44,44,44


In [38]:
df_lab_enc.to_pickle("../data/clean/patient_level_lab_enc.pkl")

### TF-IDF Encoding

In [None]:
# Combined Category column
cat_cols = df_TF_enc.columns[df_TF_enc.columns.str.contains("category")]
df_TF_enc['cat_ls'] = df_TF_enc[cat_cols].apply(lambda row: [x for x in row if pd.notnull(x)] , axis = 1)

In [198]:
df_TF_enc.head()

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value,ls_len,...,category_666,description_666,category_667,description_667,category_668,description_668,category_669,description_669,preventative_care_ind,cat_ls
0,1S00E00AA10,0,79.0,18,"[G0444, 99221, G0444, G0444, G0444, 99221, 992...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",149.37,19,...,,,,,,,,,1,"[HCPCS_level_2, HCPCS_level_1, HCPCS_level_2, ..."
1,1S00E00AA16,1,75.0,17,"[99241, G0444, 99241, G0444, 99241, G0444, G95...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",234.72,18,...,,,,,,,,,1,"[HCPCS_level_1, HCPCS_level_2, HCPCS_level_1, ..."
2,1S00E00AA23,0,77.0,29,"[99241, 99241, 99241, G0444, 99241, 99241, 992...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",85.55,30,...,,,,,,,,,1,"[HCPCS_level_1, HCPCS_level_1, HCPCS_level_1, ..."
3,1S00E00AA25,0,78.0,24,"[G0107, G0444, 99241, G0444, G0444, 99241, G04...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",21901.4,24,...,,,,,,,,,1,"[HCPCS_level_2, HCPCS_level_2, HCPCS_level_1, ..."
4,1S00E00AA32,1,80.0,19,"[G0444, 99241, G0444, G0444, G0444, G9572, 992...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[002, 002, 002, 002, 002, 002, 002, 002, 002]",8388.69,24,...,,,,,,,,,1,"[HCPCS_level_2, HCPCS_level_1, HCPCS_level_2, ..."


In [199]:
# Combined Description column
desc_cols = df_TF_enc.columns[df_TF_enc.columns.str.contains("description")]
df_TF_enc['desc_ls'] = df_TF_enc[desc_cols].apply(lambda row: [x for x in row if pd.notnull(x)] , axis = 1)

In [200]:
df_TF_enc.head()

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value,ls_len,...,description_666,category_667,description_667,category_668,description_668,category_669,description_669,preventative_care_ind,cat_ls,desc_ls
0,1S00E00AA10,0,79.0,18,"[G0444, 99221, G0444, G0444, G0444, 99221, 992...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",149.37,19,...,,,,,,,,1,"[HCPCS_level_2, HCPCS_level_1, HCPCS_level_2, ...","[Procedures/Professional Services, Evaluation ..."
1,1S00E00AA16,1,75.0,17,"[99241, G0444, 99241, G0444, 99241, G0444, G95...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",234.72,18,...,,,,,,,,1,"[HCPCS_level_1, HCPCS_level_2, HCPCS_level_1, ...","[Evaluation and Management (E/M) Codes , Proce..."
2,1S00E00AA23,0,77.0,29,"[99241, 99241, 99241, G0444, 99241, 99241, 992...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",85.55,30,...,,,,,,,,1,"[HCPCS_level_1, HCPCS_level_1, HCPCS_level_1, ...","[Evaluation and Management (E/M) Codes , Evalu..."
3,1S00E00AA25,0,78.0,24,"[G0107, G0444, 99241, G0444, G0444, 99241, G04...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",21901.4,24,...,,,,,,,,1,"[HCPCS_level_2, HCPCS_level_2, HCPCS_level_1, ...","[Procedures/Professional Services, Procedures/..."
4,1S00E00AA32,1,80.0,19,"[G0444, 99241, G0444, G0444, G0444, G9572, 992...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[002, 002, 002, 002, 002, 002, 002, 002, 002]",8388.69,24,...,,,,,,,,1,"[HCPCS_level_2, HCPCS_level_1, HCPCS_level_2, ...","[Procedures/Professional Services, Evaluation ..."


In [201]:
drop_ls = list(cat_cols) + list(desc_cols)
df_TF_enc = df_TF_enc.drop(drop_ls, axis = 1)

In [None]:
# df_TF_enc_reset = df_TF_enc.reset_index(drop=True)
# list_cols = df_TF_enc_reset.columns[:9]
# df_TF_enc_reset[list_cols].head()

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value
0,1S00E00AA10,0,79.0,18,"[G0444, 99221, G0444, G0444, G0444, 99221, 992...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",149.37
1,1S00E00AA16,1,75.0,17,"[99241, G0444, 99241, G0444, 99241, G0444, G95...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",234.72
2,1S00E00AA23,0,77.0,29,"[99241, 99241, 99241, G0444, 99241, 99241, 992...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",85.55
3,1S00E00AA25,0,78.0,24,"[G0107, G0444, 99241, G0444, G0444, 99241, G04...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",21901.4
4,1S00E00AA32,1,80.0,19,"[G0444, 99241, G0444, G0444, G0444, G9572, 992...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[002, 002, 002, 002, 002, 002, 002, 002, 002]",8388.69


In [202]:
def tokeniser(text):
  return text.split()

def get_corpus_and_vocab(df, col):
  corpus = df[col].apply(lambda x: " ".join(x)).to_list()

  vocab = list(set([i for sublist in df[col].to_list() for i in sublist]))
  vocab = {k: i for i, k in enumerate(vocab)}
  return corpus, vocab

In [210]:
corpus, vocab = get_corpus_and_vocab(df_TF_enc, 'combined_hcpcs_ls')
corpus_cat, vocab_cat = get_corpus_and_vocab(df_TF_enc, 'cat_ls')
corpus_desc, vocab_desc = get_corpus_and_vocab(df_TF_enc, 'desc_ls')

In [220]:
# Pipeline for tfidf and countvectoriser
def vectorize(corpus, vocab):
  pipe = Pipeline([
    ('count', CountVectorizer(vocabulary=vocab, tokenizer=tokeniser, lowercase=False)),
    ('tfidf', TfidfTransformer())
  ])

  tfidfs = pipe.fit_transform(corpus)
  df_tfidfs = pd.DataFrame(tfidfs.toarray(), columns=pipe['count'].get_feature_names_out())
  
  return df_tfidfs

In [221]:
tfidf_hcpcs = vectorize(corpus, vocab)



In [240]:
tfidf_hcpcs.tail()

Unnamed: 0,G9829,G0155,G0153,G8946,99241,S0605,G0424,T1502,G0107,G9833,...,G0151,G0444,99221,S9131,T1021,H2000,S9126,G0300,G8111,G0156
2602,0.0,0.189309,0.43075,0.0,0.138251,0.0,0.0,0.0,0.0,0.0,...,0.347628,0.031643,0.022347,0.253637,0.0,0.0,0.0,0.229096,0.0,0.0
2603,0.0,0.209019,0.341454,0.0,0.079149,0.0,0.0,0.0,0.0,0.0,...,0.162833,0.244563,0.028786,0.367558,0.0,0.0,0.0,0.337263,0.0,0.050729
2604,0.0,0.223091,0.47536,0.0,0.049585,0.0,0.0,0.0,0.0,0.0,...,0.204021,0.011349,0.008015,0.295649,0.0,0.0,0.0,0.164334,0.0,0.0
2605,0.0,0.329302,0.0,0.0,0.411498,0.0,0.0,0.0,0.0,0.0,...,0.0,0.03853,0.081633,0.0,0.0,0.0,0.417632,0.0,0.0,0.335671
2606,0.0,0.128664,0.472918,0.0,0.255785,0.0,0.0,0.0,0.116927,0.0,...,0.375876,0.112907,0.0,0.301672,0.0,0.0,0.0,0.233556,0.0,0.0


In [241]:
tfidf_hcpcs['G8946'].value_counts()

G8946
0.000000    2565
0.019588       1
0.040388       1
0.065541       1
0.062876       1
0.055496       1
0.132660       1
0.109426       1
0.121929       1
0.066414       1
0.079641       1
0.019807       1
0.074587       1
0.076923       1
0.064960       1
0.058925       1
0.035548       1
0.049982       1
0.065852       1
0.020787       1
0.096617       1
0.057098       1
0.073511       1
0.026140       1
0.065933       1
0.084639       1
0.063018       1
0.042087       1
0.325489       1
0.074116       1
0.021936       1
0.076889       1
0.024468       1
0.024929       1
0.078955       1
0.069787       1
0.066649       1
0.053317       1
0.085166       1
0.085916       1
0.253100       1
0.094325       1
0.065006       1
Name: count, dtype: int64

In [223]:
tfidf_cat = vectorize(corpus_cat, vocab_cat)
tfidf_cat.head()



Unnamed: 0,HCPCS_level_1,HCPCS_level_2
0,0.665558,0.746346
1,0.89277,0.450512
2,0.938776,0.344529
3,0.855392,0.51798
4,0.313617,0.94955


In [239]:
tfidf_desc = vectorize(corpus_desc, vocab_desc)
tfidf_desc.tail()



Unnamed: 0,National Codes Established for State Medicaid Agencies,Evaluation and Management (E/M) Codes,Temporary Codes,Temporary National Codes (Non-Medicare),Outpatient PPS,Alcohol and Drug Abuse Treatment,Procedures/Professional Services
2602,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2603,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2604,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2605,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2606,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [236]:
tfidf_desc.head(50).tail()

Unnamed: 0,National Codes Established for State Medicaid Agencies,Evaluation and Management (E/M) Codes,Temporary Codes,Temporary National Codes (Non-Medicare),Outpatient PPS,Alcohol and Drug Abuse Treatment,Procedures/Professional Services
45,0.0,0.0,0.0,0.0,0.0,0.0,0.0
46,0.0,0.0,0.0,0.0,0.0,0.0,0.0
47,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [242]:
tfidf_desc['Temporary Codes'].value_counts()

Temporary Codes
0.0    2607
Name: count, dtype: int64

In [225]:
# Combine with the original fixed columns
col_list = ['gender', 'age', 'number_of_claims', 'preventative_care_ind', ]
out = pd.concat([
  df_TF_enc[col_list].reset_index(drop=True),
  tfidf_hcpcs, 
  tfidf_cat,
  tfidf_desc
], axis=1)

In [226]:
out.head()

Unnamed: 0,gender,age,number_of_claims,preventative_care_ind,G9829,G0155,G0153,G8946,99241,S0605,...,G0156,HCPCS_level_1,HCPCS_level_2,National Codes Established for State Medicaid Agencies,Evaluation and Management (E/M) Codes,Temporary Codes,Temporary National Codes (Non-Medicare),Outpatient PPS,Alcohol and Drug Abuse Treatment,Procedures/Professional Services
0,0,79.0,18,1,0.0,0.0,0.0,0.0,0.506433,0.0,...,0.0,0.665558,0.746346,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,75.0,17,1,0.0,0.0,0.0,0.0,0.901996,0.0,...,0.0,0.89277,0.450512,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,77.0,29,1,0.0,0.0,0.0,0.0,0.944587,0.0,...,0.0,0.938776,0.344529,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,78.0,24,1,0.0,0.0,0.0,0.0,0.842321,0.0,...,0.0,0.855392,0.51798,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,80.0,19,1,0.0,0.0,0.0,0.0,0.323873,0.0,...,0.0,0.313617,0.94955,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [227]:
# Export
#out.to_pickle(f"{path}/patient_level_TF_enc.pkl")
out.to_pickle(f"{path}/patient_level_features.pkl")

## EDA

## Breakdown of procedures

NOTE: This is on the `claim_mini_sample` dataset (10,000 entries)

- Most common description of procedures done are:
  1. Evaluation and Management (E/M) Codes (HCPCS Level I)
  2. Procedures/Professional Services (HCPCS Level II)
- Other codes include
  1. Alcohol and Drug Abuse Treatment
  2. National Codes Established for State Medicaid Agencies

In [198]:
def countplot_with_labels(l, title):
  ax = sns.countplot(l, palette='pastel')

  for p in ax.patches:
    ax.text(
      p.get_width() + 1,
      p.get_y() + p.get_height() / 2,
      int(p.get_width()),
      ha="center",
      va="center",
      color="black",
      fontsize=12,
      fontweight="bold"
    )
  
  plt.title(title)

  return plt

In totality, what is the distribution of HCPCS codes across all claims

In [199]:
all_hcpcs = df['combined_hcpcs_ls'].explode().reset_index()
all_hcpcs = all_hcpcs.merge(
  combined_mapper,
  left_on='combined_hcpcs_ls',
  right_on='code',
  how='left'
)
all_hcpcs = all_hcpcs.drop(['index', 'code'], axis=1)
all_hcpcs = all_hcpcs.fillna("Unknown")

In [None]:
countplot_with_labels(all_hcpcs['category'], "Breakdown of Category for HCPCS")

In [None]:
countplot_with_labels(all_hcpcs['description'], "Breakdown of Descriptions for HCPCS")

Compare for the first and second HCPCS, what are the most common category of procedures done

In [None]:
countplot_with_labels(df_plot['description_0'], "Breakdown of First Procedure")

In [None]:
countplot_with_labels(df_hcpcs_combined['description_3'], "Breakdown of Second Procedure")

## How long between claim submissions

In [None]:
plt.hist(day_interval.explode('billablePeriod_end_ls'), bins=50)
plt.title("Histogram of all Day Intervals between Claim Submissions")
plt.show()

In [None]:
plt.hist(df_day_interval['day_interval_0'])
plt.title('How long between the first and second claim submissions in Days')

In [None]:
%watermark