# Unsupervised Learning
## 04_create_unsupervised_features

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 28/09/2025   | Adrienne | Created | Created file for unsupervised learning | 
| 29/09/2025   | Martin | New   | Processing to apply the HCPCS code descriptions + EDA on the new descriptions | 
| 02/10/2025 | Adrienne | Update | Created features |
| 05/10/2025 | Martin | Update | Added TFIDF transformation section for any "list-like" columns |
| 05/10/2025 | Adrienne | Update | Added a feature and cleaned up dataset to include relevant columns |
| 07/10/2025 | Adrienne | Update | Added preventative care indicator feature |
| 15/10/2025 | Adrienne | Update | Code cleanup |

## Content

* [Introduction](#introduction)
* [Load Data](##load-data)
* [Data Processing](#data-processing)
* [Create Features](#create-features)
* [EDA](#eda)
* [Variable Encoding](#variable-encoding)

## Introduction

This program creates features for the unsupervised learning model.  In this approach, we TF-IDF encode our combined list columns as we treat each columns as documents and view it as a bag of words problem.

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline

## Load Data

In [57]:
path = "../data/clean"
df = pd.read_pickle(f"{path}/patient_level.pkl")

In [58]:
mapper_path = "../data/mappers"
combined_mapper = pd.read_pickle(f"{mapper_path}/combined_mapper.pkl")
preventative_mapper = pd.read_pickle(f"{mapper_path}/preventative_mapper.pkl")

## Data Processing

To more easily TF-IDF encode desciption adding _

In [59]:
combined_mapper['description'] = combined_mapper['description'].apply(lambda x: x.replace(' ', '_' ))

In [60]:
combined_mapper.head(500).tail(10)

Unnamed: 0,code,category,description
490,291,HCPCS_level_1,Anesthesia_Codes
491,292,HCPCS_level_1,Anesthesia_Codes
492,293,HCPCS_level_1,Anesthesia_Codes
493,294,HCPCS_level_1,Anesthesia_Codes
494,295,HCPCS_level_1,Anesthesia_Codes
495,296,HCPCS_level_1,Anesthesia_Codes
496,297,HCPCS_level_1,Anesthesia_Codes
497,298,HCPCS_level_1,Anesthesia_Codes
498,299,HCPCS_level_1,Anesthesia_Codes
499,300,HCPCS_level_1,Anesthesia_Codes


Need to drop columns that would be a source of data leakage or are not needed

In [61]:
# diagnosis columns:
keep_cols = ['patient_medicare_number', 'gender', 'age', 'number_of_claims', 'combined_hcpcs_ls', 'combined_diagnosis_ls', 'combined_principal_diagnosis_ls', 'drg_ls', 'billablePeriod_start_ls', 'billablePeriod_end_ls', 'location_of_bill_ls', 'total_value']
df = df[keep_cols]

In [62]:
df.head()

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,combined_diagnosis_ls,combined_principal_diagnosis_ls,drg_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value
1,1S00E00AA10,female,79.0,18,"[G0444, 99221, G0444, G0444, G0444, 99221, 992...","[O039, O039, S63509, O039, O039, O039, O039, S...","[O039, O039, B002, B002, B085, O039, S8290X, J...","[001, 001, 001]","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",149.37
18,1S00E00AA16,male,75.0,17,"[99241, G0444, 99241, G0444, 99241, G0444, G95...","[E669, E785, E669, E785, E669, E785, E669, E78...","[E785, E785, E785, E785, B085, E785, E785, J01...",[],"[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",234.72
35,1S00E00AA23,female,77.0,29,"[99241, 99241, 99241, G0444, 99241, 99241, 992...","[J329, E785, I10, J329, E785, P292, J329, E785...","[J329, E785, J329, J029, J329, J029, J329, J32...",[],"[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",85.55
64,1S00E00AA25,female,78.0,24,"[G0107, G0444, 99241, G0444, G0444, 99241, G04...","[E669, D649, K635, O039, M810, J329, Z3400, E6...","[Z3400, Z3400, Z3400, J0190, E669, S72009, J01...",[001],"[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",21901.4
89,1S00E00AA32,male,80.0,19,"[G0444, 99241, G0444, G0444, G0444, G9572, 992...","[P292, E669, I2510, P292, J209, E669, I2510, I...","[J209, J209, J329, J0390, I10, J209, J209, J20...",[001],"[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[002, 002, 002, 002, 002, 002, 002, 002, 002]",8388.69


Just drop rows where age is missing

In [63]:
df[df['age'].isnull()]

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,combined_diagnosis_ls,combined_principal_diagnosis_ls,drg_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value
8773,1S00E00GA64,male,,49,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[I639, I639, I639, I639, I639, I639, I639, I63...",[],[],"[2012-04-24, 2012-05-22, 2012-12-04, 2013-05-1...","[2012-04-24, 2012-05-22, 2012-12-04, 2013-05-1...",[],101.17
14912,1S00E00GK24,male,,18,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[E781, E781, E781, E781, E781, E781, E781, E78...",[],[],"[2013-10-14, 2013-12-16, 2015-07-13, 2016-01-0...","[2013-10-14, 2013-12-16, 2015-07-13, 2016-01-0...",[],114.9
37555,1S00E00HT00,male,,99,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[R739, R739, R739, R739, R739, R739, R739, R73...",[],[],"[2012-02-19, 2012-04-15, 2012-05-13, 2012-06-1...","[2012-02-19, 2012-04-15, 2012-05-13, 2012-06-1...",[],142.58
55609,1S00E00JU46,male,,40,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[R739, R739, R739, R739, R739, R739, R739, R73...",[],[],"[2013-10-25, 2015-03-20, 2015-12-11, 2016-02-1...","[2013-10-25, 2015-03-20, 2015-12-11, 2016-02-1...",[],123.66
75508,1S00E00ME11,male,,91,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[E781, E781, E781, E781, E781, E781, E781, E78...",[],[],"[2012-05-14, 2012-06-11, 2012-07-23, 2012-09-1...","[2012-05-14, 2012-06-11, 2012-07-23, 2012-09-1...",[],105.46
78272,1S00E00MH19,male,,44,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[E119, E119, E119, E119, E119, E119, E119, E11...",[],[],"[2012-04-30, 2012-05-21, 2012-06-18, 2013-05-2...","[2012-04-30, 2012-05-21, 2012-06-18, 2013-05-2...",[],134.38


In [64]:
df = df[df['age'].notnull()]

Limiting patients to those with less than 1000 in the combined_hcpcs_ls as it's just five patients and it drops the longest length to 670

In [65]:
print(f"Before Length: {len(df)}")
df['ls_len'] = df['combined_hcpcs_ls'].str.len()
df = df[df['ls_len'] < 1000]
print(f"After Length: {len(df)}")

Before Length: 2612
After Length: 2607


## Create Features

Focusing on transforming the HCPCS codes into a useable format for unsupervised learning.

- HCPCS
  - code
  - category
  - description

### Apply mapper to HCPCS lists

Using the mapper we can apply the additional columns with category and description to each column of HCPCS

In [None]:
# drop rows where hcpcs columns that are all NaN
print(len(df))
df.dropna(axis=1, how='all', inplace=True)
print(len(df))

2607
2607


In [67]:
# examing unique hcpcs codes in dataset
unique_values = set(value for sublist in df['combined_hcpcs_ls'] for value in sublist)
print(unique_values)
print(len(unique_values))

{'G9857', 'G0300', 'G9573', 'C8905', 'G0464', 'G0424', 'C8928', 'S9473', 'G9708', 'G0402', 'G9829', 'G8946', 'G0107', 'Q5001', 'G0158', 'T1021', 'G0458', 'G0129', 'G9572', 'G0154', 'S9129', 'G8111', 'G8159', 'G0156', 'S9122', 'C8908', 'G0155', 'G0152', 'G0157', 'S9131', 'G0151', 'G9858', '99241', 'G0102', 'S0605', 'G0444', 'G9833', 'T1502', '99221', 'G0299', 'G0153', 'H2000', 'S9126', 'S8075'}
44


In [68]:
# create individual hcpcs columns
maxlen = max(df['combined_hcpcs_ls'].str.len())
print(f"max combined_hcpcs_ls length: {maxlen}")
df_hcpcs = df['combined_hcpcs_ls'].apply(pd.Series)
df_hcpcs = df_hcpcs.add_prefix('hcpcs_')
df_hcpcs = pd.concat([df, df_hcpcs], axis = 1)
df_hcpcs.head()

max combined_hcpcs_ls length: 670


Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,combined_diagnosis_ls,combined_principal_diagnosis_ls,drg_ls,billablePeriod_start_ls,billablePeriod_end_ls,...,hcpcs_660,hcpcs_661,hcpcs_662,hcpcs_663,hcpcs_664,hcpcs_665,hcpcs_666,hcpcs_667,hcpcs_668,hcpcs_669
1,1S00E00AA10,female,79.0,18,"[G0444, 99221, G0444, G0444, G0444, 99221, 992...","[O039, O039, S63509, O039, O039, O039, O039, S...","[O039, O039, B002, B002, B085, O039, S8290X, J...","[001, 001, 001]","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...",...,,,,,,,,,,
18,1S00E00AA16,male,75.0,17,"[99241, G0444, 99241, G0444, 99241, G0444, G95...","[E669, E785, E669, E785, E669, E785, E669, E78...","[E785, E785, E785, E785, B085, E785, E785, J01...",[],"[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...",...,,,,,,,,,,
35,1S00E00AA23,female,77.0,29,"[99241, 99241, 99241, G0444, 99241, 99241, 992...","[J329, E785, I10, J329, E785, P292, J329, E785...","[J329, E785, J329, J029, J329, J029, J329, J32...",[],"[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...",...,,,,,,,,,,
64,1S00E00AA25,female,78.0,24,"[G0107, G0444, 99241, G0444, G0444, 99241, G04...","[E669, D649, K635, O039, M810, J329, Z3400, E6...","[Z3400, Z3400, Z3400, J0190, E669, S72009, J01...",[001],"[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...",...,,,,,,,,,,
89,1S00E00AA32,male,80.0,19,"[G0444, 99241, G0444, G0444, G0444, G9572, 992...","[P292, E669, I2510, P292, J209, E669, I2510, I...","[J209, J209, J329, J0390, I10, J209, J209, J20...",[001],"[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...",...,,,,,,,,,,


In [69]:
# adding category and description for each hcpcs code
for i in range(maxlen):
  df_hcpcs = pd.merge( df_hcpcs, 
    combined_mapper,
    left_on=f"hcpcs_{i}",
    right_on="code",
    how='left'
  )
  df_hcpcs = df_hcpcs.drop(['code'], axis=1)
  df_hcpcs = df_hcpcs.rename({
    'category': f"category_{i}",
    'description': f"description_{i}",
  }, axis=1)

df_hcpcs_combined = df_hcpcs.fillna(np.nan)

In [70]:
df_hcpcs_combined[['hcpcs_1', 'category_1', 'description_1', 'hcpcs_2', 'category_2',  'description_2']].head()

Unnamed: 0,hcpcs_1,category_1,description_1,hcpcs_2,category_2,description_2
0,99221,HCPCS_level_1,Evaluation_and_Management_(E/M)_Codes_,G0444,HCPCS_level_2,Procedures/Professional_Services
1,G0444,HCPCS_level_2,Procedures/Professional_Services,99241,HCPCS_level_1,Evaluation_and_Management_(E/M)_Codes_
2,99241,HCPCS_level_1,Evaluation_and_Management_(E/M)_Codes_,99241,HCPCS_level_1,Evaluation_and_Management_(E/M)_Codes_
3,G0444,HCPCS_level_2,Procedures/Professional_Services,99241,HCPCS_level_1,Evaluation_and_Management_(E/M)_Codes_
4,99241,HCPCS_level_1,Evaluation_and_Management_(E/M)_Codes_,G0444,HCPCS_level_2,Procedures/Professional_Services


### Time interval between claims

Using `billablePeriod_end_ls`, sort and compare the time interval of days between each claim. Then expand it out into individual columns

In [22]:
def days_between_claim(item):
  sorted_dates = pd.to_datetime(pd.Series(item)).sort_values().reset_index(drop=True)
  return sorted_dates.diff().dt.days.dropna().astype(int).tolist()

In [23]:
day_interval = pd.DataFrame(df['billablePeriod_end_ls'].apply(days_between_claim))
day_maxlen = max(day_interval['billablePeriod_end_ls'].str.len())
df_day_interval = pd.DataFrame(day_interval['billablePeriod_end_ls'].to_list(), columns=[f"day_interval_{i}" for i in range(day_maxlen)])
df_day_interval.head()

Unnamed: 0,day_interval_0,day_interval_1,day_interval_2,day_interval_3,day_interval_4,day_interval_5,day_interval_6,day_interval_7,day_interval_8,day_interval_9,...,day_interval_655,day_interval_656,day_interval_657,day_interval_658,day_interval_659,day_interval_660,day_interval_661,day_interval_662,day_interval_663,day_interval_664
0,28,343,371,371,14,241,3,106,52,92.0,...,,,,,,,,,,
1,27,338,33,332,39,326,45,137,183,365.0,...,,,,,,,,,,
2,32,8,31,44,9,3,18,28,84,140.0,...,,,,,,,,,,
3,94,360,11,371,124,247,136,235,74,61.0,...,,,,,,,,,,
4,320,9,42,371,360,11,371,289,82,29.0,...,,,,,,,,,,


### Preventative Care Indicator

In [71]:
# want to see if a patient has had any preventative care by looking at combined_hpcps_ls
prev_ls = preventative_mapper['HCPCS Code'].tolist()
df_hcpcs_combined['preventative_care_ind'] = df_hcpcs_combined['combined_hcpcs_ls'].apply(lambda ls: list(set(1 for code in ls if code in prev_ls )))
for index, row in df_hcpcs_combined.iterrows():
    if len(row['preventative_care_ind']) > 0:
        df_hcpcs_combined.loc[index,'preventative_care_ind'] = 1
    else:
        df_hcpcs_combined.loc[index,'preventative_care_ind'] = 0
    

In [72]:
# quick check
df_hcpcs_combined['preventative_care_ind'].value_counts()

preventative_care_ind
1    2578
0      29
Name: count, dtype: int64

## Variable Encoding

Tried to different feature encodings:
- df_lab_enc: will have the hcpcs columns encoded using label encoding
- df_TD_enc: will use the combined_hcpcs_ls column and treat it like a bag of words problem and use a TD-IDF transformation

There are some variables that will always be label encoded

In [73]:
# Check to make sure gender is not missing
vals = df_hcpcs_combined['gender'].value_counts(normalize=True) * 100
pd.DataFrame({
  'gender_breakdown': vals
}).head(10)

Unnamed: 0_level_0,gender_breakdown
gender,Unnamed: 1_level_1
female,53.126199
male,46.873801


In [74]:
# will always encode gender using labels
le_gen = LabelEncoder()
df_hcpcs_combined['gender'] = le_gen.fit_transform(df_hcpcs_combined['gender'])

In [75]:
# create a list of hcpcs columns
# first col in list needs to be dropped
hcpcs_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("hcpcs")][1:]

# create a dataframe of unique hcpcs values for encoding
ls = list(set(value for sublist in df_hcpcs_combined['combined_hcpcs_ls'] for value in sublist))
# new hcpcs columns are filled with nan
ls.append(np.nan)
df_unique_hcpcs = pd.DataFrame( {'unique_hcpcs': ls})


In [84]:
# create copies of the dataset
df_lab_enc = df_hcpcs_combined.copy()
df_TF_enc = df_hcpcs_combined.copy()

### Label Encoding HCPCS, Category and Description

In [85]:
# create a list of cateogory columns
category_cols =  df_lab_enc.columns[ df_lab_enc.columns.str.contains("category")]

# create a dataframe of unique category values for encoding
ls = list(set(value for value in combined_mapper['category']))
# new columns are filled with nan
ls.append(np.nan)
df_unique_category = pd.DataFrame( {'unique_category': ls})

# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first category column
le.fit(df_unique_category['unique_category'])
 
# apply same encoder to rest of columns
for col in category_cols:
    df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])

  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transfor

In [86]:
# create a list of description columns
desc_cols =  df_lab_enc.columns[ df_lab_enc.columns.str.contains("description")]

# create a dataframe of unique description values for encoding
ls = list(set(value for value in combined_mapper['description']))
# new columns are filled with nan
ls.append(np.nan)
df_unique_desc = pd.DataFrame( {'unique_desc': ls})


# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first description column
le.fit(df_unique_desc['unique_desc'])
 
# apply same encoder to rest of columns
for col in desc_cols:
    df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])

  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform( df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transfor

In [87]:
# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first hcpcs column
le.fit(df_unique_hcpcs['unique_hcpcs'])
 
# apply same encoder to rest of columns
for col in hcpcs_cols:
    df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])

  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col

In [88]:
# check encodings
df_lab_enc[['category_0', 'category_0_enc', 'hcpcs_0', 'hcpcs_0_enc', 'hcpcs_1', 'hcpcs_1_enc', 'hcpcs_2', 'hcpcs_2_enc', 'gender']].head()

Unnamed: 0,category_0,category_0_enc,hcpcs_0,hcpcs_0_enc,hcpcs_1,hcpcs_1_enc,hcpcs_2,hcpcs_2_enc,gender
0,HCPCS_level_2,2,G0444,20,99221,0,G0444,20,0
1,HCPCS_level_1,1,99241,1,G0444,20,99241,1,1
2,HCPCS_level_1,1,99241,1,99241,1,99241,1,0
3,HCPCS_level_2,2,G0107,6,G0444,20,99241,1,0
4,HCPCS_level_2,2,G0444,20,99241,1,G0444,20,1


In [90]:
df_lab_enc.head()

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,combined_diagnosis_ls,combined_principal_diagnosis_ls,drg_ls,billablePeriod_start_ls,billablePeriod_end_ls,...,hcpcs_660_enc,hcpcs_661_enc,hcpcs_662_enc,hcpcs_663_enc,hcpcs_664_enc,hcpcs_665_enc,hcpcs_666_enc,hcpcs_667_enc,hcpcs_668_enc,hcpcs_669_enc
0,1S00E00AA10,0,79.0,18,"[G0444, 99221, G0444, G0444, G0444, 99221, 992...","[O039, O039, S63509, O039, O039, O039, O039, S...","[O039, O039, B002, B002, B085, O039, S8290X, J...","[001, 001, 001]","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...",...,44,44,44,44,44,44,44,44,44,44
1,1S00E00AA16,1,75.0,17,"[99241, G0444, 99241, G0444, 99241, G0444, G95...","[E669, E785, E669, E785, E669, E785, E669, E78...","[E785, E785, E785, E785, B085, E785, E785, J01...",[],"[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...",...,44,44,44,44,44,44,44,44,44,44
2,1S00E00AA23,0,77.0,29,"[99241, 99241, 99241, G0444, 99241, 99241, 992...","[J329, E785, I10, J329, E785, P292, J329, E785...","[J329, E785, J329, J029, J329, J029, J329, J32...",[],"[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...",...,44,44,44,44,44,44,44,44,44,44
3,1S00E00AA25,0,78.0,24,"[G0107, G0444, 99241, G0444, G0444, 99241, G04...","[E669, D649, K635, O039, M810, J329, Z3400, E6...","[Z3400, Z3400, Z3400, J0190, E669, S72009, J01...",[001],"[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...",...,44,44,44,44,44,44,44,44,44,44
4,1S00E00AA32,1,80.0,19,"[G0444, 99241, G0444, G0444, G0444, G9572, 992...","[P292, E669, I2510, P292, J209, E669, I2510, I...","[J209, J209, J329, J0390, I10, J209, J209, J20...",[001],"[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...",...,44,44,44,44,44,44,44,44,44,44


In [92]:
# drop original columns and list columns
drop_ls = list(category_cols) + list(desc_cols) + list(hcpcs_cols) + ['patient_medicare_number', 'total_value', 'combined_hcpcs_ls', 'billablePeriod_start_ls', 'billablePeriod_end_ls','location_of_bill_ls', 'ls_len']
df_lab_enc_drop = df_lab_enc.drop(drop_ls, axis = 1)
df_lab_enc_drop.head()

Unnamed: 0,gender,age,number_of_claims,combined_diagnosis_ls,combined_principal_diagnosis_ls,drg_ls,preventative_care_ind,category_0_enc,category_1_enc,category_2_enc,...,hcpcs_660_enc,hcpcs_661_enc,hcpcs_662_enc,hcpcs_663_enc,hcpcs_664_enc,hcpcs_665_enc,hcpcs_666_enc,hcpcs_667_enc,hcpcs_668_enc,hcpcs_669_enc
0,0,79.0,18,"[O039, O039, S63509, O039, O039, O039, O039, S...","[O039, O039, B002, B002, B085, O039, S8290X, J...","[001, 001, 001]",1,2,1,2,...,44,44,44,44,44,44,44,44,44,44
1,1,75.0,17,"[E669, E785, E669, E785, E669, E785, E669, E78...","[E785, E785, E785, E785, B085, E785, E785, J01...",[],1,1,2,1,...,44,44,44,44,44,44,44,44,44,44
2,0,77.0,29,"[J329, E785, I10, J329, E785, P292, J329, E785...","[J329, E785, J329, J029, J329, J029, J329, J32...",[],1,1,1,1,...,44,44,44,44,44,44,44,44,44,44
3,0,78.0,24,"[E669, D649, K635, O039, M810, J329, Z3400, E6...","[Z3400, Z3400, Z3400, J0190, E669, S72009, J01...",[001],1,2,2,1,...,44,44,44,44,44,44,44,44,44,44
4,1,80.0,19,"[P292, E669, I2510, P292, J209, E669, I2510, I...","[J209, J209, J329, J0390, I10, J209, J209, J20...",[001],1,2,1,2,...,44,44,44,44,44,44,44,44,44,44


In [93]:
# save dataset
path = "../data/clean"
df_lab_enc_drop.to_pickle(f"{path}/patient_level_lab_enc.pkl")

### TF-IDF Encoding

Combining category and desription columns, so the list can be considered a corpus

In [94]:
# Combined Category column
cat_cols = df_TF_enc.columns[df_TF_enc.columns.str.contains("category")]
df_TF_enc['cat_ls'] = df_TF_enc[cat_cols].apply(lambda row: [x for x in row if pd.notnull(x)] , axis = 1)

# Combined Description column
desc_cols = df_TF_enc.columns[df_TF_enc.columns.str.contains("description")]
df_TF_enc['desc_ls'] = df_TF_enc[desc_cols].apply(lambda row: [ x for x in row if pd.notnull(x)] , axis = 1)

In [95]:
df_TF_enc.head()

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,combined_diagnosis_ls,combined_principal_diagnosis_ls,drg_ls,billablePeriod_start_ls,billablePeriod_end_ls,...,description_666,category_667,description_667,category_668,description_668,category_669,description_669,preventative_care_ind,cat_ls,desc_ls
0,1S00E00AA10,0,79.0,18,"[G0444, 99221, G0444, G0444, G0444, 99221, 992...","[O039, O039, S63509, O039, O039, O039, O039, S...","[O039, O039, B002, B002, B085, O039, S8290X, J...","[001, 001, 001]","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...","[2012-04-17, 2012-05-15, 2013-04-23, 2014-04-2...",...,,,,,,,,1,"[HCPCS_level_2, HCPCS_level_1, HCPCS_level_2, ...","[Procedures/Professional_Services, Evaluation_..."
1,1S00E00AA16,1,75.0,17,"[99241, G0444, 99241, G0444, 99241, G0444, G95...","[E669, E785, E669, E785, E669, E785, E669, E78...","[E785, E785, E785, E785, B085, E785, E785, J01...",[],"[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...","[2012-09-23, 2012-10-20, 2013-09-23, 2013-10-2...",...,,,,,,,,1,"[HCPCS_level_1, HCPCS_level_2, HCPCS_level_1, ...","[Evaluation_and_Management_(E/M)_Codes_, Proce..."
2,1S00E00AA23,0,77.0,29,"[99241, 99241, 99241, G0444, 99241, 99241, 992...","[J329, E785, I10, J329, E785, P292, J329, E785...","[J329, E785, J329, J029, J329, J029, J329, J32...",[],"[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...","[2012-01-21, 2012-02-22, 2012-03-01, 2012-04-0...",...,,,,,,,,1,"[HCPCS_level_1, HCPCS_level_1, HCPCS_level_1, ...","[Evaluation_and_Management_(E/M)_Codes_, Evalu..."
3,1S00E00AA25,0,78.0,24,"[G0107, G0444, 99241, G0444, G0444, 99241, G04...","[E669, D649, K635, O039, M810, J329, Z3400, E6...","[Z3400, Z3400, Z3400, J0190, E669, S72009, J01...",[001],"[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...","[2012-04-15, 2012-07-18, 2013-07-13, 2013-07-2...",...,,,,,,,,1,"[HCPCS_level_2, HCPCS_level_2, HCPCS_level_1, ...","[Procedures/Professional_Services, Procedures/..."
4,1S00E00AA32,1,80.0,19,"[G0444, 99241, G0444, G0444, G0444, G9572, 992...","[P292, E669, I2510, P292, J209, E669, I2510, I...","[J209, J209, J329, J0390, I10, J209, J209, J20...",[001],"[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...","[2012-05-05, 2013-03-21, 2013-03-30, 2013-05-1...",...,,,,,,,,1,"[HCPCS_level_2, HCPCS_level_1, HCPCS_level_2, ...","[Procedures/Professional_Services, Evaluation_..."


In [96]:
# drop original columns as they are not needed
drop_ls = list(cat_cols) + list(desc_cols)
df_TF_enc = df_TF_enc.drop(drop_ls, axis = 1)

Functions to TF-IDF Encode

In [97]:
def tokeniser(text):
  return text.split()

def get_corpus_and_vocab(df, col):
  
  corpus = df[col].apply(lambda x: " ".join(x)).to_list()
  

  vocab = list(set([i for sublist in df[col].to_list() for i in sublist]))
  vocab = {k: i for i, k in enumerate(vocab)}
  return corpus, vocab

In [98]:
corpus, vocab = get_corpus_and_vocab(df_TF_enc, 'combined_hcpcs_ls')
corpus_cat, vocab_cat = get_corpus_and_vocab(df_TF_enc, 'cat_ls')
corpus_desc, vocab_desc = get_corpus_and_vocab(df_TF_enc, 'desc_ls')

In [99]:
# Pipeline for tfidf and countvectoriser
def vectorize(corpus, vocab):
  pipe = Pipeline([
    ('count', CountVectorizer(vocabulary=vocab, tokenizer=tokeniser, lowercase=False)),
    ('tfidf', TfidfTransformer())
  ])

  tfidfs = pipe.fit_transform(corpus)
  df_tfidfs = pd.DataFrame(tfidfs.toarray(), columns=pipe['count'].get_feature_names_out())
  
  return df_tfidfs

In [100]:
# encode hcpcs combined column
tfidf_hcpcs = vectorize(corpus, vocab)
tfidf_hcpcs.head()



Unnamed: 0,G9857,G0300,G9573,C8905,G0464,G0424,C8928,S9473,G9708,G0402,...,S0605,G0444,G9833,T1502,99221,G0299,G0153,H2000,S9126,S8075
0,0.0,0.0,0.0,0.0,0.279341,0.0,0.0,0.0,0.0,0.0,...,0.0,0.695481,0.0,0.0,0.368375,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.387095,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.309558,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.495751,0.0,0.0,0.087528,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.136924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.667158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [101]:
# encode category combined column
tfidf_cat = vectorize(corpus_cat, vocab_cat)
tfidf_cat.head()



Unnamed: 0,HCPCS_level_2,HCPCS_level_1
0,0.746346,0.665558
1,0.450512,0.89277
2,0.344529,0.938776
3,0.51798,0.855392
4,0.94955,0.313617


In [102]:
# encode description combined column
tfidf_desc = vectorize(corpus_desc, vocab_desc)
vocab_desc



{'Temporary_National_Codes_(Non-Medicare)': 0,
 'Evaluation_and_Management_(E/M)_Codes_': 1,
 'Alcohol_and_Drug_Abuse_Treatment': 2,
 'Procedures/Professional_Services': 3,
 'Outpatient_PPS': 4,
 'Temporary_Codes': 5,
 'National_Codes_Established_for_State_Medicaid_Agencies': 6}

In [103]:
tfidf_desc.head()

Unnamed: 0,Temporary_National_Codes_(Non-Medicare),Evaluation_and_Management_(E/M)_Codes_,Alcohol_and_Drug_Abuse_Treatment,Procedures/Professional_Services,Outpatient_PPS,Temporary_Codes,National_Codes_Established_for_State_Medicaid_Agencies
0,0.0,0.665558,0.0,0.746346,0.0,0.0,0.0
1,0.0,0.89277,0.0,0.450512,0.0,0.0,0.0
2,0.0,0.938776,0.0,0.344529,0.0,0.0,0.0
3,0.0,0.855392,0.0,0.51798,0.0,0.0,0.0
4,0.0,0.313617,0.0,0.94955,0.0,0.0,0.0


In [105]:
# Combine with the original fixed columns
col_list = ['patient_medicare_number', 'gender', 'age', 'number_of_claims', 'preventative_care_ind', 'combined_hcpcs_ls', 'combined_diagnosis_ls', 'combined_principal_diagnosis_ls', 'drg_ls', 'total_value']
out = pd.concat([
  df_TF_enc[col_list].reset_index(drop=True),
  tfidf_hcpcs, 
  tfidf_cat,
  tfidf_desc
], axis=1)

In [107]:
# save version of dataset with patient_medicare_numbers still in it
out.to_pickle(f"{path}/patient_level_features_med_num.pkl")

In [108]:
# Export
drop_ls = ['patient_medicare_number', 'combined_diagnosis_ls', 'combined_principal_diagnosis_ls', 'drg_ls', 'total_value']
out.drop(drop_ls, axis = 1, inplace=True)
out.to_pickle(f"{path}/patient_level_features.pkl")

## EDA

### Breakdown of procedures

NOTE: This is on the `claim_mini_sample` dataset (10,000 entries)

- Most common description of procedures done are:
  1. Evaluation and Management (E/M) Codes (HCPCS Level I)
  2. Procedures/Professional Services (HCPCS Level II)
- Other codes include
  1. Alcohol and Drug Abuse Treatment
  2. National Codes Established for State Medicaid Agencies

In [198]:
def countplot_with_labels(l, title):
  ax = sns.countplot(l, palette='pastel')

  for p in ax.patches:
    ax.text(
      p.get_width() + 1,
      p.get_y() + p.get_height() / 2,
      int(p.get_width()),
      ha="center",
      va="center",
      color="black",
      fontsize=12,
      fontweight="bold"
    )
  
  plt.title(title)

  return plt

In totality, what is the distribution of HCPCS codes across all claims

In [199]:
all_hcpcs = df['combined_hcpcs_ls'].explode().reset_index()
all_hcpcs = all_hcpcs.merge(
  combined_mapper,
  left_on='combined_hcpcs_ls',
  right_on='code',
  how='left'
)
all_hcpcs = all_hcpcs.drop(['index', 'code'], axis=1)
all_hcpcs = all_hcpcs.fillna("Unknown")

In [None]:
countplot_with_labels(all_hcpcs['category'], "Breakdown of Category for HCPCS")

In [None]:
countplot_with_labels(all_hcpcs['description'], "Breakdown of Descriptions for HCPCS")

Compare for the first and second HCPCS, what are the most common category of procedures done

In [None]:
countplot_with_labels(df['description_0'], "Breakdown of First Procedure")

In [None]:
countplot_with_labels(df['description_1'], "Breakdown of Second Procedure")

### How long between claim submissions

In [None]:
plt.hist(day_interval.explode('billablePeriod_end_ls'), bins=50)
plt.title("Histogram of all Day Intervals between Claim Submissions")
plt.show()

In [None]:
plt.hist(df_day_interval['day_interval_0'])
plt.title('How long between the first and second claim submissions in Days')

In [None]:
%watermark