# Unsupervised Learning

# 04_create_unsupervised_features

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 28/09/2025   | Adrienne | Created | Created file for unsupervised learning | 
| 29/09/2025   | Martin | New   | Processing to apply the HCPCS code descriptions + EDA on the new descriptions | 
| 02/10/2025 | Adrienne | Update | Created features |
| 05/10/2025 | Martin | Update | Added TFIDF transformation section for any "list-like" columns |
| 05/10/2025 | Adrienne | Update | Added a feature and cleaned up dataset to include relevant columns |
| 07/10/2025 | Adrienne | Update | Added preventative care indicator feature |

## Notes

- Preventative care indicator

## Content

* [Introduction](#introduction)
* [Load Data](##load-data)
* [Additional Features](#additional-features)
* [EDA](#eda)

## Introduction

In [3]:
%load_ext watermark

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline

## Load Data

In [76]:
path = "../data/clean"
#df = pd.read_pickle(f"{path}/patient_level_unsupervised.pkl")
df = pd.read_pickle(f"{path}/patient_level_supervised.pkl")

In [77]:
mapper_path = "../data/mappers"
combined_mapper = pd.read_pickle(f"{mapper_path}/combined_mapper.pkl")
preventative_mapper = pd.read_pickle(f"{mapper_path}/preventative_mapper.pkl")

In [78]:
combined_mapper.head()

Unnamed: 0,code,category,description
0,99201,HCPCS_level_1,Evaluation and Management (E/M) Codes
1,99202,HCPCS_level_1,Evaluation and Management (E/M) Codes
2,99203,HCPCS_level_1,Evaluation and Management (E/M) Codes
3,99204,HCPCS_level_1,Evaluation and Management (E/M) Codes
4,99205,HCPCS_level_1,Evaluation and Management (E/M) Codes


Need to drop columns that would be a source of data leakage or are not needed

In [79]:
# diagnosis columns:
keep_cols = ['patient_medicare_number', 'gender', 'age', 'number_of_claims', 'combined_hcpcs_ls', 'billablePeriod_start_ls', 'billablePeriod_end_ls', 'location_of_bill_ls', 'total_value']
df = df[keep_cols]

In [80]:
df.head()

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value
78168,1S00E00MH18,male,80.0,107,"[99241, 99241, 99241, G0444, G0444, G0444, 992...","[2012-04-03, 2012-05-01, 2012-06-05, 2012-07-0...","[2012-04-03, 2012-05-01, 2012-06-05, 2012-07-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",615.92
5310,1S00E00AH98,female,83.0,62,"[G0444, G0107, G8111, 99221, 99241, G0444, 992...","[2012-05-27, 2012-06-13, 2013-01-18, 2013-01-1...","[2012-05-27, 2012-06-13, 2013-01-19, 2013-01-1...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",27169.36
45691,1S00E00JE80,male,74.0,57,"[99241, G0444, 99241, 99241, 99241, G0444, G95...","[2012-02-20, 2012-04-05, 2012-08-26, 2012-09-2...","[2012-02-20, 2012-04-05, 2012-08-26, 2012-09-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",219.02
52689,1S00E00JP90,female,76.0,24,"[99241, 99241, G0444, 99241, G0444, 99241, 992...","[2012-05-07, 2013-05-07, 2013-05-28, 2014-05-0...","[2012-05-07, 2013-05-07, 2013-05-28, 2014-05-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",369.63
84110,1S00E00MQ98,male,72.0,16,"[99241, G0444, G0444, 99221, G0444, G0444, 992...","[2012-12-14, 2013-02-09, 2014-02-15, 2014-08-3...","[2012-12-14, 2013-02-09, 2014-02-15, 2014-08-3...","[002, 002, 002, 002, 002, 003, 002, 002, 002, ...",14950.72


Just drop rows where age is missing

In [81]:
df[df['age'].isnull()]

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value
55609,1S00E00JU46,male,,40,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2013-10-25, 2015-03-20, 2015-12-11, 2016-02-1...","[2013-10-25, 2015-03-20, 2015-12-11, 2016-02-1...",[],123.66
8773,1S00E00GA64,male,,49,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2012-04-24, 2012-05-22, 2012-12-04, 2013-05-1...","[2012-04-24, 2012-05-22, 2012-12-04, 2013-05-1...",[],101.17
37555,1S00E00HT00,male,,99,"[99241, 99241, 99241, 99241, 99241, 99241, 992...","[2012-02-19, 2012-04-15, 2012-05-13, 2012-06-1...","[2012-02-19, 2012-04-15, 2012-05-13, 2012-06-1...",[],142.58


In [82]:
df = df[df['age'].notnull()]

Limiting patients to those with less than 1000 in the combined_hcpcs_ls as it's just five patients and it drops the longest length to 670

In [83]:
df['ls_len'] = df['combined_hcpcs_ls'].str.len()
df = df[df['ls_len'] < 1000]

## Additional Features

Focusing on transforming the HCPCS codes into a useable format for unsupervised learning.

- HCPCS
  - code
  - category
  - description

### Apply mapper to HCPCS lists

Using the mapper we can apply the additional columns with category and description to each column of HCPCS

In [84]:
# drop hcpcs columns that are all NaN
print(len(df))
df.dropna(axis=1, how='all', inplace=True)
print(len(df))

1304
1304


In [85]:
unique_values = set(value for sublist in df['combined_hcpcs_ls'] for value in sublist)
print(unique_values)
print(len(unique_values))

{'G9829', 'G0155', 'G0153', 'G8946', '99241', 'S0605', 'G0424', 'T1502', 'G0107', 'G9833', 'G9857', 'G0152', 'G0402', 'Q5001', 'G0154', 'S8075', 'G0299', 'G0102', 'C8908', 'G8159', 'S9129', 'G0157', 'G9573', 'S9122', 'G9858', 'S9473', 'C8905', 'C8928', 'G0158', 'G9708', 'G9572', 'G0464', 'G0458', 'G0129', 'G0151', 'G0444', '99221', 'S9131', 'T1021', 'H2000', 'S9126', 'G0300', 'G8111', 'G0156'}
44


In [86]:
maxlen = max(df['combined_hcpcs_ls'].str.len())
print(f"max combined_hcpcs_ls length: {maxlen}")
df_hcpcs = df['combined_hcpcs_ls'].apply(pd.Series)
df_hcpcs = df_hcpcs.add_prefix('hcpcs_')
df_hcpcs = pd.concat([df, df_hcpcs], axis = 1)
df_hcpcs.head()

max combined_hcpcs_ls length: 670


Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value,ls_len,...,hcpcs_660,hcpcs_661,hcpcs_662,hcpcs_663,hcpcs_664,hcpcs_665,hcpcs_666,hcpcs_667,hcpcs_668,hcpcs_669
78168,1S00E00MH18,male,80.0,107,"[99241, 99241, 99241, G0444, G0444, G0444, 992...","[2012-04-03, 2012-05-01, 2012-06-05, 2012-07-0...","[2012-04-03, 2012-05-01, 2012-06-05, 2012-07-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",615.92,110,...,,,,,,,,,,
5310,1S00E00AH98,female,83.0,62,"[G0444, G0107, G8111, 99221, 99241, G0444, 992...","[2012-05-27, 2012-06-13, 2013-01-18, 2013-01-1...","[2012-05-27, 2012-06-13, 2013-01-19, 2013-01-1...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",27169.36,98,...,,,,,,,,,,
45691,1S00E00JE80,male,74.0,57,"[99241, G0444, 99241, 99241, 99241, G0444, G95...","[2012-02-20, 2012-04-05, 2012-08-26, 2012-09-2...","[2012-02-20, 2012-04-05, 2012-08-26, 2012-09-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",219.02,278,...,,,,,,,,,,
52689,1S00E00JP90,female,76.0,24,"[99241, 99241, G0444, 99241, G0444, 99241, 992...","[2012-05-07, 2013-05-07, 2013-05-28, 2014-05-0...","[2012-05-07, 2013-05-07, 2013-05-28, 2014-05-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",369.63,26,...,,,,,,,,,,
84110,1S00E00MQ98,male,72.0,16,"[99241, G0444, G0444, 99221, G0444, G0444, 992...","[2012-12-14, 2013-02-09, 2014-02-15, 2014-08-3...","[2012-12-14, 2013-02-09, 2014-02-15, 2014-08-3...","[002, 002, 002, 002, 002, 003, 002, 002, 002, ...",14950.72,331,...,,,,,,,,,,


In [87]:
for i in range(maxlen):
  df_hcpcs = pd.merge( df_hcpcs, 
    combined_mapper,
    left_on=f"hcpcs_{i}",
    right_on="code",
    how='left'
  )
  df_hcpcs = df_hcpcs.drop(['code'], axis=1)
  df_hcpcs = df_hcpcs.rename({
    'category': f"category_{i}",
    'description': f"description_{i}",
  }, axis=1)

df_hcpcs_combined = df_hcpcs.fillna(np.nan)

In [88]:
df_hcpcs_combined[['hcpcs_1', 'category_1', 'description_1', 'hcpcs_2', 'category_2',  'description_2']].head()

Unnamed: 0,hcpcs_1,category_1,description_1,hcpcs_2,category_2,description_2
0,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes
1,G0107,HCPCS_level_2,Procedures/Professional Services,G8111,HCPCS_level_2,Procedures/Professional Services
2,G0444,HCPCS_level_2,Procedures/Professional Services,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes
3,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,G0444,HCPCS_level_2,Procedures/Professional Services
4,G0444,HCPCS_level_2,Procedures/Professional Services,G0444,HCPCS_level_2,Procedures/Professional Services


### Time interval between claims

Using `billablePeriod_end_ls`, sort and compare the time interval of days between each claim. Then expand it out into individual columns

In [235]:
def days_between_claim(item):
  sorted_dates = pd.to_datetime(pd.Series(item)).sort_values().reset_index(drop=True)
  return sorted_dates.diff().dt.days.dropna().astype(int).tolist()

In [236]:
day_interval = pd.DataFrame(df['billablePeriod_end_ls'].apply(days_between_claim))
day_maxlen = max(day_interval['billablePeriod_end_ls'].str.len())
df_day_interval = pd.DataFrame(day_interval['billablePeriod_end_ls'].to_list(), columns=[f"day_interval_{i}" for i in range(day_maxlen)])
df_day_interval.head()

Unnamed: 0,day_interval_0,day_interval_1,day_interval_2,day_interval_3,day_interval_4,day_interval_5,day_interval_6,day_interval_7,day_interval_8,day_interval_9,...,day_interval_655,day_interval_656,day_interval_657,day_interval_658,day_interval_659,day_interval_660,day_interval_661,day_interval_662,day_interval_663,day_interval_664
0,28,343,371,371,14,241,3,106,52,92,...,,,,,,,,,,
1,27,338,33,332,39,326,45,137,183,365,...,,,,,,,,,,
2,32,8,31,44,9,3,18,28,84,140,...,,,,,,,,,,
3,94,360,11,371,124,247,136,235,74,61,...,,,,,,,,,,
4,320,9,42,371,360,11,371,289,82,29,...,,,,,,,,,,


### Preventative Care Indicator

In [89]:
# want to see if a patient has had any preventative care by looking at combined_hpcps_ls
prev_ls = preventative_mapper['HCPCS Code'].tolist()
df_hcpcs_combined['preventative_care_ind'] = df_hcpcs_combined['combined_hcpcs_ls'].apply(lambda ls: list(set(1 for code in ls if code in prev_ls )))
for index, row in df_hcpcs_combined.iterrows():
    if len(row['preventative_care_ind']) > 0:
        df_hcpcs_combined.loc[index,'preventative_care_ind'] = 1
    else:
        df_hcpcs_combined.loc[index,'preventative_care_ind'] = 0
    

In [90]:
# quick check
df_hcpcs_combined['preventative_care_ind'].value_counts()

preventative_care_ind
1    1287
0      17
Name: count, dtype: int64

## Variable Encoding

Will make three datasets with three different versions of the features:
- df_lab_enc: will have the hcpcs columns encoded using label encoding
- df_freq_enc: will have the hcpcs columns encoded using frequency encoding
- df_TD_enc: will use the combined_hcpcs_ls column and treat it like a bag of words problem and use a TD-IDF transformation

There are some variables that will always be label encoded

In [91]:
# will always encode gender using labels
le_gen = LabelEncoder()
df_hcpcs_combined['gender'] = le_gen.fit_transform(df_hcpcs_combined['gender'])

In [92]:
# create a list of cateogory columns
category_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("category")]

# create a dataframe of unique category values for encoding
ls = list(set(value for value in combined_mapper['category']))
# new columns are filled with nan
ls.append(np.nan)
df_unique_category = pd.DataFrame( {'unique_category': ls})


# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first category column
le.fit(df_unique_category['unique_category'])
 
# apply same encoder to rest of columns
for col in category_cols:
    df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])

  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df

In [93]:
# create a list of description columns
desc_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("description")]

# create a dataframe of unique description values for encoding
ls = list(set(value for value in combined_mapper['description']))
# new columns are filled with nan
ls.append(np.nan)
df_unique_desc = pd.DataFrame( {'unique_desc': ls})


# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first description column
le.fit(df_unique_desc['unique_desc'])
 
# apply same encoder to rest of columns
for col in desc_cols:
    df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])

  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df

In [94]:
# create a list of hcpcs columns
# first col in list needs to be dropped
hcpcs_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("hcpcs")][1:]

# create a dataframe of unique hcpcs values for encoding
ls = list(set(value for sublist in df_hcpcs_combined['combined_hcpcs_ls'] for value in sublist))
# new hcpcs columns are filled with nan
ls.append(np.nan)
df_unique_hcpcs = pd.DataFrame( {'unique_hcpcs': ls})


In [95]:
# create copies of the dataset
df_lab_enc = df_hcpcs_combined.copy()
df_freq_enc = df_hcpcs_combined.copy()
df_TD_enc = df_hcpcs_combined.copy()

### Label Encoding HCPCS

In [96]:
# create instance of label encoder
le = LabelEncoder()
# fit label encoding on first hcpcs column
le.fit(df_unique_hcpcs['unique_hcpcs'])
 
# apply same encoder to rest of columns
for col in hcpcs_cols:
    df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])

  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col

In [34]:
# check encodings
df_lab_enc[['category_0', 'category_0_enc', 'hcpcs_0', 'hcpcs_0_enc', 'hcpcs_1', 'hcpcs_1_enc', 'hcpcs_2', 'hcpcs_2_enc', 'gender']].head()

Unnamed: 0,category_0,category_0_enc,hcpcs_0,hcpcs_0_enc,hcpcs_1,hcpcs_1_enc,hcpcs_2,hcpcs_2_enc,gender
0,HCPCS_level_2,2,G0444,20,99241,1,G0444,20,0
1,HCPCS_level_2,2,G0444,20,99241,1,99241,1,0
2,HCPCS_level_2,2,G0444,20,99241,1,G0402,18,0
3,HCPCS_level_1,1,99241,1,99241,1,99241,1,1
4,HCPCS_level_1,1,99241,1,S8075,36,99241,1,0


In [37]:
# drop original columns and list columns
drop_ls = list(category_cols) + list(desc_cols) + list(hcpcs_cols) + ['patient_medicare_number', 'total_value', 'combined_hcpcs_ls', 'billablePeriod_start_ls', 'billablePeriod_end_ls','location_of_bill_ls', 'ls_len']
df_lab_enc = df_lab_enc.drop(drop_ls, axis = 1)

In [39]:
df_lab_enc.head()

Unnamed: 0,gender,age,number_of_claims,preventative_care_ind,category_0_enc,category_1_enc,category_2_enc,category_3_enc,category_4_enc,category_5_enc,...,hcpcs_582_enc,hcpcs_583_enc,hcpcs_584_enc,hcpcs_585_enc,hcpcs_586_enc,hcpcs_587_enc,hcpcs_588_enc,hcpcs_589_enc,hcpcs_590_enc,hcpcs_591_enc
0,0,71.0,14,1,2,1,2,1,1,2,...,44,44,44,44,44,44,44,44,44,44
1,0,74.0,32,1,2,1,1,2,1,1,...,44,44,44,44,44,44,44,44,44,44
2,0,77.0,25,1,2,1,2,2,2,2,...,44,44,44,44,44,44,44,44,44,44
3,1,76.0,110,1,1,1,1,1,1,1,...,44,44,44,44,44,44,44,44,44,44
4,0,79.0,56,1,1,2,1,1,2,2,...,44,44,44,44,44,44,44,44,44,44


In [38]:
df_lab_enc.to_pickle("../data/clean/patient_level_lab_enc.pkl")

### TF-IDF Encoding

In [97]:
df_TD_enc_reset = df_TD_enc.reset_index(drop=True)
list_cols = df_TD_enc_reset.columns[:9]
df_TD_enc_reset[list_cols].head()

Unnamed: 0,patient_medicare_number,gender,age,number_of_claims,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls,total_value
0,1S00E00MH18,1,80.0,107,"[99241, 99241, 99241, G0444, G0444, G0444, 992...","[2012-04-03, 2012-05-01, 2012-06-05, 2012-07-0...","[2012-04-03, 2012-05-01, 2012-06-05, 2012-07-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",615.92
1,1S00E00AH98,0,83.0,62,"[G0444, G0107, G8111, 99221, 99241, G0444, 992...","[2012-05-27, 2012-06-13, 2013-01-18, 2013-01-1...","[2012-05-27, 2012-06-13, 2013-01-19, 2013-01-1...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",27169.36
2,1S00E00JE80,1,74.0,57,"[99241, G0444, 99241, 99241, 99241, G0444, G95...","[2012-02-20, 2012-04-05, 2012-08-26, 2012-09-2...","[2012-02-20, 2012-04-05, 2012-08-26, 2012-09-2...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",219.02
3,1S00E00JP90,0,76.0,24,"[99241, 99241, G0444, 99241, G0444, 99241, 992...","[2012-05-07, 2013-05-07, 2013-05-28, 2014-05-0...","[2012-05-07, 2013-05-07, 2013-05-28, 2014-05-0...","[002, 002, 002, 002, 002, 002, 002, 002, 002, ...",369.63
4,1S00E00MQ98,1,72.0,16,"[99241, G0444, G0444, 99221, G0444, G0444, 992...","[2012-12-14, 2013-02-09, 2014-02-15, 2014-08-3...","[2012-12-14, 2013-02-09, 2014-02-15, 2014-08-3...","[002, 002, 002, 002, 002, 003, 002, 002, 002, ...",14950.72


In [98]:
def tokeniser(text):
  return text.split()

def get_corpus_and_vocab(df, col):
  corpus = df[col].apply(lambda x: " ".join(x)).to_list()

  vocab = list(set([i for sublist in df[col].to_list() for i in sublist]))
  vocab = {k: i for i, k in enumerate(vocab)}
  return corpus, vocab

In [99]:
corpus, vocab = get_corpus_and_vocab(df_TD_enc_reset, 'combined_hcpcs_ls')

In [100]:
# Pipeline for tfidf and countvectoriser
pipe = Pipeline([
  ('count', CountVectorizer(vocabulary=vocab, tokenizer=tokeniser, lowercase=False)),
  ('tfidf', TfidfTransformer())
])

tfidf_hcpcs = pipe.fit_transform(corpus)



In [101]:
tfidf_hcpcs.shape

(1304, 44)

In [102]:
# Create a useable dataframe - NOTE: DO NOT use this if the vocab is too big
df_tfidf = pd.DataFrame(tfidf_hcpcs.toarray(), columns=pipe['count'].get_feature_names_out())

In [103]:
# Combine with the original fixed columns
col_list = ['gender', 'age', 'number_of_claims']
out = pd.concat([
  df_TD_enc[col_list].reset_index(drop=True),
  df_tfidf
], axis=1)

In [104]:
out.head()

Unnamed: 0,gender,age,number_of_claims,G9829,G0155,G0153,G8946,99241,S0605,G0424,...,G0151,G0444,99221,S9131,T1021,H2000,S9126,G0300,G8111,G0156
0,1,80.0,107,0.0,0.0,0.0,0.0,0.699881,0.0,0.0,...,0.0,0.709139,0.038586,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,83.0,62,0.0,0.116396,0.0,0.0,0.527722,0.0,0.0,...,0.0,0.175802,0.248711,0.0,0.0,0.0,0.328635,0.0,0.265379,0.384709
2,1,74.0,57,0.0,0.279301,0.0,0.0,0.185474,0.0,0.0,...,0.0,0.027363,0.025807,0.0,0.0,0.0,0.499333,0.0,0.0,0.465726
3,0,76.0,24,0.0,0.097707,0.0,0.0,0.914553,0.0,0.0,...,0.0,0.354177,0.08351,0.115169,0.0,0.0,0.0,0.0,0.0,0.0
4,1,72.0,16,0.0,0.11759,0.454811,0.0,0.03175,0.0,0.0,...,0.288175,0.032788,0.023193,0.405153,0.0,0.0,0.0,0.164951,0.0,0.0


In [None]:
# Export
#out.to_pickle(f"{path}/patient_level_TF_enc.pkl")
out.to_pickle(f"{path}/patient_level_supervised_TF_enc.pkl")

## EDA

## Breakdown of procedures

NOTE: This is on the `claim_mini_sample` dataset (10,000 entries)

- Most common description of procedures done are:
  1. Evaluation and Management (E/M) Codes (HCPCS Level I)
  2. Procedures/Professional Services (HCPCS Level II)
- Other codes include
  1. Alcohol and Drug Abuse Treatment
  2. National Codes Established for State Medicaid Agencies

In [198]:
def countplot_with_labels(l, title):
  ax = sns.countplot(l, palette='pastel')

  for p in ax.patches:
    ax.text(
      p.get_width() + 1,
      p.get_y() + p.get_height() / 2,
      int(p.get_width()),
      ha="center",
      va="center",
      color="black",
      fontsize=12,
      fontweight="bold"
    )
  
  plt.title(title)

  return plt

In totality, what is the distribution of HCPCS codes across all claims

In [199]:
all_hcpcs = df['combined_hcpcs_ls'].explode().reset_index()
all_hcpcs = all_hcpcs.merge(
  combined_mapper,
  left_on='combined_hcpcs_ls',
  right_on='code',
  how='left'
)
all_hcpcs = all_hcpcs.drop(['index', 'code'], axis=1)
all_hcpcs = all_hcpcs.fillna("Unknown")

In [None]:
countplot_with_labels(all_hcpcs['category'], "Breakdown of Category for HCPCS")

In [None]:
countplot_with_labels(all_hcpcs['description'], "Breakdown of Descriptions for HCPCS")

Compare for the first and second HCPCS, what are the most common category of procedures done

In [None]:
countplot_with_labels(df_plot['description_0'], "Breakdown of First Procedure")

In [None]:
countplot_with_labels(df_hcpcs_combined['description_3'], "Breakdown of Second Procedure")

## How long between claim submissions

In [None]:
plt.hist(day_interval.explode('billablePeriod_end_ls'), bins=50)
plt.title("Histogram of all Day Intervals between Claim Submissions")
plt.show()

In [None]:
plt.hist(df_day_interval['day_interval_0'])
plt.title('How long between the first and second claim submissions in Days')

In [None]:
%watermark