# Unsupervised Learning

# 04_create_unsupervised_features

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 28/09/2025   | Adrienne | Created | Created file for unsupervised learning | 
| 29/09/2025   | Martin | New   | Processing to apply the HCPCS code descriptions + EDA on the new descriptions | 
| 02/10/2025 | Adrienne | Update | Created features |
| 05/10/2025 | Martin | Update | Added TFIDF transformation section for any "list-like" columns |

# Notes

- Preventative care indicator

# Content

* [Introduction](#introduction)
* [Load Data](##load-data)
* [Additional Features](#additional-features)
* [EDA](#eda)

# Introduction

In [3]:
%load_ext watermark

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

# Load Data

In [2]:
path = "../data/clean"
#df = pd.read_pickle(f"{path}/patient_level.pkl")
df = pd.read_pickle(f"{path}/patient_level_sample.pkl")

In [3]:

mapper_path = "../data/mappers"
combined_mapper = pd.read_pickle(f"{mapper_path}/combined_mapper.pkl")

In [4]:
combined_mapper.head()

Unnamed: 0,code,category,description
0,99201,HCPCS_level_1,Evaluation and Management (E/M) Codes
1,99202,HCPCS_level_1,Evaluation and Management (E/M) Codes
2,99203,HCPCS_level_1,Evaluation and Management (E/M) Codes
3,99204,HCPCS_level_1,Evaluation and Management (E/M) Codes
4,99205,HCPCS_level_1,Evaluation and Management (E/M) Codes


# Additional Features

Focusing on transforming the HCPCS codes into a useable format for unsupervised learning.

- HCPCS
  - code
  - category
  - description

## Apply mapper to HCPCS lists

Using the mapper we can apply the additional columns with category and description to each column of HCPCS

In [5]:
# drop hcpcs columns that are all NaN
print(len(df))
df.dropna(axis=1, how='all', inplace=True)
print(len(df))

2510
2510


In [6]:
maxlen = max(df['combined_hcpcs_ls'].str.len())
df_hcpcs = df['combined_hcpcs_ls'].apply(pd.Series)
df_hcpcs = df_hcpcs.add_prefix('hcpcs_')
df_hcpcs = pd.concat([df, df_hcpcs], axis = 1)
df_hcpcs.head()

Unnamed: 0,patient_medicare_number,patient_first_name,patient_last_name,gender,birthdate,number_of_claims,drg_ls,combined_diagnosis_ls,combined_hcpcs_ls,billablePeriod_start_ls,...,hcpcs_566,hcpcs_567,hcpcs_568,hcpcs_569,hcpcs_570,hcpcs_571,hcpcs_572,hcpcs_573,hcpcs_574,hcpcs_575
0,1S00E00AA16,Franklyn36,Tromp100,male,1950-08-12,1,[],"[J329, E669, E785, J0190]",[99241],[2018-01-21],...,,,,,,,,,,
1,1S00E00AA23,Bonita405,Hagenes547,female,1948-05-23,4,[],"[J329, E785, I10, B002, J029, J329, E785, I10,...","[99241, 99241, 99241, 99241]","[2012-05-15, 2012-07-12, 2017-02-20, 2020-02-20]",...,,,,,,,,,,
7,1S00E00AA25,Carlota980,Gamez720,female,1947-04-15,5,[],"[E669, D649, O039, M810, J329, E669, D649, O03...","[G0444, G0444, 99241, 99241, 99241]","[2015-08-05, 2016-08-10, 2020-07-29, 2021-03-0...",...,,,,,,,,,,
10,1S00E00AA32,D.,Watsic,male,,1,[],"[I10, E669, I2510, I219]",[G0444],[2018-06-09],...,,,,,,,,,,
11,1S00E00AA48,Man114,Halvorson124,male,1945-08-04,1,[],"[E785, P292, I2510, E669, J0190, J329]",[99241],[2021-09-01],...,,,,,,,,,,


In [50]:
for i in range(maxlen):
  df_hcpcs = pd.merge( df_hcpcs, 
    combined_mapper,
    left_on=f"hcpcs_{i}",
    right_on="code",
    how='left'
  )
  df_hcpcs = df_hcpcs.drop(['code'], axis=1)
  df_hcpcs = df_hcpcs.rename({
    'category': f"category_{i}",
    'description': f"description_{i}",
  }, axis=1)

df_hcpcs_combined = df_hcpcs.fillna(np.nan)

In [51]:
df_hcpcs_combined[['hcpcs_1', 'category_1', 'description_1', 'hcpcs_2', 'category_2',  'description_2']].head()

Unnamed: 0,hcpcs_1,category_1,description_1,hcpcs_2,category_2,description_2
0,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,99221,HCPCS_level_1,Evaluation and Management (E/M) Codes
1,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,99221,HCPCS_level_1,Evaluation and Management (E/M) Codes
2,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,G0444,HCPCS_level_2,Procedures/Professional Services
3,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes,G0444,HCPCS_level_2,Procedures/Professional Services
4,G0444,HCPCS_level_2,Procedures/Professional Services,99241,HCPCS_level_1,Evaluation and Management (E/M) Codes


In [164]:
df_hcpcs_combined.head()

Unnamed: 0,patient_medicare_number,patient_first_name,patient_last_name,gender,birthdate,number_of_claims,drg_ls,combined_diagnosis_ls,combined_hcpcs_ls,billablePeriod_start_ls,...,category_336,description_336,category_337,description_337,category_338,description_338,category_339,description_339,category_340,description_340
0,1S00E00AA10,Brandon214,Roob72,female,1946-01-15,3,[],"[O039, O039, B085, B002, O039, J029]","[G0444, 99241, G0444, G9572]","[2013-04-23, 2016-01-15, 2020-06-02]",...,,,,,,,,,,
1,1S00E00AA23,B.,Hagene,female,,1,[],"[J329, E785, P292]","[G0444, G9572]",[2014-04-13],...,,,,,,,,,,
2,1S00E00AA25,Carlota980,Gamez720,female,1947-04-15,2,[],"[E669, D649, K635, O039, M810, J329, E669, D64...","[G0444, 99241]","[2012-07-18, 2021-11-23]",...,,,,,,,,,,
3,1S00E00AA32,Denny560,Watsica258,male,1945-06-09,3,[],"[P292, E669, I2510, B349, J329, I10, E669, I25...","[99241, 99241, 99241]","[2015-05-12, 2021-02-20, 2021-03-20]",...,,,,,,,,,,
4,1S00E00AA54,Lashawnda5,Greenfelder433,female,1950-12-23,11,[],"[E119, R739, E781, E8881, D649, E11319, P292, ...","[G0444, 99241, 99241, 99241, 99241, G0444, 992...","[2012-10-27, 2013-01-26, 2014-06-21, 2014-07-2...",...,,,,,,,,,,


# Time interval between claims

Using `billablePeriod_end_ls`, sort and compare the time interval of days between each claim. Then expand it out into individual columns

In [25]:
def days_between_claim(item):
  sorted_dates = pd.to_datetime(pd.Series(item)).sort_values().reset_index(drop=True)
  return sorted_dates.diff().dt.days.dropna().astype(int).tolist()

In [26]:
day_interval = pd.DataFrame(df['billablePeriod_end_ls'].apply(days_between_claim))
day_maxlen = max(day_interval['billablePeriod_end_ls'].str.len())
df_day_interval = pd.DataFrame(day_interval['billablePeriod_end_ls'].to_list(), columns=[f"day_interval_{i}" for i in range(day_maxlen)])
df_day_interval.head()

Unnamed: 0,day_interval_0,day_interval_1,day_interval_2,day_interval_3,day_interval_4,day_interval_5,day_interval_6,day_interval_7,day_interval_8,day_interval_9,...,day_interval_196,day_interval_197,day_interval_198,day_interval_199,day_interval_200,day_interval_201,day_interval_202,day_interval_203,day_interval_204,day_interval_205
0,1.0,29.0,2.0,72.0,33.0,72.0,3.0,112.0,1.0,104.0,...,,,,,,,,,,
1,46.0,129.0,196.0,371.0,16.0,12.0,343.0,101.0,36.0,234.0,...,,,,,,,,,,
2,7.0,7.0,238.0,55.0,105.0,582.0,742.0,345.0,26.0,258.0,...,,,,,,,,,,
3,365.0,57.0,187.0,9.0,112.0,63.0,302.0,365.0,365.0,81.0,...,,,,,,,,,,
4,15.0,350.0,21.0,344.0,41.0,7.0,0.0,317.0,365.0,365.0,...,,,,,,,,,,


## Variable Encoding

Will make three datasets with three different versions of the features:
- df_lab_enc: will have the hcpcs columns encoded using label encoding
- df_freq_enc: will have the hcpcs columns encoded using frequency encoding
- df_TD_enc: will use the combined_hcpcs_ls column and treat it like a bag of words problem and use a TD-IDF transformation

In [52]:
# will always encode gender using labels
le_gen = LabelEncoder()
df_hcpcs_combined['gender'] = le_gen.fit_transform(df_hcpcs_combined['gender'])

In [53]:
# create a list of hcpcs columns
# first col in list needs to be dropped
category_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("category")]

# create a dataframe of unique hcpcs values for encoding
ls = list(set(value for value in combined_mapper['category']))
# new hcpcs columns are filled with nan
ls.append(np.nan)
df_unique_category = pd.DataFrame( {'unique_category': ls})


# create instance of label encoder
le = LabelEncoder()
# perform label encoding on first hcpcs column
le.fit(df_unique_category['unique_category'])
 
# apply same encoder to rest of columns
for col in category_cols:
    df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])

  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df

In [54]:
# create a list of hcpcs columns
# first col in list needs to be dropped
desc_cols = df_hcpcs_combined.columns[df_hcpcs_combined.columns.str.contains("description")]

# create a dataframe of unique hcpcs values for encoding
ls = list(set(value for value in combined_mapper['description']))
# new hcpcs columns are filled with nan
ls.append(np.nan)
df_unique_desc = pd.DataFrame( {'unique_desc': ls})


# create instance of label encoder
le = LabelEncoder()
# perform label encoding on first hcpcs column
le.fit(df_unique_desc['unique_desc'])
 
# apply same encoder to rest of columns
for col in desc_cols:
    df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])

  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df_hcpcs_combined[col])
  df_hcpcs_combined[col + '_enc'] = le.transform(df

In [55]:
# create copies of the dataset
df_lab_enc = df_hcpcs_combined.copy()
df_freq_enc = df_hcpcs_combined.copy()
df_TD_enc = df_hcpcs_combined.copy()

In [56]:
# create a list of hcpcs columns
# first col in list needs to be dropped
hcpcs_cols = df_lab_enc.columns[df_lab_enc.columns.str.contains("hcpcs")][1:]

# create a dataframe of unique hcpcs values for encoding
ls = list(set(value for sublist in df_hcpcs_combined['combined_hcpcs_ls'] for value in sublist))
# new hcpcs columns are filled with nan
ls.append(np.nan)
df_unique_hcpcs = pd.DataFrame( {'unique_hcpcs': ls})


Label Encoding HCPCS

In [58]:

# create instance of label encoder
le = LabelEncoder()
# perform label encoding on first hcpcs column
le.fit(df_unique_hcpcs['unique_hcpcs'])
 
# apply same encoder to rest of columns
for col in hcpcs_cols:
    df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])

  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])
  df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col

In [59]:
# check encodings
df_lab_enc[['category_0', 'category_0_enc', 'hcpcs_0', 'hcpcs_0_enc', 'hcpcs_1', 'hcpcs_1_enc', 'hcpcs_2', 'hcpcs_2_enc', 'gender']].head()

Unnamed: 0,category_0,category_0_enc,hcpcs_0,hcpcs_0_enc,hcpcs_1,hcpcs_1_enc,hcpcs_2,hcpcs_2_enc,gender
0,HCPCS_level_2,2,G8111,19,99241,1,99221,0,0
1,HCPCS_level_2,2,G0444,17,99241,1,99221,0,0
2,HCPCS_level_1,1,99241,1,99241,1,G0444,17,0
3,HCPCS_level_1,1,99241,1,99241,1,G0444,17,0
4,HCPCS_level_1,1,99241,1,G0444,17,99241,1,0


In [75]:
drop_ls = list(category_cols) + list(desc_cols) + list(hcpcs_cols)
df_lab_enc = df_lab_enc.drop(drop_ls, axis = 1)

In [76]:
df_lab_enc.to_pickle("../data/clean/patient_level_lab_enc.pkl")

Frequency Encoded HCPCS

In [None]:



freq_enc = df_hcpcs_combined.groupby('combined_hcpcs_ls')
 
# apply same encoder to rest of columns
for col in hcpcs_cols:
    df_lab_enc[col + '_enc'] = le.transform(df_lab_enc[col])

TD_IDF Encoding

In [39]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline

In [153]:
df_hcpcs_reset = df_hcpcs.reset_index(drop=True)
list_cols = df_hcpcs_reset.columns[:12]
df_hcpcs_reset[list_cols].head()

Unnamed: 0,patient_medicare_number,patient_first_name,patient_last_name,gender,birthdate,number_of_claims,drg_ls,combined_diagnosis_ls,combined_hcpcs_ls,billablePeriod_start_ls,billablePeriod_end_ls,location_of_bill_ls
0,1S00E00AA16,Franklyn36,Tromp100,male,1950-08-12,1,[],"[J329, E669, E785, J0190]",[99241],[2018-01-21],[2018-01-21],[002]
1,1S00E00AA23,Bonita405,Hagenes547,female,1948-05-23,4,[],"[J329, E785, I10, B002, J029, J329, E785, I10,...","[99241, 99241, 99241, 99241]","[2012-05-15, 2012-07-12, 2017-02-20, 2020-02-20]","[2012-05-15, 2012-07-12, 2017-02-20, 2020-02-20]","[002, 002, 002, 002]"
2,1S00E00AA25,Carlota980,Gamez720,female,1947-04-15,5,[],"[E669, D649, O039, M810, J329, E669, D649, O03...","[G0444, G0444, 99241, 99241, 99241]","[2015-08-05, 2016-08-10, 2020-07-29, 2021-03-0...","[2015-08-05, 2016-08-10, 2020-07-29, 2021-03-0...","[002, 002, 002]"
3,1S00E00AA32,D.,Watsic,male,,1,[],"[I10, E669, I2510, I219]",[G0444],[2018-06-09],[2018-06-09],[]
4,1S00E00AA48,Man114,Halvorson124,male,1945-08-04,1,[],"[E785, P292, I2510, E669, J0190, J329]",[99241],[2021-09-01],[2021-09-01],[002]


In [None]:
def tokeniser(text):
  return text.split()

def get_corpus_and_vocab(df, col):
  corpus = df[col].apply(lambda x: " ".join(x)).to_list()

  vocab = list(set([i for sublist in df[col].to_list() for i in sublist]))
  vocab = {k: i for i, k in enumerate(vocab)}
  return corpus, vocab

In [155]:
corpus, vocab = get_corpus_and_vocab(df_hcpcs_reset, 'combined_hcpcs_ls')

In [176]:
# Pipeline for tfidf and countvectoriser
pipe = Pipeline([
  ('count', CountVectorizer(vocabulary=vocab, tokenizer=tokeniser, lowercase=False)),
  ('tfidf', TfidfTransformer())
])

tfidf_hcpcs = pipe.fit_transform(corpus)



In [182]:
# Create a useable dataframe - NOTE: DO NOT use this if the vocab is too big
df_tfidf = pd.DataFrame(tfidf_hcpcs.toarray(), columns=pipe['count'].get_feature_names_out())

In [188]:
# Combine with the original fixed columns
col_list = ['gender', 'birthdate', 'number_of_claims']
out = pd.concat([
  df_hcpcs[col_list],
  df_tfidf
], axis=1)

# Export
out.to_pickle(f"{path}/hcpcs_tfidf.pkl")

## EDA

## Breakdown of procedures

NOTE: This is on the `claim_mini_sample` dataset (10,000 entries)

- Most common description of procedures done are:
  1. Evaluation and Management (E/M) Codes (HCPCS Level I)
  2. Procedures/Professional Services (HCPCS Level II)
- Other codes include
  1. Alcohol and Drug Abuse Treatment
  2. National Codes Established for State Medicaid Agencies

In [None]:
def countplot_with_labels(l, title):
  ax = sns.countplot(l, palette='pastel')

  for p in ax.patches:
    ax.text(
      p.get_width() + 1,
      p.get_y() + p.get_height() / 2,
      int(p.get_width()),
      ha="center",
      va="center",
      color="black",
      fontsize=12,
      fontweight="bold"
    )
  
  plt.title(title)

  return plt

In totality, what is the distribution of HCPCS codes across all claims

In [None]:
all_hcpcs = df['combined_hcpcs_ls'].explode().reset_index()
all_hcpcs = all_hcpcs.merge(
  combined_mapper,
  left_on='combined_hcpcs_ls',
  right_on='code',
  how='left'
)
all_hcpcs = all_hcpcs.drop(['index', 'code'], axis=1)
all_hcpcs = all_hcpcs.fillna("Unknown")

In [None]:
countplot_with_labels(all_hcpcs['category'], "Breakdown of Category for HCPCS")

In [None]:
countplot_with_labels(all_hcpcs['description'], "Breakdown of Descriptions for HCPCS")

Compare for the first and second HCPCS, what are the most common category of procedures done

In [None]:
countplot_with_labels(df_plot['description_0'], "Breakdown of First Procedure")

In [None]:
countplot_with_labels(df_hcpcs_combined['description_3'], "Breakdown of Second Procedure")

## How long between claim submissions

In [None]:
plt.hist(day_interval.explode('billablePeriod_end_ls'), bins=50)
plt.title("Histogram of all Day Intervals between Claim Submissions")
plt.show()

In [None]:
plt.hist(df_day_interval['day_interval_0'])
plt.title('How long between the first and second claim submissions in Days')

In [None]:
%watermark