## Cui2Vec Embeddings

This notebook prepares cui2vec embeddings for Medicare fraud prediction.

Original paper: https://psb.stanford.edu/psb-online/proceedings/psb20/Beam.pdf

Public embeddings: http://cui2vec.dbmi.hms.harvard.edu/

Embeddings for over 108,000 medical concepts. These embeddings were created using insurance claims for 60 million americans, 1.7 million full-text PubMed articles, and clinical notes from 20 million patients at Stanford.

In [1]:
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors

## Load Part B - HCPCS Codes

In [4]:
partb_file = '/Users/jujohnson/cms-data/raw/Medicare_PUF_PartB_2012to2017.csv.gz'
partb_cols = ['hcpcs_code', 'hcpcs_description']
hcpcs_df = pd.read_csv(partb_file, usecols=partb_cols)
unique_hcpcs = hcpcs_df['hcpcs_code'].unique()
print(f'Loaded {len(unique_hcpcs)} unique hcpcs codes')

Loaded 7520 unique hcpcs codes


## Convert HCPCS Codes to CUI Codes

UMLS Provides a catalog of medical terms and respective CUI codes.

In [21]:
hcpcs_df.loc[hcpcs_df['hcpcs_code']=='G0438'].head(1).values

array([['G0438',
        'Annual wellness visit; includes a personalized prevention plan of service (pps), initial visit']],
      dtype=object)

In [25]:
with open('unique-hcpcs.csv', 'w') as fout:
  fout.write(','.join(unique_hcpcs))

In [20]:
unique_hcpcs[129]

'G0438'

## Load Cui2Vec Embeddingns

In [33]:
cui2vec_file = '/Users/jujohnson/Desktop/cui2vec_pretrained.csv'
cui2vecs = {}

with open(cui2vec_file, 'r') as fin:
  lines = fin.readlines()

lines = lines[1:]

for line in lines:
  [code, *embedding] = line.strip().split(',')
  embedding = [float(x) for x in embedding]
  cui2vecs[code] = embedding

## Create HCPCS-Embedding Mapping

Before we can map HCPCS -> CUI, we need access to the UMLS.

In [34]:
len(lines)

109053

In [36]:
exists = 0

for hcpcs in unique_hcpcs:
  if cui2vecs.get(hcpcs) != None:
    exists += 1

In [37]:
exists

0