# Merging Datasets
# 02_EDA_merging_files

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 22/09/25 | Adrienne | Update | Made significant changes to program | 
| 23/09/25 | Adrienne | Update | Restructured file to merge EOB and Claim |

### Content

* [Introduction](#introduction)


### Summary 
Through EDA we have discovered that the coverage and claim response files do not contain any information that would be useful for our unsupervised learning task.  At this time, we are doing more EDA into the Patient file as we seem to be limited in the number of records we can pull from the API. This program attempts to merge some of the files to see if there are overlapping patients in both files.

Some Notes on Idenfifiers:
- EOB does not have Patient Medicare Number
- Claim does not have Patient Number

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from datetime import datetime
import json_lines

In [3]:
# readin clean datafiles
path = "../data/clean/"
#eob_df = pd.read_pickle(path + 'explanation_of_benefit_clean.pkl')
eob_df = pd.read_pickle(path + 'eob.pkl')
#coverage_df = pd.read_pickle(path + 'coverage.pkl')
claim_df = pd.read_pickle(path + 'claim_sample.pkl')
#claim_df =  pd.read_pickle(path + 'claim.pkl')
#claim_response_df =  pd.read_pickle(path + 'claim_response.pkl')
patient_df = pd.read_pickle(path + 'patient.pkl')

Creating some common variables across datafiles

In [5]:
# Preprocess identifiers in EOB
eob_df['patient_medicare_number'] = eob_df['contained_0_identifier_0_value']
eob_df['patient_number'] = eob_df['id'].str.split('-').str[-1]
#eob_df['unique_claim_ID'] = eob_df['claimId_2'].str.replace(r'[-]', '', regex=True)

In [None]:
# needed to convert beneficiary to string to merge on
coverage_df['patient_number'] = coverage_df['beneficiary'].apply(str)

In [6]:
# Preprocess Identifier in Patient
patient_df['patient_medicare_number'] = patient_df['identifier_1_value']
patient_df['patient_number'] = patient_df['identifier_0_value'].str.replace(r'[-]', '', regex=True)
patient_df['patient_first_name'] = patient_df['name_0_given'].str.replace(r'[ \[ \]"]', '', regex=True)
patient_df['patient_last_name'] = patient_df['name_0_family']

Checking unique values and column lengths for merging

In [None]:
# EOB
print(f"num of unique patient numbers in EOB: {len(eob_df['patient_number'].unique())}")
#print(f"num of unique claim IDs in EOB: {len(eob_df['unique_claim_ID'].unique())}")

# Patient
print(f"num of unique patient numbers in Patient: {len(patient_df['patient_number'].unique())}")
print(f"num of unique patient medicare numbers in Patient: {len(patient_df['patient_medicare_number'].unique())}")
print(f"num of unique patient first names in Patient: {len(patient_df['patient_first_name'].unique())}")

# Coverage
print(f"num of unique patient numbers in Coverage: {len(coverage_df['patient_number'].unique())}")

# Claim
print(f"num of unique patient medicare numbers in Claim: {len(claim_df['patient_medicare_number'].unique())}")
print(f"num of unique claim IDs in Claim: {len(claim_df['unique_claim_ID'].unique())}")
#print(f"num of unique patient medicare numbers in Claim Response: {len(claim_response_df['patient_medicare_number'].unique())}")

num of unique patient numbers in EOB: 3233
num of unique patient numbers in Patient: 5000
num of unique patient medicare numbers in Patient: 5000
num of unique patient first names in Patient: 2
num of unique patient medicare numbers in Claim: 2719
num of unique claim IDs in Claim: 19139


In [9]:
# Checking lengths of values in columns
print(eob_df['patient_number'].str.len().unique())
#print(eob_df['unique_claim_ID'].str.len().unique())
print(patient_df['patient_number'].str.len().unique())
print(patient_df['patient_medicare_number'].str.len().unique())
print(claim_df['patient_medicare_number'].str.len().unique())
print(claim_df['unique_claim_ID'].str.len().unique())
print(claim_df['identifier_0_value'].str.len().unique())
#print(claim_response_df['patient_medicare_number'].str.len().unique())

[10 11  9  8  7]
[14]
[11]
[11]
[ 9 19 14]
[10 13 23 15]


Counts by column length

In [10]:
eob_df['length_counts'] = eob_df['patient_number'].str.len()
#claim_df['length_counts'] = claim_df['identifier_0_value'].str.len()
length_distribution = eob_df['length_counts'].value_counts()
length_distribution

length_counts
10    1914
11    1138
9      161
8       19
7        1
Name: count, dtype: int64

In [11]:
claim_df['length_counts'] = claim_df['unique_claim_ID'].str.len()
#claim_df['length_counts'] = claim_df['identifier_0_value'].str.len()
length_distribution = claim_df['length_counts'].value_counts()
length_distribution

length_counts
9     17986
19     1146
14      868
Name: count, dtype: int64

## Merging files

In [None]:
# For ease of merging initially limiting dataset columns
eob_df_part = eob_df[['patient_number','unique_claim_ID' ]]
patient_df_part = patient_df[['patient_number', 'patient_medicare_number']]
claim_df_part = claim_df[['patient_medicare_number', 'unique_claim_ID']]
#claim_response_df_part = claim_response_df[['patient_medicare_number', 'unique_claim_ID']]


Checking to see if there are identifiers found in each dataset

In [12]:
l1 = set(eob_df['patient_number'].unique()).intersection(patient_df['patient_number'])
len(l1)

0

In [None]:
df = pd.merge(eob_df_part, patient_df_part, how = 'outer', on = 'patient_number')

In [None]:
df1 = pd.merge(eob_df_part, claim_df_part, how = 'outer', on = 'unique_claim_ID')

In [28]:
df2 = pd.merge(coverage_df_part, patient_df_part, how = 'outer', on = 'patient_number')
