# Merging & EDA

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 22/09/2025   | Martin | Created   | Notebook created for merging and EDA | 
| 23/09/2025   | Martin | New   | Re-explored claims data. Added findings | 

# Content

* [Loading Data](#loading-data)
* [Merging Data](#merging-data)
* [Exploring a Patient](#exploring-a-patient)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Loading Data

In [2]:
%load_ext watermark

In [3]:
path = "../data/clean"
# claim = pd.read_pickle(f"{path}/claim_sample.pkl")
claim_old = pd.read_pickle(f"{path}/claim_old.pkl")
# claim_resp = pd.read_pickle(f"{path}/claim_response.pkl")
patient = pd.read_pickle(f"{path}/patient.pkl")
eob = pd.read_pickle(f"{path}/explanation_of_benefit.pkl")
# coverage = pd.read_pickle(f"{path}/coverage.pkl")

In [4]:
# Select relevant columns for initial exploration
patient = patient[['address_0_state', 'birthDate', 'deceasedBoolean', 'extension_1_valueCoding_display', 'extension_3_valueDate', 'gender', 'id', 'identifier_1_value', 'deceasedBoolean', 'deceasedDateTime', 'address_0_postalCode']]
patient = patient.rename({
  'extension_1_valueCoding_display': 'race',
  'extension_3_valueDate': 'referenceYear',
  'identifier_1_value': 'patient_medicare_number',
  'address_0_postal_code': 'postalCode'
}, axis=1)


# l = claim_resp[['id', 'contained_birthDate', 'contained_gender', 'identifier_value', 'contained_identifer_patient_medicare_number']]
# claim_ids = claim[['id', 'patient_medicare_number', 'identifier_0_value']]

# Merging Data

- Coverage (subscriberId) -> (Many-to-One) -> Patient (patient_medicare_number)
  - Not all have a value, but if there is, 4 values per patient (Part A-D)
- EOB (patient) -> (Many-to-One) -> Patient (id)
  - Need to convert EOB col to int
  - Only ~50 patients with information
- Claim (patient_medicare_number) -> (Many-to-One) -> Patient (identifier_1_value)
- Claim Response (id) -> (One-to-One) -> Claim (id)

<u>Findings</u>

- Claim Response has One-to-One relationship with Claim, but doesn't really provide any additional info
- Looked through the claims data again, besides the typical indexing columns and patient info, seems like only diagnosis, priority, claim-type total seem to be the most "relevant" columns (but I'm not too sure based on my limited knowledge of the context)

In [48]:
def find_common_patients(claim, eob, patient):
  """Finds the common patient ids that contain data in all 3 datasets

  Args:
      claim (pd.DataFrame): claim dataset
      eob (pd.DataFrame): eob dataset
      patient (pd.DataFrame): patient dataset

  Returns:
      set: List of Patient IDs
  """
  # Patients that have claim
  l1 = set(claim['patient_medicare_number'].unique()).intersection(patient['patient_medicare_number'])

  # Patients that have an eob & claim info
  l2 = set(patient[patient['patient_medicare_number'].isin(l1)]['id']).intersection(set(eob['patient'].astype(int)))

  # Get the medicare numbers and patient ids
  mcare = set(patient[patient['id'].isin(l2)]['patient_medicare_number'])
  print(len(l1))

  print("Patient Medicare Numbers:")
  print(mcare)
  print()
  print("Patient IDs")
  print(l2)

In [51]:
# These are patients that have data across all the data sets
find_common_patients(claim, eob, patient)

132
Patient Medicare Numbers:
{'1S00E00AE22', '1S00E00AA50'}

Patient IDs
{-10000000000050, -10000000000322}


## Re-exploring some details from claim

<u>Findings</u>

- `diagnosis` can be split into individual code columns since all of them are ICD-10-CM codes and the max length is 23 for the list = max 23 new columns
  - Missing on few rows
  - The sequences are also all in order, so we can represent them as a sequence data type
  - https://www.icd10data.com/ICD10CM/Codes/S00-T88
- Some of the entries in claim are (I think) exactly the same, they only differ by their billablePeriod. See entry 1 and 2 in raw `Claim.ndjson`

In [3]:
path = "../data/clean"
claim_old = pd.read_pickle(f'{path}/claim_old.pkl')

In [4]:
claim_old.shape

(178761, 69)

In [42]:
lengths = []
for i in claim_old['diagnosis']:
  if not isinstance(i, float):
    lengths.append(len(i))

print(pd.Series(lengths).unique())
print(len(lengths))

[10 12 11  2  1  3  5  4  6  7  8  9 13 15 16 18 17 14 19 20 21 23 22]
178570


In [None]:
codes = []
for entry in claim_old['diagnosis']:
  t = []
  if not isinstance(entry, float):
    for item in entry:
      t.append(item['diagnosisCodeableConcept']['coding'][0]['code'])
  codes.append(t)

In [None]:
# We can explore this list into columns
pd.DataFrame({
  'codes': codes
})

Unnamed: 0,codes
0,"[R52, R739, E1149, E119, E781, E8881, T50904, ..."
1,"[R52, R739, E1142, E119, E781, E8881, T50904, ..."
2,"[R52, R739, E1149, E119, E781, E8881, T50904, ..."
3,"[R52, R739, E1149, E119, E781, E8881, T50904, ..."
4,"[R52, R739, E1143, E119, E781, E8881, T50904, ..."
...,...
178756,[E669]
178757,[E669]
178758,[E669]
178759,[K635]


In [None]:
claim_old['diagnosis'][0]['diagn']

[{'diagnosisCodeableConcept': {'coding': [{'code': 'R52',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 1},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'R739',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 2},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'E1149',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 3},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'E119',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 4},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'E781',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 5},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'E8881',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 6},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'T50904',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 7,
  'type': [{'coding': [{'code': 'admitting',
      'display': 'Admitting Diagnosi

In [15]:
sequences = []
for entry in claim_old['diagnosis']:
  t = []
  if not isinstance(entry, float):
    for seq in entry:
      t.append(seq['sequence'])
  sequences.append(t)

# Exploring a Patient

In [5]:
patient_id = -10000000000050
medicare_number = "1S00E00AE22"

In [15]:
c = claim_old[claim_old['contained_identifer_patient_medicare_number'] == medicare_number]
e = eob[eob['patient'] == str(patient_id)]
p = patient[patient['id'] == patient_id]

Patient Details

In [None]:
print(f"Patient - Medicare Number: {medicare_number}, ID: {patient_id}")
print(f"Race: {p['race']}")
print(f"Gender: {p['gender']}")
print(f"Birth Date: {p['birthDate']}")
print(f"Deceased: {p['deceasedBoolean'] == 1}")
print(f"Postal Code: {p['address_0_postalCode']}")


Patient - Medicare Number: 1S00E00AE22, ID: -10000000000050
Race: 4984    White
Name: race, dtype: object
Gender: 4984    female
Name: gender, dtype: object
Birth Date: 4984    1954-07-31
Name: birthDate, dtype: object
Deceased:       deceasedBoolean  deceasedBoolean
4984            False            False
Postal Code: 4984    01851
Name: address_0_postalCode, dtype: object


In [19]:
c.head()

Unnamed: 0,billablePeriod,contained,created,diagnosis,extension,facility,id,identifier,insurance,item,...,diagnosis_14,diagnosis_15,diagnosis_16,diagnosis_17,diagnosis_18,diagnosis_19,diagnosis_20,diagnosis_21,diagnosis_22,contained_identifer_patient_medicare_number
121986,"{'end': '2000-06-25', 'start': '2000-06-25'}","[{'birthDate': '1947-06-12', 'extension': [{'u...",2025-09-03T22:17:04+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTA1NWM1MmQyYmMyMWI5ODU1MWY1ZDE,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,,...,,,,,,,,,,1S00E00AE22
121987,"{'end': '1947-08-09', 'start': '1947-08-09'}","[{'birthDate': '1947-06-29', 'extension': [{'u...",2025-09-03T22:17:04+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDA5MzA3Mg,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'detail': [{'productOrService': {'coding': [...,...,,,,,,,,,,1S00E00AE22
121988,"{'end': '1990-03-09', 'start': '1990-03-08'}","[{'birthDate': '1947-06-29', 'extension': [{'u...",2025-09-03T22:17:04+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDA5MzA4NA,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'detail': [{'productOrService': {'coding': [...,...,,,,,,,,,,1S00E00AE22
121989,"{'end': '1990-06-05', 'start': '1990-06-04'}","[{'birthDate': '1947-06-29', 'extension': [{'u...",2025-09-03T22:17:04+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDA5MzA5MQ,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'detail': [{'productOrService': {'coding': [...,...,,,,,,,,,,1S00E00AE22
121990,"{'end': '1991-10-18', 'start': '1991-10-17'}","[{'birthDate': '1947-06-29', 'extension': [{'u...",2025-09-03T22:17:04+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDA5MzA5NA,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'detail': [{'productOrService': {'coding': [...,...,,,,,,,,,,1S00E00AE22


In [20]:

claim = pd.read_pickle(f"{path}/claim_sample.pkl")

In [22]:
c = claim[claim['patient_medicare_number'] == medicare_number]

In [24]:
c.sort_values('billablePeriod_start')

Unnamed: 0,billablePeriod_end,billablePeriod_start,contained_0_birthDate,contained_0_extension_0_url,contained_0_extension_0_valueCode,contained_0_gender,contained_0_id,contained_0_identifier_0_system,contained_0_identifier_0_type_coding_0_code,contained_0_identifier_0_type_coding_0_display,...,item_153_detail_0_quantity_system,item_153_detail_0_quantity_unit,item_153_detail_0_quantity_value,item_153_detail_0_sequence,patient_medicare_number,patient_number,patient_first_name,patient_last_name,unique_claim_ID,hcpcs_code
121987,1947-08-09,1947-08-09,1947-06-29,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,Vennie613,Wiza601,100054616,99221
122088,1981-09-13,1981-09-13,,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,V.,Wiza60,100038542,
122080,1989-06-22,1989-06-22,,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,K.,Runolf,07cad5d54d0936,
122076,1997-06-12,1997-06-12,1947-06-12,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,Kylee806,Runolfsdottir78,dcn9db1eaad6e9b3d3c4e7,
122018,1999-10-20,1999-10-19,1947-06-29,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,Vennie613,Wiza601,100054666,99221
122126,2002-07-11,2002-07-11,,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,K.,Runolf,bb9ae6501358cb,
121997,2004-10-30,2004-10-29,1947-06-29,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,Vennie613,Wiza601,100054696,99221
122078,2012-06-14,2012-06-14,1947-06-12,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,Kylee806,Runolfsdottir78,dcncebbdfc6866fb78546b,
122069,2013-07-21,2013-07-21,1947-06-12,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,Kylee806,Runolfsdottir78,dcn2a91fbb24d6c4348362,
122071,2018-07-21,2018-07-21,1947-06-12,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00AE22,#patient,Kylee806,Runolfsdottir78,dcn65228a34a407a576d67,


In [25]:
[i for i in c.columns if i.startswith('diagnosis')]

['diagnosis_0_diagnosisCodeableConcept_coding_0_code',
 'diagnosis_0_diagnosisCodeableConcept_coding_0_system',
 'diagnosis_0_sequence',
 'diagnosis_1_diagnosisCodeableConcept_coding_0_code',
 'diagnosis_1_diagnosisCodeableConcept_coding_0_system',
 'diagnosis_1_sequence',
 'diagnosis_2_diagnosisCodeableConcept_coding_0_code',
 'diagnosis_2_diagnosisCodeableConcept_coding_0_system',
 'diagnosis_2_sequence',
 'diagnosis_3_diagnosisCodeableConcept_coding_0_code',
 'diagnosis_3_diagnosisCodeableConcept_coding_0_system',
 'diagnosis_3_sequence',
 'diagnosis_4_diagnosisCodeableConcept_coding_0_code',
 'diagnosis_4_diagnosisCodeableConcept_coding_0_system',
 'diagnosis_4_sequence',
 'diagnosis_5_diagnosisCodeableConcept_coding_0_code',
 'diagnosis_5_diagnosisCodeableConcept_coding_0_system',
 'diagnosis_5_sequence',
 'diagnosis_6_diagnosisCodeableConcept_coding_0_code',
 'diagnosis_6_diagnosisCodeableConcept_coding_0_system',
 'diagnosis_6_sequence',
 'diagnosis_6_type_0_coding_0_code',
 'di

In [2]:
%watermark

Last updated: 2025-09-22T23:59:12.209541+08:00

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 9.5.0

Compiler    : MSC v.1938 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
CPU cores   : 20
Architecture: 64bit

