# Merging & EDA

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 22/09/2025   | Martin | Created   | Notebook created for merging and EDA | 
| 23/09/2025   | Martin | New   | Re-explored claims data. Added findings | 

# Content

* [Loading Data](#loading-data)
* [Merging Data](#merging-data)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Loading Data

In [2]:
%load_ext watermark

In [3]:
path = "../data/clean"
claim = pd.read_pickle(f"{path}/claim_sample.pkl")
claim_old = pd.read_pickle(f"{path}/claim_old.pkl")
claim_resp = pd.read_pickle(f"{path}/claim_response.pkl")
patient = pd.read_pickle(f"{path}/patient.pkl")
eob = pd.read_pickle(f"{path}/explanation_of_benefit.pkl")
coverage = pd.read_pickle(f"{path}/coverage.pkl")

In [4]:
# Select relevant columns for initial exploration
patient = patient[['address_0_state', 'birthDate', 'deceasedBoolean', 'extension_1_valueCoding_display', 'extension_3_valueDate', 'gender', 'id', 'identifier_1_value', 'deceasedBoolean', 'deceasedDateTime', 'address_0_postalCode']]
patient = patient.rename({
  'extension_1_valueCoding_display': 'race',
  'extension_3_valueDate': 'referenceYear',
  'identifier_1_value': 'patient_medicare_number',
  'address_0_postal_code': 'postalCode'
}, axis=1)


# l = claim_resp[['id', 'contained_birthDate', 'contained_gender', 'identifier_value', 'contained_identifer_patient_medicare_number']]
# claim_ids = claim[['id', 'patient_medicare_number', 'identifier_0_value']]

# Merging Data

- Coverage (subscriberId) -> (Many-to-One) -> Patient (patient_medicare_number)
  - Not all have a value, but if there is, 4 values per patient (Part A-D)
- EOB (patient) -> (Many-to-One) -> Patient (id)
  - Need to convert EOB col to int
  - Only ~50 patients with information
- Claim (patient_medicare_number) -> (Many-to-One) -> Patient (identifier_1_value)
- Claim Response (id) -> (One-to-One) -> Claim (id)

<u>Findings</u>

- Claim Response has One-to-One relationship with Claim, but doesn't really provide any additional info
- Looked through the claims data again, besides the typical indexing columns and patient info, seems like only diagnosis, priority, claim-type total seem to be the most "relevant" columns (but I'm not too sure based on my limited knowledge of the context)

In [5]:
def find_common_patients(claim, eob, patient):
  """Finds the common patient ids that contain data in all 3 datasets

  Args:
      claim (pd.DataFrame): claim dataset
      eob (pd.DataFrame): eob dataset
      patient (pd.DataFrame): patient dataset

  Returns:
      set: List of Patient IDs
  """
  # Patients that have claim
  l1 = set(claim['patient_medicare_number'].unique()).intersection(patient['patient_medicare_number'])

  # Patients that have an eob & claim info
  l2 = set(patient[patient['patient_medicare_number'].isin(l1)]['id']).intersection(set(eob['patient'].astype(int)))

  # Get the medicare numbers and patient ids
  mcare = set(patient[patient['id'].isin(l2)]['patient_medicare_number'])

  print("Patient Medicare Numbers:")
  print(mcare)
  print()
  print("Patient IDs")
  print(l2)

In [6]:
# These are patients that have data across all the data sets
find_common_patients(claim, eob, patient)

Patient Medicare Numbers:
{'1S00E00AE22', '1S00E00AA50'}

Patient IDs
{-10000000000050, -10000000000322}


## Re-exploring some details from claim

<u>Findings</u>

- `diagnosis` can be split into individual code columns since all of them are ICD-10-CM codes and the max length is 23 for the list = max 23 new columns
  - Missing on few rows
  - The sequences are also all in order, so we can repsent them as a sequence data type
  - https://www.icd10data.com/ICD10CM/Codes/S00-T88
- Some of the entries in claim are (I think) exactly the same, they only differ by their billablePeriod. See entry 1 and 2 in raw `Claim.ndjson`

In [3]:
path = "../data/clean"
claim_old = pd.read_pickle(f'{path}/claim_old.pkl')

In [4]:
claim_old.shape

(178761, 69)

In [42]:
lengths = []
for i in claim_old['diagnosis']:
  if not isinstance(i, float):
    lengths.append(len(i))

print(pd.Series(lengths).unique())
print(len(lengths))

[10 12 11  2  1  3  5  4  6  7  8  9 13 15 16 18 17 14 19 20 21 23 22]
178570


In [None]:
codes = []
for entry in claim_old['diagnosis']:
  t = []
  if not isinstance(entry, float):
    for item in entry:
      t.append(item['diagnosisCodeableConcept']['coding'][0]['code'])
  codes.append(t)

In [None]:
# We can explore this list into columns
pd.DataFrame({
  'codes': codes
})

Unnamed: 0,codes
0,"[R52, R739, E1149, E119, E781, E8881, T50904, ..."
1,"[R52, R739, E1142, E119, E781, E8881, T50904, ..."
2,"[R52, R739, E1149, E119, E781, E8881, T50904, ..."
3,"[R52, R739, E1149, E119, E781, E8881, T50904, ..."
4,"[R52, R739, E1143, E119, E781, E8881, T50904, ..."
...,...
178756,[E669]
178757,[E669]
178758,[E669]
178759,[K635]


In [None]:
claim_old['diagnosis'][0]['diagn']

[{'diagnosisCodeableConcept': {'coding': [{'code': 'R52',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 1},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'R739',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 2},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'E1149',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 3},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'E119',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 4},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'E781',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 5},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'E8881',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 6},
 {'diagnosisCodeableConcept': {'coding': [{'code': 'T50904',
     'system': 'http://hl7.org/fhir/sid/icd-9-cm'}]},
  'sequence': 7,
  'type': [{'coding': [{'code': 'admitting',
      'display': 'Admitting Diagnosi

In [15]:
sequences = []
for entry in claim_old['diagnosis']:
  t = []
  if not isinstance(entry, float):
    for seq in entry:
      t.append(seq['sequence'])
  sequences.append(t)

In [16]:
sequences

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [

In [2]:
%watermark

Last updated: 2025-09-22T23:59:12.209541+08:00

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 9.5.0

Compiler    : MSC v.1938 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
CPU cores   : 20
Architecture: 64bit

