# Cleaning Claim Data
# 01_claim_data_cleaning

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 18/09/2025   | Adrienne | Created   | Created to flatten data | 
| 27.09.2025 | Adrienne | Update | Added Martin's code and other updates |

## Content

* [Introduction](#introduction)
* [Preprocess JSON](#preprocess-JSON)
* [Functions to Flatten JSON File](#functions-to-Flatten-JSON-File)

# Preprocess JSON

## Claims

__Columns__

- `contained` - birthDate, extension, gender, id, identifier, name, resourceType, id, identifier, resourceType
- `created`
- `diagnosis` - [diagnosisCodeableConcept, sequence, type] x 23
- `extension`
- `id`
- `resourceType`
- `status`
- `supportingInfo`
- `type`
- `use`
- `billablePeriod_end`
- `billablePeriod_start`
- `facility_extension`
- `identifier_system`
- `identifier_type`
- `identifier_value`
- `insurance_coverage`
- `insurance_focal`
- `insurance_sequence`
- `item_extension`
- `item_productOrService`
- `item_revenue`
- `item_sequence`
- `item_servicedDate`
- `meta_lastUpdated`
- `patient_reference`
- `priority_coding`
- `provider_reference`
- `total_currency`
- `total_value`
- `contained_identifer_patient_medicare_number`
- `contained_name_family`
- `contained_name_given`

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from io import StringIO
import os
import json
from collections import OrderedDict
import pickle
import numpy as np

In [2]:
# readin json file
path = "../data/raw"
#claim = pd.read_json(f"{path}/Claim.ndjson", lines=True)
claim = pd.read_json(f"{path}/Claim.ndjson", lines=True, nrows=10000)

The claims file has very complex nesting.  The code below prints out one row of data, so the structure can easily be seen.

In [3]:
claim_text = claim.head(1)
#claim_text = claim.head(93532).tail(1)
for key, value in claim_text.items():
    print(f"key*: {key}")
    for item in value:
            if isinstance(item, dict):
                for key, value in value.items():
                    print(f"\tkey:: ({key}) value: ({value})\n")
            elif isinstance(item, list):
                for i in item:
                    if isinstance(i, dict):
                        for key, value in i.items():
                            if isinstance(value, list):
                                for y in value:
                                    print(f"\t\tkey: {key} list: ({y})\n")
                            else:
                                print(f"\t\tkey: ({key}) value: ({value})\n")
                    else:
                        print(f"{i}\n")
            else:
                print(f"\tvalue$: {item}\n")


key*: billablePeriod
	key:: (0) value: ({'end': '2012-09-16', 'start': '2012-09-16'})

key*: contained
		key: (birthDate) value: (1944-05-25)

		key: extension list: ({'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-sex', 'valueCode': '248152002'})

		key: (gender) value: (female)

		key: (id) value: (patient)

		key: identifier list: ({'system': 'http://hl7.org/fhir/sid/us-mbi', 'type': {'coding': [{'code': 'MC', 'display': "Patient's Medicare Number", 'system': 'http://terminology.hl7.org/CodeSystem/v2-0203'}]}, 'value': '1S00E00JK17'})

		key: name list: ({'family': 'Wiza601', 'given': ['Patrina117'], 'text': 'Patrina117 Wiza601 ([max 10 chars of first], [max 15 chars of last])'})

		key: (resourceType) value: (Patient)

		key: (id) value: (provider-org)

		key: identifier list: ({'system': 'https://bluebutton.cms.gov/resources/variables/fiss/meda-prov-6', 'type': {'coding': [{'code': 'PRN', 'display': 'Provider number', 'system': 'http://terminology.hl7.org/CodeSyst

### Functions to Flatten JSON File

In [4]:
def flatten_json(nested_json, prefix=''):
    """
    Recursively flattens a nested JSON object or dictionary into a single level.

    Notes:
        - Nested dictionaries and lists are flattened such that keys from deeper levels
          in the hierarchy are concatenated with underscores
        - Lists of dictionaries are handled by appending index numbers to the keys.
        - Non-dict lists are serialized using JSON encoding
        - Returns OrderedDict, a flattened version of the input json, where keys represent
          the nested structure and values are the corresponding data
    """
    out = OrderedDict()
    for key, value in nested_json.items():
        if isinstance(value, dict):
            # Recursively flatten nested dictionaries
            out.update(flatten_json(value, prefix + key + '_'))
        elif isinstance(value, list):
            if len(value) > 0:
                if isinstance(value[0], dict):
                    # Handle list of dictionaries by flattening each item
                    for i, item in enumerate(value):
                        out.update(flatten_json(item, prefix + key + '_' + str(i) + '_'))
                else:
                    # Non-dict lists are serialized into a JSON string
                    out[prefix + key] = json.dumps(value)
            else:
                # Empty lists are serialized as JSON strings
                out[prefix + key] = json.dumps(value)
        else:
            # Base case: key-value pair where value is not a list or dict
            out[prefix + key] = value
    return out

def process_dataframe(df):
    """
    Processes a pandas DataFrame by flattening any JSON-like data (dictionaries or lists)
    present in its columns and converting it into a new DataFrame.

    Notes:
        - The function iterates through each row and flattens any JSON-like data (dictionaries or lists)
        - Non-nested data is left unchanged
        - The resulting DataFrame will contain a combination of original columns and
          additional columns derived from the flattened structure
        - Returns a new pandas DataFrame with the flattened data
    """
    flattened_data = []
    for _, row in df.iterrows():
        flattened_row = {}
        for column, value in row.items():
            if isinstance(value, (dict, list)):
                # Flatten any dictionary or list
                flattened = flatten_json({column: value})
                flattened_row.update(flattened)
            else:
                # Keep non-nested columns unchanged
                flattened_row[column] = value
        flattened_data.append(flattened_row)
    return pd.DataFrame(flattened_data)


In [31]:
# Flatten claim_df
flat_claim_df = process_dataframe(claim)

### Specific Column Processing

In [32]:
# Preprocess other columns
flat_claim_df['patient_medicare_number'] = flat_claim_df['contained_0_identifier_0_value']
flat_claim_df['patient_first_name'] = flat_claim_df['contained_0_name_0_given'].str.replace(r'[ \[ \]"]', '', regex=True)
flat_claim_df['patient_last_name'] = flat_claim_df['contained_0_name_0_family']
flat_claim_df['unique_claim_ID'] = flat_claim_df['identifier_0_value'].str.replace(r'[-]', '', regex=True).str.replace('dcn', '')
flat_claim_df['drg_code'] = flat_claim_df['supportingInfo_1_code_coding_0_code']
flat_claim_df['provider_number'] = flat_claim_df['contained_1_identifier_0_value']
flat_claim_df['national_provider_identifier'] = flat_claim_df['contained_1_identifier_1_value']
flat_claim_df['type_of_bill'] = flat_claim_df['supportingInfo_0_code_coding_0_code']
flat_claim_df['claim_type'] = flat_claim_df['type_coding_0_code']
flat_claim_df['location_of_bill'] = flat_claim_df['facility_extension_0_valueCoding_code']
flat_claim_df['gender'] = flat_claim_df['contained_0_gender']

In [33]:
diagnosis_cols = [col for col in flat_claim_df.columns if 'diagnosisCodeableConcept_coding_0_code' in col]
#diagnosis_cols

In [34]:
# Create list column of diagnoses
flat_claim_df['diagnosis_ls'] = flat_claim_df[diagnosis_cols].apply(lambda row: [x for x in row if pd.notnull(x)] , axis = 1)

In [35]:
# Quick check
diagnosis_cols.append('diagnosis_ls')
flat_claim_df[diagnosis_cols].head(3)

Unnamed: 0,diagnosis_0_diagnosisCodeableConcept_coding_0_code,diagnosis_1_diagnosisCodeableConcept_coding_0_code,diagnosis_2_diagnosisCodeableConcept_coding_0_code,diagnosis_3_diagnosisCodeableConcept_coding_0_code,diagnosis_4_diagnosisCodeableConcept_coding_0_code,diagnosis_5_diagnosisCodeableConcept_coding_0_code,diagnosis_6_diagnosisCodeableConcept_coding_0_code,diagnosis_7_diagnosisCodeableConcept_coding_0_code,diagnosis_8_diagnosisCodeableConcept_coding_0_code,diagnosis_9_diagnosisCodeableConcept_coding_0_code,diagnosis_10_diagnosisCodeableConcept_coding_0_code,diagnosis_11_diagnosisCodeableConcept_coding_0_code,diagnosis_12_diagnosisCodeableConcept_coding_0_code,diagnosis_13_diagnosisCodeableConcept_coding_0_code,diagnosis_14_diagnosisCodeableConcept_coding_0_code,diagnosis_15_diagnosisCodeableConcept_coding_0_code,diagnosis_16_diagnosisCodeableConcept_coding_0_code,diagnosis_17_diagnosisCodeableConcept_coding_0_code,diagnosis_ls
0,R52,R739,E1149,E119,E781,E8881,T50904,M19049,G43719,K011,,,,,,,,,"[R52, R739, E1149, E119, E781, E8881, T50904, ..."
1,R52,R739,E1142,E119,E781,E8881,T50904,M19049,G43719,K011,,,,,,,,,"[R52, R739, E1142, E119, E781, E8881, T50904, ..."
2,R52,R739,E1149,E119,E781,E8881,T50904,M19049,G43719,K011,,,,,,,,,"[R52, R739, E1149, E119, E781, E8881, T50904, ..."


In [None]:
# Sometimes there are admitting and/or principle diagnosis code types.  will make separate columns for them
diagnosis_code_type_cols = flat_claim_df.columns[flat_claim_df.columns.str.contains('type_0_coding_0_code|type_1_coding_0_code')]

for index, row in flat_claim_df.iterrows():
    for col in diagnosis_code_type_cols:       
        if row[col] == 'admitting':
            ind = col[11:12]
            if ind == '_':
                ind = col[0:12]
            else:
                ind = col[0:13]
            col_name = ind + 'diagnosisCodeableConcept_coding_0_code'
            flat_claim_df.loc[index, 'admitting_diagnosis'] = row[col_name]
        elif row[col] == 'principal':
            ind = col[11:12]
            if ind == '_':
                ind = col[0:12]
            else:
                ind = col[0:13]
            col_name = ind + 'diagnosisCodeableConcept_coding_0_code'
            flat_claim_df.loc[index,'principal_diagnosis'] = row[col_name]                  

In [73]:
flat_claim_df.head(10)

Unnamed: 0,billablePeriod_end,billablePeriod_start,contained_0_birthDate,contained_0_extension_0_url,contained_0_extension_0_valueCode,contained_0_gender,contained_0_id,contained_0_identifier_0_system,contained_0_identifier_0_type_coding_0_code,contained_0_identifier_0_type_coding_0_display,...,drg_code,provider_number,national_provider_identifier,type_of_bill,claim_type,location_of_bill,gender,diagnosis_ls,admitting_diagnosis,principal_diagnosis
0,2012-09-16,2012-09-16,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,922.0,321301,8885675875,1,institutional,2,female,"[R52, R739, E1149, E119, E781, E8881, T50904, ...",T50904,T50904
1,2013-06-11,2013-06-11,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,917.0,321301,8885675875,1,institutional,2,female,"[R52, R739, E1142, E119, E781, E8881, T50904, ...",T50904,T50904
2,2014-04-02,2014-04-01,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,947.0,321301,8885675875,1,institutional,2,female,"[R52, R739, E1149, E119, E781, E8881, T50904, ...",,R52
3,2014-11-18,2014-11-17,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,947.0,321301,8885675875,1,institutional,2,female,"[R52, R739, E1149, E119, E781, E8881, T50904, ...",,R52
4,2016-04-04,2016-04-04,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,917.0,321301,8885675875,1,institutional,2,female,"[R52, R739, E1143, E119, E781, E8881, T50904, ...",T50904,T50904
5,2014-11-10,2014-11-10,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,922.0,321301,8885675875,1,institutional,2,female,"[R52, R739, E11610, E119, E781, E8881, T50904,...",T50904,T50904
6,2012-09-30,2012-09-30,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,321301,8885675875,1,institutional,2,female,"[R52, R739, E1140, E119, E781, E8881, T50904, ...",,R52
7,2012-05-11,2012-05-11,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,321301,8885675875,1,institutional,2,female,"[R52, R739, G53, E119, E781, E8881, T50904, M1...",,R52
8,2014-02-01,2014-02-01,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,321301,8885675875,1,institutional,2,female,"[R52, R739, E1143, E119, E781, E8881, T50904, ...",,R52
9,2014-03-17,2014-03-17,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,321301,8885675875,1,institutional,2,female,"[R52, R739, E1142, E119, E781, E8881, T50904, ...",,R52


In [74]:
# Create list column of HCPCS codes
hcpcs_cols = [col for col in flat_claim_df.columns if 'productOrService_coding_0_code' in col]
#hcpcs_cols

In [75]:
flat_claim_df['hcpcs_ls'] = flat_claim_df[hcpcs_cols].apply(lambda row: [x for x in row if pd.notnull(x)] , axis = 1)

In [76]:
# Quick check
hcpcs_cols.append('hcpcs_ls')
flat_claim_df[hcpcs_cols].head(3)

Unnamed: 0,item_0_productOrService_coding_0_code,item_1_productOrService_coding_0_code,item_2_productOrService_coding_0_code,item_3_productOrService_coding_0_code,item_4_productOrService_coding_0_code,item_5_productOrService_coding_0_code,item_6_productOrService_coding_0_code,item_7_productOrService_coding_0_code,item_8_productOrService_coding_0_code,item_9_productOrService_coding_0_code,...,item_26_detail_0_productOrService_coding_0_code,item_274_productOrService_coding_0_code,item_275_productOrService_coding_0_code,item_276_productOrService_coding_0_code,item_277_productOrService_coding_0_code,item_278_productOrService_coding_0_code,item_279_productOrService_coding_0_code,item_280_productOrService_coding_0_code,item_281_productOrService_coding_0_code,hcpcs_ls
0,99221,,,,,,,,,,...,,,,,,,,,,[99221]
1,99221,,,,,,,,,,...,,,,,,,,,,[99221]
2,99221,,,,,,,,,,...,,,,,,,,,,[99221]


In [77]:
procedure_cols = [col for col in flat_claim_df.columns if 'procedureCodeableConcept_coding_0_code' in col]
#procedure_cols

In [78]:
# Create list column of diagnoses
flat_claim_df['procedure_ls'] = flat_claim_df[procedure_cols].apply(lambda row: [x for x in row if pd.notnull(x)] , axis = 1)

In [79]:
# Quick check
procedure_cols.append('procedure_ls')
flat_claim_df[procedure_cols].head(8500).tail(5)

Unnamed: 0,procedure_0_procedureCodeableConcept_coding_0_code,procedure_1_procedureCodeableConcept_coding_0_code,procedure_2_procedureCodeableConcept_coding_0_code,procedure_3_procedureCodeableConcept_coding_0_code,procedure_4_procedureCodeableConcept_coding_0_code,procedure_5_procedureCodeableConcept_coding_0_code,procedure_6_procedureCodeableConcept_coding_0_code,procedure_7_procedureCodeableConcept_coding_0_code,procedure_8_procedureCodeableConcept_coding_0_code,procedure_9_procedureCodeableConcept_coding_0_code,...,procedure_16_procedureCodeableConcept_coding_0_code,procedure_17_procedureCodeableConcept_coding_0_code,procedure_18_procedureCodeableConcept_coding_0_code,procedure_19_procedureCodeableConcept_coding_0_code,procedure_20_procedureCodeableConcept_coding_0_code,procedure_21_procedureCodeableConcept_coding_0_code,procedure_22_procedureCodeableConcept_coding_0_code,procedure_23_procedureCodeableConcept_coding_0_code,procedure_24_procedureCodeableConcept_coding_0_code,procedure_ls
8495,,,,,,,,,,,...,,,,,,,,,,[]
8496,,,,,,,,,,,...,,,,,,,,,,[]
8497,,,,,,,,,,,...,,,,,,,,,,[]
8498,,,,,,,,,,,...,,,,,,,,,,[]
8499,,,,,,,,,,,...,,,,,,,,,,[]


## Drop Columns

In [80]:
bef_len = str(len(flat_claim_df.columns))

In [81]:
# Remove all columns that have system or url in the name
drop_cols = flat_claim_df.columns[flat_claim_df.columns.str.contains("system|url")]
flat_claim_df = flat_claim_df.drop(drop_cols, axis = 1)

# Remove all columns that have extension in the name
ext_cols = sorted([col for col in flat_claim_df.columns if 'extension' in col])
flat_claim_df = flat_claim_df.drop(ext_cols, axis = 1)

In [82]:
print(f'number of columns before dropping: {bef_len}')
print(f'number of columns after dropping: {len(flat_claim_df.columns)}')

number of columns before dropping: 5337
number of columns after dropping: 1512


In [83]:
flat_claim_df.head()

Unnamed: 0,billablePeriod_end,billablePeriod_start,contained_0_birthDate,contained_0_gender,contained_0_id,contained_0_identifier_0_type_coding_0_code,contained_0_identifier_0_type_coding_0_display,contained_0_identifier_0_value,contained_0_name_0_family,contained_0_name_0_given,...,national_provider_identifier,type_of_bill,claim_type,location_of_bill,gender,diagnosis_ls,admitting_diagnosis,principal_diagnosis,hcpcs_ls,procedure_ls
0,2012-09-16,2012-09-16,1944-05-25,female,patient,MC,Patient's Medicare Number,1S00E00JK17,Wiza601,"[""Patrina117""]",...,8885675875,1,institutional,2,female,"[R52, R739, E1149, E119, E781, E8881, T50904, ...",T50904,T50904,[99221],[]
1,2013-06-11,2013-06-11,1944-05-25,female,patient,MC,Patient's Medicare Number,1S00E00JK17,Wiza601,"[""Patrina117""]",...,8885675875,1,institutional,2,female,"[R52, R739, E1142, E119, E781, E8881, T50904, ...",T50904,T50904,[99221],[]
2,2014-04-02,2014-04-01,1944-05-25,female,patient,MC,Patient's Medicare Number,1S00E00JK17,Wiza601,"[""Patrina117""]",...,8885675875,1,institutional,2,female,"[R52, R739, E1149, E119, E781, E8881, T50904, ...",,R52,[99221],[]
3,2014-11-18,2014-11-17,1944-05-25,female,patient,MC,Patient's Medicare Number,1S00E00JK17,Wiza601,"[""Patrina117""]",...,8885675875,1,institutional,2,female,"[R52, R739, E1149, E119, E781, E8881, T50904, ...",,R52,[99221],[]
4,2016-04-04,2016-04-04,1944-05-25,female,patient,MC,Patient's Medicare Number,1S00E00JK17,Wiza601,"[""Patrina117""]",...,8885675875,1,institutional,2,female,"[R52, R739, E1143, E119, E781, E8881, T50904, ...",T50904,T50904,[99221],[]


In [None]:
# creating claim sample file and saving all files to pickle
flat_claim_df.to_pickle("../data/clean/claim.pkl")

# flat_claim_df =  flat_claim_df.sample(n=20000)
flat_claim_df.to_pickle("../data/clean/claim_sample.pkl")

# flat_claim_df =  flat_claim_df.sample(n=10000)
# flat_claim_df.to_pickle("../data/clean/claim_mini_sample.pkl")