# Cleaning Claim Data
# 01_claim_data_cleaning

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 18/09/2025   | Adrienne | Created   | Created to flatten data | 
|    | |   | |

# Content

* [Introduction](#introduction)

# Preprocess JSON

## Claims

__Columns__

- `contained` - birthDate, extension, gender, id, identifier, name, resourceType, id, identifier, resourceType
- `created`
- `diagnosis` - [diagnosisCodeableConcept, sequence, type] x 23
- `extension`
- `id`
- `resourceType`
- `status`
- `supportingInfo`
- `type`
- `use`
- `billablePeriod_end`
- `billablePeriod_start`
- `facility_extension`
- `identifier_system`
- `identifier_type`
- `identifier_value`
- `insurance_coverage`
- `insurance_focal`
- `insurance_sequence`
- `item_extension`
- `item_productOrService`
- `item_revenue`
- `item_sequence`
- `item_servicedDate`
- `meta_lastUpdated`
- `patient_reference`
- `priority_coding`
- `provider_reference`
- `total_currency`
- `total_value`
- `contained_identifer_patient_medicare_number`
- `contained_name_family`
- `contained_name_given`

In [43]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from io import StringIO
import os
import json
from collections import OrderedDict
import pickle

In [44]:
# readin json file
path = "../data/raw"
#claim = pd.read_json(f"{path}/Claim.ndjson", lines=True)
claim = pd.read_json(f"{path}/Claim.ndjson", lines=True, nrows=1000)

The claims file has very complex nesting.  The code below prints out one row of data, so the structure can easily be seen.

In [3]:
claim_text = claim.head(1)
for key, value in claim_text.items():
    print(f"key*: {key}")
    for item in value:
            if isinstance(item, dict):
                for key, value in value.items():
                    print(f"\tkey:: ({key}) value: ({value})\n")
            elif isinstance(item, list):
                for i in item:
                    if isinstance(i, dict):
                        for key, value in i.items():
                            if isinstance(value, list):
                                for y in value:
                                    print(f"\t\tkey: {key} list: ({y})\n")
                            else:
                                print(f"\t\tkey: ({key}) value: ({value})\n")
                    else:
                        print(f"{i}\n")
            else:
                print(f"\tvalue$: {item}\n")


key*: billablePeriod
	key:: (0) value: ({'end': '2012-09-16', 'start': '2012-09-16'})

key*: contained
		key: (birthDate) value: (1944-05-25)

		key: extension list: ({'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-sex', 'valueCode': '248152002'})

		key: (gender) value: (female)

		key: (id) value: (patient)

		key: identifier list: ({'system': 'http://hl7.org/fhir/sid/us-mbi', 'type': {'coding': [{'code': 'MC', 'display': "Patient's Medicare Number", 'system': 'http://terminology.hl7.org/CodeSystem/v2-0203'}]}, 'value': '1S00E00JK17'})

		key: name list: ({'family': 'Wiza601', 'given': ['Patrina117'], 'text': 'Patrina117 Wiza601 ([max 10 chars of first], [max 15 chars of last])'})

		key: (resourceType) value: (Patient)

		key: (id) value: (provider-org)

		key: identifier list: ({'system': 'https://bluebutton.cms.gov/resources/variables/fiss/meda-prov-6', 'type': {'coding': [{'code': 'PRN', 'display': 'Provider number', 'system': 'http://terminology.hl7.org/CodeSyst

### Functions to Flatten JSON File

In [45]:
def flatten_json(nested_json, prefix=''):
    """
    Recursively flattens a nested JSON object or dictionary into a single level.

    Notes:
        - Nested dictionaries and lists are flattened such that keys from deeper levels
          in the hierarchy are concatenated with underscores
        - Lists of dictionaries are handled by appending index numbers to the keys.
        - Non-dict lists are serialized using JSON encoding
        - Returns OrderedDict, a flattened version of the input json, where keys represent
          the nested structure and values are the corresponding data
    """
    out = OrderedDict()
    for key, value in nested_json.items():
        if isinstance(value, dict):
            # Recursively flatten nested dictionaries
            out.update(flatten_json(value, prefix + key + '_'))
        elif isinstance(value, list):
            if len(value) > 0:
                if isinstance(value[0], dict):
                    # Handle list of dictionaries by flattening each item
                    for i, item in enumerate(value):
                        out.update(flatten_json(item, prefix + key + '_' + str(i) + '_'))
                else:
                    # Non-dict lists are serialized into a JSON string
                    out[prefix + key] = json.dumps(value)
            else:
                # Empty lists are serialized as JSON strings
                out[prefix + key] = json.dumps(value)
        else:
            # Base case: key-value pair where value is not a list or dict
            out[prefix + key] = value
    return out

def process_dataframe(df):
    """
    Processes a pandas DataFrame by flattening any JSON-like data (dictionaries or lists)
    present in its columns and converting it into a new DataFrame.

    Notes:
        - The function iterates through each row and flattens any JSON-like data (dictionaries or lists)
        - Non-nested data is left unchanged
        - The resulting DataFrame will contain a combination of original columns and
          additional columns derived from the flattened structure
        - Returns a new pandas DataFrame with the flattened data
    """
    flattened_data = []
    for _, row in df.iterrows():
        flattened_row = {}
        for column, value in row.items():
            if isinstance(value, (dict, list)):
                # Flatten any dictionary or list
                flattened = flatten_json({column: value})
                flattened_row.update(flattened)
            else:
                # Keep non-nested columns unchanged
                flattened_row[column] = value
        flattened_data.append(flattened_row)
    return pd.DataFrame(flattened_data)


In [46]:
# Flatten claim_df
flat_claim_df = process_dataframe(claim)

In [None]:
# checking column
#print(flat_claim_df.item_0_extension_0_url.loc[90:99])

90                                                  NaN
91                                                  NaN
92                                                  NaN
93    https://bluebutton.cms.gov/resources/variables...
94    https://bluebutton.cms.gov/resources/variables...
95    https://bluebutton.cms.gov/resources/variables...
96    https://bluebutton.cms.gov/resources/variables...
97    https://bluebutton.cms.gov/resources/variables...
98    https://bluebutton.cms.gov/resources/variables...
99    https://bluebutton.cms.gov/resources/variables...
Name: item_0_extension_0_url, dtype: object


### Specific Column Processing

In [47]:
# Preprocess other columns
flat_claim_df['patient_medicare_number'] = flat_claim_df['contained_0_identifier_0_value']
flat_claim_df['patient_number'] = flat_claim_df['patient_reference']
flat_claim_df['patient_first_name'] = flat_claim_df['contained_0_name_0_given'].str.replace(r'[ \[ \]"]', '', regex=True)
flat_claim_df['patient_last_name'] = flat_claim_df['contained_0_name_0_family']
flat_claim_df['Unique Claim ID'] = flat_claim_df['identifier_0_value'].str.replace(r'[-]', '', regex=True)
flat_claim_df['hcpcs_code'] = flat_claim_df['item_0_productOrService_coding_0_code']

In [48]:
#print(f" list of columns in unnested dataset {list(flat_claim_df.columns)}")
for col in flat_claim_df.columns:
    print(col)
print(f"\noriginal json file had 20 columns and now the dataset contains {len(list(flat_claim_df.columns))}")

billablePeriod_end
billablePeriod_start
contained_0_birthDate
contained_0_extension_0_url
contained_0_extension_0_valueCode
contained_0_gender
contained_0_id
contained_0_identifier_0_system
contained_0_identifier_0_type_coding_0_code
contained_0_identifier_0_type_coding_0_display
contained_0_identifier_0_type_coding_0_system
contained_0_identifier_0_value
contained_0_name_0_family
contained_0_name_0_given
contained_0_name_0_text
contained_0_resourceType
contained_1_id
contained_1_identifier_0_system
contained_1_identifier_0_type_coding_0_code
contained_1_identifier_0_type_coding_0_display
contained_1_identifier_0_type_coding_0_system
contained_1_identifier_0_value
contained_1_identifier_1_system
contained_1_identifier_1_type_coding_0_code
contained_1_identifier_1_type_coding_0_display
contained_1_identifier_1_type_coding_0_system
contained_1_identifier_1_value
contained_1_resourceType
created
diagnosis_0_diagnosisCodeableConcept_coding_0_code
diagnosis_0_diagnosisCodeableConcept_coding

In [49]:
# take a look at the dataset
flat_claim_df.head()

Unnamed: 0,billablePeriod_end,billablePeriod_start,contained_0_birthDate,contained_0_extension_0_url,contained_0_extension_0_valueCode,contained_0_gender,contained_0_id,contained_0_identifier_0_system,contained_0_identifier_0_type_coding_0_code,contained_0_identifier_0_type_coding_0_display,...,item_1_detail_0_sequence,diagnosis_4_type_1_coding_0_code,diagnosis_4_type_1_coding_0_display,diagnosis_4_type_1_coding_0_system,patient_medicare_number,patient_number,patient_first_name,patient_last_name,Unique Claim ID,hcpcs_code
0,2012-09-16,2012-09-16,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00JK17,#patient,Patrina117,Wiza601,100125087,99221
1,2013-06-11,2013-06-11,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00JK17,#patient,Patrina117,Wiza601,100125090,99221
2,2014-04-02,2014-04-01,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00JK17,#patient,Patrina117,Wiza601,100125092,99221
3,2014-11-18,2014-11-17,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00JK17,#patient,Patrina117,Wiza601,100125096,99221
4,2016-04-04,2016-04-04,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,...,,,,,1S00E00JK17,#patient,Patrina117,Wiza601,100125098,99221


In [50]:
# save to pickle
flat_claim_df.to_pickle("../data/clean/claim.pkl")