# Cleaning Claim Response Data
# 01_claim_response_cleaning

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 18/09/2025   | Adrienne | Created   | Created to flatten data | 
| 13/10/2025   | Martin | Updated   | Code cleanup. Added docstrings to functions | 

# Content

* [Preprocess JSON](#preprocess-json)
  * [Function to Flatten JSON File](#function-to-flatten-a-column)
  * [Specific column processing](#specific-column-processing)

# Preprocess JSON

### Claim Responses

__Columns__

- `contained` - birthDate, extension, gender, id, identifier, name, resourceType, id, identifier, resourceType
- `created`
- `extension` - url, valueCoding, url, valueDate, url, valueDate
- `id`
- `outcome`
- `resourceType`
- `status`
- `use`
- `identifier_system'`
- `identifier_type`
- `identifier_value`
- `insurer_identifier`
- `meta_lastUpdated`
- `patient_reference`
- `request_reference`
- `type_coding`
- `contained_identifer_patient_medicare_number`
- `contained_name_family`
- `contained_name_given`

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from io import StringIO
import os
import json
from collections import OrderedDict
import pickle

In [36]:
path = "../data/raw"
claim_response = pd.read_json(f"{path}/ClaimResponse.ndjson", lines=True)
#claim_response = pd.read_json(f"{path}/ClaimResponse.ndjson", lines=True, nrows=10)

The claims file has very complex nesting.  The code below prints out one row of data, so the structure can easily be seen.

In [16]:
for key, value in claim_response.head(1).items():
    print(f"key: {key}")
    for item in value:
            if isinstance(item, dict):
                for key, value in value.items():
                    print(f"\tkey: ({key}) value ({value})\n")
            #print(item)
            elif isinstance(item, list):
                for i in item:
                    if isinstance(i, dict):
                        for key, value in i.items():
                            if isinstance(value, list):
                                for y in value:
                                    print(f"\t\tkey: {key} list: ({y})\n")
                            else:
                                print(f"\t\tkey: ({key}) value ({value})\n")
                    else:
                        print(f"{i}\n")
            else:
                print(f"\tvalue: {item}\n")


key: contained
		key: (birthDate) value (1944-05-25)

		key: extension list: ({'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-sex', 'valueCode': '248152002'})

		key: (gender) value (female)

		key: (id) value (patient)

		key: identifier list: ({'system': 'http://hl7.org/fhir/sid/us-mbi', 'type': {'coding': [{'code': 'MC', 'display': "Patient's Medicare Number", 'system': 'http://terminology.hl7.org/CodeSystem/v2-0203'}]}, 'value': '1S00E00JK17'})

		key: name list: ({'family': 'Wiza601', 'given': ['Patrina117'], 'text': 'Patrina117 Wiza601 ([max 10 chars of first], [max 15 chars of last])'})

		key: (resourceType) value (Patient)

key: created
	value: 2025-09-03T22:20:31+00:00

key: extension
		key: (url) value (https://bluebutton.cms.gov/resources/variables/fiss/curr-status)

		key: (valueCoding) value ({'code': 'A', 'system': 'https://bluebutton.cms.gov/resources/variables/fiss/curr-status'})

		key: (url) value (https://bluebutton.cms.gov/resources/variables/fiss/

### Function to Flatten a Column

In [None]:
def flatten_json(nested_json: dict, prefix: str='') -> OrderedDict:
    """
    Recursively flattens a nested JSON object or dictionary into a single level.

    Notes:
        - Nested dictionaries and lists are flattened such that keys from deeper levels
          in the hierarchy are concatenated with underscores
        - Lists of dictionaries are handled by appending index numbers to the keys.
        - Non-dict lists are serialized using JSON encoding
        - Returns OrderedDict, a flattened version of the input json, where keys represent
          the nested structure and values are the corresponding data

    Args:
        nested_json (dict): JSON object
        prefix (str, optional): Prefix for keys. Defaults to ''.

    Returns:
        OrderedDict: unnested JSON object
    """
    out = OrderedDict()
    for key, value in nested_json.items():
        if isinstance(value, dict):
            # Recursively flatten nested dictionaries
            out.update(flatten_json(value, prefix + key + '_'))
        elif isinstance(value, list):
            if len(value) > 0:
                if isinstance(value[0], dict):
                    # Handle list of dictionaries by flattening each item
                    for i, item in enumerate(value):
                        out.update(flatten_json(item, prefix + key + '_' + str(i) + '_'))
                else:
                    # Non-dict lists are serialized into a JSON string
                    out[prefix + key] = json.dumps(value)
            else:
                # Empty lists are serialized as JSON strings
                out[prefix + key] = json.dumps(value)
        else:
            # Base case: key-value pair where value is not a list or dict
            out[prefix + key] = value
    return out

def process_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Processes a pandas DataFrame by flattening any JSON-like data (dictionaries or lists)
    present in its columns and converting it into a new DataFrame.

    Notes:
        - The function iterates through each row and flattens any JSON-like data (dictionaries or lists)
        - Non-nested data is left unchanged
        - The resulting DataFrame will contain a combination of original columns and
          additional columns derived from the flattened structure
        - Returns a new pandas DataFrame with the flattened data

    Args:
        df (pd.DataFrame): Dataframe containing JSON objects in columns

    Returns:
        pd.DataFrame: Flattened dataframe
    """
    flattened_data = []
    for _, row in df.iterrows():
        flattened_row = {}
        for column, value in row.items():
            if isinstance(value, (dict, list)):
                # Flatten any dictionary or list
                flattened = flatten_json({column: value})
                flattened_row.update(flattened)
            else:
                # Keep non-nested columns unchanged
                flattened_row[column] = value
        flattened_data.append(flattened_row)
    return pd.DataFrame(flattened_data)


In [38]:
# Flatten claim_df
flat_claim_response_df = process_dataframe(claim_response)

### Specific Column Processing

In [39]:
# Preprocess other columns
flat_claim_response_df['patient_medicare_number'] = flat_claim_response_df['contained_0_identifier_0_value']
flat_claim_response_df['patient_number'] = flat_claim_response_df['patient_reference']
flat_claim_response_df['patient_first_name'] = flat_claim_response_df['contained_0_name_0_given'].str.replace(r'[ \[ \]"]', '', regex=True)
flat_claim_response_df['patient_last_name'] = flat_claim_response_df['contained_0_name_0_family']
flat_claim_response_df['unique_claim_ID'] = flat_claim_response_df['identifier_0_value'].str.replace(r'[-]', '', regex=True)

In [40]:
#print(f" list of columns in unnested dataset {flat_claim_response_df.columns}")
for col in flat_claim_response_df.columns:
    print(col)
print(f"\noriginal json file had 14 columns and now the dataset contains {len(list(flat_claim_response_df.columns))}")

contained_0_birthDate
contained_0_extension_0_url
contained_0_extension_0_valueCode
contained_0_gender
contained_0_id
contained_0_identifier_0_system
contained_0_identifier_0_type_coding_0_code
contained_0_identifier_0_type_coding_0_display
contained_0_identifier_0_type_coding_0_system
contained_0_identifier_0_value
contained_0_name_0_family
contained_0_name_0_given
contained_0_name_0_text
contained_0_resourceType
created
extension_0_url
extension_0_valueCoding_code
extension_0_valueCoding_system
extension_1_url
extension_1_valueDate
extension_2_url
extension_2_valueDate
id
identifier_0_system
identifier_0_type_coding_0_code
identifier_0_type_coding_0_display
identifier_0_type_coding_0_system
identifier_0_value
insurer_identifier_value
meta_lastUpdated
outcome
patient_reference
request_reference
resourceType
status
type_coding_0_code
type_coding_0_display
type_coding_0_system
use
patient_medicare_number
patient_number
patient_first_name
patient_last_name
unique_claim_ID

original json 

In [None]:
flat_claim_response_df.head()

Unnamed: 0,contained_0_birthDate,contained_0_extension_0_url,contained_0_extension_0_valueCode,contained_0_gender,contained_0_id,contained_0_identifier_0_system,contained_0_identifier_0_type_coding_0_code,contained_0_identifier_0_type_coding_0_display,contained_0_identifier_0_type_coding_0_system,contained_0_identifier_0_value,...,status,type_coding_0_code,type_coding_0_display,type_coding_0_system,use,patient_medicare_number,patient_number,patient_first_name,patient_last_name,unique_claim_ID
0,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,http://terminology.hl7.org/CodeSystem/v2-0203,1S00E00JK17,...,active,institutional,Institutional,http://terminology.hl7.org/CodeSystem/claim-type,claim,1S00E00JK17,#patient,Patrina117,Wiza601,100125087
1,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,http://terminology.hl7.org/CodeSystem/v2-0203,1S00E00JK17,...,active,institutional,Institutional,http://terminology.hl7.org/CodeSystem/claim-type,claim,1S00E00JK17,#patient,Patrina117,Wiza601,100125090
2,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,http://terminology.hl7.org/CodeSystem/v2-0203,1S00E00JK17,...,active,institutional,Institutional,http://terminology.hl7.org/CodeSystem/claim-type,claim,1S00E00JK17,#patient,Patrina117,Wiza601,100125092
3,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,http://terminology.hl7.org/CodeSystem/v2-0203,1S00E00JK17,...,active,institutional,Institutional,http://terminology.hl7.org/CodeSystem/claim-type,claim,1S00E00JK17,#patient,Patrina117,Wiza601,100125096
4,1944-05-25,http://hl7.org/fhir/us/core/StructureDefinitio...,248152002,female,patient,http://hl7.org/fhir/sid/us-mbi,MC,Patient's Medicare Number,http://terminology.hl7.org/CodeSystem/v2-0203,1S00E00JK17,...,active,institutional,Institutional,http://terminology.hl7.org/CodeSystem/claim-type,claim,1S00E00JK17,#patient,Patrina117,Wiza601,100125098


In [None]:
# save to pickle
path = "../data/clean"
flat_claim_response_df.to_pickle(f"{path}/claim_response.pkl")