# Cleaning Claim Data
# 01_claim_data_cleaning

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 18/09/2025   | Adrienne | Created   | Created to flatten data | 
|    | |   | |

# Content

* [Introduction](#introduction)

# Preprocess JSON

## Claims

__Columns__

- `contained` - birthDate, extension, gender, id, identifier, name, resourceType, id, identifier, resourceType
- `created`
- `diagnosis` - [diagnosisCodeableConcept, sequence, type] x 9
- `extension`
- `id`
- `resourceType`
- `status`
- `supportingInfo`
- `type`
- `use`
- `billablePeriod_end`
- `billablePeriod_start`
- `facility_extension`
- `identifier_system`
- `identifier_type`
- `identifier_value`
- `insurance_coverage`
- `insurance_focal`
- `insurance_sequence`
- `item_extension`
- `item_productOrService`
- `item_revenue`
- `item_sequence`
- `item_servicedDate`
- `meta_lastUpdated`
- `patient_reference`
- `priority_coding`
- `provider_reference`
- `total_currency`
- `total_value`
- `contained_identifer_patient_medicare_number`
- `contained_name_family`
- `contained_name_given`

In [424]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from datetime import datetime
import json_lines
import json
import copy
import math

In [490]:
# readin json file
path = "../data/raw"
#claim = pd.read_json(f"{path}/Claim.ndjson", lines=True)
claim = pd.read_json(f"{path}/Claim.ndjson", lines=True, nrows=100)

The claims file has very complex nesting.  The code below prints out one row of data, so the structure can easily be seen.

In [491]:
claim_text = claim.head(1)
for key, value in claim_text.items():
    print(f"key*: {key}")
    for item in value:
            if isinstance(item, dict):
                for key, value in value.items():
                    print(f"\tkey:: ({key}) value: ({value})\n")
            elif isinstance(item, list):
                for i in item:
                    if isinstance(i, dict):
                        for key, value in i.items():
                            if isinstance(value, list):
                                for y in value:
                                    print(f"\t\tkey: {key} list: ({y})\n")
                            else:
                                print(f"\t\tkey: ({key}) value: ({value})\n")
                    else:
                        print(f"{i}\n")
            else:
                print(f"\tvalue$: {item}\n")


key*: billablePeriod
	key:: (0) value: ({'end': '2012-09-16', 'start': '2012-09-16'})

key*: contained
		key: (birthDate) value: (1944-05-25)

		key: extension list: ({'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-sex', 'valueCode': '248152002'})

		key: (gender) value: (female)

		key: (id) value: (patient)

		key: identifier list: ({'system': 'http://hl7.org/fhir/sid/us-mbi', 'type': {'coding': [{'code': 'MC', 'display': "Patient's Medicare Number", 'system': 'http://terminology.hl7.org/CodeSystem/v2-0203'}]}, 'value': '1S00E00JK17'})

		key: name list: ({'family': 'Wiza601', 'given': ['Patrina117'], 'text': 'Patrina117 Wiza601 ([max 10 chars of first], [max 15 chars of last])'})

		key: (resourceType) value: (Patient)

		key: (id) value: (provider-org)

		key: identifier list: ({'system': 'https://bluebutton.cms.gov/resources/variables/fiss/meda-prov-6', 'type': {'coding': [{'code': 'PRN', 'display': 'Provider number', 'system': 'http://terminology.hl7.org/CodeSyst

### Function to Flatten a Column

In [482]:
def flatten_dict_col(df, col):
       # if the column contains a dictionary, that is easy to create columns from the key-value pairs
        if df[col].apply(type).eq(dict).any():
            print(f"dict {col}")
            # original column is dropped
            temp = df[col].apply(pd.Series)
            temp = temp.add_prefix(str(col) + '_')
            print(temp.columns)
            df = pd.concat([df, temp], axis=1).drop(col, axis=1)
            
        # # if the column is a list, then the list needs to be broken into columns
        # elif df[col].apply(type).eq(list).any():
        #     print(f"list {col}")
        #     # if all items extracted from the list are dictionaries, then each dictionary is unnested into columns
        #     # other wise the column is left as-is.  Original column will not be deleted
        #     for i in range(0, np.int64(df[col].str.len().unique()[0]).item()):
        #         # to make the code more readable, creating a next col variable for the item in the list
        #         next_col = df[col].apply(lambda x:x[i])
                    
                    
        #         if next_col.apply(type).eq(dict).all():
        #             temp = next_col.apply(pd.Series)
        #             temp = temp.add_prefix(str(col) + '_')
        #             print(temp.columns)
                     
        #             df = pd.concat([df, temp], axis=1)
                                        
        #         else:
        #             print('list contains mixed elements.  leaving alone for now')    
        return df
            

The code below unnests two levels using the function above and then specific code is written to unnest other columns.  Some columns have not been fully unnested becuase if they don't contain data we are interested in using, we will skip over those columns/fields. 

In [483]:
claim = flatten_dict_col(claim, 'billablePeriod')
claim = flatten_dict_col(claim, 'contained')
#claim = flatten_col(claim, 'diagnosis')
claim = flatten_dict_col(claim, 'facility')
claim = flatten_dict_col(claim, 'identifier')
#claim = flatten_col(claim, 'insurance')
#claim = flatten_col(claim, 'item')
claim = flatten_dict_col(claim, 'meta')
claim = flatten_dict_col(claim, 'patient')
claim = flatten_dict_col(claim, 'priority')
claim = flatten_dict_col(claim, 'provider')
# supportingInfo has lists that are not the same length so it will need to be unnested separately
claim = flatten_dict_col(claim, 'total')

dict billablePeriod
Index(['billablePeriod_end', 'billablePeriod_start'], dtype='object')
dict facility
Index(['facility_extension', 'facility_0'], dtype='object')
dict meta
Index(['meta_lastUpdated'], dtype='object')
dict patient
Index(['patient_reference'], dtype='object')
dict priority
Index(['priority_coding'], dtype='object')
dict provider
Index(['provider_reference'], dtype='object')
dict total
Index(['total_currency', 'total_value'], dtype='object')


### Specific Column Processing

In [473]:
claim_cp = copy.deepcopy(claim)

In [487]:
def flatten_list_col(df, col, prefix):
    temp = df[col].apply(pd.Series)
    temp = temp.add_prefix(prefix)
    print(temp.columns)
    df = pd.concat([df, temp], axis=1)

    # expanding sequence dictionaries
    for i in range(0, len(temp.columns)):
        temp_col = df[prefix + str(i)].apply(pd.Series)
        temp_col = temp_col.add_prefix(prefix + str(i) + '_')
        df = pd.concat([df, temp_col], axis=1)
        

In [492]:
claim.head()

Unnamed: 0,billablePeriod,contained,created,diagnosis,extension,facility,id,identifier,insurance,item,...,patient,priority,provider,resourceType,status,supportingInfo,total,type,use,procedure
0,"{'end': '2012-09-16', 'start': '2012-09-16'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzU5,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,{'reference': '#patient'},"{'coding': [{'code': 'normal', 'display': 'Nor...",{'reference': '#provider-org'},Claim,active,[{'category': {'coding': [{'code': 'typeofbill...,"{'currency': 'USD', 'value': 119.62}","{'coding': [{'code': 'institutional', 'display...",claim,
1,"{'end': '2013-06-11', 'start': '2013-06-11'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzY0,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,{'reference': '#patient'},"{'coding': [{'code': 'normal', 'display': 'Nor...",{'reference': '#provider-org'},Claim,active,[{'category': {'coding': [{'code': 'typeofbill...,"{'currency': 'USD', 'value': 119.62}","{'coding': [{'code': 'institutional', 'display...",claim,
2,"{'end': '2014-04-02', 'start': '2014-04-01'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzY4,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,{'reference': '#patient'},"{'coding': [{'code': 'normal', 'display': 'Nor...",{'reference': '#provider-org'},Claim,active,[{'category': {'coding': [{'code': 'typeofbill...,"{'currency': 'USD', 'value': 119.62}","{'coding': [{'code': 'institutional', 'display...",claim,
3,"{'end': '2014-11-18', 'start': '2014-11-17'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzc2,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,{'reference': '#patient'},"{'coding': [{'code': 'normal', 'display': 'Nor...",{'reference': '#provider-org'},Claim,active,[{'category': {'coding': [{'code': 'typeofbill...,"{'currency': 'USD', 'value': 119.62}","{'coding': [{'code': 'institutional', 'display...",claim,
4,"{'end': '2016-04-04', 'start': '2016-04-04'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzc4,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,{'reference': '#patient'},"{'coding': [{'code': 'normal', 'display': 'Nor...",{'reference': '#provider-org'},Claim,active,[{'category': {'coding': [{'code': 'typeofbill...,"{'currency': 'USD', 'value': 119.62}","{'coding': [{'code': 'institutional', 'display...",claim,


In [488]:
claim = flatten_list_col(claim, 'diagnosis', 'sequence_')
claim_test = flatten_list_col(claim, 'item', 'item_')


Index(['sequence_0', 'sequence_1', 'sequence_2', 'sequence_3', 'sequence_4',
       'sequence_5', 'sequence_6', 'sequence_7', 'sequence_8', 'sequence_9',
       'sequence_10', 'sequence_11'],
      dtype='object')


TypeError: 'NoneType' object is not subscriptable

In [None]:
# Preprocess other columns
claim['contained_identifer_patient_medicare_number'] = pd.DataFrame(claim['contained_identifier']).iloc[:,0].apply(lambda x: x[0]['value'])
# text version of name needs some extra processing 
claim['contained_name_family'] = claim['contained_name'].apply(lambda x: x[0]['family'])
claim['contained_name_given'] = claim['contained_name'].apply(lambda x: x[0]['given'][0])
claim['item_0_productOrService_coding_hcpcs'] =  claim['item_0_productOrService'].apply(lambda x: x['coding'][0]['code'] if isinstance(x, dict) else 'no code')
claim['item_1_productOrService_coding_hcpcs'] =  claim['item_1_productOrService'].apply(lambda x: x['coding'][0]['code'] if isinstance(x, dict) else 'no code')

# TODO item_revenue supportingInfo

In [368]:
print(f" list of columns in unnested dataset {list(claim.columns)}")
print(f" original json file had 20 columns and now the dataset contains {len(list(claim.columns))}")

 list of columns in unnested dataset ['contained', 'created', 'diagnosis', 'extension', 'id', 'identifier', 'insurance', 'item', 'resourceType', 'status', 'supportingInfo', 'type', 'use', 'procedure', 'billablePeriod_end', 'billablePeriod_start', 'contained_birthDate', 'contained_extension', 'contained_gender', 'contained_id', 'contained_identifier', 'contained_name', 'contained_resourceType', 'contained_id', 'contained_identifier', 'contained_resourceType', 'contained_extension', 'facility_extension', 'facility_0', 'identifier_system', 'identifier_type', 'identifier_value', 'meta_lastUpdated', 'patient_reference', 'priority_coding', 'provider_reference', 'total_currency', 'total_value', 'sequence_0', 'sequence_1', 'sequence_2', 'sequence_3', 'sequence_4', 'sequence_5', 'sequence_6', 'sequence_7', 'sequence_8', 'sequence_9', 'sequence_10', 'sequence_11', 'sequence_0_diagnosisCodeableConcept', 'sequence_0_sequence', 'sequence_0_type', 'sequence_1_diagnosisCodeableConcept', 'sequence_1_s

In [369]:
# take a look at the dataset
claim.head()

Unnamed: 0,contained,created,diagnosis,extension,id,identifier,insurance,item,resourceType,status,...,sequence_10_diagnosisCodeableConcept,sequence_10_sequence,sequence_10_type,sequence_11_0,sequence_11_diagnosisCodeableConcept,sequence_11_sequence,sequence_11_type,contained_identifer_patient_medicare_number,contained_name_family,contained_name_given
0,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzU5,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117
1,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzY0,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117
2,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzY4,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117
3,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzc2,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117
4,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzc4,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117


In [923]:
# save to pickle
claim.to_pickle("../data/clean/claim.pkl")

This was the original code that was turned into a function. Keeping for now

In [None]:
# claim_cp = copy.deepcopy(claim)
# for col in claim_cp.columns:
#     # if the column contains a dictionary, that is easy to create columns from the key-value pairs
#     if claim_cp[col].apply(type).eq(dict).any():
#         print(f"dict {col}")
#         # original column is dropped
#         claim_cp = pd.concat([claim_cp, claim_cp[col].apply(pd.Series)], axis=1).drop(col, axis=1)
        
#     # if the column is a list, then the list needs to be broken into columns
#     elif claim_cp[col].apply(type).eq(list).any():
#         print(f"list {col}")
#         # the supportingInfo column is not all the same length, so this code won't process that column
#         if col != 'supportingInfo':
            
            
#             # if all items extracted from the list are dictionaries, then each dictionary is unnested into columns
#             # other wise the column is left as-is.  Original column will not be deleted
#             for i in range(0, claim_cp[col].str.len().unique()[0]):
#                 # to make the code more readable, creating a next col variable for the item in the list
#                 next_col = claim_cp[col].apply(lambda x:x[i])
                
                
#                 if next_col.apply(type).eq(dict).all():
#                     claim_cp = pd.concat([claim_cp, next_col.apply(pd.Series)], axis=1)
                                       
#                 else:
#                     print('list contains mixed elements.  leaving alone for now')    
           

dict billablePeriod
list contained
list diagnosis
dict facility
list insurance
list item
dict meta
dict patient
dict priority
dict provider
list supportingInfo
dict total
