# Cleaning Claim Data
# 01_claim_data_cleaning

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 18/09/2025   | Adrienne | Created   | Created to flatten data | 
|    | |   | |

# Content

* [Introduction](#introduction)

# Preprocess JSON

In [905]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from datetime import datetime
import json_lines
import json
import copy

## Claims

__Columns__

- `contained` - birthDate, extension, gender, id, identifier, name, resourceType, id, identifier, resourceType
- `created`
- `diagnosis` - [diagnosisCodeableConcept, sequence, type] x 9
- `extension`
- `id`
- `resourceType`
- `status`
- `supportingInfo`
- `type`
- `use`
- `billablePeriod_end`
- `billablePeriod_start`
- `facility_extension`
- `identifier_system`
- `identifier_type`
- `identifier_value`
- `insurance_coverage`
- `insurance_focal`
- `insurance_sequence`
- `item_extension`
- `item_productOrService`
- `item_revenue`
- `item_sequence`
- `item_servicedDate`
- `meta_lastUpdated`
- `patient_reference`
- `priority_coding`
- `provider_reference`
- `total_currency`
- `total_value`
- `contained_identifer_patient_medicare_number`
- `contained_name_family`
- `contained_name_given`

In [918]:
# readin json file
path = "../data/raw"
#claim = pd.read_json(f"{path}/Claim.ndjson", lines=True)
claim = pd.read_json(f"{path}/Claim.ndjson", lines=True, nrows=10)

The claims file has very complex nesting.  The code below prints out one row of data, so the structure can easily be seen.

In [907]:
claim_text = claim.head(1)
for key, value in claim_text.items():
    print(f"key*: {key}")
    for item in value:
            if isinstance(item, dict):
                for key, value in value.items():
                    print(f"\tkey:: ({key}) value: ({value})\n")
            elif isinstance(item, list):
                for i in item:
                    if isinstance(i, dict):
                        for key, value in i.items():
                            if isinstance(value, list):
                                for y in value:
                                    print(f"\t\tkey: {key} list: ({y})\n")
                            else:
                                print(f"\t\tkey: ({key}) value: ({value})\n")
                    else:
                        print(f"{i}\n")
            else:
                print(f"\tvalue$: {item}\n")


key*: billablePeriod
	key:: (0) value: ({'end': '2012-09-16', 'start': '2012-09-16'})

key*: contained
		key: (birthDate) value: (1944-05-25)

		key: extension list: ({'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-sex', 'valueCode': '248152002'})

		key: (gender) value: (female)

		key: (id) value: (patient)

		key: identifier list: ({'system': 'http://hl7.org/fhir/sid/us-mbi', 'type': {'coding': [{'code': 'MC', 'display': "Patient's Medicare Number", 'system': 'http://terminology.hl7.org/CodeSystem/v2-0203'}]}, 'value': '1S00E00JK17'})

		key: name list: ({'family': 'Wiza601', 'given': ['Patrina117'], 'text': 'Patrina117 Wiza601 ([max 10 chars of first], [max 15 chars of last])'})

		key: (resourceType) value: (Patient)

		key: (id) value: (provider-org)

		key: identifier list: ({'system': 'https://bluebutton.cms.gov/resources/variables/fiss/meda-prov-6', 'type': {'coding': [{'code': 'PRN', 'display': 'Provider number', 'system': 'http://terminology.hl7.org/CodeSyst

### Function to Flatten a Column

In [908]:
def flatten_col(df, col):
       # if the column contains a dictionary, that is easy to create columns from the key-value pairs
        if df[col].apply(type).eq(dict).any():
            print(f"dict {col}")
            # original column is dropped
            temp = df[col].apply(pd.Series)
            temp = temp.add_prefix(str(col) + '_')
            print(temp.columns)
            df = pd.concat([df, temp], axis=1).drop(col, axis=1)
            
        # if the column is a list, then the list needs to be broken into columns
        elif df[col].apply(type).eq(list).any():
            print(f"list {col}")
            # if all items extracted from the list are dictionaries, then each dictionary is unnested into columns
            # other wise the column is left as-is.  Original column will not be deleted
            for i in range(0, df[col].str.len().unique()[0]):
                # to make the code more readable, creating a next col variable for the item in the list
                next_col = df[col].apply(lambda x:x[i])
                    
                    
                if next_col.apply(type).eq(dict).all():
                    temp = next_col.apply(pd.Series)
                    temp = temp.add_prefix(str(col) + '_')
                    print(temp.columns)
                     
                    df = pd.concat([df, temp], axis=1)
                                        
                else:
                    print('list contains mixed elements.  leaving alone for now')    
        return df
            

The code below unnests two levels using the function above and then specific code is written to unnest other columns.  Some columns have not been fully unnested becuase if they don't contain data we are interested in using, we will skip over those columns/fields. 

In [919]:
claim = flatten_col(claim, 'billablePeriod')
claim = flatten_col(claim, 'contained')
claim = flatten_col(claim, 'diagnosis')
claim = flatten_col(claim, 'facility')
claim = flatten_col(claim, 'identifier')
claim = flatten_col(claim, 'insurance')
claim = flatten_col(claim, 'item')
claim = flatten_col(claim, 'meta')
claim = flatten_col(claim, 'patient')
claim = flatten_col(claim, 'priority')
claim = flatten_col(claim, 'provider')
# supportingInfo has lists that are not the same length so it will need to be unnested separately
claim = flatten_col(claim, 'total')

dict billablePeriod
Index(['billablePeriod_end', 'billablePeriod_start'], dtype='object')
list contained
Index(['contained_birthDate', 'contained_extension', 'contained_gender',
       'contained_id', 'contained_identifier', 'contained_name',
       'contained_resourceType'],
      dtype='object')
Index(['contained_id', 'contained_identifier', 'contained_resourceType'], dtype='object')
list diagnosis
Index(['diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence',
       'diagnosis_type'],
      dtype='object')
Index(['diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence'], dtype='object')
Index(['diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence'], dtype='object')
Index(['diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence'], dtype='object')
Index(['diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence'], dtype='object')
Index(['diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence'], dtype='object')
Index(['diagnosis_diagnosisCodeableConcept', 'diagnosis_seque

### Specific Column Processing

In [920]:
# Preprocess other columns
claim['contained_identifer_patient_medicare_number'] = pd.DataFrame(claim['contained_identifier']).iloc[:,0].apply(lambda x: x[0]['value'])
# text version of name needs some extra processing 
claim['contained_name_family'] = claim['contained_name'].apply(lambda x: x[0]['family'])
claim['contained_name_given'] = claim['contained_name'].apply(lambda x: x[0]['given'][0])
claim['item_productOrService_coding_hcpcs'] =  claim['item_productOrService'].apply(lambda x: x['coding'][0]['code'])

# TODO item_revenue supportingInfo

In [921]:
print(f" list of columns in unnested dataset {list(claim.columns)}")
print(f" original json file had 20 columns and now the dataset contains {len(list(claim.columns))}")

 list of columns in unnested dataset ['contained', 'created', 'diagnosis', 'extension', 'id', 'identifier', 'insurance', 'item', 'resourceType', 'status', 'supportingInfo', 'type', 'use', 'billablePeriod_end', 'billablePeriod_start', 'contained_birthDate', 'contained_extension', 'contained_gender', 'contained_id', 'contained_identifier', 'contained_name', 'contained_resourceType', 'contained_id', 'contained_identifier', 'contained_resourceType', 'diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence', 'diagnosis_type', 'diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence', 'diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence', 'diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence', 'diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence', 'diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence', 'diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence', 'diagnosis_type', 'diagnosis_diagnosisCodeableConcept', 'diagnosis_sequence', 'diagnosis_diagnosisCodeableConcep

In [922]:
# take a look at the dataset
claim.head()

Unnamed: 0,contained,created,diagnosis,extension,id,identifier,insurance,item,resourceType,status,...,meta_lastUpdated,patient_reference,priority_coding,provider_reference,total_currency,total_value,contained_identifer_patient_medicare_number,contained_name_family,contained_name_given,item_productOrService_coding_hcpcs
0,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzU5,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,2023-05-11T21:17:37.364+00:00,#patient,"[{'code': 'normal', 'display': 'Normal', 'syst...",#provider-org,USD,119.62,1S00E00JK17,Wiza601,Patrina117,99221
1,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzY0,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,2023-05-11T21:17:36.876+00:00,#patient,"[{'code': 'normal', 'display': 'Normal', 'syst...",#provider-org,USD,119.62,1S00E00JK17,Wiza601,Patrina117,99221
2,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzY4,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,2023-05-11T21:17:37.098+00:00,#patient,"[{'code': 'normal', 'display': 'Normal', 'syst...",#provider-org,USD,119.62,1S00E00JK17,Wiza601,Patrina117,99221
3,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzc2,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,2023-05-11T21:17:37.145+00:00,#patient,"[{'code': 'normal', 'display': 'Normal', 'syst...",#provider-org,USD,119.62,1S00E00JK17,Wiza601,Patrina117,99221
4,"[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,f-LTEwMDAwMDAzNTUxNzc4,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,Claim,active,...,2023-05-11T21:17:37.099+00:00,#patient,"[{'code': 'normal', 'display': 'Normal', 'syst...",#provider-org,USD,119.62,1S00E00JK17,Wiza601,Patrina117,99221


In [923]:
# save to pickle
claim.to_pickle("../data/clean/claim.pkl")

This was the original code that was turned into a function. Keeping for now

In [589]:
claim_cp = copy.deepcopy(claim)
for col in claim_cp.columns:
    # if the column contains a dictionary, that is easy to create columns from the key-value pairs
    if claim_cp[col].apply(type).eq(dict).any():
        print(f"dict {col}")
        # original column is dropped
        claim_cp = pd.concat([claim_cp, claim_cp[col].apply(pd.Series)], axis=1).drop(col, axis=1)
        
    # if the column is a list, then the list needs to be broken into columns
    elif claim_cp[col].apply(type).eq(list).any():
        print(f"list {col}")
        # the supportingInfo column is not all the same length, so this code won't process that column
        if col != 'supportingInfo':
            
            
            # if all items extracted from the list are dictionaries, then each dictionary is unnested into columns
            # other wise the column is left as-is.  Original column will not be deleted
            for i in range(0, claim_cp[col].str.len().unique()[0]):
                # to make the code more readable, creating a next col variable for the item in the list
                next_col = claim_cp[col].apply(lambda x:x[i])
                
                
                if next_col.apply(type).eq(dict).all():
                    claim_cp = pd.concat([claim_cp, next_col.apply(pd.Series)], axis=1)
                                       
                else:
                    print('list contains mixed elements.  leaving alone for now')    
           

dict billablePeriod
list contained
list diagnosis
dict facility
list insurance
list item
dict meta
dict patient
dict priority
dict provider
list supportingInfo
dict total
