# Cleaning Claim Data
# 01_claim_data_cleaning

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 18/09/2025   | Adrienne | Created   | Created to flatten data | 
|    | |   | |

# Content

* [Introduction](#introduction)

# Preprocess JSON

## Claims

__Columns__

- `contained` - birthDate, extension, gender, id, identifier, name, resourceType, id, identifier, resourceType
- `created`
- `diagnosis` - [diagnosisCodeableConcept, sequence, type] x 9
- `extension`
- `id`
- `resourceType`
- `status`
- `supportingInfo`
- `type`
- `use`
- `billablePeriod_end`
- `billablePeriod_start`
- `facility_extension`
- `identifier_system`
- `identifier_type`
- `identifier_value`
- `insurance_coverage`
- `insurance_focal`
- `insurance_sequence`
- `item_extension`
- `item_productOrService`
- `item_revenue`
- `item_sequence`
- `item_servicedDate`
- `meta_lastUpdated`
- `patient_reference`
- `priority_coding`
- `provider_reference`
- `total_currency`
- `total_value`
- `contained_identifer_patient_medicare_number`
- `contained_name_family`
- `contained_name_given`

In [662]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from datetime import datetime
import json_lines
import json
import copy
import math

In [663]:
# readin json file
path = "../data/raw"
#claim = pd.read_json(f"{path}/Claim.ndjson", lines=True)
claim = pd.read_json(f"{path}/Claim.ndjson", lines=True, nrows=100)

The claims file has very complex nesting.  The code below prints out one row of data, so the structure can easily be seen.

In [None]:
claim_text = claim.head(1)
for key, value in claim_text.items():
    print(f"key*: {key}")
    for item in value:
            if isinstance(item, dict):
                for key, value in value.items():
                    print(f"\tkey:: ({key}) value: ({value})\n")
            elif isinstance(item, list):
                for i in item:
                    if isinstance(i, dict):
                        for key, value in i.items():
                            if isinstance(value, list):
                                for y in value:
                                    print(f"\t\tkey: {key} list: ({y})\n")
                            else:
                                print(f"\t\tkey: ({key}) value: ({value})\n")
                    else:
                        print(f"{i}\n")
            else:
                print(f"\tvalue$: {item}\n")


### Functions to Flatten a Column

The function below unnests columns that are dictionaries and unnests list columns if the lists are all the same size. Some columns have not been fully unnested becuase if they don't contain data we are interested in using, we will skip over those columns/fields. 

In [664]:
def flatten_dict_col(df, col):
       # if the column contains a dictionary, that is easy to create columns from the key-value pairs
        if df[col].apply(type).eq(dict).any():
            print(f"dict {col}")
            # original column is dropped
            temp = df[col].apply(pd.Series)
            temp = temp.add_prefix(str(col) + '_')
            print(temp.columns)
            #df = pd.concat([df, temp], axis=1).drop(col, axis=1)
            df = pd.concat([df, temp], axis=1)
            
            
        # if the column is a list, then the list needs to be broken into columns
        elif df[col].apply(type).eq(list).any():
            print(f"list {col}")
            # if all items extracted from the list are dictionaries, then each dictionary is unnested into columns
            # other wise the column is left as-is.  Original column will not be deleted
            for i in range(0, np.int64(df[col].str.len().unique()[0]).item()):
                # to make the code more readable, creating a next col variable for the item in the list
                next_col = df[col].apply(lambda x:x[i])
                    
                    
                if next_col.apply(type).eq(dict).all():
                    temp = next_col.apply(pd.Series)
                    temp = temp.add_prefix(str(col) + '_')
                    print(temp.columns)
                     
                    df = pd.concat([df, temp], axis=1)
                                        
                else:
                    print('list contains mixed elements.  leaving alone for now')    
        return df
            

In [665]:
claim = flatten_dict_col(claim, 'billablePeriod')
claim = flatten_dict_col(claim, 'contained')
#claim = flatten_col(claim, 'diagnosis')
claim = flatten_dict_col(claim, 'facility')
claim = flatten_dict_col(claim, 'identifier')
#claim = flatten_dict_col(claim, 'insurance')
#claim = flatten_col(claim, 'item')
claim = flatten_dict_col(claim, 'meta')
claim = flatten_dict_col(claim, 'patient')
claim = flatten_dict_col(claim, 'priority')
claim = flatten_dict_col(claim, 'provider')
# supportingInfo has lists that are not the same length so it will need to be unnested separately
claim = flatten_dict_col(claim, 'total')

dict billablePeriod
Index(['billablePeriod_end', 'billablePeriod_start'], dtype='object')
list contained
Index(['contained_birthDate', 'contained_extension', 'contained_gender',
       'contained_id', 'contained_identifier', 'contained_name',
       'contained_resourceType'],
      dtype='object')
Index(['contained_id', 'contained_identifier', 'contained_resourceType',
       'contained_extension'],
      dtype='object')
dict facility
Index(['facility_extension', 'facility_0'], dtype='object')
list identifier
Index(['identifier_system', 'identifier_type', 'identifier_value'], dtype='object')
dict meta
Index(['meta_lastUpdated'], dtype='object')
dict patient
Index(['patient_reference'], dtype='object')
dict priority
Index(['priority_coding'], dtype='object')
dict provider
Index(['provider_reference'], dtype='object')
dict total
Index(['total_currency', 'total_value'], dtype='object')


This function unnests columns that are lists that contain dictionaries

In [666]:
def flatten_list_col(df, col, prefix):
    temp = df[col].apply(pd.Series)
    temp = temp.add_prefix(prefix)
    print(temp.columns)
    print(len(temp.columns))
    
    df = pd.concat([df, temp], axis=1)

    # expanding sequence dictionaries
    for sub_col in temp.columns.values:
        temp_col = df[sub_col].apply(pd.Series)
        temp_col = temp_col.add_prefix( str(sub_col) + '_')
    
        df = pd.concat([df, temp_col], axis=1)
        #print(temp_col.columns)
    
    return df
        

In [667]:
# can only get function to unnest diagnosis column
claim = flatten_list_col(claim, 'diagnosis', 'diagnosis_')

Index(['diagnosis_0', 'diagnosis_1', 'diagnosis_2', 'diagnosis_3',
       'diagnosis_4', 'diagnosis_5', 'diagnosis_6', 'diagnosis_7',
       'diagnosis_8', 'diagnosis_9', 'diagnosis_10', 'diagnosis_11'],
      dtype='object')
12


## This is the issue with the item, insurance, supportingInfo columns.  They are lists of dictionaries that are not all the same length.  See the example below

In [None]:
print(f"len of rows in item: {claim_cp.item.str.len().unique()}")
print(f"item col with len 1: {list(claim_cp.item.loc[70])}")
print(f"item col with len 2: {list(claim_cp.item.loc[99])}")

In [None]:
# This is an example of the column containing NaN 
print(list(claim_cp.item.loc[67]))

### Specific Column Processing

In [668]:
# Preprocess other columns
claim['contained_identifer_patient_medicare_number'] = pd.DataFrame(claim['contained_identifier']).iloc[:,0].apply(lambda x: x[0]['value'])
# text version of name needs some extra processing has text notes in the field
claim['contained_name_family'] = claim['contained_name'].apply(lambda x: x[0]['family'])
claim['contained_name_given'] = claim['contained_name'].apply(lambda x: x[0]['given'][0])

In [None]:
# once item is unnested correctly, then these can be updated and ran
# claim['item_0_productOrService_coding_hcpcs'] =  claim['item_0_productOrService'].apply(lambda x: x['coding'][0]['code'] if isinstance(x, dict) else 'no code')
# claim['item_1_productOrService_coding_hcpcs'] =  claim['item_1_productOrService'].apply(lambda x: x['coding'][0]['code'] if isinstance(x, dict) else 'no code')


In [669]:
print(f" list of columns in unnested dataset {list(claim.columns)}")
print(f" original json file had 20 columns and now the dataset contains {len(list(claim.columns))}")

 list of columns in unnested dataset ['billablePeriod', 'contained', 'created', 'diagnosis', 'extension', 'facility', 'id', 'identifier', 'insurance', 'item', 'meta', 'patient', 'priority', 'provider', 'resourceType', 'status', 'supportingInfo', 'total', 'type', 'use', 'procedure', 'billablePeriod_end', 'billablePeriod_start', 'contained_birthDate', 'contained_extension', 'contained_gender', 'contained_id', 'contained_identifier', 'contained_name', 'contained_resourceType', 'contained_id', 'contained_identifier', 'contained_resourceType', 'contained_extension', 'facility_extension', 'facility_0', 'identifier_system', 'identifier_type', 'identifier_value', 'meta_lastUpdated', 'patient_reference', 'priority_coding', 'provider_reference', 'total_currency', 'total_value', 'diagnosis_0', 'diagnosis_1', 'diagnosis_2', 'diagnosis_3', 'diagnosis_4', 'diagnosis_5', 'diagnosis_6', 'diagnosis_7', 'diagnosis_8', 'diagnosis_9', 'diagnosis_10', 'diagnosis_11', 'diagnosis_0_diagnosisCodeableConcept',

In [670]:
# take a look at the dataset
claim.head()

Unnamed: 0,billablePeriod,contained,created,diagnosis,extension,facility,id,identifier,insurance,item,...,diagnosis_10_diagnosisCodeableConcept,diagnosis_10_sequence,diagnosis_10_type,diagnosis_11_0,diagnosis_11_diagnosisCodeableConcept,diagnosis_11_sequence,diagnosis_11_type,contained_identifer_patient_medicare_number,contained_name_family,contained_name_given
0,"{'end': '2012-09-16', 'start': '2012-09-16'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzU5,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117
1,"{'end': '2013-06-11', 'start': '2013-06-11'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzY0,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117
2,"{'end': '2014-04-02', 'start': '2014-04-01'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzY4,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117
3,"{'end': '2014-11-18', 'start': '2014-11-17'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzc2,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117
4,"{'end': '2016-04-04', 'start': '2016-04-04'}","[{'birthDate': '1944-05-25', 'extension': [{'u...",2025-09-03T22:14:42+00:00,[{'diagnosisCodeableConcept': {'coding': [{'co...,[{'url': 'https://bluebutton.cms.gov/resources...,{'extension': [{'url': 'https://bluebutton.cms...,f-LTEwMDAwMDAzNTUxNzc4,[{'system': 'https://bluebutton.cms.gov/resour...,[{'coverage': {'identifier': {'system': 'https...,[{'extension': [{'url': 'https://bluebutton.cm...,...,,,,,,,,1S00E00JK17,Wiza601,Patrina117


In [671]:
# save to pickle
claim.to_pickle("../data/clean/claim.pkl")