## healthcare_data_manipulation.ipnyb

This program extracts synthetic healthcare data from a database stored on SQL Server and manipulates it in various ways.

Written by Stephen Lew

In [149]:
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

### Extract data from database stored on SQL Server

In [150]:
engine = create_engine(
    "mssql+pyodbc://localhost\\SQLEXPRESS/SyntheticHealthcareData?driver=ODBC+Driver+17+for+SQL+Server&Trusted_Connection=yes"
)

Synthetic healthcare claims data

Source: https://www.kaggle.com/datasets/abuthahir1998/synthetic-healthcare-claims-dataset/data

In [151]:
df_claim_data = pd.read_sql("SELECT * FROM claim_data", engine)
df_claim_data.sample(5)

Unnamed: 0,claim_id,provider_id,patient_id,date_of_service,billed_amount,procedure_code,diagnosis_code,allowed_amount,paid_amount,insurance_type,claim_status,reason_code,followup_required,ar_status,outcome
236,DZFSOUHGBF,3284140919,7714365852,2024-06-27,135.0,99221,R40.2331,107.0,88.0,Medicaid,Under Review,Pre-existing condition,Yes,Pending,Denied
833,A25OXEV0JO,9122941551,8439672012,2024-07-09,269.0,99231,A07.9,186.0,154.0,Commercial,Under Review,Incorrect billing information,No,Open,Paid
570,FKTZG0CNQO,5629941385,3025164227,2024-08-16,326.0,99231,A04.5,269.0,240.0,Commercial,Paid,Missing documentation,No,On Hold,Paid
535,CLT90VMN8P,6079651870,5818045824,2024-09-09,316.0,99233,O88.33,278.0,228.0,Medicare,Paid,Pre-existing condition,No,Partially Paid,Paid
905,QCXU5JFFML,4857852547,4151838566,2024-07-25,220.0,99233,A01.4,162.0,139.0,Medicare,Denied,Pre-existing condition,No,Open,Partially Paid


ICD 10 code descriptions

Source: https://www.kaggle.com/datasets/mrhell/icd10cm-codeset-2023

In [152]:
df_icd_codes = pd.read_sql("SELECT * FROM icd_codes", engine)
df_icd_codes.sample(5)

Unnamed: 0,icd_code,description
71356,Y36.330S,"War operations involving flamethrower, militar..."
38340,S59.901D,"Unspecified injury of right elbow, subsequent ..."
24158,S00.219S,Abrasion of unspecified eyelid and periocular ...
40658,S62.366S,Nondisplaced fracture of neck of fifth metacar...
39719,S62.132P,Displaced fracture of capitate [os magnum] bon...


### Join data

Bring in descriptions of ICD-10 codes into the claims data.

Rename the new field from "description" to "diagnosis description".

In [153]:
df_claim_data = (df_claim_data.merge(df_icd_codes, left_on = "diagnosis_code", right_on = "icd_code", how = "left")
                 .rename(columns = {"description": "diagnosis_description"})
                 .drop("icd_code", axis = 1))
df_claim_data.sample(5)

Unnamed: 0,claim_id,provider_id,patient_id,date_of_service,billed_amount,procedure_code,diagnosis_code,allowed_amount,paid_amount,insurance_type,claim_status,reason_code,followup_required,ar_status,outcome,diagnosis_description
131,I3W13NNO8Q,6763652717,7851510784,2024-07-15,359.0,99233,A04.1,303.0,276.0,Medicare,Denied,Authorization not obtained,No,On Hold,Paid,Enterotoxigenic Escherichia coli infection
841,IDINVUDJGV,5145102885,413266303,2024-07-30,212.0,99233,A01.2,143.0,122.0,Medicaid,Under Review,Authorization not obtained,No,Denied,Partially Paid,Paratyphoid fever B
107,OJFM3JQZ1S,2285909913,853961177,2024-05-18,303.0,99215,A04.1,230.0,222.0,Self-Pay,Paid,Patient eligibility issues,Yes,Pending,Paid,Enterotoxigenic Escherichia coli infection
229,IJ5T168FLX,7864007404,7250117317,2024-05-06,267.0,99222,A19.9,221.0,182.0,Medicaid,Under Review,Service not covered,Yes,Partially Paid,Denied,"Miliary tuberculosis, unspecified"
373,AZLH3XYRRW,8105376676,4316744945,2024-06-12,395.0,99232,A03.3,342.0,288.0,Medicare,Under Review,Patient eligibility issues,Yes,Open,Partially Paid,Shigellosis due to Shigella sonnei


### Create variable and aggregate data

Create a new dataframe that has, for each insurance type, the rate at which claims are denied.

In [154]:
df_claim_data["denial_rate"] = np.where(df_claim_data["claim_status"] == "Denied", 100, 0)
df_denial_rate = (df_claim_data.groupby("insurance_type")["denial_rate"].mean()
                  .reset_index())
print(df_denial_rate)


  insurance_type  denial_rate
0     Commercial    34.362934
1       Medicaid    28.957529
2       Medicare    36.051502
3       Self-Pay    32.128514


### Filter data

Subset the data to only records with a diagnosis code of A05.4 (Foodborne Bacillus cereus intoxication).

In [155]:
df_filter = df_claim_data.query("diagnosis_code == 'A05.4'")
df_filter.sample(5)

Unnamed: 0,claim_id,provider_id,patient_id,date_of_service,billed_amount,procedure_code,diagnosis_code,allowed_amount,paid_amount,insurance_type,claim_status,reason_code,followup_required,ar_status,outcome,diagnosis_description,denial_rate
745,6O1LURCE9Z,2352266822,1246918644,2024-06-02,365.0,99232,A05.4,254.0,204.0,Self-Pay,Paid,Incorrect billing information,No,On Hold,Paid,Foodborne Bacillus cereus intoxication,0
475,DX78LUNF87,2316904587,9259142711,2024-06-30,492.0,99238,A05.4,379.0,352.0,Self-Pay,Under Review,Lack of medical necessity,No,Closed,Paid,Foodborne Bacillus cereus intoxication,0
600,S1SUBLMW4D,9685974415,3445672900,2024-09-16,151.0,99213,A05.4,92.0,75.0,Medicaid,Denied,Pre-existing condition,Yes,Partially Paid,Paid,Foodborne Bacillus cereus intoxication,100
417,EYOS09JPEW,3050607331,6787726636,2024-09-12,193.0,99215,A05.4,148.0,146.0,Medicaid,Denied,Missing documentation,No,Pending,Partially Paid,Foodborne Bacillus cereus intoxication,100
720,ALIEYHSV55,8958518789,1059063310,2024-06-17,434.0,99223,A05.4,385.0,384.0,Commercial,Denied,Service not covered,Yes,Open,Paid,Foodborne Bacillus cereus intoxication,100


### Sort data

Sort data by date of service then claim ID.

In [156]:
(df_claim_data.sort_values(["date_of_service", "claim_id"])
 .head(5))

Unnamed: 0,claim_id,provider_id,patient_id,date_of_service,billed_amount,procedure_code,diagnosis_code,allowed_amount,paid_amount,insurance_type,claim_status,reason_code,followup_required,ar_status,outcome,diagnosis_description,denial_rate
235,4AECRPPSTL,6034872297,1183596686,2024-05-01,214.0,99213,M72.8,179.0,178.0,Commercial,Under Review,Duplicate claim,No,Partially Paid,Partially Paid,Other fibroblastic disorders,0
403,8H6FAKD1OH,1109442085,5127757508,2024-05-01,244.0,99238,A04.8,168.0,150.0,Medicaid,Under Review,Duplicate claim,Yes,Pending,Paid,Other specified bacterial intestinal infections,0
267,AFCVL9N05Z,369668919,3458366761,2024-05-01,108.0,99222,A05.0,66.0,58.0,Medicaid,Paid,Missing documentation,Yes,Closed,Partially Paid,Foodborne staphylococcal intoxication,0
919,WR6QACIBSK,6636473801,6871324728,2024-05-01,252.0,99213,M93.259,222.0,183.0,Commercial,Paid,Duplicate claim,Yes,Denied,Partially Paid,"Osteochondritis dissecans, unspecified hip",0
174,0APP6HH8P5,5577749071,8116288647,2024-05-02,224.0,99221,A05.3,158.0,133.0,Commercial,Denied,Lack of medical necessity,Yes,Open,Paid,Foodborne Vibrio parahaemolyticus intoxication,100


### Keep select fields

Keep only the fields for claim ID, procedure code, and diagnosis code.

In [157]:
df_select = df_claim_data[["claim_id", "procedure_code", "diagnosis_code"]]
df_select.sample(5)

Unnamed: 0,claim_id,procedure_code,diagnosis_code
218,3SH5BZYEI6,99221,A07.3
286,XRRAATHX7I,99238,S54.01XS
962,NHZOO9HJH4,99214,S62.212A
460,BCI8Z8GZAG,99213,A18.6
193,J0FY34DBL2,99214,A19.0
