<a href="https://colab.research.google.com/github/maxruther/HCP_Fraud_Detection/blob/main/analysis_in_segments/HPFD_2_Integration1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Integration I**
## Healthcare Provider Fraud Detection - **Part 2**

In this second segment of my analysis spread across multiple notebooks, I handle the first phase of data integration.

This entails a two-step process:
- [Merging the two claims tables](#merge-claims-tables)
- [Joining in the beneficiary data](#join-in-beneficiaries)

<br></br>

### **Merge all claims, then join in the beneficiaries.**

This section concerns a central task of this project, the integration of the predictor data from its various files. This task might break down into the following three subtasks:

1. Combine all of the records of the Outpatient and Inpatient claim files to form one set.
2. Join that set to the patient-level data, which comprises the beneficiary file.
3. Use feature engineering to aggregate over providers, to create a set of provider-level features on which I will train my classifiers of provider fraudulence.

Completing the first two of these subtasks makes up this current section, "Data Integration I." The last of these is completed in the "Feature Engineering" section that follows.

After that, the results of these are drawn from to create full datasets of predictors and labels, in the *Data Integration II* section. That part concludes my integration work and sets the stage for the ensuing EDA.

### **Quick Setup**

**Importing libraries**

In [None]:
import pandas as pd
import numpy as np

**Loading objects from the preceding part**

In [None]:
# Mounting my Google Drive, where I've saved the preceding part's objects to file:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# Loading the objects necessary for this part

# Project directory path in Google Drive
project_dir_path = '/content/gdrive/MyDrive/fraud_data_dsc540/'

# Filepath to files saved in part 1.
part1_filepath = project_dir_path + '/walkthrough/part_1/'

ip_df = pd.read_pickle(f'{part1_filepath}ip_df.pkl')
op_df = pd.read_pickle(f'{part1_filepath}op_df.pkl')
bene_df = pd.read_pickle(f'{part1_filepath}bene_df.pkl')

<a name="merge-claims-tables"></a>
### **Merging the outpatient and inpatient claim data**

To merge the records from these two similar datasets, I concatenate these sets along the index axis (i.e. stacking them vertically to combine their rows.)

I execute this with an inner join, so only the attributes shared by these form the merged result.

#### **Noting claim records' table of origin**

Before concatentating, I create an attribute *PatType* in both claims datasets where I indicate the record's origin as either *Inpatient* or *Outpatient*. This way, I avoid losing that important information once these are merged.

In [None]:
ip_df['PatType'] = 'Inpatient'
op_df['PatType'] = 'Outpatient'

To check for this change, glancing at a few records from these datasets:

In [None]:
ip_df.iloc[:3, [1,-1]]

Unnamed: 0,ClaimID,PatType
0,CLM46614,Inpatient
1,CLM66048,Inpatient
2,CLM68358,Inpatient


In [None]:
op_df.iloc[:3, [1,-1]]

Unnamed: 0,ClaimID,PatType
0,CLM624349,Outpatient
1,CLM189947,Outpatient
2,CLM438021,Outpatient


#### **Executing the merge**

I merge the outpatient and inpatient claim datasets by concatenating them, using an inner join:

In [None]:
AllP_train = pd.concat([op_df, ip_df], axis=0, join='inner')
AllP_train.iloc[[0,1,2,-3,-2,-1], [1,-1]]

Unnamed: 0,ClaimID,PatType
0,CLM624349,Outpatient
1,CLM189947,Outpatient
2,CLM438021,Outpatient
40471,CLM76485,Inpatient
40472,CLM79949,Inpatient
40473,CLM69948,Inpatient


Checking the row count of the merged result against the sum of those of its composing sets:

In [None]:
print(f'Count of outpatient claims: {op_df.shape[0]}')
print(f'Count of inpatient claims: {ip_df.shape[0]}')

AllP_train['PatType'].value_counts()

Count of outpatient claims: 517737
Count of inpatient claims: 40474


Unnamed: 0_level_0,count
PatType,Unnamed: 1_level_1
Outpatient,517737
Inpatient,40474


In [None]:
AllP_train.shape[0] == ip_df.shape[0] + op_df.shape[0]

True

<a name="join-in-beneficiaries"></a>
### **Joining the claims and beneficiary data**

Before I make this next combination, joining this claims data I've just concatenated to the beneficiary data, I first perform a couple checks.

First, I glance at the first several records of the beneficiary data,  to refresh my sense of it:

In [None]:
bene_df.head(3)

Unnamed: 0,BeneID,DOB,DOD,Gender,Race,RenalDiseaseIndicator,State,County,NoOfMonths_PartACov,NoOfMonths_PartBCov,...,ChronicCond_Depression,ChronicCond_Diabetes,ChronicCond_IschemicHeart,ChronicCond_Osteoporasis,ChronicCond_rheumatoidarthritis,ChronicCond_stroke,IPAnnualReimbursementAmt,IPAnnualDeductibleAmt,OPAnnualReimbursementAmt,OPAnnualDeductibleAmt
0,BENE11001,1943-01-01,,1,1,0,39,230,12,12,...,1,1,1,2,1,1,36000,3204,60,70
1,BENE11002,1936-09-01,,2,1,0,39,280,12,12,...,2,2,2,2,2,2,0,0,30,50
2,BENE11003,1936-08-01,,1,1,0,52,590,12,12,...,2,2,1,2,2,2,0,0,90,40


Next, I reconfirm that the inclusion dependency constraint of the foreign key _BeneID_ has persisted through the concatenation of the claims datasets.

(This is equivalent to checking that all _BeneID_ values in the dataset of merged claims are also present in that same attribute of the beneficiary file.)

In [None]:
print(f'Combined IP and OP claim records: {AllP_train.shape[0]}')
AllP_train['BeneID'].isin(bene_df['BeneID']).value_counts()

Combined IP and OP claim records: 558211


Unnamed: 0_level_0,count
BeneID,Unnamed: 1_level_1
True,558211


With those confirmed, I **execute the join** and then glance at the result's first few records:

In [None]:
AllPnB_train = pd.merge(AllP_train, bene_df, how='inner', on='BeneID')
AllPnB_train.head(3)

Unnamed: 0,BeneID,ClaimID,ClaimStartDt,ClaimEndDt,Provider,InscClaimAmtReimbursed,AttendingPhysician,OperatingPhysician,OtherPhysician,ClmDiagnosisCode_1,...,ChronicCond_Depression,ChronicCond_Diabetes,ChronicCond_IschemicHeart,ChronicCond_Osteoporasis,ChronicCond_rheumatoidarthritis,ChronicCond_stroke,IPAnnualReimbursementAmt,IPAnnualDeductibleAmt,OPAnnualReimbursementAmt,OPAnnualDeductibleAmt
0,BENE11002,CLM624349,2009-10-11,2009-10-11,PRV56011,30,PHY326117,,,78943,...,2,2,2,2,2,2,0,0,30,50
1,BENE11003,CLM189947,2009-02-12,2009-02-12,PRV57610,80,PHY362868,,,6115,...,2,2,1,2,2,2,0,0,90,40
2,BENE11003,CLM438021,2009-06-27,2009-06-27,PRV57595,10,PHY328821,,,2723,...,2,2,1,2,2,2,0,0,90,40


To finish, I assign `AllPnB_train` to new variable `df`, merely because 'df' is much faster to type. (This convenience can be crucial to deftly manipulating and explorating data, in my experience.)

In [None]:
df = AllPnB_train



---


### *Saving objects to file for part #3*

In [None]:
filesave_path = project_dir_path + '/walkthrough/part_2'
!mkdir -p {filesave_path}

bene_df.to_pickle(f'{filesave_path}/bene_df.pkl')
ip_df.to_pickle(f'{filesave_path}/ip_df.pkl')
op_df.to_pickle(f'{filesave_path}/op_df.pkl')
df.to_pickle(f'{filesave_path}/df.pkl')