<a href="https://colab.research.google.com/github/maxruther/HCP_Fraud_Detection/blob/main/analysis_in_segments/HPFD_1_Setup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Setup**
### Healthcare Provider Fraud Detection - **Part 1**

This segmented version of my analysis of healthcare claim fraud starts with this first one, where I focus on setup. This entails:
- [Loading the various datasets](#loading-the-data)
- Preliminary analysis thereon, which involves:
    - [Running some basic checks](#Preliminary-checks)
    - [Determining a necessity of data integration](#identifying-a-challenge-of-data-integration)
    - [Verifying suspected foreign key relationships](#determining-tables-relationships)

<br></br>

### **Importing Libraries**

First, I import the libraries that I'll be using. Mostly for reference, below I import all the libraries that will be used in this entire project:

In [None]:
!pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Collecting imbalanced-learn (from imblearn)
  Downloading imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Collecting sklearn-compat<1,>=0.1 (from imbalanced-learn->imblearn)
  Downloading sklearn_compat-0.1.3-py3-none-any.whl.metadata (18 kB)
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Downloading imbalanced_learn-0.13.0-py3-none-any.whl (238 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m238.4/238.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sklearn_compat-0.1.3-py3-none-any.whl (18 kB)
Installing collected packages: sklearn-compat, imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.13.0 imblearn-0.0 sklearn-compat-0.1.3


In [None]:
import pandas as pd
import numpy as np

import time
from datetime import datetime

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

from sklearn.metrics import precision_score, make_scorer
from sklearn.metrics import classification_report, confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sns

from imblearn.over_sampling import RandomOverSampler

import pickle

import warnings

<a name="loading-the-data"></a>
### **Loading the Data**

I have already stored copies of the Kaggle data in my Google Drive. I mount my Google Drive in this Colab session (which requires my authentication) then read in the Kaggle data:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# Project directory path in my Google Drive
project_dir_path = '/content/gdrive/MyDrive/fraud_data_dsc540/'
raw_file_subpath = '/data/raw_files/'

# Read in data files
bene_df = pd.read_csv(f'{project_dir_path}{raw_file_subpath}Train_Beneficiarydata-1542865627584.csv')
ip_df = pd.read_csv(f'{project_dir_path}{raw_file_subpath}Train_Inpatientdata-1542865627584.csv')
op_df = pd.read_csv(f'{project_dir_path}{raw_file_subpath}Train_Outpatientdata-1542865627584.csv')
label_df = pd.read_csv(f'{project_dir_path}{raw_file_subpath}Train-1542865627584.csv')

I have only read in the 'Train' files provided by the Kaggle post, even though 'Test' files were provided as well. I have intentionally omitted those 'Test' files because their labels are virtually empty.

I will create my own test sets later on, split from data sourced in these "Train_*" files.

(***Sidenote for the Kaggle-uninitiated:*** It is the style of Kaggle competitions to withhold the test set's labels, because competitors submit predictions thereof for grading. Here, I am conducting analysis as an academic exercise, not for a Kaggle competition. To code my acknowledgment of this omitted test data, I write commented-out load commands:)

In [None]:
# bene_test_df = pd.read_csv(f'{project_dir_path}{raw_file_subpath}Test_Beneficiarydata-1542969243754.csv')
# ip_test_df = pd.read_csv(f'{project_dir_path}{raw_file_subpath}Test_Inpatientdata-1542969243754.csv')
# op_test_df = pd.read_csv(f'{project_dir_path}{raw_file_subpath}Test_Outpatientdata-1542969243754')
# label_test_df = pd.read_csv(f'{project_dir_path}{raw_file_subpath}Test-1542969243754.csv')

<a name="Preliminary-checks"></a>
#### **Preliminary checks**

To briefly look at each freshly loaded dataset, I check their shape and first few records.

I start with the beneficiary file:

In [None]:
print(f'Number of records in the beneficiary file: {bene_df.shape[0]}')
print(f'Number of attributes in the beneficiary file: {bene_df.shape[1]}')
print()

bene_df.head(3)

Number of records in the beneficiary file: 138556
Number of attributes in the beneficiary file: 25



Unnamed: 0,BeneID,DOB,DOD,Gender,Race,RenalDiseaseIndicator,State,County,NoOfMonths_PartACov,NoOfMonths_PartBCov,...,ChronicCond_Depression,ChronicCond_Diabetes,ChronicCond_IschemicHeart,ChronicCond_Osteoporasis,ChronicCond_rheumatoidarthritis,ChronicCond_stroke,IPAnnualReimbursementAmt,IPAnnualDeductibleAmt,OPAnnualReimbursementAmt,OPAnnualDeductibleAmt
0,BENE11001,1943-01-01,,1,1,0,39,230,12,12,...,1,1,1,2,1,1,36000,3204,60,70
1,BENE11002,1936-09-01,,2,1,0,39,280,12,12,...,2,2,2,2,2,2,0,0,30,50
2,BENE11003,1936-08-01,,1,1,0,52,590,12,12,...,2,2,1,2,2,2,0,0,90,40


Moving on to the inpatient claim file:

In [None]:
print(f'Number of records in the Inpatient file: {ip_df.shape[0]}')
print(f'Number of attributes in the Inpatient file: {ip_df.shape[1]}')
print()

ip_df.head(3)

Number of records in the Inpatient file: 40474
Number of attributes in the Inpatient file: 30



Unnamed: 0,BeneID,ClaimID,ClaimStartDt,ClaimEndDt,Provider,InscClaimAmtReimbursed,AttendingPhysician,OperatingPhysician,OtherPhysician,AdmissionDt,...,ClmDiagnosisCode_7,ClmDiagnosisCode_8,ClmDiagnosisCode_9,ClmDiagnosisCode_10,ClmProcedureCode_1,ClmProcedureCode_2,ClmProcedureCode_3,ClmProcedureCode_4,ClmProcedureCode_5,ClmProcedureCode_6
0,BENE11001,CLM46614,2009-04-12,2009-04-18,PRV55912,26000,PHY390922,,,2009-04-12,...,2724.0,19889.0,5849.0,,,,,,,
1,BENE11001,CLM66048,2009-08-31,2009-09-02,PRV55907,5000,PHY318495,PHY318495,,2009-08-31,...,,,,,7092.0,,,,,
2,BENE11001,CLM68358,2009-09-17,2009-09-20,PRV56046,5000,PHY372395,,PHY324689,2009-09-17,...,,,,,,,,,,


Now checking the outpatient claim file:

In [None]:
print(f'Number of records in the Outpatient file: {op_df.shape[0]}')
print(f'Number of attributes in the Outpatient file: {op_df.shape[1]}')
print()

op_df.head(3)

Number of records in the Outpatient file: 517737
Number of attributes in the Outpatient file: 27



Unnamed: 0,BeneID,ClaimID,ClaimStartDt,ClaimEndDt,Provider,InscClaimAmtReimbursed,AttendingPhysician,OperatingPhysician,OtherPhysician,ClmDiagnosisCode_1,...,ClmDiagnosisCode_9,ClmDiagnosisCode_10,ClmProcedureCode_1,ClmProcedureCode_2,ClmProcedureCode_3,ClmProcedureCode_4,ClmProcedureCode_5,ClmProcedureCode_6,DeductibleAmtPaid,ClmAdmitDiagnosisCode
0,BENE11002,CLM624349,2009-10-11,2009-10-11,PRV56011,30,PHY326117,,,78943,...,,,,,,,,,0,56409.0
1,BENE11003,CLM189947,2009-02-12,2009-02-12,PRV57610,80,PHY362868,,,6115,...,,,,,,,,,0,79380.0
2,BENE11003,CLM438021,2009-06-27,2009-06-27,PRV57595,10,PHY328821,,,2723,...,,,,,,,,,0,


Finally, checking the shape of the label file:

In [None]:
print(f'Number of records in the label file: {label_df.shape[0]}')
print(f'Number of attributes in the label file: {label_df.shape[1]}')
print()

label_df.head(3)

Number of records in the label file: 5410
Number of attributes in the label file: 2



Unnamed: 0,Provider,PotentialFraud
0,PRV51001,No
1,PRV51003,Yes
2,PRV51004,No


The label file, 'Train_labels.csv', only contains 5410 records, while the other training files have more than five times that amount. If this discrepancy strikes you as confusing, or as an indication that the datasets are unreconciled, then I would agree. I would also hope you find gratification in my breakdown that follows in the next section:

<a name="identifying-a-challenge-of-data-integration"></a>
#### **Identifying a Challenge of Data Integration**
#### *Transforming towards the labels' provider level*

The label file, 'Train_labels.csv', only contains 5410 records, while the other training files have more than five times that amount. This discrepancy owes to the fact that none of the files containing predictor data are at the prescriber level, which is the level of the label data. The 'level' of a dataset is a term that's often used to describe the item signified by each row of data.

Most of the four files provided in this Kaggle post differ in level, in what their records reflect:
- The label file's records reflect healthcare providers, specifically whether they are flagged as _PotentiallyFraud_ or not.

- The beneficiary file's records reflect individual beneficiaries (a.k.a. patients.)

- The records of both the *Train_Outpatient.csv* and *Train_Inpatient.csv* files reflect individual claims. These also contain what appear to be key attributes: they share *Provider* and *BeneID* with the label and beneficiary files, respectively.

Given how none of the claim or beneficiary datasets' levels match that of the provider labels, **much of this project's challenge involves data integration, to so adapt the former.** I meet that challenge by carrying out the following two steps:

  1. **Integrate the three non-label files**, by combining the claims data and then joining in the patient data. This gets me a claim-level dataset with all the original predictor attributes still available.

  2. **Aggregate on that result through feature engineering**, to transform that claim-level dataset to a provider-level one (which matches the labels.)

But before starting on all this, I need to verify these datasets' key attributes as such, as well as otherwise examining how they relate. (The Kaggle post provided scant explanation of schema.)

<a name="determining-tables-relationships"></a>
##### **Determining tables' relationships**

###### **Verifying Foreign Keys**

**Key #1 - *Beneficiary.BeneID***

It appears that the "BeneID" field in the beneficiary data is referenced as a foreign key by both the Inpatient (IP) and Outpatient (OP) datasets. This wasn't established in the Kaggle Project's description, so I'm going to here double-check that this field does behave as a foreign key, by checking:
1. its uniqueness in the Beneficiary data and
2. whether all of its values in the IP and OP claim datasets are contained in the Beneficiary dataset. This property of a foreign key is known as the *inclusion dependency constraint.*

**Checking (1)** the uniqueness of _BeneID_ within the parent table, that of the Beneficiary file:

In [None]:
bene_df.shape[0]

138556

In [None]:
bene_df['BeneID'].nunique()

138556

Indeed, every *BeneID* value in the beneficiary file is unique.

Also, as shown in the following cell, all of these *BeneID* values are contained in either the IP or OP claim data. (This is helpful to know, but not necessary for defining it as a foreign key):

In [None]:
(bene_df['BeneID'].isin(op_df['BeneID']) | bene_df['BeneID'].isin(ip_df['BeneID'])).value_counts()

Unnamed: 0_level_0,count
BeneID,Unnamed: 1_level_1
True,138556


**Checking (2)** the inclusion dependency constraint:

Are all *BeneID* values featured in the IP and OP claim files also contained in the Beneficiary file?

**OP**

In [None]:
op_df.shape[0]

517737

In [None]:
op_df['BeneID'].isin(bene_df['BeneID']).value_counts()

Unnamed: 0_level_0,count
BeneID,Unnamed: 1_level_1
True,517737


True for OP file.

**IP**

In [None]:
ip_df.shape[0]

40474

In [None]:
ip_df['BeneID'].isin(bene_df['BeneID']).value_counts()

Unnamed: 0_level_0,count
BeneID,Unnamed: 1_level_1
True,40474


True for IP file.

'BeneID' seems to be an effective foreign key in the OP and IP claim files (with the bene file being the parent.)


---



**Key #2 - *Labels.Provider***

**Checking (1)** the uniqueness of *Provider* within the parent table, that of the label file:

In [None]:
label_df.shape[0]

5410

In [None]:
label_df['Provider'].nunique()

5410

Indeed, every *Provider* value in the label file is unique.

Also, as shown in the following cell, all of these *Provider* values are contained in either the IP or OP claim data. (This is helpful to know, but not necessary for defining it as a foreign key):

In [None]:
(label_df['Provider'].isin(op_df['Provider']) | label_df['Provider'].isin(ip_df['Provider'])).value_counts()

Unnamed: 0_level_0,count
Provider,Unnamed: 1_level_1
True,5410


**Checking (2)** the inclusion dependency constraint:

Are all *Provider* values featured in the IP and OP claim files also contained in the label file?

**OP**

In [None]:
op_df.shape[0]

517737

In [None]:
op_df['Provider'].isin(label_df['Provider']).value_counts()

Unnamed: 0_level_0,count
Provider,Unnamed: 1_level_1
True,517737


True for OP file.

**IP**

In [None]:
ip_df.shape[0]

40474

In [None]:
ip_df['Provider'].isin(label_df['Provider']).value_counts()

Unnamed: 0_level_0,count
Provider,Unnamed: 1_level_1
True,40474


True for IP file.

*Provider* is an effective foreign key in the OP and IP claim files (referencing that same attribute in the label file as parent.)

###### **Comparing the OP and IP claims tables**
###### *Specifically, comparing their attribute sets.*

These tables appear to be similar, but if there are any attributes that exclusive to one or the other, I want to identify them. I do so programmatically:

In [None]:
op_df_cols = op_df.columns.values.tolist()
ip_df_cols = ip_df.columns.values.tolist()
OP_only_cols = [x for x in op_df_cols if x not in ip_df_cols]
IP_only_cols = [x for x in ip_df_cols if x not in op_df_cols]

print(f'{OP_only_cols=}')
print(f'{IP_only_cols=}')

OP_only_cols=[]
IP_only_cols=['AdmissionDt', 'DischargeDt', 'DiagnosisGroupCode']


There are no attributes in the OP claim data that aren't featured in that of the IP claims.

**However, there are three attributes that are exclusive to the IP claim data**:
- *AdmissionDt*
- *DischargeDt*
- *DiagnosisGroupCode*.


This concludes my investigation of the tables' relationships. With this deeper familiarity, I proceed with integrating the datasets.



---


### *Saving objects to file for part #2*

In [None]:
filesave_path = '/content/gdrive/MyDrive/fraud_data_dsc540/walkthrough/part_1'
!mkdir -p {filesave_path}

bene_df.to_pickle(f'{filesave_path}/bene_df.pkl')
ip_df.to_pickle(f'{filesave_path}/ip_df.pkl')
op_df.to_pickle(f'{filesave_path}/op_df.pkl')
label_df.to_pickle(f'{filesave_path}/label_df.pkl')
