<a href="https://colab.research.google.com/github/maxruther/HCP_Fraud_Detection/blob/main/analysis_in_segments/HPFD_4_Integration2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Integration II**
## Healthcare Provider Fraud Detection - **Part 4**

In this fourth segment of my claim fraud analysis, I complete my integration and transformation of the data. I do so by first selecting only the provider-aggregate attributes that were created in the last section, to form a predictor dataset `X`.

This section's main areas concern:
- [Forming the predictor dataset, `X`](#prepare-X)
- [Forming the label dataset, `y`](#prepare-y)
- [Rejoining those, then train-test splitting them](#rejoin-then-split)

<br></br>

### **Overview**

I have now, through aggregation, created all my provider-level attributes of interest. With these, I can form a dataset where the records each reflect a provider, as in the label data.

My process of forming that dataset breaks down into the following steps, which make up the subsections:
1. **Prepare predictor dataset `X`** - Identify and select all engineered, provider-level attributes from the working dataset, `df`.

2. **Prepare label dataset `y`** - I transform the labels from a categorical attribute to a binary one.

3. **Join X and y** - I join these prepared sets on the *Provider* key attribute, reassigning df its result.

4. **Finalize `X_all` and `y_all`** - I create final predictor and label datasets from `X_all` and `y_all`. By drawing these from the joined result of `X` and `y`, I ensure that these datasets have matching order.

5. **Train-test splitting** - Split `X_all` and `y_all` into training and testing subsets.

6. **Forming full training set For EDA** - create a full training set for EDA by concatenating `X_train` and `y_train` along the column axis.

### **Quick Setup**



**Importing libraries**

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import pickle

**Loading objects from the preceding part**

In [None]:
# Mounting my Google Drive, where I've saved the preceding part's objects to file:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# Loading the objects necessary for this part

# Project directory path in Google Drive
project_dir_path = '/content/gdrive/MyDrive/fraud_data_dsc540/'

# Loading the 'df' from the preceding part, 3:
part3_filepath = project_dir_path + '/walkthrough/part_3/'
df = pd.read_pickle(f'{part3_filepath}df.pkl')

# Loading the label df, which was last used in part 1:
part1_filepath = project_dir_path + '/walkthrough/part_1/'
label_df = pd.read_pickle(f'{part1_filepath}label_df.pkl')

<a name="prepare-X"></a>
### **Preparing predictor dataset** `X`

To create a full predictor dataset `X`, I identify and select my provider-level aggregate attributes from the main working dataset, `df`, plus the provider key attribute *Provider*.

The engineered attributes are each indicated by whether their name features the prefix 'Prv_'. I use this pattern to identify and list these:

In [None]:
prv_fields = [x for x in df.columns.tolist() if 'Prv_' in x]
prv_fields

['Prv_Claim_Count',
 'Prv_Claim_AmtTotal',
 'Prv_Claim_AmtAvg',
 'Prv_Claim_IPShare',
 'Prv_Claim_LengthAvg',
 'Prv_condPrev_Alz',
 'Prv_condPrev_HeartF',
 'Prv_condPrev_KidneyD',
 'Prv_condPrev_Cancer',
 'Prv_condPrev_ObstrP',
 'Prv_condPrev_Depr',
 'Prv_condPrev_Diab',
 'Prv_condPrev_IschemicH',
 'Prv_condPrev_Osteo',
 'Prv_condPrev_Rheuma',
 'Prv_condPrev_Stroke',
 'Prv_InsAmt_IP_ReimbAvg',
 'Prv_InsAmt_IP_DeductAvg',
 'Prv_InsAmt_OP_ReimbAvg',
 'Prv_InsAmt_OP_DeductAvg']

To create a first full dataset of predictors `X`, I project on these aggregate attributes and _Provider_, then eliminate duplicates therefrom:

In [None]:
X = df[['Provider'] + prv_fields].drop_duplicates().sort_values('Provider', ascending=True)
X

Unnamed: 0,Provider,Prv_Claim_Count,Prv_Claim_AmtTotal,Prv_Claim_AmtAvg,Prv_Claim_IPShare,Prv_Claim_LengthAvg,Prv_condPrev_Alz,Prv_condPrev_HeartF,Prv_condPrev_KidneyD,Prv_condPrev_Cancer,...,Prv_condPrev_Depr,Prv_condPrev_Diab,Prv_condPrev_IschemicH,Prv_condPrev_Osteo,Prv_condPrev_Rheuma,Prv_condPrev_Stroke,Prv_InsAmt_IP_ReimbAvg,Prv_InsAmt_IP_DeductAvg,Prv_InsAmt_OP_ReimbAvg,Prv_InsAmt_OP_DeductAvg
2572,PRV51001,25,104640,4185.600000,0.200000,1.440000,0.583333,0.750000,0.708333,0.208333,...,0.375000,0.833333,0.916667,0.250000,0.333333,0.208333,18047.916667,890.000000,2537.500000,474.916667
1056,PRV51003,132,605670,4588.409091,0.469697,3.674242,0.376068,0.598291,0.444444,0.085470,...,0.401709,0.743590,0.846154,0.239316,0.273504,0.076923,6814.017094,822.632479,2490.598291,664.529915
4172,PRV51004,149,52170,350.134228,0.000000,1.429530,0.434783,0.594203,0.340580,0.115942,...,0.434783,0.695652,0.710145,0.311594,0.297101,0.115942,4596.739130,454.144928,2095.144928,600.869565
1052,PRV51005,1165,280910,241.124464,0.000000,1.088412,0.333333,0.531313,0.359596,0.119192,...,0.371717,0.634343,0.701010,0.286869,0.256566,0.078788,3717.232323,398.698990,1798.808081,475.965657
3723,PRV51007,72,33710,468.194444,0.041667,0.958333,0.362069,0.517241,0.293103,0.103448,...,0.362069,0.620690,0.689655,0.293103,0.275862,0.155172,3109.655172,423.517241,1497.241379,430.689655
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21682,PRV57759,28,10640,380.000000,0.000000,2.142857,0.500000,0.750000,0.541667,0.125000,...,0.291667,0.750000,1.000000,0.458333,0.333333,0.125000,3414.166667,445.000000,2910.416667,755.000000
39875,PRV57760,22,4770,216.818182,0.000000,0.318182,0.222222,0.444444,0.222222,0.000000,...,0.444444,0.666667,1.000000,0.444444,0.111111,0.000000,1240.000000,237.333333,1883.333333,832.222222
5906,PRV57761,82,18470,225.243902,0.000000,1.390244,0.432836,0.671642,0.462687,0.164179,...,0.417910,0.686567,0.761194,0.358209,0.328358,0.119403,6737.313433,573.850746,2506.716418,631.492537
511155,PRV57762,1,1900,1900.000000,0.000000,0.000000,0.000000,0.000000,1.000000,1.000000,...,0.000000,1.000000,1.000000,0.000000,0.000000,0.000000,15000.000000,1068.000000,2540.000000,400.000000


I finish this section by making a couple cursory checks on this new dataset:

1) Are all values of *Provider* in `X` unique?

2) Is every value of *Provider* in `X` contained in this same attribute of the labels?

It's on this *Provider* attribute that I will soon join this predictor dataset to that of the labels, so it's important to know whether these checks are violated (especially the first one.)

In [None]:
# Check if all 'Provider' values in X are unique
if X['Provider'].nunique() == X.shape[0]:
  print("Confirmed: all 'Provider' values in X are unique.", end='\n\n')
else:
  print("Uh oh, there exist 'Provider' duplicate values in X.")

# Check where there are any 'Provider' values in X that are not contained in the
# label's 'Provider' field.
if ~X['Provider'].isin(label_df['Provider']).any() == False:
  print("Confirmed: all 'Provider' values in X are contained in the label's \
  'Provider' field.")
else:
  print("Uh oh, there exists 'Provider' values in X that aren't contained in\
  the label's 'Provider' field.")

Confirmed: all 'Provider' values in X are unique.

Confirmed: all 'Provider' values in X are contained in the label's   'Provider' field.


It seems I can expect that all predictor records in `X` will be joined to a label (in `label_df`, as of yet.)


<a name="prepare-y"></a>
### **Preparing label dataset `y`**

To make the target attribute _PotentialFraud_ more amenable to statistical analysis and modelling, I need to transform it from a categorical binary variable to a numeric one.

First, I glance at the label data's first few records:

In [None]:
label_df.head(3)

Unnamed: 0,Provider,PotentialFraud
0,PRV51001,No
1,PRV51003,Yes
2,PRV51004,No


Next, I create a working label set `y` by transforming a copy of the original set of labels. By making such a copy instead of transforming directly, I retain a reference to the original label data, in `label_df`. This reference will be useful if I ever need to backtrack.

Creating said copy in `y`:

In [None]:
y = label_df.copy()

Executing the encoding of _PotentialFraud_ (into temporary attribute _Potential**ly**Fraud_):

In [None]:
y['PotentiallyFraud'] = pd.factorize(y['PotentialFraud'])[0]
y.head(3)

Unnamed: 0,Provider,PotentialFraud,PotentiallyFraud
0,PRV51001,No,0
1,PRV51003,Yes,1
2,PRV51004,No,0


Dropping the original target attribute *PotentialFraud*, then giving its name to the encoded one:

In [None]:
y = y.drop('PotentialFraud', axis=1)
y = y.rename(columns={'PotentiallyFraud': 'PotentialFraud'})
y.head(3)

Unnamed: 0,Provider,PotentialFraud
0,PRV51001,0
1,PRV51003,1
2,PRV51004,0


<a name="rejoin-then-split"></a>
### **Joining `X` and `y`**

I now join these prepared sets of predictors and labels `X` and `y` to create a full dataset, reassigning `df` to that result.

In [None]:
df = X.merge(y, how='inner', on='Provider')
df.head(3)

Unnamed: 0,Provider,Prv_Claim_Count,Prv_Claim_AmtTotal,Prv_Claim_AmtAvg,Prv_Claim_IPShare,Prv_Claim_LengthAvg,Prv_condPrev_Alz,Prv_condPrev_HeartF,Prv_condPrev_KidneyD,Prv_condPrev_Cancer,...,Prv_condPrev_Diab,Prv_condPrev_IschemicH,Prv_condPrev_Osteo,Prv_condPrev_Rheuma,Prv_condPrev_Stroke,Prv_InsAmt_IP_ReimbAvg,Prv_InsAmt_IP_DeductAvg,Prv_InsAmt_OP_ReimbAvg,Prv_InsAmt_OP_DeductAvg,PotentialFraud
0,PRV51001,25,104640,4185.6,0.2,1.44,0.583333,0.75,0.708333,0.208333,...,0.833333,0.916667,0.25,0.333333,0.208333,18047.916667,890.0,2537.5,474.916667,0
1,PRV51003,132,605670,4588.409091,0.469697,3.674242,0.376068,0.598291,0.444444,0.08547,...,0.74359,0.846154,0.239316,0.273504,0.076923,6814.017094,822.632479,2490.598291,664.529915,1
2,PRV51004,149,52170,350.134228,0.0,1.42953,0.434783,0.594203,0.34058,0.115942,...,0.695652,0.710145,0.311594,0.297101,0.115942,4596.73913,454.144928,2095.144928,600.869565,0


#### **Finalizing `X_all` and `y_all`**

Following the join, the predictor records are now matched to their labels. Now that I'm sure of their correspondence, I create the final, full predictor and target datasets, `X_all` and `y_all`.

**`X_all`**

Creating the set of predictors `X_all`by removing the key identifier _Provider_ as well as the target variable _PotentialFraud_:

In [None]:
# Excluding the first and last fields: Provider and PotentialFraud.
X_all = df.iloc[:, 1:-1]
X_all.head(3)

Unnamed: 0,Prv_Claim_Count,Prv_Claim_AmtTotal,Prv_Claim_AmtAvg,Prv_Claim_IPShare,Prv_Claim_LengthAvg,Prv_condPrev_Alz,Prv_condPrev_HeartF,Prv_condPrev_KidneyD,Prv_condPrev_Cancer,Prv_condPrev_ObstrP,Prv_condPrev_Depr,Prv_condPrev_Diab,Prv_condPrev_IschemicH,Prv_condPrev_Osteo,Prv_condPrev_Rheuma,Prv_condPrev_Stroke,Prv_InsAmt_IP_ReimbAvg,Prv_InsAmt_IP_DeductAvg,Prv_InsAmt_OP_ReimbAvg,Prv_InsAmt_OP_DeductAvg
0,25,104640,4185.6,0.2,1.44,0.583333,0.75,0.708333,0.208333,0.375,0.375,0.833333,0.916667,0.25,0.333333,0.208333,18047.916667,890.0,2537.5,474.916667
1,132,605670,4588.409091,0.469697,3.674242,0.376068,0.598291,0.444444,0.08547,0.282051,0.401709,0.74359,0.846154,0.239316,0.273504,0.076923,6814.017094,822.632479,2490.598291,664.529915
2,149,52170,350.134228,0.0,1.42953,0.434783,0.594203,0.34058,0.115942,0.268116,0.434783,0.695652,0.710145,0.311594,0.297101,0.115942,4596.73913,454.144928,2095.144928,600.869565


In [None]:
print(f'The predictor data has:\n\n{X_all.shape[0]} records\n{X_all.shape[1]} attributes')

The predictor data has:

5410 records
20 attributes


To improve the legibility of the predictors' names, I now remove their 'Prv_' prefixes:

In [None]:
curr_names = list(X_all.columns)
renamings = [x.replace('Prv_', '') for x in curr_names]
renaming_dict = dict(zip(curr_names, renamings))
X_all.rename(columns=renaming_dict, inplace=True)

list(X_all.columns)

['Claim_Count',
 'Claim_AmtTotal',
 'Claim_AmtAvg',
 'Claim_IPShare',
 'Claim_LengthAvg',
 'condPrev_Alz',
 'condPrev_HeartF',
 'condPrev_KidneyD',
 'condPrev_Cancer',
 'condPrev_ObstrP',
 'condPrev_Depr',
 'condPrev_Diab',
 'condPrev_IschemicH',
 'condPrev_Osteo',
 'condPrev_Rheuma',
 'condPrev_Stroke',
 'InsAmt_IP_ReimbAvg',
 'InsAmt_IP_DeductAvg',
 'InsAmt_OP_ReimbAvg',
 'InsAmt_OP_DeductAvg']

**`y_all`**

Creating the set of labels `y_all` by only including the _PotentialFraud_ attribute:

In [None]:
y_all = df.iloc[:, -1]
y_all.head()
y_all.shape

(5410,)

In [None]:
print(f'The target data has:\n\n{y_all.shape[0]} records\n1 attribute')

The target data has:

5410 records
1 attribute


In [None]:
y_all.head(3)

Unnamed: 0,PotentialFraud
0,0
1,1
2,0


#### **Train-test splitting**

From these final predictor and label sets, I can create subsets for training and testing.

But before I do, I set a seed for the random state. Its use helps to ensure the reproducibility of this analysis and modelling. This is exactly how it is commonly used for train-test splits, as I practice below:

##### **Random State (42)**

Specifying a random state to ensure the reproducibility of the analysis and modelling.

In [None]:
rand_st = 42

##### **Executing the train-test split:**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2,
                                                    stratify=y_all,
                                                    random_state=rand_st)

In [None]:
print(f'{X_train.shape=}', f'{X_test.shape=}', '',
      f'{y_train.shape=}', f'{y_test.shape=}',
      sep = '\n')

X_train.shape=(4328, 20)
X_test.shape=(1082, 20)

y_train.shape=(4328,)
y_test.shape=(1082,)


#### **For EDA, merge `X_train` and `y_train`**

It is best practice to perform EDA on only the training data, not on the testing data. Doing so maintains the test data's simulant 'unseen' quality, conventionally considered crucial for validation.

In [None]:
print(f'{X_train.shape=}')
print(f'{y_train.shape=}')

X_train.shape=(4328, 20)
y_train.shape=(4328,)


In [None]:
df_train = pd.concat([X_train, y_train], axis=1)
df_train.head(3)

Unnamed: 0,Claim_Count,Claim_AmtTotal,Claim_AmtAvg,Claim_IPShare,Claim_LengthAvg,condPrev_Alz,condPrev_HeartF,condPrev_KidneyD,condPrev_Cancer,condPrev_ObstrP,...,condPrev_Diab,condPrev_IschemicH,condPrev_Osteo,condPrev_Rheuma,condPrev_Stroke,InsAmt_IP_ReimbAvg,InsAmt_IP_DeductAvg,InsAmt_OP_ReimbAvg,InsAmt_OP_DeductAvg,PotentialFraud
3090,105,25220,240.190476,0.0,1.6,0.406593,0.450549,0.263736,0.10989,0.164835,...,0.703297,0.703297,0.340659,0.285714,0.076923,3499.120879,484.703297,2026.153846,603.956044,0
3462,14,8320,594.285714,0.0,0.571429,0.5,0.875,0.5,0.125,0.125,...,0.75,0.75,0.625,0.25,0.125,4662.5,400.5,3138.75,828.75,0
3446,144,36470,253.263889,0.0,0.847222,0.416667,0.62963,0.37037,0.194444,0.268519,...,0.62963,0.731481,0.333333,0.37037,0.074074,3988.055556,454.888889,1721.759259,500.648148,0


In [None]:
print(f'{df_train.shape=}')

df_train.shape=(4328, 21)


### **Save the various dataframes to file:**

The desired datasets are finally complete and ready for EDA. I save them to file before proceeding to that section:

In [None]:
# Save these preprocessed datasets as pickle files
data_save_filepath = project_dir_path + '/walkthrough/data/preprocessed/pkl/'
!mkdir -p {data_save_filepath}

df_train.to_pickle(data_save_filepath + 'df_train.pkl')
X_train.to_pickle(data_save_filepath + 'X_train.pkl')
y_train.to_pickle(data_save_filepath + 'y_train.pkl')
X_test.to_pickle(data_save_filepath + 'X_test.pkl')
y_test.to_pickle(data_save_filepath + 'y_test.pkl')

# Save datasets as csv files
data_save_filepath = project_dir_path + '/walkthrough/data/preprocessed/csv/'
!mkdir -p {data_save_filepath}

df_train.to_csv(data_save_filepath + 'df_train.csv')
X_train.to_csv(data_save_filepath + 'X_train.csv')
y_train.to_csv(data_save_filepath + 'y_train.csv')
X_test.to_csv(data_save_filepath + 'X_test.csv')
y_test.to_csv(data_save_filepath + 'y_test.csv')



---


### *Saving objects to file for part #5*

In [None]:
filesave_path = project_dir_path + '/walkthrough/part_4'
!mkdir -p {filesave_path}

df.to_pickle(f'{filesave_path}/df.pkl')
df_train.to_pickle(f'{filesave_path}/df_train.pkl')
X_train.to_pickle(f'{filesave_path}/X_train.pkl')
y_train.to_pickle(f'{filesave_path}/y_train.pkl')
X_test.to_pickle(f'{filesave_path}/X_test.pkl')
y_test.to_pickle(f'{filesave_path}/y_test.pkl')

with open(f'{filesave_path}/rand_st.pkl', 'wb') as file:
  pickle.dump(rand_st, file)