---
---

# **`Medicare Provider Fraudulent Detection`**

---
---

In [None]:
from IPython.display import Image
Image("../input/medicare-prv-fraud-files/Display_Pic.png")

**This Kernel comprises the below tasks:**
   - EDA on Beneficiary Data
   
   - EDA on In-patients & Out-patients Data
   
   - EDA on entire data after joining with Provider Tgt Labels
   
   - Feature Engineered (SET-1):
       - Basic EDA features
       - Single level Aggregated features
       - Multiple levels Aggregated features
   
   
   - Models trained and evaluated on SET-1 features
   
   
   - Feature Engineered (SET-2):
       - Basic EDA features
       - Code Embeddings Similarity Features:
           - CAD <--> Dx
           - CAD <--> PROC
           - Dx <--> PROC
       - Single level Aggregated features (excluding CAD, Dx & PROC codes)
       - Multiple levels Aggregated features (excluding no CAD, Dx & PROC codes)
   
   
   - Models trained and evaluated on SET-2 features
   

**Kindly checkout the below WebApp for accessing the best trained model for this problem:**

- [WebApp](https://medicare-prv-fraud-detection.herokuapp.com/)


**Kindly checkout below link for gaining BUSINESS related insights about this problem:**

- [Deck : Detailed Explanation](https://docs.google.com/presentation/d/1Thuw_eZskafkl9W3xYuEVsjgTJKFeEkzEfAmczH96Uw/)


**Kindly checkout below link for TECHNICAL description about this problem:**

- [Technical Document](https://docs.google.com/document/d/10z9xbn4dZWkforlAszCDaa1M0roAW-MOf2pjJHSnEXk/)


**Kindly checkout below link for In-depth Description and Reasoning of all the Features ::**

- [Features Description](https://docs.google.com/spreadsheets/d/1ktwjad3U-hGT_7yccGyZC4AzGhWAMs3RHrIMk8Gi8xQ/)

### **Importing_Libraries**

In [None]:
import os
import sys
import math
import scipy as scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
pd.set_option('display.max_columns',30)
label_font_dict = {'family':'sans-serif','size':13.5,'color':'brown','style':'italic'}
title_font_dict = {'family':'sans-serif','size':16.5,'color':'Blue','style':'italic'}

# **BENE Data - EDA**

In [None]:
train_bene_df = pd.read_csv("input/medicare-prv-fraud-files/Train_Beneficiarydata-1542865627584.csv")
train_ip_df = pd.read_csv("input/healthcare-provider-fraud-detection-analysis/Train_Inpatientdata-1542865627584.csv")
train_op_df = pd.read_csv("input/healthcare-provider-fraud-detection-analysis/Train_Outpatientdata-1542865627584.csv")

In [None]:
train_bene_df.shape

In [None]:
train_bene_df.columns

In [None]:
train_bene_df.dtypes

In [None]:
train_bene_df.head()

### **Q1. How many unique beneficiaries we have in our dataset?**

In [None]:
train_bene_df['BeneID'].nunique()

### **Q2. How many records we have at the GENDER level?**

In [None]:
train_bene_df['Gender'].unique()

In [None]:
train_bene_df['Gender'] = train_bene_df['Gender'].apply(lambda val: 0 if val == 2 else 1)

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of GENDER 
with plt.style.context('seaborn'):
  plt.figure(figsize=(10,8))
  fig = train_bene_df['Gender'].value_counts().plot(kind='bar', color=['yellow','purple'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
  # Providing the labels and title to the graph
  plt.xlabel("Gender Code", fontdict=label_font_dict)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES based on GENDER\n", fontdict=title_font_dict)

**`OBERVATION`**
* From the above plot, we can decude that the ratio b/w GENDER_0 : GENDER_1 is 57% : 43%.





### **Q3. Lets calculate the AGE of every BENEFICIARY?**

In [None]:
train_bene_df['DOB'] = pd.to_datetime(train_bene_df['DOB'], format="%Y-%m-%d")

In [None]:
train_bene_df['Patient_Age_Year'] = train_bene_df['DOB'].dt.year
train_bene_df['Patient_Age_Month'] = train_bene_df['DOB'].dt.month

* **`Adding new feature`** <::::::::::::::::::::::::> **"`YEAR of birth of beneficiaries`"**

In [None]:
bene_age_year_df = pd.DataFrame(train_bene_df['Patient_Age_Year'].value_counts()).reset_index(drop=False)
bene_age_year_df.columns= ['year','num_of_beneficiaries']
bene_age_year_df = bene_age_year_df.sort_values(by='year')

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of their YEAR of Birth 
with plt.style.context('seaborn'):
  plt.figure(figsize=(21,9))
  fig = sns.barplot(data=bene_age_year_df, x='year', y='num_of_beneficiaries', palette='inferno')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],1))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
  # Providing the labels and title to the graph
  plt.xlabel("\nBeneficary YEAR of Birth", fontdict=label_font_dict)
  plt.xticks(rotation=90)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.minorticks_on()
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.title("Distribution of BENEFICIARIES based on their YEAR of birth\n", fontdict=title_font_dict)

**`OBERVATION`**
*   From the above plot, we can decude that the majority of the beneficiaries are from the YEAR - 1919 to 1943.
  *   More specifically, overall the percentage of number of beneficiaries with YEAR of birth from 1939 to 1943 is the highest. Whereas, it is lowest from 1978 to 1983.


**`REASONING`**
*   The reason for adding the "YEAR of Birth of Beneficiaries" as a new feature in the dataset with the intent to find whether personal details has been forged by the beneficiary or its direct/indirect association with Fraud claims.

* **`Adding new feature`** <::::::::::::::::::::::::> **"`MONTH of birth of beneficiaries`"**

In [None]:
bene_age_month_df = pd.DataFrame(train_bene_df['Patient_Age_Month'].value_counts()).reset_index(drop=False)
bene_age_month_df.columns= ['month','num_of_beneficiaries']
bene_age_month_df = bene_age_month_df.sort_values(by='month')

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of their MONTH of Birth 
with plt.style.context('seaborn'):
  plt.figure(figsize=(12,8))
  fig = sns.barplot(data=bene_age_month_df, x='month', y='num_of_beneficiaries', palette='summer')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],1))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
  # Providing the labels and title to the graph
  plt.xlabel("\nBeneficary MONTH of Birth\n", fontdict=label_font_dict)
  plt.xticks(rotation=90)
  plt.ylabel("Number or % share of patients", fontdict=label_font_dict)
  plt.minorticks_on()
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.title("Distribution of BENEFICIARIES based on their MONTH of birth\n", fontdict=title_font_dict)

**`OBERVATION`**
* From the above plot, we can decude that there is no such difference b/w the patients based on their MONTH of birth. 
  * Initial, look suggests me that this feature might not be more of a use.


**`REASONING`**
* The reason for adding the "MONTH of Birth of Beneficiaries" as a new feature in the dataset with the intent to find whether personal details has been forged by the beneficiary or its direct/indirect association with Fraud claims.

In [None]:
train_bene_df.head()

* **`Adding new indicator`** <::::::::::::::::::::::::> **"`Beneficiary Dead or Alive?`"**
  * This represent whether a patient is alive or not?
    * Here, the point to understand is that there can be ONE to MANY relationship b/w a BENEFICIARY and CLAIM filed, thus I'm making an assumption here is that beneficiaries with DOD as NA are alive, however this cannot be  always true once we join it with CLAIMS data.
      * For example, A patient gets successfully operated for Kidney Failure on Sep 2012 and filed CLAIM-I. So, this claimed gets approved with DOD as NA. 
        * Same patient gets operated after 4 months for Cardiac Failure but couldn't survived and filed CLAIM-II. Thus, for CLAIM-I this person shoud have DEAD_or_ALIVE indicator as FALSE whereas for CLAIM-II it should be TRUE.


* **`REASONING`**
  * My intention behind adding this feature is to see whether FRAUD claims are higher for DEAD patients or not? 
    * The reason here is that generally when an organized fraud happens then healthcare providers falsely diagnose, misdiagnose, or overdiagnose a disease, which can lead to harmful courses of medication or treatments or procedures being prescribed or performed. All of this together can lead to patient death.

In [None]:
# 0 means ALIVE and 1 means DEAD
train_bene_df['Dead_or_Alive'] = train_bene_df['DOD'].apply(lambda val: 0 if val != val else 1)

In [None]:
train_bene_df['Dead_or_Alive'].value_counts()

In [None]:
# Here, I'm displaying the distribution of whether BENEFICIARY is ALIVE or NOT?
with plt.style.context('seaborn'):
  plt.figure(figsize=(10,8))
  fig = train_bene_df['Dead_or_Alive'].value_counts().plot(kind='bar', color=['lightgreen','coral'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
  # Providing the labels and title to the graph
  plt.xlabel("Alive or Dead Status?", fontdict=label_font_dict)
  plt.xticks(labels=["ALIVE","DEAD"], ticks=[0,1], rotation=20)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES based on Alive or Dead Status\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph tells us that almost 99% of the beneficiaries are ALIVE and very small percentage of patients are DEAD. It would be good to see how much this feature has impact on the actual label.

In [None]:
train_bene_df['DOD'] = pd.to_datetime(train_bene_df['DOD'])

In [None]:
# Greatest Date of Death in the TRAIN set for beneficiaries
max_bene_DOD = max(train_bene_df['DOD'].unique()[1:])
max_bene_DOD

In [None]:
# For all NAN DODs filling the greatest Date of Death
train_bene_df['DOD'].fillna(value=max_bene_DOD, inplace=True)

* **`Adding new feature`** <::::::::::::::::::::::::> **"`Beneficiary AGE`"**
  * This represent the AGE of the beneficiary.


* **`REASONING`**
  * My intention behind adding this feature is to better analyse the CLAIM and Patient related data.

In [None]:
train_bene_df['AGE'] = np.round(((train_bene_df['DOD'] - train_bene_df['DOB']).dt.days)/365.0,1)

In [None]:
train_bene_df.drop(labels=['DOD'],axis=1,inplace=True)

In [None]:
train_bene_df.head()

In [None]:
# Here, I'm displaying the distribution of AGE of Beneficiaries?
with plt.style.context('seaborn'):
  plt.figure(figsize=(10,8))
  train_bene_df['AGE'].plot(kind='hist', color='purple')
  # Providing the labels and title to the graph
  plt.xlabel("\nBeneficiaries Age in years", fontdict=label_font_dict)
  plt.ylabel("Frequency of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES AGE", fontdict=title_font_dict)
  plt.legend();

- **`OBSERVATION`**
  - The above graph tells us that the majority of the beneficiaries are b/w 65 to 85 years.

In [None]:
train_bene_df['AGE'].describe()

* **`Adding new categorical column`** <::::::::::::::::::::::::> **"`Beneficiary AGE brackets`"**
  * This will represent whether a patient falls in a specific AGE bracket?
    * [1-40] --> Group 1 --> Young
    * (40-60] --> Group 2 --> Mid Aged
    * (60-80] --> Group 3 --> Old 
    * (80 or more] --> Group 4 --> Very Old


* **`REASONING`**
  * My intention behind adding this feature is to see whether FRAUD claims are higher for specific AGE Groups?
    * The reasoning here is that there may be a potential pattern that providers files higher number of FRAUD claims for either a younger age group or very old age group.

In [None]:
def bene_age_brackets(val):
  if val >=1 and val <=40:
    return 'Young'
  elif val > 40 and val <=60:
    return 'Mid'
  elif val > 60 and val <= 80:
    return 'Old'
  else:
    return 'Very Old'

In [None]:
train_bene_df['AGE_groups'] = train_bene_df['AGE'].apply(lambda age: bene_age_brackets(age))

In [None]:
age_grps = list(train_bene_df['AGE_groups'].unique())
for grp in age_grps:
  # Here, I'm displaying the distribution of AGE GROUPS of Beneficiaries?
  with plt.style.context('seaborn'):
    plt.figure(figsize=(8,6))
    train_bene_df[train_bene_df['AGE_groups'] == grp]['AGE'].plot(kind='hist', color='grey')
    # Providing the labels and title to the graph
    plt.xlabel("\nBeneficiaries Age in years", fontdict=label_font_dict)
    plt.ylabel("Frequency of patients\n", fontdict=label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Distribution of BENEFICIARIES Age group -- {}".format(str(grp).upper()), fontdict=title_font_dict)
    plt.legend();

- **`OBSERVATION`**
  - The above graphs tells us about the spread of beneficiaries across the Age Groups.
    - For YOUNG group we can say that the spread is quite even across the ages of beneficiaries.

### **Q4. Lets see the ratio of GENDER across various HUMAN RACE?**

*  **The world population can be divided into 4 major races:**
  * white/Caucasian 
  * Mongoloid/Asian
  * Negroid/Black
  * Australoid. 

**This is based on a racial classification made by Carleton S. Coon in 1962. Refer [here](https://www.umsl.edu/~naumannj/culture%20and%20cultural%20geography/articles/How%20many%20major%20races%20are%20there%20in%20the%20world.docx#:~:text=The%20world%20population%20can%20be,classification%20made%20by%20Carleton%20S.).**

In [None]:
train_bene_df['Race'].unique()

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of Human RACE
with plt.style.context('seaborn'):
  plt.figure(figsize=(10,8))
  fig = train_bene_df['Race'].value_counts().plot(kind='bar', color=['lightgreen','coral','purple','red'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.02), ha='center', fontsize=13.5)
  # Providing the labels and title to the graph
  plt.xlabel("Human RACE", fontdict=label_font_dict)
  plt.xticks(labels=["Race_1","Race_2","Race_3","Race_5"], ticks=[0,1,2,3], rotation=10)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES on the basis of Human RACE\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph tells us that there is a serious imbalance in the records for Human Race categories.

In [None]:
# Lets validate whether we have imbalance of males and females across the human races
with plt.style.context('seaborn'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['Gender','Race'])['AGE'].count().plot(kind='bar', color=['lightgreen','coral','purple','red','lightgreen','coral','purple','red'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.02), ha='center', fontsize=13.5)
  # Providing the labels and title to the graph
  plt.xlabel("Gender & Human RACE", fontdict=label_font_dict)
  plt.xticks(labels=[("Gender_0","Race_1"), ("Gender_0","Race_2"), ("Gender_0","Race_3"), ("Gender_0","Race_5"),("Gender_1","Race_1"), ("Gender_1","Race_2"), ("Gender_1","Race_3"), ("Gender_1","Race_5")],
             ticks=[0,1,2,3,4,5,6,7], rotation=80)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Imbalance of males and females across the human races?\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us whether we have an imbalance of males or females across the human races. 
    - And, it looks like within a specific human race there is no such gender imbalance.

### **Q5. Lets see the number of beneficiaries with Chronic Renal Disease.**

*  There are two main types of kidney disease - short-term (acute kidney injury) and lifelong (chronic).

*  The two main types of kidney disease are short-term (acute kidney injury) and lifelong (chronic kidney disease).
  *  Chronic kidney disease, also known as chronic renal disease or CKD, is a condition characterized by a gradual loss of kidney function over time.

* **What are the main causes of chronic kidney disease?**
  * Diabetes and high blood pressure, or hypertension, are responsible for two-thirds of chronic kidney disease cases.

**For more info refer here -->** [link-1](https://kidney.org.au/your-kidneys/what-is-kidney-disease/types-of-kidney-disease) 
[link-2](https://www.kidney.org/atoz/content/about-chronic-kidney-disease)

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of State
with plt.style.context('seaborn'):
  plt.figure(figsize=(10,8))
  fig = train_bene_df['RenalDiseaseIndicator'].value_counts().plot(kind='bar', color=['green','orange'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.02), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nRenal Disease present or not?", fontdict=label_font_dict)
  plt.xticks(labels=["NO","YES"], ticks=[0,1], rotation=10, size=12)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES on the basis of Renal Disease\n", fontdict=title_font_dict)

  # 0 means NO and 1 means YES
  print(pd.DataFrame(train_bene_df['RenalDiseaseIndicator'].value_counts()),"\n")

- **`OBSERVATION`**
  - Above graph tells us that around 14% of beneficiaries has or had Kidney Failure(Renal Disease).

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of "ChronicCond_KidneyDisease"
with plt.style.context('seaborn'):
  plt.figure(figsize=(10,8))
  fig = train_bene_df['ChronicCond_KidneyDisease'].value_counts().plot(kind='bar', color=['green','orange'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.02), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\n Chronic Kidney Disease present or not?", fontdict=label_font_dict)
  plt.xticks(labels=["NO","YES"], ticks=[0,1], rotation=10, size=12)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES on the basis of 'ChronicCond_KidneyDisease'\n", fontdict=title_font_dict)

# I believe 2 means NO and 1 means YES
print(pd.DataFrame(train_bene_df['ChronicCond_KidneyDisease'].value_counts()),'\n')

- **`OBSERVATION`**
  - Above graph tells us that around 31% of Beneficiaries had Chronic Kidney Disease.

* In the beneficiary dataset, we have 2 columns:
  * RenalDisease
  * ChronicCond_KidneyDisease
Both of these columns appears to be representing the beneficiary history about the kidney disease.

I found this [link](https://www.kidney.org/blog/ask-doctor/chronic-kidney-disease-and-chronic-renal-failure-same-thing) useful in order to understand the difference b/w both of these. It looks they have RenalDisease indicator to represent whether the beneficiary has or had Kidney Failure.
And, ChronicCond_KidneyDisease represents the long term Kidney Disease may be like not functioning to the fullest.

In [None]:
# Lets validate whether we have beneficiaries with both RKD & CKD?
with plt.style.context('seaborn'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['RenalDiseaseIndicator','ChronicCond_KidneyDisease'])['Gender'].count().plot(kind='bar', color=['orange','green','purple','red'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.02), ha='center', fontsize=13.5)
  
  # Added the Description for one of the horizontal bar
  fig.annotate('This is interesting!!\n Beneficiaries with NO Long Term Kidney Malfunction but\n suffered from Kidney Failure.', 
               xy=(2.85, 0.85),  xycoords='data', xytext=(1.02, 0.65), textcoords='axes fraction', fontsize=12.5,
               arrowprops=dict(facecolor='black', shrink=0.03), horizontalalignment='right', verticalalignment='top')

  # Providing the labels and title to the graph
  plt.xlabel("\nRKD and CKD both are present?", fontdict=label_font_dict)
  plt.xticks(labels=[("RKD_No","CKD_Yes"), ("RKD_No","CKD_No"), ("RKD_Yes","CKD_Yes"), ("RKD_Yes","CKD_No")], ticks=[0,1,2,3], rotation=0, size=12)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Imbalance b/w males and females across the human races?\n", fontdict=title_font_dict)

# RKD --> Renal Disease Indicator
# CKD --> Chronic Condition Kidney Disease
print(pd.DataFrame(train_bene_df.groupby(['RenalDiseaseIndicator','ChronicCond_KidneyDisease'])['Gender'].count()),"\n")

- **`OBSERVATION`**
  - In the above, we came to know that 2.94% of beneficiaries had no previously CKD but suffered from RKD. It would be good to check how many out these end-up as FRAUD claims.

### **Q6. Lets see the number of beneficiaries on the basis of State Codes.**

In [None]:
train_bene_df['State'].unique()

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of State
with plt.style.context('seaborn'):
  plt.figure(figsize=(20,9))
  fig = train_bene_df['State'].value_counts().plot(kind='bar')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.03), ha='center', fontsize=13.5, rotation=90)
  # Providing the labels and title to the graph
  plt.xlabel("\nState Codes", fontdict=label_font_dict)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES on the basis of States\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us that for some States the number of beneficiaries are more as compared to the others. And, the maximum number of beneficiaries are from State-5.

### **Q7. Lets see the number of beneficiaries on the basis of Country Codes.**

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of Country
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(18,8))
  fig = train_bene_df['County'].value_counts()[0:45].plot(kind='bar', color='palegreen')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.03), ha='center', fontsize=13.5, rotation=90)
  # Providing the labels and title to the graph
  plt.xlabel("\nCountry Codes", fontdict=label_font_dict)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES on the basis of Countries (Top-45)", fontdict=title_font_dict)

In [None]:
# Countries with only handful of beneficiaries
train_bene_df['County'].value_counts()[::-1][0:45]

- **`OBSERVATION`**
  - The above graph shows us that distribution of beneficiaries based upon the country codes. Some having very high number of beneficiaries and others have very less.

- **`NOTE`**
  - One thing still needs to be evaluated that whether some specific state or country code has higher number of frauds?

### **Q8.1 Lets see the number of beneficiaries on the basis of 'NoOfMonths_PartACov'.**

In [None]:
train_bene_df['NoOfMonths_PartACov'].unique()

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of NoOfMonths_PartACov
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(18,8))
  fig = train_bene_df['NoOfMonths_PartACov'].value_counts().plot(kind='bar', color='palegreen')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.03), ha='center', fontsize=13.5, rotation=90)
  # Providing the labels and title to the graph
  plt.xlabel("\nMonths for Part-A Coverage", fontdict=label_font_dict)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES on the basis of 'NoOfMonths_PartACov'", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us that more than 99% of beneficiaries have 12 months Part-A Coverage. My initial look to above plot tells me that this feature won't have much impact on the tgt labels.

### **Q8.2 Lets see the number of beneficiaries on the basis of 'NoOfMonths_PartBCov'.**

In [None]:
train_bene_df['NoOfMonths_PartBCov'].unique()

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of NoOfMonths_PartBCov
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(18,8))
  fig = train_bene_df['NoOfMonths_PartBCov'].value_counts().plot(kind='bar', color='palegreen')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.03), ha='center', fontsize=13.5, rotation=90)
  # Providing the labels and title to the graph
  plt.xlabel("\nMonths for Part-B Coverage", fontdict=label_font_dict)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES on the basis of 'NoOfMonths_PartBCov'", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us that around 99% of beneficiaries have 12 months Part-B Coverage. My initial look to above plot tells me that this feature won't have much impact on the tgt labels.

### **Q9 Lets see the number of beneficiaries on the basis of 'ChronicCond_Alzheimer'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of 'ChronicCond_Alzheimer'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df['ChronicCond_Alzheimer'].value_counts().plot(kind='bar', color=['palegreen','orange'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic ALZH Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['NO', 'YES'], fontsize=13, rotation=30)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES on the basis of 'ChronicCond_Alzheimer'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic ALZH Disease
# 2 means -ve with Chronic ALZH Disease
print(pd.DataFrame(train_bene_df['ChronicCond_Alzheimer'].value_counts()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that the beneficiaries with Chronic ALZH Disease are almost half as compared to the non-ALZH beneficiaries.

In [None]:
# Here, I'm displaying the Total Annual Sum of Max IP Reimbursement for 'ChronicCond_Alzheimer'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['ChronicCond_Alzheimer'])['IPAnnualReimbursementAmt'].sum().plot(kind='bar', color=['orange','palegreen'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/(train_bene_df["IPAnnualReimbursementAmt"].sum()),2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic ALZH Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum of Max IP Reimbursement \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of Max IP Reimbursement : 'ChronicCond_Alzheimer'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic ALZH Disease
# 2 means -ve with Chronic ALZH Disease
print(pd.DataFrame(train_bene_df.groupby(['ChronicCond_Alzheimer'])['IPAnnualReimbursementAmt'].sum()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that despite the non-ALZH beneficiaries are double in number as compared to their counterparts, however, the Annual Sum of MAX IP Reimbursement is almost same for both the groups.
    - This means if a beneficiary with or without chronic ALZH Disease get admitted then the annual reimbursement paid is approx same.
      - The difference is of 13 Million or 2.52%.

In [None]:
# Here, I'm displaying the Total Annual Sum of Max OP Reimbursement for 'ChronicCond_Alzheimer'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['ChronicCond_Alzheimer'])['OPAnnualReimbursementAmt'].sum().plot(kind='bar', color=['orange','palegreen'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/(train_bene_df["OPAnnualReimbursementAmt"].sum()),2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic ALZH Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum of Max OP Reimbursement \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of Max OP Reimbursement : 'ChronicCond_Alzheimer'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic ALZH Disease
# 2 means -ve with Chronic ALZH Disease
print(pd.DataFrame(train_bene_df.groupby(['ChronicCond_Alzheimer'])['OPAnnualReimbursementAmt'].sum()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that the Annual Sum of OP Reimbursement for non-chronic ALZH Disease beneficiaries is more than 31 Million or 16.92% than the other group.

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_Alzheimer'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['ChronicCond_Alzheimer'])['IPAnnualDeductibleAmt'].sum().plot(kind='bar', color=['orange','palegreen'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/(train_bene_df["IPAnnualDeductibleAmt"].sum()),2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic ALZH Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum of IP Co-payment \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of IP Co-payment paid by patient : 'ChronicCond_Alzheimer'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic ALZH Disease
# 2 means -ve with Chronic ALZH Disease
print(pd.DataFrame(train_bene_df.groupby(['ChronicCond_Alzheimer'])['IPAnnualDeductibleAmt'].sum()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that the Annual Sum of IP Co-payment paid by the patient of non-chronic & chronic ALZH Disease is almost same. The difference is only of 10 Million or 1.8%.
    - So, we can say that the re-imbursement & co-payment amounts are same for chronic and non-chronic patients if they gets admitted.

In [None]:
# Here, I'm displaying the Total Annual Sum of OP Co-payment for 'ChronicCond_Alzheimer'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['ChronicCond_Alzheimer'])['OPAnnualDeductibleAmt'].sum().plot(kind='bar', color=['orange','palegreen'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/(train_bene_df["OPAnnualDeductibleAmt"].sum()),2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic ALZH Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum of OP Co-payment \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of OP Co-payment paid by patient : 'ChronicCond_Alzheimer'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic ALZH Disease
# 2 means -ve with Chronic ALZH Disease
print(pd.DataFrame(train_bene_df.groupby(['ChronicCond_Alzheimer'])['OPAnnualDeductibleAmt'].sum()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that the Annual Sum of OP Co-payment paid by the patient of non-chronic is more as compared to chronic ALZH Disease. The difference is of 93.3 Million or 17.82%.
    - So, we can say that the co-payment paid by non-admitted patients is higher than the ones who gets admitted.

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_Alzheimer'])['BeneID'].count())

In [None]:
CC_ALZH_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Alzheimer'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Alzheimer'])['BeneID'].count())
CC_ALZH_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_ALZH_IP_R

In [None]:
CC_ALZH_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Alzheimer'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Alzheimer'])['BeneID'].count())
CC_ALZH_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_ALZH_OP_R

In [None]:
CC_ALZH_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Alzheimer'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Alzheimer'])['BeneID'].count())
CC_ALZH_IP_D.columns = ['AVG IP Co-payment Amt']
CC_ALZH_IP_D

In [None]:
CC_ALZH_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Alzheimer'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Alzheimer'])['BeneID'].count())
CC_ALZH_OP_D.columns = ['AVG OP Co-payment Amt']
CC_ALZH_OP_D

In [None]:
CC_ALZH_all_amts = pd.concat([CC_ALZH_IP_R, CC_ALZH_OP_R, CC_ALZH_IP_D, CC_ALZH_OP_D], axis=1)
CC_ALZH_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_Alzheimer'
with plt.style.context('seaborn-poster'):
  fig = CC_ALZH_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic ALZH Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_Alzheimer'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic ALZH Disease
# 2 means -ve with Chronic ALZH Disease
print(CC_ALZH_all_amts,"\n")

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic ALZH disease. To be more precise, in case of +ve it is more than 50%.
    - For other comparisons the difference is not very high.

### **Q10. Lets see the number of beneficiaries on the basis of 'ChronicCond_Heartfailure'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Here, I'm displaying the distribution of BENEFICIARIES on the basis of 'ChronicCond_Heartfailure'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df['ChronicCond_Heartfailure'].value_counts().plot(kind='bar', color=['palegreen','orange'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/train_bene_df.shape[0],2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic HF Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['NO', 'YES'], fontsize=13, rotation=30)
  plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Distribution of BENEFICIARIES on the basis of 'ChronicCond_Heartfailure'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic HF Disease
# 2 means -ve with Chronic HF Disease
print(pd.DataFrame(train_bene_df['ChronicCond_Heartfailure'].value_counts()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that the beneficiaries with Chronic HF Disease are almost equal to the non-ALZH beneficiaries.

In [None]:
# Here, I'm displaying the Total Annual Sum of Max IP Reimbursement for 'ChronicCond_Heartfailure'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['ChronicCond_Heartfailure'])['IPAnnualReimbursementAmt'].sum().plot(kind='bar', color=['orange','palegreen'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/(train_bene_df["IPAnnualReimbursementAmt"].sum()),2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic HF Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum of Max IP Reimbursement \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of Max IP Reimbursement : 'ChronicCond_Heartfailure'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic HF Disease
# 2 means -ve with Chronic HF Disease
print(pd.DataFrame(train_bene_df.groupby(['ChronicCond_Heartfailure'])['IPAnnualReimbursementAmt'].sum()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that the Payer pays very high amount annually for patients with Chronic Heart Failure Disease.
      - The difference is of around 234 Million or 47%. This is very high and it can be a potential sign of FRAUD claims.

In [None]:
# Here, I'm displaying the Total Annual Sum of Max OP Reimbursement for 'ChronicCond_Heartfailure'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['ChronicCond_Heartfailure'])['OPAnnualReimbursementAmt'].sum().plot(kind='bar', color=['orange','palegreen'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/(train_bene_df["OPAnnualReimbursementAmt"].sum()),2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic HF Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum of Max OP Reimbursement \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of Max OP Reimbursement : 'ChronicCond_Heartfailure'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic HF Disease
# 2 means -ve with Chronic HF Disease
print(pd.DataFrame(train_bene_df.groupby(['ChronicCond_Heartfailure'])['OPAnnualReimbursementAmt'].sum()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that the Annual Sum paid by the payers for non-admitted patients is 49 Million more than the other group. This again can be a potential sign of fraud claims.

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_Heartfailure'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['ChronicCond_Heartfailure'])['IPAnnualDeductibleAmt'].sum().plot(kind='bar', color=['orange','palegreen'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/(train_bene_df["IPAnnualDeductibleAmt"].sum()),2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic HF Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum of IP Co-payment \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of IP Co-payment paid by patient : 'ChronicCond_Heartfailure'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic HF Disease
# 2 means -ve with Chronic HF Disease
print(pd.DataFrame(train_bene_df.groupby(['ChronicCond_Heartfailure'])['IPAnnualDeductibleAmt'].sum()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that the Annual Sum of IP Co-payment paid by the patient of non-chronic & chronic HF Disease has a difference of 25 Million.
    - But, the point to note here that the co-payment proportion is very less in front of the part paid by the Payer. This again can be potential sign of FRAUD.

In [None]:
# Here, I'm displaying the Total Annual Sum of OP Co-payment for 'ChronicCond_Heartfailure'
with plt.style.context('seaborn-poster'):
  plt.figure(figsize=(12,8))
  fig = train_bene_df.groupby(['ChronicCond_Heartfailure'])['OPAnnualDeductibleAmt'].sum().plot(kind='bar', color=['orange','palegreen'])
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{str(round((height*100)/(train_bene_df["OPAnnualDeductibleAmt"].sum()),2))+"%"}', (x + width/2, y + height*1.01), ha='center', fontsize=13.5, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic HF Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum of OP Co-payment \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="--", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of OP Co-payment paid by patient : 'ChronicCond_Heartfailure'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic HF Disease
# 2 means -ve with Chronic HF Disease
print(pd.DataFrame(train_bene_df.groupby(['ChronicCond_Heartfailure'])['OPAnnualDeductibleAmt'].sum()),"\n")

- **`OBSERVATION`**
  - The above graph shows us that the Annual Sum of OP Co-payment paid by the patient of non-chronic & chronic HF Disease has a difference of 13 Million.
    - But, the point to note here that the co-payment proportion is very less in front of the part paid by the Payer. This again can be potential sign of FRAUD.

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_Heartfailure'])['BeneID'].count())

In [None]:
CC_HF_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Heartfailure'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Heartfailure'])['BeneID'].count())
CC_HF_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_HF_IP_R

In [None]:
CC_HF_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Heartfailure'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Heartfailure'])['BeneID'].count())
CC_HF_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_HF_OP_R

In [None]:
CC_HF_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Heartfailure'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Heartfailure'])['BeneID'].count())
CC_HF_IP_D.columns = ['AVG IP Co-payment Amt']
CC_HF_IP_D

In [None]:
CC_HF_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Heartfailure'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Heartfailure'])['BeneID'].count())
CC_HF_OP_D.columns = ['AVG OP Co-payment Amt']
CC_HF_OP_D

In [None]:
CC_HF_all_amts = pd.concat([CC_HF_IP_R, CC_HF_OP_R, CC_HF_IP_D, CC_HF_OP_D], axis=1)
CC_HF_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_Heartfailure'
with plt.style.context('seaborn-poster'):
  fig = CC_HF_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic HF Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_Heartfailure'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic HF Disease
# 2 means -ve with Chronic HF Disease
print(CC_HF_all_amts,"\n")

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve it is more than 50%.
    - For other comparisons the difference is not very high.

### **Q11. Lets see the number of beneficiaries on the basis of 'ChronicCond_KidneyDisease'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_KidneyDisease'])['BeneID'].count())

In [None]:
CC_KD_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_KidneyDisease'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_KidneyDisease'])['BeneID'].count())
CC_KD_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_KD_IP_R

In [None]:
CC_KD_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_KidneyDisease'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_KidneyDisease'])['BeneID'].count())
CC_KD_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_KD_OP_R

In [None]:
CC_KD_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_KidneyDisease'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_KidneyDisease'])['BeneID'].count())
CC_KD_IP_D.columns = ['AVG IP Co-payment Amt']
CC_KD_IP_D

In [None]:
CC_KD_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_KidneyDisease'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_KidneyDisease'])['BeneID'].count())
CC_KD_OP_D.columns = ['AVG OP Co-payment Amt']
CC_KD_OP_D

In [None]:
CC_KD_all_amts = pd.concat([CC_KD_IP_R, CC_KD_OP_R, CC_KD_IP_D, CC_KD_OP_D], axis=1)
CC_KD_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_KidneyDisease'
with plt.style.context('seaborn-poster'):
  fig = CC_KD_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic KD Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_KidneyDisease'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic KD Disease
# 2 means -ve with Chronic KD Disease
print(CC_KD_all_amts,"\n")

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve it is more than 50%.
    - For other comparisons the difference is not very high.

### **Q12. Lets see the number of beneficiaries on the basis of 'ChronicCond_Cancer'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_Cancer'])['BeneID'].count())

In [None]:
CC_CN_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Cancer'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Cancer'])['BeneID'].count())
CC_CN_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_CN_IP_R

In [None]:
CC_CN_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Cancer'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Cancer'])['BeneID'].count())
CC_CN_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_CN_OP_R

In [None]:
CC_CN_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Cancer'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Cancer'])['BeneID'].count())
CC_CN_IP_D.columns = ['AVG IP Co-payment Amt']
CC_CN_IP_D

In [None]:
CC_CN_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Cancer'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Cancer'])['BeneID'].count())
CC_CN_OP_D.columns = ['AVG OP Co-payment Amt']
CC_CN_OP_D

In [None]:
CC_CN_all_amts = pd.concat([CC_CN_IP_R, CC_CN_OP_R, CC_CN_IP_D, CC_CN_OP_D], axis=1)
CC_CN_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_Cancer'
with plt.style.context('seaborn-poster'):
  fig = CC_CN_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic CN Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_Cancer'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic CN Disease
# 2 means -ve with Chronic CN Disease
print(CC_CN_all_amts,"\n")

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve it is around 50%.
    - For other comparisons the difference is not very high.

### **Q13. Lets see the number of beneficiaries on the basis of 'ChronicCond_ObstrPulmonary'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_ObstrPulmonary'])['BeneID'].count())

In [None]:
CC_PL_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_ObstrPulmonary'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_ObstrPulmonary'])['BeneID'].count())
CC_PL_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_PL_IP_R

In [None]:
CC_PL_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_ObstrPulmonary'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_ObstrPulmonary'])['BeneID'].count())
CC_PL_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_PL_OP_R

In [None]:
CC_PL_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_ObstrPulmonary'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_ObstrPulmonary'])['BeneID'].count())
CC_PL_IP_D.columns = ['AVG IP Co-payment Amt']
CC_PL_IP_D

In [None]:
CC_PL_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_ObstrPulmonary'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_ObstrPulmonary'])['BeneID'].count())
CC_PL_OP_D.columns = ['AVG OP Co-payment Amt']
CC_PL_OP_D

In [None]:
CC_PL_all_amts = pd.concat([CC_PL_IP_R, CC_PL_OP_R, CC_PL_IP_D, CC_PL_OP_D], axis=1)
CC_PL_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_ObstrPulmonary'
with plt.style.context('seaborn-poster'):
  fig = CC_PL_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic PL Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_ObstrPulmonary'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic PL Disease
# 2 means -ve with Chronic PL Disease

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve it is more than 50%.
    - For other comparisons the difference is not very high.

### **Q14. Lets see the number of beneficiaries on the basis of 'ChronicCond_Depression'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_Depression'])['BeneID'].count())

In [None]:
CC_DP_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Depression'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Depression'])['BeneID'].count())
CC_DP_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_DP_IP_R

In [None]:
CC_DP_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Depression'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Depression'])['BeneID'].count())
CC_DP_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_DP_OP_R

In [None]:
CC_DP_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Depression'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Depression'])['BeneID'].count())
CC_DP_IP_D.columns = ['AVG IP Co-payment Amt']
CC_DP_IP_D

In [None]:
CC_DP_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Depression'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Depression'])['BeneID'].count())
CC_DP_OP_D.columns = ['AVG OP Co-payment Amt']
CC_DP_OP_D

In [None]:
CC_DP_all_amts = pd.concat([CC_DP_IP_R, CC_DP_OP_R, CC_DP_IP_D, CC_DP_OP_D], axis=1)
CC_DP_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_Depression'
with plt.style.context('seaborn-poster'):
  fig = CC_DP_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic DP Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_Depression'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic DP Disease
# 2 means -ve with Chronic DP Disease

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve it is around 42%.
    - For other comparisons the difference is not very high.

### **Q15. Lets see the number of beneficiaries on the basis of 'ChronicCond_Diabetes'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_Diabetes'])['BeneID'].count())

In [None]:
CC_DB_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Diabetes'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Diabetes'])['BeneID'].count())
CC_DB_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_DB_IP_R

In [None]:
CC_DB_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Diabetes'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Diabetes'])['BeneID'].count())
CC_DB_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_DB_OP_R

In [None]:
CC_DB_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Diabetes'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Diabetes'])['BeneID'].count())
CC_DB_IP_D.columns = ['AVG IP Co-payment Amt']
CC_DB_IP_D

In [None]:
CC_DB_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Diabetes'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Diabetes'])['BeneID'].count())
CC_DB_OP_D.columns = ['AVG OP Co-payment Amt']
CC_DB_OP_D

In [None]:
CC_DB_all_amts = pd.concat([CC_DB_IP_R, CC_DB_OP_R, CC_DB_IP_D, CC_DB_OP_D], axis=1)
CC_DB_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_Diabetes'
with plt.style.context('seaborn-poster'):
  fig = CC_DB_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic DB Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_Diabetes'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic DB Disease
# 2 means -ve with Chronic DB Disease

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve it is more than 50%.
    - For other comparisons the difference is not very high.

### **Q16. Lets see the number of beneficiaries on the basis of 'ChronicCond_IschemicHeart'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_IschemicHeart'])['BeneID'].count())

In [None]:
CC_IH_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_IschemicHeart'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_IschemicHeart'])['BeneID'].count())
CC_IH_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_IH_IP_R

In [None]:
CC_IH_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_IschemicHeart'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_IschemicHeart'])['BeneID'].count())
CC_IH_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_IH_OP_R

In [None]:
CC_IH_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_IschemicHeart'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_IschemicHeart'])['BeneID'].count())
CC_IH_IP_D.columns = ['AVG IP Co-payment Amt']
CC_IH_IP_D

In [None]:
CC_IH_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_IschemicHeart'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_IschemicHeart'])['BeneID'].count())
CC_IH_OP_D.columns = ['AVG OP Co-payment Amt']
CC_IH_OP_D

In [None]:
CC_IH_all_amts = pd.concat([CC_IH_IP_R, CC_IH_OP_R, CC_IH_IP_D, CC_IH_OP_D], axis=1)
CC_IH_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_IschemicHeart'
with plt.style.context('seaborn-poster'):
  fig = CC_IH_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic IH Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_IschemicHeart'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic IH Disease
# 2 means -ve with Chronic IH Disease

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve it is more than 50%.
    - For other comparisons the difference is not very high.

### **Q17. Lets see the number of beneficiaries on the basis of 'ChronicCond_Osteoporasis'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_Osteoporasis'])['BeneID'].count())

In [None]:
CC_OS_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Osteoporasis'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Osteoporasis'])['BeneID'].count())
CC_OS_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_OS_IP_R

In [None]:
CC_OS_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Osteoporasis'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_Osteoporasis'])['BeneID'].count())
CC_OS_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_OS_OP_R

In [None]:
CC_OS_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Osteoporasis'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Osteoporasis'])['BeneID'].count())
CC_OS_IP_D.columns = ['AVG IP Co-payment Amt']
CC_OS_IP_D

In [None]:
CC_OS_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_Osteoporasis'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_Osteoporasis'])['BeneID'].count())
CC_OS_OP_D.columns = ['AVG OP Co-payment Amt']
CC_OS_OP_D

In [None]:
CC_OS_all_amts = pd.concat([CC_OS_IP_R, CC_OS_OP_R, CC_OS_IP_D, CC_OS_OP_D], axis=1)
CC_OS_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_Osteoporasis'
with plt.style.context('seaborn-poster'):
  fig = CC_OS_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic OS Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_Osteoporasis'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic OS Disease
# 2 means -ve with Chronic OS Disease

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve the difference is not that high.
    - For other comparisons the difference is not very high.

### **Q18. Lets see the number of beneficiaries on the basis of 'ChronicCond_rheumatoidarthritis'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_rheumatoidarthritis'])['BeneID'].count())

In [None]:
CC_RH_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_rheumatoidarthritis'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_rheumatoidarthritis'])['BeneID'].count())
CC_RH_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_RH_IP_R

In [None]:
CC_RH_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_rheumatoidarthritis'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_rheumatoidarthritis'])['BeneID'].count())
CC_RH_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_RH_OP_R

In [None]:
CC_RH_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_rheumatoidarthritis'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_rheumatoidarthritis'])['BeneID'].count())
CC_RH_IP_D.columns = ['AVG IP Co-payment Amt']
CC_RH_IP_D

In [None]:
CC_RH_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_rheumatoidarthritis'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_rheumatoidarthritis'])['BeneID'].count())
CC_RH_OP_D.columns = ['AVG OP Co-payment Amt']
CC_RH_OP_D

In [None]:
CC_RH_all_amts = pd.concat([CC_RH_IP_R, CC_RH_OP_R, CC_RH_IP_D, CC_RH_OP_D], axis=1)
CC_RH_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_rheumatoidarthritis'
with plt.style.context('seaborn-poster'):
  fig = CC_RH_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic RH Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_rheumatoidarthritis'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic RH Disease
# 2 means -ve with Chronic RH Disease

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve the difference is not that high.
    - For other comparisons the difference is not very high.

### **Q19. Lets see the number of beneficiaries on the basis of 'ChronicCond_stroke'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['ChronicCond_stroke'])['BeneID'].count())

In [None]:
CC_ST_IP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_stroke'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_stroke'])['BeneID'].count())
CC_ST_IP_R.columns = ['AVG IP Reimbursement Amt']
CC_ST_IP_R

In [None]:
CC_ST_OP_R = pd.DataFrame(train_bene_df.groupby(['ChronicCond_stroke'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['ChronicCond_stroke'])['BeneID'].count())
CC_ST_OP_R.columns = ['AVG OP Reimbursement Amt']
CC_ST_OP_R

In [None]:
CC_ST_IP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_stroke'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_stroke'])['BeneID'].count())
CC_ST_IP_D.columns = ['AVG IP Co-payment Amt']
CC_ST_IP_D

In [None]:
CC_ST_OP_D = pd.DataFrame(train_bene_df.groupby(['ChronicCond_stroke'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['ChronicCond_stroke'])['BeneID'].count())
CC_ST_OP_D.columns = ['AVG OP Co-payment Amt']
CC_ST_OP_D

In [None]:
CC_ST_all_amts = pd.concat([CC_ST_IP_R, CC_ST_OP_R, CC_ST_IP_D, CC_ST_OP_D], axis=1)
CC_ST_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'ChronicCond_stroke'
with plt.style.context('seaborn-poster'):
  fig = CC_ST_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic ST Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['YES', 'NO'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'ChronicCond_stroke'\n", fontdict=title_font_dict)

# 1 means +ve with Chronic ST Disease
# 2 means -ve with Chronic ST Disease

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without chronic HF disease. To be more precise, in case of +ve the difference is more than 50%.
    - For other comparisons the difference is not very high.

### **Q20. Lets see the number of beneficiaries on the basis of 'RenalDiseaseIndicator'. And, the Annual IP & OP expenditures for such patients.**

In [None]:
# Number of beneficiaries with chronic or no-chronic conditions
pd.DataFrame(train_bene_df.groupby(['RenalDiseaseIndicator'])['BeneID'].count())

In [None]:
RKD_IP_R = pd.DataFrame(train_bene_df.groupby(['RenalDiseaseIndicator'])['IPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['RenalDiseaseIndicator'])['BeneID'].count())
RKD_IP_R.columns = ['AVG IP Reimbursement Amt']
RKD_IP_R

In [None]:
RKD_OP_R = pd.DataFrame(train_bene_df.groupby(['RenalDiseaseIndicator'])['OPAnnualReimbursementAmt'].sum() / train_bene_df.groupby(['RenalDiseaseIndicator'])['BeneID'].count())
RKD_OP_R.columns = ['AVG OP Reimbursement Amt']
RKD_OP_R

In [None]:
RKD_IP_D = pd.DataFrame(train_bene_df.groupby(['RenalDiseaseIndicator'])['IPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['RenalDiseaseIndicator'])['BeneID'].count())
RKD_IP_D.columns = ['AVG IP Co-payment Amt']
RKD_IP_D

In [None]:
RKD_OP_D = pd.DataFrame(train_bene_df.groupby(['RenalDiseaseIndicator'])['OPAnnualDeductibleAmt'].sum() / train_bene_df.groupby(['RenalDiseaseIndicator'])['BeneID'].count())
RKD_OP_D.columns = ['AVG OP Co-payment Amt']
RKD_OP_D

In [None]:
RKD_all_amts = pd.concat([RKD_IP_R, RKD_OP_R, RKD_IP_D, RKD_OP_D], axis=1)
RKD_all_amts

In [None]:
# Here, I'm displaying the Total Annual Sum of IP Co-payment for 'RenalDiseaseIndicator'
with plt.style.context('seaborn-poster'):
  fig = RKD_all_amts.plot(kind='bar', colormap='rainbow')
  # Using the "patches" function we will get the location of the rectangle bars from the graph.
  ## Then by using those location(width & height) values we will add the annotations
  for p in fig.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    fig.annotate(f'{round(height,0)}', (x + width/2, y + height*1.015), ha='center', fontsize=12, rotation=0)
  # Providing the labels and title to the graph
  plt.xlabel("\nHaving Chronic Renal Kidney Disease?", fontdict=label_font_dict)
  plt.xticks(ticks=[0,1], labels=['NO', 'YES'], fontsize=13, rotation=30)
  plt.ylabel("Total Annual Sum \n", fontdict=label_font_dict)
  plt.grid(which='major', linestyle="-.", color='lightgrey')
  plt.minorticks_on()
  plt.title("Total Annual Sum of various amounts : 'RenalDiseaseIndicator'\n", fontdict=title_font_dict)

# Y means +ve with Renal Kidney Disease
# 0 means -ve with Renal Kidney Disease

- **`OBSERVATION`**
  - The above graph is telling us below points:
    - Payer pays huge chunk of expenses specially when a beneficiary gets admitted with or without Renal Kidney Disease. To be more precise, in case of +ve the difference is more than 50%.
    - For other comparisons the difference is not very high.

### **Q21. Lets check the percentiles of the pre-disease indicators for the Annual IP expenditures for such patients.**

In [None]:
def cal_display_percentiles(x_col, y_col, title_lbl, x_filter_code):
    """
    Description : This function is created for calculating and generating the percentiles for pre-disease indicators.
    
    Input: It accepts below parameters:
        1. x_col : Disease indicator feature name.
        2. y_col : Feature like re-imbursement or deductible amount whose percentiles you want to generate.
        3. title_lbl : Label to be provided in the title of the plot.
        4. x_filter_code : Category code for which you want to generate the percentiles.
        
    Output: It returns the dataframe having percentiles and their respective values for the specific disease indicator feature.
    And, it displays the pointplot graph of the same.
    """
    percentiles = []
    percentiles_vals = []

    # Calculating & storing the various percentiles and their respective values
    for val in [0.1,0.2,0.25,0.3,0.4,0.5,0.6,0.7,0.75,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.999,0.9999,0.99999,0.999999,1.0]:
        percentile = round(float(val*100),6)
        percentiles.append(percentile)

        percentile_val = round(train_bene_df[train_bene_df[x_col] == x_filter_code][y_col].quantile(val),1)
        percentiles_vals.append(percentile_val)

    # Creating the temp dataframe for displaying the results
    tmp_percentiles = pd.DataFrame([percentiles, percentiles_vals]).T
    tmp_percentiles.columns = ['Percentiles', 'Values']

    # Here, I'm displaying the Percentiles values for all disease code features
    with plt.style.context('seaborn-poster'):
        plt.figure(figsize=(15,7))
        sns.pointplot(data=tmp_percentiles, x='Percentiles', y='Values', markers="o", palette='spring')
        sns.pointplot(data=tmp_percentiles, x='Percentiles', y='Values', markers="", color='grey', linestyles="solid")
        # Providing the labels and title to the graph
        plt.xlabel("\nPercentiles", fontdict=label_font_dict)
        plt.xticks(rotation=90, size=12)
        plt.ylabel("Total Annual `{}` Sum \n".format(y_col), fontdict=label_font_dict)
        plt.grid(which='major', linestyle="-.", color='lightpink')
        plt.minorticks_on()
        plt.title("Percentile values of `{}` :: `{}`\n".format(y_col,title_lbl), fontdict=title_font_dict)
        
    return tmp_percentiles

- **(RenalDiseaseIndicator == YES) and (IPAnnualReimbursementAmt)**

In [None]:
RKD_YES_IP_R_percentiles = cal_display_percentiles(x_col='RenalDiseaseIndicator', 
                                                   y_col='IPAnnualReimbursementAmt',
                                                   title_lbl="Renal Kidney Disease = YES",
                                                   x_filter_code='Y')

- **`OBSERVATION`**
  - The above graph shows us that some of the reimbursements paid by the PAYER are very high as compared to the rest of the records.
      - This can be a potential sign of fraudulent cases because generally the criminals file some forge cases with exponentially high amounts.

- **(RenalDiseaseIndicator == NO) and (IPAnnualReimbursementAmt)**

In [None]:
RKD_NO_IP_R_percentiles = cal_display_percentiles(x_col='RenalDiseaseIndicator', 
                                                   y_col='IPAnnualReimbursementAmt',
                                                   title_lbl="Renal Kidney Disease = NO",
                                                   x_filter_code='0')

- **`OBSERVATION`**
  - The above graph shows us that some of the reimbursements paid by the PAYER are very high as compared to the rest of the records.
      - This can be a potential sign of fraudulent cases because generally the criminals file some forge cases with exponentially high amounts.

- **(RenalDiseaseIndicator == YES) and (OPAnnualReimbursementAmt)**

In [None]:
RKD_YES_OP_R_percentiles = cal_display_percentiles(x_col='RenalDiseaseIndicator', 
                                                   y_col='OPAnnualReimbursementAmt',
                                                   title_lbl="Renal Kidney Disease = YES",
                                                   x_filter_code='Y')

- **`OBSERVATION`**
  - The above graph shows us that some of the reimbursements paid by the PAYER are very high as compared to the rest of the records.
      - This can be a potential sign of fraudulent cases because generally the criminals file some forge cases with exponentially high amounts.

- **(RenalDiseaseIndicator == NO) and (OPAnnualReimbursementAmt)**

In [None]:
RKD_NO_OP_R_percentiles = cal_display_percentiles(x_col='RenalDiseaseIndicator', 
                                                   y_col='OPAnnualReimbursementAmt',
                                                   title_lbl="Renal Kidney Disease = NO",
                                                   x_filter_code='0')

- **`OBSERVATION`**
  - The above graph shows us that some of the reimbursements paid by the PAYER are very high as compared to the rest of the records.
      - This can be a potential sign of fraudulent cases because generally the criminals file some forge cases with exponentially high amounts.

- **(ChronicCond_stroke == YES) and (IPAnnualReimbursementAmt)**

In [None]:
CC_ST_YES_IP_R_percentiles = cal_display_percentiles(x_col='ChronicCond_stroke', 
                                                     y_col='IPAnnualReimbursementAmt',
                                                     title_lbl="ChronicCond_stroke = YES",
                                                     x_filter_code=1)

- **`OBSERVATION`**
  - The above graph shows us that some of the reimbursements paid by the PAYER are very high as compared to the rest of the records.
      - This can be a potential sign of fraudulent cases because generally the criminals file some forge cases with exponentially high amounts.

- **(ChronicCond_stroke == NO) and (IPAnnualReimbursementAmt)**

In [None]:
CC_ST_NO_IP_R_percentiles = cal_display_percentiles(x_col='ChronicCond_stroke', 
                                                     y_col='IPAnnualReimbursementAmt',
                                                     title_lbl="ChronicCond_stroke = NO",
                                                     x_filter_code=2)

- **`OBSERVATION`**
  - The above graph shows us that some of the reimbursements paid by the PAYER are very high as compared to the rest of the records.
      - This can be a potential sign of fraudulent cases because generally the criminals file some forge cases with exponentially high amounts.

- **(ChronicCond_stroke == YES) and (OPAnnualReimbursementAmt)**

In [None]:
CC_ST_YES_OP_R_percentiles = cal_display_percentiles(x_col='ChronicCond_stroke', 
                                                     y_col='OPAnnualReimbursementAmt',
                                                     title_lbl="ChronicCond_stroke = YES",
                                                     x_filter_code=1)

- **`OBSERVATION`**
  - The above graph shows us that some of the reimbursements paid by the PAYER are very high as compared to the rest of the records.
      - This can be a potential sign of fraudulent cases because generally the criminals file some forge cases with exponentially high amounts.

- **(ChronicCond_stroke == NO) and (OPAnnualReimbursementAmt)**

In [None]:
CC_ST_NO_OP_R_percentiles = cal_display_percentiles(x_col='ChronicCond_stroke', 
                                                     y_col='OPAnnualReimbursementAmt',
                                                     title_lbl="ChronicCond_stroke = NO",
                                                     x_filter_code=2)

- **`OBSERVATION`**
  - The above graph shows us that some of the reimbursements paid by the PAYER are very high as compared to the rest of the records.
      - This can be a potential sign of fraudulent cases because generally the criminals file some forge cases with exponentially high amounts.

### **Q22. Lets just visualize the spread of pre-disease indicators for the Annual IP and OP expenditures across males and females.**

In [None]:
def plot_strip_plots(x_col, hue_col, y_col, lgd_title):
    """
    Description : This function is created for plotting the spread of data points of pre-disease indicators for the Annual IP and OP expenditures
    across males and females.
    
    Input: It accepts below parameters:
        1. x_col : Gender feature.
        2. hue_Col : Pre-Disease indicator
        3. y_col : Feature like re-imbursement or deductible amount whose percentiles you want to generate.
        4. lgd_title : Category code for which you want to generate the data spread.
        
    Output: It displays the stipplot graph of the same.
    """
    with plt.style.context('seaborn-poster'):
        plt.figure(figsize=(10,7))
        sns.stripplot(data=train_bene_df, x=x_col, y=y_col, hue=hue_col, palette='plasma')
        # Providing the labels and title to the graph
        plt.xlabel("\n{}".format(x_col), fontdict=label_font_dict)
        plt.xticks(rotation=90, size=12)
        plt.ylabel("{}\n".format(y_col), fontdict=label_font_dict)
        plt.grid(which='major', linestyle="-.", color='lightpink')
        plt.minorticks_on()
        plt.title("Spread of payment paid by payer\n", fontdict=title_font_dict)
        plt.legend(loc='upper center',title=lgd_title)

- **(RenalDiseaseIndicator) , (IPAnnualReimbursementAmt) and (GENDER)**

In [None]:
plot_strip_plots(x_col='Gender', hue_col="RenalDiseaseIndicator", y_col='IPAnnualReimbursementAmt', lgd_title="Renal Kidney Disease")

- **`OBSERVATION`**
  - The above graph shows us the complete overlapping of data points with some potential outliers(may be fraud).
      - Here, another thing that I found is that few of the points lying in the negative range(this is quite strange may be error).

In [None]:
train_bene_df['OPAnnualReimbursementAmt'].min(), train_bene_df['IPAnnualReimbursementAmt'].min()

In [None]:
train_bene_df['OPAnnualDeductibleAmt'].min(), train_bene_df['IPAnnualDeductibleAmt'].min()

- **(RenalDiseaseIndicator) , (OPAnnualReimbursementAmt) and (GENDER)**

In [None]:
plot_strip_plots(x_col='Gender', hue_col="RenalDiseaseIndicator", y_col='OPAnnualReimbursementAmt', lgd_title="Renal Kidney Disease")

- **`OBSERVATION`**
  - The above graph shows us the complete overlapping of data points with some potential outliers(may be fraud).
      - Here, another thing taht I found is that few of the points lying in the negative range(this is quite strange may be error).

- **(ChronicCond_rheumatoidarthritis) , (IPAnnualReimbursementAmt) and (GENDER)**

In [None]:
plot_strip_plots(x_col='Gender', hue_col="ChronicCond_rheumatoidarthritis", y_col='IPAnnualReimbursementAmt', lgd_title="Rheumatoidarthritis")

- **`OBSERVATION`**
  - The above graph shows us the complete overlapping of data points with some potential outliers(may be fraud).
      - Here, another thing taht I found is that few of the points lying in the negative range(this is quite strange may be error).

- **(ChronicCond_rheumatoidarthritis) , (OPAnnualReimbursementAmt) and (GENDER)**

In [None]:
plot_strip_plots(x_col='Gender', hue_col="ChronicCond_rheumatoidarthritis", y_col='OPAnnualReimbursementAmt', lgd_title="Rheumatoidarthritis")

- **`OBSERVATION`**
  - The above graph shows us the complete overlapping of data points with some potential outliers(may be fraud).
      - Here, another thing taht I found is that few of the points lying in the negative range(this is quite strange may be error).

- **(ChronicCond_IschemicHeart) , (IPAnnualReimbursementAmt) and (GENDER)**

In [None]:
plot_strip_plots(x_col='Gender', hue_col="ChronicCond_IschemicHeart", y_col='IPAnnualReimbursementAmt', lgd_title="Ischemic Heart")

- **`OBSERVATION`**
  - The above graph shows us the complete overlapping of data points with some potential outliers specifically for patients with Chronic Heart Desiase(may be fraud).
      - Here, another thing taht I found is that few of the points lying in the negative range(this is quite strange may be error).

- **(ChronicCond_IschemicHeart) , (OPAnnualReimbursementAmt) and (GENDER)**

In [None]:
plot_strip_plots(x_col='Gender', hue_col="ChronicCond_IschemicHeart", y_col='OPAnnualReimbursementAmt', lgd_title="Ischemic Heart")

- **`OBSERVATION`**
  - The above graph shows us the complete overlapping of data points with some potential outliers specifically for patients with Chronic Heart Desiase(may be fraud).
      - Here, another thing taht I found is that few of the points lying in the negative range(this is quite strange may be error).

### **Q23. Lets visualize the spread of Annual IP and OP expenditures through out the AGE and its assciated features for males and females.**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,12))
    sns.scatterplot(data=train_bene_df, x='AGE', y='IPAnnualReimbursementAmt', hue='Gender', palette='cubehelix')
    plt.grid(which='major', linestyle="-.", color='lightpink')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer for entire age\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us the complete overlapping of data points.

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,12))
    sns.scatterplot(data=train_bene_df, x='AGE', y='OPAnnualReimbursementAmt', hue='Gender', palette='cubehelix')
    plt.grid(which='major', linestyle="-.", color='lightpink')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer for entire age\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us the complete overlapping of data points.

- **(AGE GROUPS) , (IPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,12))
    sns.boxenplot(data=train_bene_df, x='AGE_groups', y='IPAnnualReimbursementAmt', hue='Gender', palette='cubehelix')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer across age groups\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

- **(AGE GROUPS) , (OPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,12))
    sns.boxenplot(data=train_bene_df, x='AGE_groups', y='OPAnnualReimbursementAmt', hue='Gender', palette='cubehelix')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer across age groups\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

- **(DOB MONTH) , (IPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(16,12))
    sns.boxenplot(data=train_bene_df, x='Patient_Age_Month', y='IPAnnualReimbursementAmt', hue='Gender', palette='cubehelix')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer across DOB Months\n", fontdict=title_font_dict)
    plt.legend(loc='upper right', title="Gender")

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

- **(DOB MONTH) , (OPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(16,12))
    sns.boxenplot(data=train_bene_df, x='Patient_Age_Month', y='OPAnnualReimbursementAmt', hue='Gender', palette='cubehelix')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer across DOB Months\n", fontdict=title_font_dict)
    plt.legend(loc='upper right', title="Gender")

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

- **(DOB YEARS) , (IPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn'):
    plt.figure(figsize=(16,12))
    sns.stripplot(data=train_bene_df, x='Patient_Age_Year', y='IPAnnualReimbursementAmt', hue='Gender', palette='cubehelix')
    plt.xticks(rotation=90, fontsize=11)
    plt.grid(which='major', linestyle="-.", color='lightpink')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer across DOB Years\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

- **(DOB YEARS) , (OPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn'):
    plt.figure(figsize=(16,12))
    sns.stripplot(data=train_bene_df, x='Patient_Age_Year', y='OPAnnualReimbursementAmt', hue='Gender', palette='cubehelix')
    plt.xticks(rotation=90, fontsize=11)
    plt.grid(which='major', linestyle="-.", color='lightpink')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer across DOB Years\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

- **(LIFE Status) , (IPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,8))
    sns.boxenplot(data=train_bene_df, x='Dead_or_Alive', y='IPAnnualReimbursementAmt', hue='Gender', palette='autumn')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer based on life status\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

- **(LIFE STATUS) , (OPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,8))
    sns.boxenplot(data=train_bene_df, x='Dead_or_Alive', y='OPAnnualReimbursementAmt', hue='Gender', palette='autumn')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer based on life status\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

- **(HUMAN RACE) , (IPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,8))
    sns.boxenplot(data=train_bene_df, x='Race', y='IPAnnualReimbursementAmt', hue='Gender', palette='twilight')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer based on life status\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

- **(HUMAN RACE) , (OPAnnualReimbursementAmt) and (GENDER)**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,8))
    sns.boxenplot(data=train_bene_df, x='Race', y='OPAnnualReimbursementAmt', hue='Gender', palette='twilight')
    plt.minorticks_on()
    plt.title("Spread of payment paid by payer based on life status\n", fontdict=title_font_dict)

- **`OBSERVATION`**
  - The above graph shows us there is no such difference in the amounts across different AGE Groups.

# **`BENE - EDA - SUMMARY`**

1. For the below mentioned features, based on the above initial analysis it looks like these might not be able to provide much information or differentiation but still I would like to check them after adding CLAIMS data.
    - `DOB YEAR`
    - `DOB MONTH`
    - `AGE GROUPS`
    - `LIFE STATUS`
    - `HUMAN RACE`
    - `STATE`
      
      
2. For the below mentioned features majority of the values are same which most probably won't be of any use thus removing these from BENE dataset.
    - `NoOfMonths_PartACov`
    - `NoOfMonths_PartBCov`
    

3. The `Pre-disease` indicators looks like important features based on the initial analysis thus it would interesting to see how much they are useful after adding CLAIMS dataset.


4. `Date of Death` is also removed from the dataset, as we have already calculated bene age, life status and others out of it.

In [None]:
train_bene_df.drop(["NoOfMonths_PartACov", "NoOfMonths_PartBCov"], axis=1, inplace=True)

In [None]:
train_bene_df.shape

In [None]:
train_bene_df.head()

In [None]:
train_bene_df.to_csv("train_bene_1.csv")

# **IP & OP Data - EDA**

In [None]:
train_bene_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Beneficiarydata-1542865627584.csv")
train_ip_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Inpatientdata-1542865627584.csv")
train_op_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Outpatientdata-1542865627584.csv")

## ***Records_counts_for_In-patient_&_Out-patient_Data***

- **In-patients**

In [None]:
train_ip_df.shape

In [None]:
train_ip_df.dtypes

In [None]:
train_ip_df.head()

In [None]:
print("### Number of records where patient gets admitted --> {} ###".format(train_ip_df.shape[0]))

- **Out-patients**

In [None]:
train_op_df.shape

In [None]:
train_op_df.dtypes

In [None]:
train_op_df.head()

In [None]:
print("### Number of records where patients didn't gets admitted --> {} ###".format(train_op_df.shape[0]))

- **Patient IDs who medicated with or without admission**

In [None]:
ip_bene_unq = set(train_ip_df['BeneID'])
op_bene_unq = set(train_op_df['BeneID'])

In [None]:
len(ip_bene_unq), len(op_bene_unq)

- **Number of patients who either are in-patients or out-patients**

In [None]:
only_in_patients = ip_bene_unq.intersection(op_bene_unq)
len(only_in_patients)

In [None]:
print("### Only admitted in-patients --> {} ###".format(len(only_in_patients)))

In [None]:
only_out_patients = op_bene_unq.difference(ip_bene_unq)
len(only_out_patients)

In [None]:
print("### Only out-patients --> {} ###".format(len(only_out_patients)))

In [None]:
patients_counts = pd.DataFrame([len(only_in_patients), len(only_out_patients)]).T
patients_counts.columns = ['Only In-patients', 'Only Out-patients']
patients_counts

In [None]:
tot_patients = len(only_in_patients) + len(only_out_patients)
tot_patients

In [None]:
# Here, I'm displaying the number of only in-patients and out-patients
with plt.style.context('seaborn-poster'):
    fig = patients_counts.plot(kind='bar',colormap='rainbow')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_patients,2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
    # Providing the labels and title to the graph
    plt.xticks(labels=["Patients Counts"], ticks=[0], rotation=10)
    plt.ylabel("Number or % share of patients\n", fontdict=label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Number of only In or Out patients\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can decude that 80% of the patients gets medicared without even admission.

# ***Exploring the In-patients Data***

In [None]:
train_ip_df

- **NULL records in the in-patients data**

In [None]:
# Here, I'm displaying the number of only in-patients and out-patients
with plt.style.context('seaborn'):
    plt.figure(figsize=(15,12))
    fig = sns.heatmap(pd.DataFrame(train_ip_df.isnull().sum()), annot=True, fmt=".7g", cmap='inferno', cbar=True)
    # Providing the labels and title to the graph
    plt.xticks(labels=[" "], ticks=[0])
    plt.xlabel("Null Counts", fontdict=label_font_dict)
    plt.ylabel("Features Names\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Number of Nulls in In-patients dataset\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can see that the majority of the ProcedureCodes are having NULLS. And, ClmDiagnosisCode_10 is very rare among Patients.

- **Added Flag for indicating whether beneficiary admitted or not?**

In [None]:
train_ip_df["Admitted?"] = 1

- **Added Claim_Clearance_Days**

In [None]:
train_ip_df['ClaimStartDt'] = pd.to_datetime(train_ip_df['ClaimStartDt'], format="%Y-%m-%d")
train_ip_df['ClaimEndDt'] = pd.to_datetime(train_ip_df['ClaimEndDt'], format="%Y-%m-%d")

In [None]:
train_ip_df['Claim_Duration'] = (train_ip_df['ClaimEndDt'] - train_ip_df['ClaimStartDt']).dt.days

In [None]:
train_ip_df['Claim_Duration'].describe()

In [None]:
# Here, I'm displaying the number of only in-patients and out-patients
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,8))
    train_ip_df['Claim_Duration'].plot(kind='hist', colormap="viridis");
    # Providing the labels and title to the graph
    plt.xlabel("Claim Duration(in days)", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Claim Duration Days\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can decude that the majority of the claims filed for less than 7 days.


- **Percentiles values**

In [None]:
for val in [0.1,0.2,0.25,0.3,0.4,0.5,0.6,0.7,0.75,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.999,0.9999,0.99999,0.999999,1.0]:
    percentile = round(float(val*100),6)
    percentile_val = round(train_ip_df["Claim_Duration"].quantile(val),1)
    print("Percentile --> {} and its value is --> {}".format(percentile,percentile_val))

**`OBSERVATION`**
* From the results, we can say that 95% of the claims are filed for 17 days.

### **Q1. What is the relationship b/w Amount of Insurance Claim Reimbursed v/s Claim Clearance Days?**

In [None]:
unq_claim_duration_days = train_ip_df['Claim_Duration'].unique()
unq_claim_duration_days

In [None]:
tot_claims_filed_for_specific_days = pd.DataFrame(train_ip_df.groupby(['Claim_Duration'])['BeneID'].count())
tot_claims_filed_for_specific_days

In [None]:
tot_insc_amount_for_claim_durations = pd.DataFrame(train_ip_df.groupby(['Claim_Duration'])['InscClaimAmtReimbursed'].sum())
tot_insc_amount_for_claim_durations

In [None]:
claim_clearance_amts = pd.merge(left=tot_claims_filed_for_specific_days, right=tot_insc_amount_for_claim_durations,
                                how='inner',
                                left_on=tot_claims_filed_for_specific_days.index,
                                right_on=tot_insc_amount_for_claim_durations.index)

claim_clearance_amts.columns = ['Claim_durations_in_days', 'Total_claims', 'All_Claims_Total_Amount']
claim_clearance_amts.head()

In [None]:
claim_clearance_amts['Avg_Claim_Insc_Amount'] = np.round(claim_clearance_amts['All_Claims_Total_Amount']/claim_clearance_amts['Total_claims'],2)

In [None]:
claim_clearance_amts.head()

In [None]:
with plt.style.context('seaborn'):
    plt.figure(figsize=(16,16))
    sns.pointplot(data=claim_clearance_amts, x='Claim_durations_in_days', y='Total_claims', 
                  color='k', markers="^", linestyles="")
    sns.pointplot(data=claim_clearance_amts, x='Claim_durations_in_days', y='Total_claims', 
                  color='coral', markers="", linestyles="-")
     
    # Providing the labels and title to the graph
    plt.xticks(rotation=90)
    plt.xlabel("\nClaims Durations(in days)", fontdict= label_font_dict)
    plt.ylabel("Total Claims\n", fontdict= label_font_dict)
    plt.yticks(np.arange(0,7500,200))
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title('\nTrend of "Total Filed Claims" for every duration(in days)', fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above graph tells us that the most number of claims are filed for 3 days. And, there are very less number of claims for duration greater than 15.
    * However, we can witness a litter spike for 35 days of duration for claims.

* And, there are around 600 claims for which the duration is 0 that means Claim Start Date and End Date is same.

In [None]:
with plt.style.context('seaborn'):
    plt.figure(figsize=(16,7))
    sns.pointplot(data=claim_clearance_amts, x='Claim_durations_in_days', y='Avg_Claim_Insc_Amount', 
                  color='blue', markers="^", linestyles="")
    sns.pointplot(data=claim_clearance_amts, x='Claim_durations_in_days', y='Avg_Claim_Insc_Amount', 
                  color='coral', markers="", linestyles="-")
    # Providing the labels and title to the graph
    plt.xticks(rotation=90)
    plt.xlabel("\nVarious Claims duration (in days)", fontdict= label_font_dict)
    plt.ylabel(" ", fontdict= label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title('\nTrend of "Avg Re-imbursed Claim Amount"', fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above graph tells us that as the claim duration increases then the Avg Re-imbursed Amount also increases, however, as we have already seen that total number of claims are very less when duration is greater than 15 days.

* Another thing to look here is that if the duration is b/w [30-35] then the Average Re-imbursed amount is very high and reaches its maximum.

In [None]:
with plt.style.context('seaborn'):
    plt.figure(figsize=(16,7))
    sns.pointplot(data=claim_clearance_amts, x='Claim_durations_in_days', y='All_Claims_Total_Amount', color='green')
    # Providing the labels and title to the graph
    plt.xticks(rotation=90)
    plt.xlabel("\nClaims Durations (in days)", fontdict= label_font_dict)
    plt.ylabel("Total Re-imbursed Claim Amount", fontdict= label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("\nTrend of `Total Re-imbursed Claim Amount` for each filed duration(in days)", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above graph tells us that the Total Re-imbursed Amount is the highest for 3 days claims

* And, for claims with durations from 12 to 34 the total re-imbursed amount is very less, however, for 35 days duration we can witness a clear spike that can be a potential sign of fraudulent.

### **Q2. What is the relationship b/w Claimed and Admitted Durations with Re-imbursed Amount?**

In [None]:
train_ip_df['DischargeDt'] = pd.to_datetime(train_ip_df['DischargeDt'], format="%Y-%m-%d")
train_ip_df['AdmissionDt'] = pd.to_datetime(train_ip_df['AdmissionDt'], format="%Y-%m-%d")

In [None]:
train_ip_df['Admitted_Days'] = train_ip_df['DischargeDt'] - train_ip_df['AdmissionDt']
train_ip_df['Admitted_Days'] = train_ip_df['Admitted_Days'].dt.days

In [None]:
claims_with_diff_admitted_and_claimed_dur = train_ip_df[~(train_ip_df['Claim_Duration'] == train_ip_df['Admitted_Days'])]
claims_with_diff_admitted_and_claimed_dur

In [None]:
claims_with_diff_admitted_and_claimed_dur['InscClaimAmtReimbursed'].sum()

**`OBSERVATION`**
* The above table tells us that there are 49 claims whose Claimed Duration and Admitted Duration are different.

* And, for these 49 claims the total re-imbursed amount is around 0.67 Million. So, doesn't look like an issue here as the admitted days can be greater than claimed duration based upon the plan bought by the beneficiary.

* **Lets check whether claimed duration is greater then admitted duration**

In [None]:
claims_with_diff_admitted_and_claimed_dur[claims_with_diff_admitted_and_claimed_dur['Claim_Duration']  > \
                                          claims_with_diff_admitted_and_claimed_dur['Admitted_Days']]

In [None]:
claims_with_diff_admitted_and_claimed_dur[claims_with_diff_admitted_and_claimed_dur['Claim_Duration']  > \
                                          claims_with_diff_admitted_and_claimed_dur['Admitted_Days']]['InscClaimAmtReimbursed'].sum()

**`OBSERVATION`**
* The above table tells us that 17 claims out of 49 have Claimed Duration greater than the Admitted Duration.

* And, for these 17 claims the total re-imbursed amount is around 0.27 Million. For now, I'll keep this feature but my initial look says that it won't be much of a use.

### **Q3. What is the relationship b/w DeductibleAmtPaid and Re-imbursed Amount?**

In [None]:
no_of_claim_with_no_copay = train_ip_df[train_ip_df['DeductibleAmtPaid'].isna()].shape[0]
no_of_claim_with_no_copay

In [None]:
no_of_claim_with_copay = train_ip_df[~train_ip_df['DeductibleAmtPaid'].isna()].shape[0]
no_of_claim_with_copay

In [None]:
percent_of_no_copay_claims = round((no_of_claim_with_no_copay / (no_of_claim_with_copay + no_of_claim_with_no_copay)) * 100,1)
print("### Percentage of claims with no co-payment or deductible --> {}% ###".format(percent_of_no_copay_claims))

In [None]:
re_imbursed_amt_for_no_copay = train_ip_df[train_ip_df['DeductibleAmtPaid'].isna()]['InscClaimAmtReimbursed'].sum()
re_imbursed_amt_for_no_copay

In [None]:
re_imbursed_amt_with_some_copay = train_ip_df[~train_ip_df['DeductibleAmtPaid'].isna()]['InscClaimAmtReimbursed'].sum()
re_imbursed_amt_with_some_copay

In [None]:
tot_sum_of_claims_with_copay = re_imbursed_amt_with_some_copay / (re_imbursed_amt_with_some_copay + re_imbursed_amt_for_no_copay)
tot_sum_of_claims_with_no_copay = re_imbursed_amt_for_no_copay / (re_imbursed_amt_with_some_copay + re_imbursed_amt_for_no_copay)

In [None]:
percent_of_tot_sum_no_copay_claims_amt = round(tot_sum_of_claims_with_no_copay * 100,1)
print("### Percentage of Total Re-imbursed Amount for claims with no co-payment or deductible --> {}% ###".\
      format(percent_of_tot_sum_no_copay_claims_amt))

**`OBSERVATION`**
* The above table tells us that there are a 2% of total claims for which there is no co-payment.
    * And, for these 2% (or 899) of total claims the total re-imbursed amount is 10.6 Million that is 2.6% of the total re-imbursed amount.

In [None]:
# Here, updating the NULL values of DeductibleAmtPaid feature as 0
train_ip_df['DeductibleAmtPaid'].fillna(value=0.0, inplace=True)

### **Q4. What is the relationship of Providers with Total number of claims filed & Re-imbursed Amount?**

In [None]:
# How many unique providers are there in the dataset?
print("We have {} unique number of Providers in the in-patient dataset.".format(train_ip_df['Provider'].nunique()))

In [None]:
provider_tot_claims_filed = pd.DataFrame(train_ip_df.groupby(['Provider'])['ClaimID'].count())
provider_tot_reimbursed_amt = pd.DataFrame(train_ip_df.groupby(['Provider'])['InscClaimAmtReimbursed'].sum())

prv_tot_filed_claims_and_tot_reimb_amt = pd.merge(left=provider_tot_claims_filed, right=provider_tot_reimbursed_amt, how='inner',
                                                  left_on=provider_tot_claims_filed.index, right_on=provider_tot_reimbursed_amt.index)

prv_tot_filed_claims_and_tot_reimb_amt.columns = ['ProviderID', 'Tot_Claims_Filed', 'Tot_Re_Imbursed_Amt']
prv_tot_filed_claims_and_tot_reimb_amt.reset_index(drop=True,inplace=True)
prv_tot_filed_claims_and_tot_reimb_amt['Percentage_out_of_tot_reimb_amt'] = round((prv_tot_filed_claims_and_tot_reimb_amt['Tot_Re_Imbursed_Amt'] / train_ip_df['InscClaimAmtReimbursed'].sum()) * 100, 3)

provider_max_reimbursed_amt = pd.DataFrame(train_ip_df.groupby(['Provider'])['InscClaimAmtReimbursed'].max())
provider_max_reimbursed_amt.rename(columns={"InscClaimAmtReimbursed": "Max_Re_Imbursed_Amt"}, inplace=True)

prv_tot_filed_claims_tot_max_reimb_amt = pd.merge(left=prv_tot_filed_claims_and_tot_reimb_amt, 
                                                  right=provider_max_reimbursed_amt, how='inner',
                                                  left_on=prv_tot_filed_claims_and_tot_reimb_amt['ProviderID'], 
                                                  right_on=provider_max_reimbursed_amt.index)

prv_tot_filed_claims_tot_max_reimb_amt.drop(['key_0'], axis=1, inplace=True)
prv_tot_filed_claims_tot_max_reimb_amt['Diff_in_Tot_and_Max'] = prv_tot_filed_claims_tot_max_reimb_amt['Tot_Re_Imbursed_Amt'] - \
prv_tot_filed_claims_tot_max_reimb_amt['Max_Re_Imbursed_Amt']

prv_tot_filed_claims_tot_max_reimb_amt.head()

In [None]:
prv_tot_filed_claims_tot_max_reimb_amt.sort_values(by=['Diff_in_Tot_and_Max','Max_Re_Imbursed_Amt','Percentage_out_of_tot_reimb_amt'],
                                                   axis=0, inplace=True,
                                                   ascending=[True, False, False])

In [None]:
prv_tot_filed_claims_tot_max_reimb_amt.head(60)

**`OBSERVATION`**
* The above table showing us the Provider Ids who only filed 1 or 2 claims and got the entire amount re-imbursed.
    * This, can be a potential sign of fraudulent because many small-small hospitals in rural area who don't have much facilities or equipments made fraud for benefits. Similar recently happened : refer here https://www.justice.gov/opa/pr/two-individuals-convicted-14-billion-health-care-fraud-scheme-involving-rural-hospitals

In [None]:
tot_re_imb_amt_for_prv_with_5orless_claims = prv_tot_filed_claims_tot_max_reimb_amt[prv_tot_filed_claims_tot_max_reimb_amt['Tot_Claims_Filed'] < 5] \
                                            ['Tot_Re_Imbursed_Amt'].sum()

pp_re_imb_amt_for_prv_with_5orless_claims = round((tot_re_imb_amt_for_prv_with_5orless_claims / train_ip_df['InscClaimAmtReimbursed'].sum()) * 100,2)
print("### Total Re-imbursed Amount for Providers with less than 5 filed claims is --> {} (17 Million). ###\n\
### And, this is {}% of Total Re-imbursed Claim Amount (408 Million). ###".format(tot_re_imb_amt_for_prv_with_5orless_claims, 
                                                                    pp_re_imb_amt_for_prv_with_5orless_claims))

# ***Exploring the Out-patients Data***

In [None]:
train_op_df.shape

- **NULL records in the in-patients data**

In [None]:
# Here, I'm displaying the number of only out-patients
with plt.style.context('seaborn'):
    plt.figure(figsize=(15,12))
    fig = sns.heatmap(pd.DataFrame(train_op_df.isnull().sum()), annot=True, fmt=".7g", cmap='inferno', cbar=True)
    # Providing the labels and title to the graph
    plt.xticks(labels=[" "], ticks=[0])
    plt.xlabel("Null Counts", fontdict=label_font_dict)
    plt.ylabel("Features Names\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Number of Nulls in Out-patients dataset\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can say that the majority of the ProcedureCodes are having NULLS. 

* ClmDiagnosisCode_9 & ClmDiagnosisCode_10 are very rare among Patients.

- **Added Flag for indicating whether beneficiary admitted or not?**

In [None]:
train_op_df["Admitted?"] = 0

- **Added Claim_Clearance_Days**

In [None]:
train_op_df['ClaimStartDt'] = pd.to_datetime(train_op_df['ClaimStartDt'], format="%Y-%m-%d")
train_op_df['ClaimEndDt'] = pd.to_datetime(train_op_df['ClaimEndDt'], format="%Y-%m-%d")

In [None]:
train_op_df['Claim_Duration'] = (train_op_df['ClaimEndDt'] - train_op_df['ClaimStartDt']).dt.days

In [None]:
train_op_df['Claim_Duration'].describe()

In [None]:
# Here, I'm displaying the number of only out-patients
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(12,8))
    train_op_df['Claim_Duration'].plot(kind='hist', colormap="viridis");
    # Providing the labels and title to the graph
    plt.xlabel("Claim Duration(in days)", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Claim Duration Days\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can decude that the majority of the claims filed for less than or equals to 2 days.


- **Percentiles values**

In [None]:
for val in [0.1,0.2,0.25,0.3,0.4,0.5,0.6,0.7,0.75,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.999,0.9999,0.99999,0.999999,1.0]:
    percentile = round(float(val*100),6)
    percentile_val = round(train_op_df["Claim_Duration"].quantile(val),1)
    print("Percentile --> {} and its value is --> {}".format(percentile,percentile_val))

**`OBSERVATION`**
* From the results, we can say that 90% of the claims are filed for 2 days.

### **Q5. What is the relationship b/w Claim Duration and Re-imbursed Amount?**

In [None]:
with plt.style.context("seaborn-poster"):
    sns.stripplot(x="Claim_Duration", y="InscClaimAmtReimbursed", data=train_op_df, palette="plasma")
    # Providing the labels and title to the graph
    plt.xlabel("Claim Durations (in days)", fontdict=label_font_dict)
    plt.ylabel("Claim Re-imbursed Amount\n", fontdict=label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Various re-imbursed amounts for different claim durations\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can decude that the majority of the claims filed have re-imbursed amount less than 20,000. And, very few have more than 1,00,000.

In [None]:
for val in [0.1,0.2,0.25,0.3,0.4,0.5,0.6,0.7,0.75,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.999,0.9999,0.99999,0.999999,1.0]:
    percentile = round(float(val*100),6)
    percentile_val = round(train_op_df["InscClaimAmtReimbursed"].quantile(val),1)
    print("Percentile --> {} and its value is --> {}".format(percentile,percentile_val))

* **99.9% of claims have Re-imbursed amount less than 3500.**

### **Q5.1 What is the relationship b/w Claim Duration and Co-Payment?**

In [None]:
with plt.style.context("seaborn-poster"):
    sns.stripplot(x="Claim_Duration", y="DeductibleAmtPaid", data=train_op_df, palette="plasma")
    # Providing the labels and title to the graph
    plt.xlabel("Claim Durations (in days)", fontdict=label_font_dict)
    plt.ylabel("Co-payment Amount\n", fontdict=label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Various co-payments paid for different claim durations\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can decude that the trend of co-payment is similar across the various durations, however, there are few co-payment which are very high or more than 800.

### **Q6. What is the relationship b/w Amount of Insurance Claim Reimbursed v/s Claim Clearance Days?**

In [None]:
unq_claim_duration_days = train_op_df['Claim_Duration'].unique()
unq_claim_duration_days

In [None]:
train_op_df.columns

In [None]:
tot_claims_filed_for_specific_days = pd.DataFrame(train_op_df.groupby(['Claim_Duration'])['ClaimID'].count())
tot_claims_filed_for_specific_days

In [None]:
tot_insc_amount_for_claim_durations = pd.DataFrame(train_op_df.groupby(['Claim_Duration'])['InscClaimAmtReimbursed'].sum())
tot_insc_amount_for_claim_durations

In [None]:
claim_clearance_amts = pd.merge(left=tot_claims_filed_for_specific_days, right=tot_insc_amount_for_claim_durations,
                                how='inner',
                                left_on=tot_claims_filed_for_specific_days.index,
                                right_on=tot_insc_amount_for_claim_durations.index)

claim_clearance_amts.columns = ['Claim_durations_in_days', 'Total_claims', 'All_Claims_Total_Amount']
claim_clearance_amts.head()

In [None]:
claim_clearance_amts['Avg_Claim_Insc_Amount'] = np.round(claim_clearance_amts['All_Claims_Total_Amount']/claim_clearance_amts['Total_claims'],2)

In [None]:
claim_clearance_amts.head()

In [None]:
with plt.style.context('seaborn'):
    plt.figure(figsize=(16,10))
    fig = sns.barplot(data=claim_clearance_amts, x='Claim_durations_in_days', y='Total_claims', palette='plasma')     
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/claim_clearance_amts["Total_claims"].sum(),2))+"%"}', (x + width/2, y + height*1.02), ha='center', fontsize=9, rotation=0)
    
    # Providing the labels and title to the graph
    plt.xticks(rotation=90)
    plt.xlabel("\nClaims Durations(in days)", fontdict= label_font_dict)
    plt.ylabel("Total Claims\n", fontdict= label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title('\nTrend of "Total Filed Claims" for every duration(in days)', fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above graph tells us that the most number of claims are filed for 0 days. And, there are very less number of claims for other durations.
    * However, we can witness a litter spike for 20 days of duration for claims.

In [None]:
with plt.style.context('seaborn'):
    plt.figure(figsize=(14,7))
    sns.pointplot(data=claim_clearance_amts, x='Claim_durations_in_days', y='Avg_Claim_Insc_Amount', 
                  color='blue', markers="^", linestyles="")
    sns.pointplot(data=claim_clearance_amts, x='Claim_durations_in_days', y='Avg_Claim_Insc_Amount', 
                  color='coral', markers="", linestyles="-")
    # Providing the labels and title to the graph
    plt.xticks(rotation=90)
    plt.xlabel("\nVarious Claims duration (in days)", fontdict= label_font_dict)
    plt.ylabel(" ", fontdict= label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title('\nTrend of "Avg Re-imbursed Claim Amount"', fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above graph tells us that the Average Re-imbursed Amount is same throughout the various durations except for 21 and 23 days.

In [None]:
with plt.style.context('seaborn'):
    plt.figure(figsize=(16,7))
    sns.pointplot(data=claim_clearance_amts, x='Claim_durations_in_days', y='All_Claims_Total_Amount', color='green')
    # Providing the labels and title to the graph
    plt.xticks(rotation=90)
    plt.xlabel("\nClaims Durations (in days)", fontdict= label_font_dict)
    plt.ylabel("Total Re-imbursed Claim Amount", fontdict= label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("\nTrend of `Total Re-imbursed Claim Amount` for each filed duration(in days)", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above graph tells us that the Total Re-imbursed Amount is the highest for 0 days claims.

* And, for claims with durations from 2 to 19 the total re-imbursed amount is very less or similar, however, for 20 days duration we can witness a clear spike that can be a potential sign of fraudulent.

### **Q7. What is the relationship b/w DeductibleAmtPaid and Re-imbursed Amount?**

In [None]:
no_of_claim_with_no_copay = train_op_df[train_op_df['DeductibleAmtPaid'] == 0].shape[0]
no_of_claim_with_no_copay

In [None]:
no_of_claim_with_copay = train_op_df[train_op_df['DeductibleAmtPaid'] != 0].shape[0]
no_of_claim_with_copay

In [None]:
percent_of_no_copay_claims = round((no_of_claim_with_no_copay / (no_of_claim_with_copay + no_of_claim_with_no_copay)) * 100,1)
print("### Percentage of claims with no co-payment or deductible --> {}% ###".format(percent_of_no_copay_claims))

In [None]:
re_imbursed_amt_for_no_copay = train_op_df[train_op_df['DeductibleAmtPaid'] == 0]['InscClaimAmtReimbursed'].sum()
re_imbursed_amt_for_no_copay

In [None]:
re_imbursed_amt_with_some_copay = train_op_df[train_op_df['DeductibleAmtPaid'] != 0]['InscClaimAmtReimbursed'].sum()
re_imbursed_amt_with_some_copay

In [None]:
tot_sum_of_claims_with_copay = re_imbursed_amt_with_some_copay / (re_imbursed_amt_with_some_copay + re_imbursed_amt_for_no_copay)
tot_sum_of_claims_with_no_copay = re_imbursed_amt_for_no_copay / (re_imbursed_amt_with_some_copay + re_imbursed_amt_for_no_copay)

In [None]:
percent_of_tot_sum_no_copay_claims_amt = round(tot_sum_of_claims_with_no_copay * 100,1)
print("### Percentage of Total Re-imbursed Amount for claims with no co-payment or deductible --> {}% ###".\
      format(percent_of_tot_sum_no_copay_claims_amt))

**`OBSERVATION`**
* The above table tells us that there are 95.9% of total claims for which there is no co-payment.
    * And, for these 95.9% of total claims the total re-imbursed amount is 142.3 Million that is 96.1% of the total re-imbursed amount.

### **Q8. What is the relationship of Providers with Total number of claims filed & Re-imbursed Amount?**

In [None]:
# How many unique providers are there in the dataset?
print("We have {} unique number of Providers in the in-patient dataset.".format(train_op_df['Provider'].nunique()))

In [None]:
provider_tot_claims_filed = pd.DataFrame(train_op_df.groupby(['Provider'])['ClaimID'].count())
provider_tot_reimbursed_amt = pd.DataFrame(train_op_df.groupby(['Provider'])['InscClaimAmtReimbursed'].sum())

prv_tot_filed_claims_and_tot_reimb_amt = pd.merge(left=provider_tot_claims_filed, right=provider_tot_reimbursed_amt, how='inner',
                                                  left_on=provider_tot_claims_filed.index, right_on=provider_tot_reimbursed_amt.index)

prv_tot_filed_claims_and_tot_reimb_amt.columns = ['ProviderID', 'Tot_Claims_Filed', 'Tot_Re_Imbursed_Amt']
prv_tot_filed_claims_and_tot_reimb_amt.reset_index(drop=True,inplace=True)
prv_tot_filed_claims_and_tot_reimb_amt['Percentage_out_of_tot_reimb_amt'] = round((prv_tot_filed_claims_and_tot_reimb_amt['Tot_Re_Imbursed_Amt'] / train_op_df['InscClaimAmtReimbursed'].sum()) * 100, 3)

provider_max_reimbursed_amt = pd.DataFrame(train_op_df.groupby(['Provider'])['InscClaimAmtReimbursed'].max())
provider_max_reimbursed_amt.rename(columns={"InscClaimAmtReimbursed": "Max_Re_Imbursed_Amt"}, inplace=True)

prv_tot_filed_claims_tot_max_reimb_amt = pd.merge(left=prv_tot_filed_claims_and_tot_reimb_amt, 
                                                  right=provider_max_reimbursed_amt, how='inner',
                                                  left_on=prv_tot_filed_claims_and_tot_reimb_amt['ProviderID'], 
                                                  right_on=provider_max_reimbursed_amt.index)

prv_tot_filed_claims_tot_max_reimb_amt.drop(['key_0'], axis=1, inplace=True)
prv_tot_filed_claims_tot_max_reimb_amt['Diff_in_Tot_and_Max'] = prv_tot_filed_claims_tot_max_reimb_amt['Tot_Re_Imbursed_Amt'] - \
prv_tot_filed_claims_tot_max_reimb_amt['Max_Re_Imbursed_Amt']

prv_tot_filed_claims_tot_max_reimb_amt.head()

In [None]:
prv_tot_filed_claims_tot_max_reimb_amt.sort_values(by=['Diff_in_Tot_and_Max','Max_Re_Imbursed_Amt','Percentage_out_of_tot_reimb_amt'],
                                                   axis=0, inplace=True,
                                                   ascending=[True, False, False])

In [None]:
prv_tot_filed_claims_tot_max_reimb_amt.head(60)

**`OBSERVATION`**
* The above table showing us the Provider Ids who only filed 1 or 2 claims and got the entire amount re-imbursed.
    * This, can be a potential sign of fraudulent because many small-small hospitals in rural area who don't have much facilities or equipments made fraud for benefits. Similar recently happened : refer here https://www.justice.gov/opa/pr/two-individuals-convicted-14-billion-health-care-fraud-scheme-involving-rural-hospitals

In [None]:
tot_re_imb_amt_for_prv_with_5orless_claims = prv_tot_filed_claims_tot_max_reimb_amt[prv_tot_filed_claims_tot_max_reimb_amt['Tot_Claims_Filed'] < 5] \
                                            ['Tot_Re_Imbursed_Amt'].sum()

pp_re_imb_amt_for_prv_with_5orless_claims = round((tot_re_imb_amt_for_prv_with_5orless_claims / train_op_df['InscClaimAmtReimbursed'].sum()) * 100,2)
print("### Total Re-imbursed Amount for Providers with less than 5 filed claims is --> {} (0.52 Million). ###\n\
### And, this is {}% of Total Re-imbursed Claim Amount (148 Million). ###".format(tot_re_imb_amt_for_prv_with_5orless_claims, 
                                                                    pp_re_imb_amt_for_prv_with_5orless_claims))

# **`IP & OP - EDA - SUMMARY`**

- Features to be added:
    - Claim Duration
    - Admitted Duration
    - Admitted or not?


- Relationships to be validated:
    - Providers <--> Physicians <--> Fraud or not?
    - Providers <--> Physicians <--> Diagnosis and Procedure Codes <--> Fraud or not?
    - Providers with very less number of claims submissions but higher Re-imbursed amount <--> Fraud or not?

# **Entire Data -- EDA**

In [None]:
train_bene_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Beneficiarydata-1542865627584.csv")
train_ip_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Inpatientdata-1542865627584.csv")
train_op_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Outpatientdata-1542865627584.csv")
train_tgt_lbls_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train-1542865627584.csv")

## ***Exploring_Target_Labels_Data***

In [None]:
train_tgt_lbls_df.head()

* **Check the Fraud and Non-Fraud Counts**

In [None]:
print("### The unique number of providers are {}. ###".format(train_tgt_lbls_df.shape[0]))

In [None]:
with plt.style.context('seaborn-poster'):
    fig = train_tgt_lbls_df["PotentialFraud"].value_counts().plot(kind='bar', color=['green','orange'])
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/train_tgt_lbls_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
    # Providing the labels and title to the graph
    plt.xlabel("Provider Fraud or Not?", fontdict=label_font_dict)
    plt.ylabel("Number or % share of providers\n", fontdict=label_font_dict)
    plt.yticks(np.arange(0,5100,500))
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Distribution of Fraud & Non-fraud providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can say that 90% of the providers are not frausters and only 9% of them are involved in frauds.

### **Adding the `Admitted` or `Not Admitted` indicator in IP and OP Dataset**

* **Adding in IP Dataset**

In [None]:
train_ip_df["Admitted?"] = 1

In [None]:
train_ip_df.head()

* **Adding in OP Dataset**

In [None]:
train_op_df["Admitted?"] = 0

In [None]:
train_op_df.head()

### **Merging the Datasets**

In [None]:
# Commom columns must be 28
common_cols = [col for col in train_ip_df.columns if col in train_op_df.columns]
len(common_cols)

In [None]:
# Merging the IP and OP dataset on the basis of common columns
train_ip_op_df = pd.merge(left=train_ip_df, right=train_op_df, left_on=common_cols, right_on=common_cols, how="outer")
train_ip_op_df.shape

In [None]:
train_ip_op_df.head()

### **Merging the IP_OP Dataset with BENE Data**

In [None]:
# Joining the IP_OP dataset with the BENE data
train_ip_op_bene_df = pd.merge(left=train_ip_op_df, right=train_bene_df, left_on='BeneID', right_on='BeneID',how='inner')
train_ip_op_bene_df.shape

### **Merging the IP_OP_BENE Dataset with PROVIDER level Tgt Labels Data**

In [None]:
# Joining the IP_OP_BENE dataset with the Tgt Label Provider Data
train_iobp_df = pd.merge(left=train_ip_op_bene_df, right=train_tgt_lbls_df, left_on='Provider', right_on='Provider',how='inner')
train_iobp_df.shape

### **Entire Dataset**

In [None]:
train_iobp_df.shape

In [None]:
# Unique Providers
train_iobp_df["Provider"].nunique()

In [None]:
# Unique Claim numbers
train_iobp_df["ClaimID"].nunique()

### *`ASSUMPTION`* :: One provider may have been involved in more than one claim. So, does all the claims filed by a potentially fraud provider are all frauds?
    - This cannot holds True for all the providers because if one provider has filed say 50 claims then we can't say that all the claims for that provider are fraudulent. 
        - There may exists a pattern that out of 50 claims a provider files 1 or 2 fraudulent claims. 
    
#### **`Therefore, it is a big assumption to make that all the claims filed by a potentially fraud provider are fraudulent.`**

In [None]:
prvs_claims_df = pd.DataFrame(train_iobp_df.groupby(['Provider'])['ClaimID'].count()).reset_index()
prvs_claims_tgt_lbls_df = pd.merge(left=prvs_claims_df, right=train_tgt_lbls_df, on='Provider', how='inner')
prvs_claims_tgt_lbls_df

**`OBSERVATION`**
* As shown in the above table, PRV51005 has filed 1165 claims so after joining the datasets all of these will be marked as Fraud.

- **Fraud Count at Claims level**

In [None]:
print(pd.DataFrame(train_iobp_df['PotentialFraud'].value_counts()), "\n")

with plt.style.context('seaborn-poster'):
    fig = train_iobp_df['PotentialFraud'].value_counts().plot(kind='bar', color=['green','orange'])
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/train_iobp_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
    # Providing the labels and title to the graph
    plt.xlabel("Fraud or Not?", fontdict=label_font_dict)
    plt.ylabel("Number (or %) of claims\n", fontdict=label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Distribution of Fraud & Non-fraud claims\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us that, 62% of claims are Non-Fraud and 32% of them are Fraudulent. 
    * By looking at the percentages we may say that there is a class-imbalance problem but after looking at the number of records it doesn't seem to be a severe class-imbalance problem. 
        * So, I'll try some class balancing techniques only after training a baseline model w/o any synthetic or class weighting techniques.

# **Feature Engineering + Impact Analysis**
**`Let's create some features`**

### **Adding `New Feature` :: `Is_Alive?`**

    - Is Alive? = No if DOD is NaN else Yes

In [None]:
train_iobp_df['DOB'] = pd.to_datetime(train_iobp_df['DOB'], format="%Y-%m-%d")
train_iobp_df['DOD'] = pd.to_datetime(train_iobp_df['DOD'], format="%Y-%m-%d")

In [None]:
train_iobp_df['Is_Alive?'] = train_iobp_df['DOD'].apply(lambda val: 'No' if val != val else 'Yes')

In [None]:
train_iobp_df['Is_Alive?'].value_counts()

### **Adding `New Feature` :: `Claim_Duration`**
    
    - Claim Duration = Claim End Date - Claim Start Date

In [None]:
train_iobp_df['ClaimStartDt'] = pd.to_datetime(train_iobp_df['ClaimStartDt'], format="%Y-%m-%d")
train_iobp_df['ClaimEndDt'] = pd.to_datetime(train_iobp_df['ClaimEndDt'], format="%Y-%m-%d")

train_iobp_df['Claim_Duration'] = (train_iobp_df['ClaimEndDt'] - train_iobp_df['ClaimStartDt']).dt.days

In [None]:
with plt.style.context('seaborn'):
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', palette='dark')
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Duration for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Duration for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Claim Duration alone might not be useful in segregating the Fraud cases.

- **Relationship b/w `Claim_Duration` and `Potentially Fraud` for both the `Genders`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', hue='Gender', palette='prism')      
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Duration of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Duration of males and females for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Claim Duration might not be useful in segregating the Fraud cases.

- **Relationship b/w `Claim_Duration` and `Potentially Fraud` for `Is_Alive?`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', hue='Is_Alive?', palette='prism')      
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Duration patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Duration on patient life status for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Claim Duration might not be useful in segregating the Fraud cases.

- **Relationship b/w `Claim_Duration` and `Potentially Fraud` for all `Human Races`**

In [None]:
with plt.style.context('seaborn-poster'):
    plt.figure(figsize=(16,8))
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', hue='Race', palette='cubehelix')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Duration of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc='upper center',title='Race');

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Duration of all human races for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Claim Duration might not be useful in segregating the Fraud cases.

- **Relationship b/w `Claim_Duration` and `Potentially Fraud` for `RenalDiseaseIndicator`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', hue='RenalDiseaseIndicator', palette='copper')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Duration of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title='RKD');

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Duration of patients with or w/o RKD for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Claim Duration might not be useful in segregating the Fraud cases.

### **Adding `New Feature` :: `Admitted_Duration`**

    - Admitted Duration = Discharge Date - Admission Date

In [None]:
train_iobp_df['AdmissionDt'] = pd.to_datetime(train_iobp_df['AdmissionDt'], format="%Y-%m-%d")
train_iobp_df['DischargeDt'] = pd.to_datetime(train_iobp_df['DischargeDt'], format="%Y-%m-%d")

train_iobp_df['Admitted_Duration'] = (train_iobp_df['DischargeDt'] - train_iobp_df['AdmissionDt']).dt.days

In [None]:
with plt.style.context('seaborn'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', palette='Accent_r')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Admitted Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Admitted Duration for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Admit Duration for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Admit Duration alone might not be useful in segregating the Fraud cases.

- **Relationship b/w `Admit_Duration` and `Potentially Fraud` for both the `Genders`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', hue='Gender', palette='inferno')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Admit Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Admit Duration of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Admit Duration of males and females for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Admit Duration might not be useful in segregating the Fraud cases.

- **Relationship b/w `Admitted_Duration` and `Potentially Fraud` for `Is_Alive?`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', hue='Is_Alive?', palette='prism')      
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Admit Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Admit Duration patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title='Is_Alive?');

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Admit Duration on patient life status for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Admit Duration might not be useful in segregating the Fraud cases.

- **Relationship b/w `Admit_Duration` and `Potentially Fraud` for all `Human Races`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', hue='Race', palette='plasma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Admit Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Admit Duration of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title="Race");

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Admit Duration of all human races for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Admit Duration might not be useful in segregating the Fraud cases.

- **Relationship b/w `Admitted_Duration` and `Potentially Fraud` for `RenalDiseaseIndicator`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', hue='RenalDiseaseIndicator',palette='magma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Admit Duration (in days)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Admit Duration of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title="RKD");

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Admit Duration of patients with or w/o RKD for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Admit Duration might not be useful in segregating the Fraud cases.

### **Adding `New Feature` :: `Bene_Age`**

    - Bene Age = DOD - DOB (if DOD is Null then replace it with MAX date in DOD)

In [None]:
# Filling the Null values as MAX Date of Death in the Dataset
train_iobp_df['DOD'].fillna(value=train_iobp_df['DOD'].max(), inplace=True)

In [None]:
train_iobp_df['Bene_Age'] = round(((train_iobp_df['DOD'] - train_iobp_df['DOB']).dt.days)/365,1)

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', palette='Pastel2')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Beneficiary Age (in years)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Beneficiary Age for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Beneficiary Age for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Beneficiary Age alone might not be useful in segregating the Fraud cases.

- **Relationship b/w `Bene_Age` and `Potentially Fraud` for both the `Genders`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', hue='Gender', palette='inferno')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Bene_Age (in years)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Bene_Age of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Bene_Age of males and females for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Bene_Age might not be useful in segregating the Fraud cases.

- **Relationship b/w `Bene_Age` and `Potentially Fraud` for `Is_Alive?`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', hue='Is_Alive?', palette='prism')      
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Bene_Age (in years)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Bene_Age patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title='Is_Alive?');

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Bene_Age on patient life status for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Bene_Age might not be useful in segregating the Fraud cases.

- **Relationship b/w `Bene_Age` and `Potentially Fraud` for all `Human Races`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', hue='Race', palette='plasma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Bene_Age (in years)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Bene_Age of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title="Race");

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Bene_Age of all human races for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Bene_Age might not be useful in segregating the Fraud cases.

- **Relationship b/w `Bene_Age` and `Potentially Fraud` for `RenalDiseaseIndicator`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', hue='RenalDiseaseIndicator',palette='magma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Bene_Age (in years)\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Bene_Age of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
    plt.legend(loc='lower center', title='RKD');

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Bene_Age of patients with or w/o RKD for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Bene_Age might not be useful in segregating the Fraud cases.

### **Does `InscClaimAmtReimbursed` influences `Potentially Fraud`?**
   - **Relationship b/w `InscClaimAmtReimbursed` and `Potentially Fraud`**

In [None]:
with plt.style.context('seaborn'):
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', palette='flag')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Re-Imb Amount for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Re-Imb Amount for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Claim Re-Imb Amount alone might not be useful in segregating the Fraud cases.

- **Relationship b/w `InscClaimAmtReimbursed` and `Potentially Fraud` for both the `Genders`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', hue='Gender', palette='flare')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Re-Imb Amount of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Re-Imb Amount of males and females for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Claim Re-Imb Amount might not be useful in segregating the Fraud cases.

- **Relationship b/w `InscClaimAmtReimbursed` and `Potentially Fraud` for `Is_Alive?`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', hue='Is_Alive?', palette='prism')      
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Re-Imb Amount patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title='Is_Alive?');

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Re-Imb Amount on patient life status for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Claim Re-Imb Amount might not be useful in segregating the Fraud cases.

- **Relationship b/w `InscClaimAmtReimbursed` and `Potentially Fraud` for all `Human Races`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', hue='Race', palette='plasma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Re-Imb Amount of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title="Race");

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Re-Imb Amount of all human races for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Claim Re-Imb Amount might not be useful in segregating the Fraud cases.

- **Relationship b/w `InscClaimAmtReimbursed` and `Potentially Fraud` for `RenalDiseaseIndicator`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', hue='RenalDiseaseIndicator',palette='magma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Claim Re-Imb Amount of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
    plt.legend(loc='upper center', title='RKD');

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Claim Re-Imb Amount of patients with or w/o RKD for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that claim Re-Imb Amount might not be useful in segregating the Fraud cases.

### **Does `IPAnnualReimbursementAmt` influences `Potentially Fraud`?**
   - **Relationship b/w `IPAnnualReimbursementAmt` and `Potentially Fraud`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxenplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], y='PotentialFraud',x='IPAnnualReimbursementAmt', 
                         palette='dark', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.yticks(rotation=90, fontsize=12)
    plt.xlabel("\nAnnual IP Re-Imb Amount", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Annual IP Re-Imb Amount for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Annual IP Re-Imb Amount for Potentially Fraud and Non-Fraud Providers.
    * And, it looks like a Pareto Distribution.

#### **Why do we have `IP Annual Re-Imb Amount` as `0` for `Admitted Patients`?**

In [None]:
print(pd.DataFrame(train_iobp_df[(train_iobp_df['IPAnnualReimbursementAmt'] == 0)]['Admitted?'].value_counts()))

- So, we have 413 claims where Patients were admitted to the hospital but the allocated `IP Annual Re-Imb Amt` is 0.

In [None]:
print(pd.DataFrame(train_iobp_df[(train_iobp_df['IPAnnualReimbursementAmt'] == 0) & (train_iobp_df['Admitted?'] == 1)]\
                   ['PotentialFraud'].value_counts()))

- So, out 413 claims 60% of them means 249 are fraudulent whereas 164 are non-fraudulent.

**`OBSERVATION`**

- **Thus, we can say that if the IP Annual Amount is 0 and patient was admitted to the hospital then chances of being fraudulent are high.**

- **Relationship b/w `IPAnnualReimbursementAmt` and `Potentially Fraud` for both the `Genders`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], x='PotentialFraud',y='IPAnnualReimbursementAmt', hue='Gender',
                      palette='flare')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Annual IP Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Annual IP Re-Imb Amount of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Annual IP Re-Imb Amount of males and females for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Annual IP Re-Imb Amount might not be useful in segregating the Fraud cases.

- **Relationship b/w `IPAnnualReimbursementAmt` and `Potentially Fraud` for `Is_Alive?`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], x='PotentialFraud',y='IPAnnualReimbursementAmt', hue='Is_Alive?',
                      palette='prism')      
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Annual IP Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Annual IP Re-Imb Amount patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title='Is_Alive?');

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Annual IP Re-Imb Amount on patient life status for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Annual IP Re-Imb Amount might not be useful in segregating the Fraud cases.

- **Relationship b/w `IPAnnualReimbursementAmt` and `Potentially Fraud` for all `Human Races`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], x='PotentialFraud',y='IPAnnualReimbursementAmt', hue='Race',
                      palette='plasma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Annual IP Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Annual IP Re-Imb Amount of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title="Race");

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Annual IP Re-Imb Amount of all human races for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Annual IP Re-Imb Amount might not be useful in segregating the Fraud cases.

- **Relationship b/w `IPAnnualReimbursementAmt` and `Potentially Fraud` for `RenalDiseaseIndicator`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], x='PotentialFraud',y='IPAnnualReimbursementAmt', 
                      hue='RenalDiseaseIndicator',palette='magma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Annual IP Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Annual IP Re-Imb Amount of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
    plt.legend(loc='upper center', title='RKD');

**`OBSERVATION`**
* The above plot clearly shows us that there is a very slight difference in the distribution of Annual IP Re-Imb Amount of patients with or w/o RKD for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Annual IP Re-Imb Amount might not be useful in segregating the Fraud cases.

### **Does `OPAnnualReimbursementAmt` influences `Potentially Fraud`?**
   - **Relationship b/w `OPAnnualReimbursementAmt` and `Potentially Fraud`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxenplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], y='PotentialFraud', x='OPAnnualReimbursementAmt', 
                         palette='dark', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.yticks(rotation=90, fontsize=12)
    plt.xlabel("\nAnnual OP Re-Imb Amount", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Annual OP Re-Imb Amount for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Annual OP Re-Imb Amount for Potentially Fraud and Non-Fraud Providers.
    * And, it looks like a Pareto Distribution.

#### **Why do we have `OP Annual Re-Imb Amount` as `0` for `Admitted Patients`?**

In [None]:
print(pd.DataFrame(train_iobp_df[(train_iobp_df['OPAnnualReimbursementAmt'] == 0)]['Admitted?'].value_counts()))

- So, we have 1009 claims where Patients were not admitted to the hospital but the allocated `OP Annual Re-Imb Amt` is 0.

In [None]:
print(pd.DataFrame(train_iobp_df[(train_iobp_df['OPAnnualReimbursementAmt'] == 0) & (train_iobp_df['Admitted?'] == 0)]\
                   ['PotentialFraud'].value_counts()))

- So, out 1009 claims 38% of them means 392 are fraudulent whereas 62% or 617 are non-fraudulent.

**`OBSERVATION`**

- **Thus, we can say that if the OP Annual Amount is 0 and patient was not admitted to the hospital then chances of being fraudulent are less.**

- **Relationship b/w `OPAnnualReimbursementAmt` and `Potentially Fraud` for both the `Genders`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], x='PotentialFraud',y='OPAnnualReimbursementAmt', hue='Gender',
                      palette='flare')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Annual OP Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Annual OP Re-Imb Amount of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Annual OP Re-Imb Amount of males and females for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Annual OP Re-Imb Amount might not be useful in segregating the Fraud cases.

- **Relationship b/w `OPAnnualReimbursementAmt` and `Potentially Fraud` for `Is_Alive?`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], x='PotentialFraud',y='OPAnnualReimbursementAmt', hue='Is_Alive?',
                      palette='prism')      
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Annual OP Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Annual OP Re-Imb Amount patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title='Is_Alive?');

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Annual OP Re-Imb Amount on patient life status for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Annual OP Re-Imb Amount might not be useful in segregating the Fraud cases.

- **Relationship b/w `OPAnnualReimbursementAmt` and `Potentially Fraud` for all `Human Races`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], x='PotentialFraud',y='OPAnnualReimbursementAmt', hue='Race',
                      palette='plasma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Annual OP Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Annual OP Re-Imb Amount of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
    plt.legend(loc="upper center", title="Race");

**`OBSERVATION`**
* The above plot clearly shows us that there is no difference in the distribution of Annual OP Re-Imb Amount of all human races for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Annual OP Re-Imb Amount might not be useful in segregating the Fraud cases.

- **Relationship b/w `OPAnnualReimbursementAmt` and `Potentially Fraud` for `RenalDiseaseIndicator`**

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], x='PotentialFraud',y='OPAnnualReimbursementAmt', 
                      hue='RenalDiseaseIndicator',palette='magma')   
    # Providing the labels and title to the graph
    plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Annual OP Re-Imb Amount\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Annual OP Re-Imb Amount of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
    plt.legend(loc='upper center', title='RKD');

**`OBSERVATION`**
* The above plot clearly shows us that there is a very slight difference in the distribution of Annual OP Re-Imb Amount of patients with or w/o RKD for Potentially Fraud and Non-Fraud Providers.
    * Therefore, we can say that Annual OP Re-Imb Amount might not be useful in segregating the Fraud cases.

### **Adding `New Feature` :: `Total Number of false claims filed by a Provider`**

    - Logic :: COUNT(all claims submitted by a Provider) - COUNT(all non-fraud claims submitted by a Provider)

**`REASONING`**

- The idea behind adding this feature is to introduce a way by which we can see how many fraud or non fraud claims been submitted by a provider.

- _Generally what has been observed in medicare frauds is that many small hospitals from rural places had been intentionally used for filing the false claims by giving them bribes or in desire of kickbacks. Thus, for such providers total claims submitted will be less but majority of them will be false._

- ***But, the problem in the given dataset after joining(IP, OP, BENE with PRV TGT) is that***

In [None]:
from IPython.display import Image
Image("../input/medicare-prv-fraud-files/Joining Datasets.png")

- ***if a provider was FRAUD then all the claims associated to it will be marked as FRAUD which I believe is wrong and doesn't provide any useful information.***

### **Adding `New Feature` :: `Total Number of claims or cases seen by Attending Physician`**

In [None]:
# Total unique number of Attended Physicians
print("Unique number of Attending Physicians present in the dataset are --> {}".format(train_iobp_df['AttendingPhysician'].nunique()))

In [None]:
train_iobp_df['Att_Phy_tot_claims'] = train_iobp_df.groupby(['AttendingPhysician'])['ClaimID'].transform('count')
train_iobp_df['Att_Phy_tot_claims'].describe()

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Att_Phy_tot_claims'],color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Att_Phy_tot_claims'],color='green')
    # Providing the labels and title to the graph
    plt.xlabel("\nAttending Physicians Total Claims Submitted", fontdict=label_font_dict)
    plt.xticks(np.arange(0,2800,100), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Total claims filed by Attending Physicians", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Att_Phy_tot_claims', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,2800,100), rotation=90, fontsize=12)
    plt.xlabel("\nAttending Physician Total Claims", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Total claims filed by Attending Physicians", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots suggests that the newly added feature `Att_Phy_tot_claims` may be useful in segregating the potentially fraud and non-fraudulent cases.
    * For example, we can say that if total claims filed by a Attending Physician is greater than 500 then chances of being fraudulent are high.

### **Adding `New Feature` :: `Total Number of claims or cases seen by Opearting Physician`**

In [None]:
# Total unique number of Operating Physicians
print("Unique number of Operating Physicians present in the dataset are --> {}".format(train_iobp_df['OperatingPhysician'].nunique()))

In [None]:
train_iobp_df['Opr_Phy_tot_claims'] = train_iobp_df.groupby(['OperatingPhysician'])['ClaimID'].transform('count')
train_iobp_df['Opr_Phy_tot_claims'].describe()

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Opr_Phy_tot_claims'],color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Opr_Phy_tot_claims'],color='green')
    # Providing the labels and title to the graph
    plt.xlabel("\nOperating Physicians Total Claims Submitted", fontdict=label_font_dict)
    plt.xticks(np.arange(0,500,20), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Total claims filed by Operating Physicians", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Opr_Phy_tot_claims', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,500,20), rotation=90, fontsize=12)
    plt.xlabel("\nOperating Physician Total Claims", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Total claims filed by Operating Physicians", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots suggests that the newly added feature `Opr_Phy_tot_claims` may be useful in segregating the potentially fraud and non-fraudulent cases.
    * Results are very similar to `Att_Phy_tot_claims`.
        * For example, we can say that if total claims filed by a Operating Physician are greater than 100 then chances of being fraudulent are high.

### **Adding `New Feature` :: `Total Number of claims or cases seen by Other Physician`**

In [None]:
# Total unique number of Other Physicians
print("Unique number of Other Physicians present in the dataset are --> {}".format(train_iobp_df['OtherPhysician'].nunique()))

In [None]:
train_iobp_df['Oth_Phy_tot_claims'] = train_iobp_df.groupby(['OtherPhysician'])['ClaimID'].transform('count')
train_iobp_df['Oth_Phy_tot_claims'].describe()

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Oth_Phy_tot_claims'],color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Oth_Phy_tot_claims'],color='green')
    # Providing the labels and title to the graph
    plt.xlabel("Other Physicians Total Claims Submitted", fontdict=label_font_dict)
    plt.xticks(np.arange(0,1450,50), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Total claims filed by Other Physicians", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Oth_Phy_tot_claims', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,1500,50), rotation=90, fontsize=12)
    plt.xlabel("Other Physician Total Claims", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Total claims filed by Other Physicians", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots suggests that the newly added feature `Oth_Phy_tot_claims` may be useful in segregating the potentially fraud and non-fraudulent cases.
    * Results are very similar to `Att_Phy_tot_claims`.
        * For example, we can say that if total claims filed by a Other Physician are greater than 100 then chances of being fraudulent are high.

In [None]:
# Simultaneously viewing the plots for better understanding
with plt.style.context('seaborn'):
    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15,8), sharey=False)
    sns.boxplot(data=train_iobp_df[train_iobp_df["PotentialFraud"] == 'Yes'][['Att_Phy_tot_claims','Opr_Phy_tot_claims','Oth_Phy_tot_claims']],
                ax=ax1, palette='viridis')
    ax1.set_title("Potential Fraud = Yes", fontdict=title_font_dict)
    ax1.set_xlabel("\nPhysicians Categories", fontdict=label_font_dict)
    ax1.set_ylabel("Claims Filed", fontdict=label_font_dict)
    
    sns.boxplot(data=train_iobp_df[train_iobp_df["PotentialFraud"] == 'No'][['Att_Phy_tot_claims','Opr_Phy_tot_claims','Oth_Phy_tot_claims']],
                ax=ax2, palette='magma')
    ax2.set_title("Potential Fraud = No", fontdict=title_font_dict)
    ax2.set_xlabel("\nPhysicians Categories", fontdict=label_font_dict)
    ax2.set_ylabel("Claims Filed", fontdict=label_font_dict)
    # Providing the title to the figure
    fig.suptitle("Distribution of Total Claims filed by Attending(Att), Operating(Opr) and Other(Oth) Physicians.\n", fontdict=title_font_dict)
    plt.minorticks_on()
    plt.plot();    

**`OBSERVATION`**
* The above Box plots suggests that theses newly added features may be slightly useful in segregating the potentially fraud and non-fraudulent cases.
    * As we can see that the total false claims filed by physicians are slightly more than the non-false claims filed by them.

### **Adding `Combined Feature` :: `Att_Opr_Oth_Phy_Tot_Claims`**
    
   * It represents the total claims submitted by Attending, Operating and Other Physicians.
       
       * **`Reasoning`** :: The idea behind adding this feature is to see whether a total of physicians claims submission will help in influencing the potential frauds.


   * **`Logic`** :: Att_Phy_tot_claims + Opr_Phy_tot_claims + Oth_Phy_tot_claims

In [None]:
train_iobp_df['Att_Phy_tot_claims'].fillna(value=0, inplace=True)
train_iobp_df['Opr_Phy_tot_claims'].fillna(value=0, inplace=True)
train_iobp_df['Oth_Phy_tot_claims'].fillna(value=0, inplace=True)

In [None]:
train_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'] = train_iobp_df['Att_Phy_tot_claims'] + train_iobp_df['Opr_Phy_tot_claims'] + train_iobp_df['Oth_Phy_tot_claims']

In [None]:
train_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'].describe()

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Att_Opr_Oth_Phy_Tot_Claims'],color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Att_Opr_Oth_Phy_Tot_Claims'],color='green')
    # Providing the labels and title to the graph
    plt.xlabel("\nAttending, Operating & Other Physicians Total Claims Submitted", fontdict=label_font_dict)
    plt.xticks(np.arange(0,3600,100), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Total Claims filed by Attending, Operating & Other Physicians", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Att_Opr_Oth_Phy_Tot_Claims', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,3600,100), rotation=90, fontsize=12)
    plt.xlabel("\nAttending, Operating & Other Physicians Total Claims", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Total claims filed by Attending, Operating & Other Physicians", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above Box plots shows us the similar results like the previous features. And, there is a slight difference in the data distributions which may be useful in segregating the potential frauds.

### **Adding `3` `New Features` :: `Prv_Tot_Att_Phy`, `Prv_Tot_Opr_Phy` and `Prv_Tot_Oth_Phy`**
    
   * These features will represent the total Attending, Operating and Other Physicians for every provider.
       * **`Reasoning`** :: The idea behind adding this feature is to see if a provider has wroked with very less or very high number of physicians then does that increases or decreases the chances of potential fraud.

In [None]:
train_iobp_df["Prv_Tot_Att_Phy"] = train_iobp_df.groupby(['Provider'])['AttendingPhysician'].transform('count')
train_iobp_df["Prv_Tot_Opr_Phy"] = train_iobp_df.groupby(['Provider'])['OperatingPhysician'].transform('count')
train_iobp_df["Prv_Tot_Oth_Phy"] = train_iobp_df.groupby(['Provider'])['OtherPhysician'].transform('count')

In [None]:
# Nulls in the above features
train_iobp_df.isna().sum().tail(3)

In [None]:
train_iobp_df["Prv_Tot_Att_Phy"].describe()

* The average number of attending physicians for providers are 820.

In [None]:
train_iobp_df["Prv_Tot_Opr_Phy"].describe()

* The average number of operating physicians for providers are 155.

In [None]:
train_iobp_df["Prv_Tot_Oth_Phy"].describe()

* The average number of other physicians for providers are 306.

* **Relationship b/w PROVIDER and ATTENDING PHYSICIAN**

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Prv_Tot_Att_Phy'],color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Prv_Tot_Att_Phy'],color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nProviders interacted with how many attending physicians?", fontdict=label_font_dict)
    plt.xticks(np.arange(0,8800,400), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Providers interaction with attending physicians", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Prv_Tot_Att_Phy', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,8800,400), rotation=90, fontsize=12)
    plt.xlabel("\nProviders interacted with how many attending physicians?", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Providers interaction with attending physicians", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots are quite interesting as we can see that if `Prv_Tot_Att_Phy` is high then chances of fraud is quite high.

* **Relationship b/w PROVIDER and OPERATING PHYSICIAN**

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Prv_Tot_Opr_Phy'],color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Prv_Tot_Opr_Phy'],color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nProviders interacted with how many operating physicians?", fontdict=label_font_dict)
    plt.xticks(np.arange(0,1600,100), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Providers interaction with operating physicians", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Prv_Tot_Opr_Phy', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,1600,100), rotation=90, fontsize=12)
    plt.xlabel("\nProviders interacted with how many operating physicians?", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Providers interaction with operating physicians", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots are quite interesting as we can see that if `Prv_Tot_Opr_Phy` is high then chances of fraud is quite high.

* **Relationship b/w PROVIDER and OTHER PHYSICIAN**

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Prv_Tot_Oth_Phy'],color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Prv_Tot_Oth_Phy'],color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nProviders interacted with how many other physicians?", fontdict=label_font_dict)
    plt.xticks(np.arange(0,3600,200), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Providers interaction with other physicians", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Prv_Tot_Oth_Phy', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,3600,200), rotation=90, fontsize=12)
    plt.xlabel("\nProviders interacted with how many other physicians?", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Providers interaction with other physicians", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots are quite interesting as we can see that if `Prv_Tot_Oth_Phy` is high then chances of fraud is quite high.

### **Adding `Combined Feature` :: `Prv_Tot_Att_Opr_Oth_Phys`**
    
   * It represents the total of all kind of physicians that a provider has interacted with.
       
       * **`Reasoning`** :: The idea behind adding this feature is to see whether a fraudulent provider interacts with higher or lower numberof of various physicians.


   * **`Logic`** :: Prv_Tot_Att_Phy + Prv_Tot_Opr_Phy + Prv_Tot_Oth_Phy

In [None]:
train_iobp_df['Prv_Tot_Att_Opr_Oth_Phys'] = train_iobp_df['Prv_Tot_Att_Phy'] + train_iobp_df['Prv_Tot_Opr_Phy'] + train_iobp_df['Prv_Tot_Oth_Phy']

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Prv_Tot_Att_Opr_Oth_Phys'],color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Prv_Tot_Att_Opr_Oth_Phys'],color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nProviders interacted with how many all kind of physicians?", fontdict=label_font_dict)
    plt.xticks(np.arange(0,14000,1000), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Providers interaction with all kind of physicians", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Prv_Tot_Att_Opr_Oth_Phys', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,14000,1000), rotation=90, fontsize=12)
    plt.xlabel("\nProviders interacted with how many all kind of physicians?", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Providers interaction with all kind of physicians", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots are quite interesting as we can see that if `Prv_Tot_Att_Opr_Oth_Phys` is high then chances of fraud is quite high.

### **Adding `New Feature` :: `Total Unique Claim Admit Codes used by a PROVIDER`**
   
   * **`Reasoning`** :: The idea behind adding this feature is to see how many unique number of `Claim Admit Diagnosis Codes` used by the Provider. 
       * As there may be a pattern that if a provider has used so many Admit Diagnosis Codes then it might increases or decreases the chances of fraud.

In [None]:
train_iobp_df['PRV_Tot_Admit_DCodes'] = train_iobp_df.groupby(['Provider'])['ClmAdmitDiagnosisCode'].transform('nunique')

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Tot_Admit_DCodes'], color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Tot_Admit_DCodes'], color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nProviders used unique Claim Admit Diagnosis Codes", fontdict=label_font_dict)
    plt.xticks(np.arange(0,600,50), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Providers used unique Claim Admit Diagnosis Codes", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Tot_Admit_DCodes', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,600,50), rotation=90, fontsize=12)
    plt.xlabel("\nProviders used unique Claim Admit Diagnosis Codes", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Providers used unique Claim Admit Diagnosis Codes", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots are very interesting as we can see that if `PRV_Tot_Admit_DCodes` is high then chances of fraud also increases.

**`NOTE` :: `What didn't worked?`**
* I also looked to add the `unique number of Admit Diagnosis Codes` used by the `3 different class of physicians` but the variation was very minimal, thus not added those features.

### **Adding `New Feature` :: `Total Unique Number of Diagnosis Group Codes used by a PROVIDER`**
   
   * **`Reasoning`** :: The idea behind adding this feature is to see how many unique `Diagnosis Group Codes` used by the Provider.
       * As there may be a pattern that if a provider has used so many Diagnosis Group Codes then it might increases or decreases the chances of fraud.

In [None]:
train_iobp_df['PRV_Tot_DGrpCodes'] = train_iobp_df.groupby(['Provider'])['DiagnosisGroupCode'].transform('nunique')

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Tot_DGrpCodes'], color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Tot_DGrpCodes'], color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nProviders used unique Diagnosis Group Codes", fontdict=label_font_dict)
    plt.xticks(np.arange(0,400,40), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Providers used unique Diagnosis Group Codes", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Tot_DGrpCodes', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,400,40), rotation=90, fontsize=12)
    plt.xlabel("\nProviders used unique Diagnosis Group Codes", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Providers used unique Diagnosis Group Codes", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots suggest that if `PRV_Tot_Admit_DCodes` is high then it slightly increases the chances of fraud.


**`NOTE` :: `What didn't worked?`**
* I also looked to add the `unique number of Diagnosis Group Codes` used by the `3 different class of physicians` but the variation was very minimal, thus not added those features.


**`NOTE` :: `What didn't worked?`**
* I also looked to add `in how many claims a unique Diagnosis Group Code` is used but there was no variation at all, thus not added that feature. Kindly refer to the below image:

In [None]:
from IPython.display import Image
Image("../input/medicare-prv-fraud-files/What_didnt_worked_DGCode_across_claims.png")

**`NOTE` :: `What didn't worked?`**
* I also looked to add `DOB -- Month`, `DOD -- Year` and `DOD -- Month` in order to see whether we can find some pattern of bogus DOB or DOD but there was no variation at all, thus not added that feature. Also, raw `DOB -- Year` also showed no variation. Kindly refer to the below image:

In [None]:
from IPython.display import Image
Image("../input/medicare-prv-fraud-files/What_didnt_worked_DOB_Years_Month.png")

### **Adding `New Feature` :: `Total unique Date of Birth years of beneficiaries provided by a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that if a provider has very high variability in the year of birth of patients then that might be one of the signs of medicare frauds.
       - Because generally private hospitals who treat poor patients make false claims on their names. For example, Nazia is 10 years old. But, according to a claim filed by Chhattisgarh-based Shaheed Hospital with the Rashtriya Swasthya Bima Yojna (RSBY), she has delivered a baby after a caesarean operation. Mukul (name changed) is only 7. But Agarwal Hospital, Raipur, has made a claim for removing cataract from his eyes.

Read more at:
https://economictimes.indiatimes.com/news/politics-and-nation/private-hospitals-perform-fake-surgeries-to-claim-thousands-in-insurance-cover/articleshow/16934229.cms?utm_source=contentofinterest&utm_medium=text&utm_campaign=cppst

In [None]:
train_iobp_df['DOB_Year'] = train_iobp_df['DOB'].dt.year

In [None]:
train_iobp_df['PRV_Tot_Unq_DOB_Years'] = train_iobp_df.groupby(['Provider'])['DOB_Year'].transform('nunique')

In [None]:
train_iobp_df['PRV_Tot_Unq_DOB_Years'].describe()

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Tot_Unq_DOB_Years'], color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Tot_Unq_DOB_Years'], color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nTotal unique Years of birth of patients", fontdict=label_font_dict)
    plt.xticks(np.arange(0,80,4), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution Providers treated Patients of various DOB Years", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Tot_Unq_DOB_Years', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,80,4), rotation=90, fontsize=12)
    plt.xlabel("\nTotal unique Years of birth of patients", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution Providers treated Patients of various DOB Years", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots suggest that if `PRV_Tot_Unq_DOB_Years` is very high than then it increases the chances of fraud as well.
    * As shown below, if PRV_Tot_Unq_DOB_Years is greater than or equals to 67 then the very high majority of cases are fraud.

In [None]:
train_iobp_df[train_iobp_df['PRV_Tot_Unq_DOB_Years'] >=67]['PotentialFraud'].value_counts()

### **Adding `New Feature` :: `Sum of patients age treated by a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that there might be a pattern like if the sum of patients age treated by a provider is very high or low then it might influence the fraud.

In [None]:
train_iobp_df['PRV_Bene_Age_Sum'] = train_iobp_df.groupby(['Provider'])['Bene_Age'].transform('sum')

In [None]:
train_iobp_df['PRV_Bene_Age_Sum'].describe()

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Bene_Age_Sum'], color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Bene_Age_Sum'], color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nSum of patients age treated by Providers", fontdict=label_font_dict)
    plt.xticks(np.arange(0,620000,50000), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of Sum of patients age treated by Providers", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Bene_Age_Sum', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,620000,50000), rotation=90, fontsize=11)
    plt.xlabel("\nSum of patients age treated by Providers", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Sum of patients age treated by Providers", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots suggest that if `PRV_Bene_Age_Sum` is high then it increases the chances of fraud.

### **Adding `New Feature` :: `Sum of Insc Claim Re-Imb Amount for a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that there might be a pattern like if the sum of claim re-imb amount for a provider is very high or low then it might influence the fraud.

In [None]:
train_iobp_df['PRV_Insc_Clm_ReImb_Amt'] = train_iobp_df.groupby(['Provider'])['InscClaimAmtReimbursed'].transform('sum')

In [None]:
train_iobp_df['PRV_Insc_Clm_ReImb_Amt'].describe()

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Insc_Clm_ReImb_Amt'], color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Insc_Clm_ReImb_Amt'], color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nSum of Insc Claim Re-Imb Amount for a Provider", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Sum of Insc Claim Re-Imb Amount for a Provider", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Insc_Clm_ReImb_Amt', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xlabel("\nSum of Insc Claim Re-Imb Amount for a Provider", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of Sum of Insc Claim Re-Imb Amount for a Provider", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots suggest that if `PRV_Insc_Clm_ReImb_Amt` is high then it increases the chances of fraud.

### **Adding `New Feature` :: `Total number of RKD Patients seen by a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that there might be a pattern like if the total number of RKD Patients seen by a Provider is very high or low then it might influence the fraud.

In [None]:
train_iobp_df['RenalDiseaseIndicator'] = train_iobp_df['RenalDiseaseIndicator'].apply(lambda val: 1 if val == "Y" else 0)

In [None]:
train_iobp_df['PRV_Tot_RKD_Patients'] = train_iobp_df.groupby(['Provider'])['RenalDiseaseIndicator'].transform('sum')

In [None]:
train_iobp_df['PRV_Tot_RKD_Patients'].describe()

In [None]:
with plt.style.context('seaborn-poster'):
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Tot_RKD_Patients'], color='red')
    sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Tot_RKD_Patients'], color='blue')
    # Providing the labels and title to the graph
    plt.xlabel("\nRKD Patients seen by a Provider", fontdict=label_font_dict)
    plt.xticks(np.arange(0,1600,100), rotation=90, fontsize=11)
    plt.minorticks_on()
    plt.title("Distribution of total number of RKD Patients seen by a Provider", fontdict=title_font_dict)
    plt.legend(labels=["Yes", "No"], title="Potential Fraud?");

In [None]:
with plt.style.context('seaborn-poster'):
    fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Tot_RKD_Patients', palette='prism_r', orient='h')   
    # Providing the labels and title to the graph
    plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
    plt.xticks(np.arange(0,1600,100), rotation=90, fontsize=11)
    plt.xlabel("\nRKD Patients seen by a Provider", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.title("Distribution of total number of RKD Patients seen by a Provider", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above KDE and Box plots suggest that if `PRV_Tot_RKD_Patients` is high then it increases the chances of fraud.

# **`Some trends`**

### **Q1. Which are the Top-25 `Providers` with maximum number of fraudulent cases?**

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['Provider','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['Provider', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['Provider','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="Provider", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent Providers", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 Providers with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 Providers with most percentage of Fraudulent Case Submissions.
    * Here, PRV51459 has the highest percentage share of fraudulent cases. The, difference b/w others providers is not that high.

### **Q2. Which are the Top-25 `Providers` with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['Provider','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="Provider", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent Providers", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 Providers with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 Providers with most percentage of Non-Fraudulent Case Submissions.
    * Here, PRV53750 has the highest percentage share of non-fraudulent cases. But, the difference with other providers is not so high.

### **Q3. Which are the Top-25 `Attending Physicians` with maximum number of fraudulent cases?**

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['AttendingPhysician','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['AttendingPhysician', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['AttendingPhysician','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="AttendingPhysician", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent AttendingPhysician", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 AttendingPhysician with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 Attenting Physicians with most percentage of Fraudulent Case Submissions.
    * Here, PHY330576 has the highest percentage share of fraudulent cases. But, the difference b/w other physicians is not so high.

### **Q4. Which are the Top-25 `Attenting Physicians` with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['AttendingPhysician','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="AttendingPhysician", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent AttendingPhysician", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 AttendingPhysician with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 Attenting Physicians with most percentage of Non-Fraudulent Case Submissions.
    * Here, PHY351121 has the highest percentage share of non-fraudulent cases. But, the difference with other providers is not so high.

### **Q5. Which are the Top-25 `Operating Physicians` with maximum number of fraudulent cases?**

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['OperatingPhysician','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['OperatingPhysician', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['OperatingPhysician','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="OperatingPhysician", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent OperatingPhysician", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 OperatingPhysician with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 Operating Physicians with most percentage of Fraudulent Case Submissions.
    * Here, PHY330576 has the highest percentage share of fraudulent cases. But, the difference b/w other providers is not so high.

### **Q6. Which are the Top-25 `Operating Physicians` with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['OperatingPhysician','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="OperatingPhysician", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent OperatingPhysician", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 OperatingPhysician with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 Operating Physicians with most percentage of Non-Fraudulent Case Submissions.
    * Here, PHY387900 and PHY351121 has the highest percentage share of non-fraudulent cases. But, the difference b/w providers is not so high.

### **Q7. Which are the Top-25 `Other Physicians` with maximum number of fraudulent cases?**

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['OtherPhysician','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['OtherPhysician', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['OtherPhysician','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="OtherPhysician", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent OtherPhysician", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 OtherPhysician with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 Other Physicians with most percentage of Fraudulent Case Submissions.
    * Here, PHY412132 has the highest percentage share of fraudulent cases. But, the difference with other providers is not so high.

### **Q8. Which are the Top-25 `Other Physicians` with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['OtherPhysician','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="OtherPhysician", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent OtherPhysician", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 OtherPhysician with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 Other Physicians with most percentage of Non-Fraudulent Case Submissions.
    * Here, PHY422235 has the highest percentage share of non-fraudulent cases. But, the difference with other providers is not so high.

### **Q9. Which are the Top-25 `ClmAdmitDiagnosisCode` with maximum number of fraudulent cases?**

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['ClmAdmitDiagnosisCode','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['ClmAdmitDiagnosisCode', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['ClmAdmitDiagnosisCode','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="ClmAdmitDiagnosisCode", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent ClmAdmitDiagnosisCode", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 ClmAdmitDiagnosisCode with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 'Claim Admit Diagnosis Code' with most percentage of Fraudulent Case Submissions.

### **Q10. Which are the Top-25 `ClmAdmitDiagnosisCode` with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['ClmAdmitDiagnosisCode','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="ClmAdmitDiagnosisCode", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent ClmAdmitDiagnosisCode", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 ClmAdmitDiagnosisCode with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**

* The above plot shows us the Top-25 'Claim Admit Diagnosis Code' with most percentage of Non-Fraudulent Case Submissions.

* Main observation from the above 2 plots is that same `Claim Admit Diagnostic Codes` have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.

### **Q11. Which are the Top-25 `DiagnosisGroupCode` with maximum number of fraudulent cases?**

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['DiagnosisGroupCode','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['DiagnosisGroupCode', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['DiagnosisGroupCode','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="DiagnosisGroupCode", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent DiagnosisGroupCode", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 DiagnosisGroupCode with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 'Diagnosis Group Code' with most percentage of Fraudulent Case Submissions.

### **Q12. Which are the Top-25 `DiagnosisGroupCode` with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['DiagnosisGroupCode','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="DiagnosisGroupCode", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent DiagnosisGroupCode", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 DiagnosisGroupCode with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**

* The above plot shows us the Top-25 'Diagnosis Group Code' with most percentage of Non-Fraudulent Case Submissions.

* Main observation from the above 2 plots is that same `Diagnosis Group Codes` have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.

### **Q13. Does `Age_groups` have any relationship with maximum number of fraudulent cases?**

In [None]:
def bene_age_brackets(val):
    """
    Description : This function is created for allocating the age groups based on Beneficiary Age.
    """
    if val >=1 and val <=40:
        return 'Young'
    elif val > 40 and val <=60:
        return 'Mid'
    elif val > 60 and val <= 80:
        return 'Old'
    else:
        return 'Very Old'

In [None]:
train_iobp_df['AGE_groups'] = train_iobp_df['Bene_Age'].apply(lambda age: bene_age_brackets(age))

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['AGE_groups','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['AGE_groups', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['AGE_groups','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(10,8))
    fig = sns.barplot(data=tmp_only_frauds, x="AGE_groups", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=0)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent AGE_groups", fontdict=label_font_dict)
    plt.xticks(rotation=0, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("AGE_groups with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the percentage of Fraudulent Case Submissions for various Age Groups.

### **Q14. Does `Age_groups` have any relationship with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['AGE_groups','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(10,8))
    fig = sns.barplot(data=tmp_only_non_frauds, x="AGE_groups", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=0)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent AGE_groups", fontdict=label_font_dict)
    plt.xticks(rotation=0, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("AGE_groups with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**

* The above plot shows us the percentage of Non-Fraudulent Case Submissions for various Age Groups.

* Main observation from the above 2 plots is that same `Age Groups` have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.

### **Q15. Which are the Top-25 `States` with maximum number of fraudulent cases?**

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['State','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['State', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['State','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="State", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent State Codes", fontdict=label_font_dict)
    plt.xticks(rotation=0, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 State Codes with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 State Codes with most percentage of Fraudulent Case Submissions.

### **Q16. What are the Top-25 `States` with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['State','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(14,8))
    fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="State", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent State Codes", fontdict=label_font_dict)
    plt.xticks(rotation=0, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 State Codes with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**

* The above plot shows us the Top-25 State Codes with most percentage of Non-Fraudulent Case Submissions.

* Main observation from the above 2 plots is that same `State Codes` have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.

### **Q17. Which are the Top-25 `Country` with maximum number of fraudulent cases?**

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['County','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['County', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['County','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(15,10))
    fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="County", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent Country Codes", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 Country Codes with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the Top-25 Country Codes with most percentage of Fraudulent Case Submissions.

### **Q18. What are the Top-25 `Country` with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['County','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(15,10))
    fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="County", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent Country Codes", fontdict=label_font_dict)
    plt.xticks(rotation=90, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Top-25 Country Codes with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**

* The above plot shows us the Top-25 Country Codes with most percentage of Non-Fraudulent Case Submissions.

* Main observation from the above 2 plots is that same `Country Codes` have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.

### **Q19. Does various `Human Races` have any relationship with maximum number of fraudulent cases?**

In [None]:
tmp = pd.DataFrame(train_iobp_df.groupby(['Race','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['Race', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)

tmp.head()

In [None]:
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_frauds[['Race','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(10,8))
    fig = sns.barplot(data=tmp_only_frauds, x="Race", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=0)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Fraudulent Race", fontdict=label_font_dict)
    plt.xticks(rotation=0, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Human Race with most number of fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us the percentage of Fraudulent Case Submissions for various Human Races.

### **Q20. Does various `Human Races` have any relationship with maximum number of non-fraudulent cases?**

In [None]:
tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)

In [None]:
print(tmp_only_non_frauds[['Race','Num_of_cases','Percentage']].head(25), "\n")

with plt.style.context('seaborn'):
    plt.figure(figsize=(10,8))
    fig = sns.barplot(data=tmp_only_non_frauds, x="Race", y="Num_of_cases", palette='Accent')
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=0)
    
    # Providing the labels and title to the graph
    plt.xlabel("\nTop Non-Fraudulent Race", fontdict=label_font_dict)
    plt.xticks(rotation=0, fontsize=12)
    plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
    plt.minorticks_on()
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.title("Human Race with most number of non-fraudulent cases\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**

* The above plot shows us the percentage of Non-Fraudulent Case Submissions for various Human Races.

* Main observation from the above 2 plots is that same `Human Races` have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.

# **Feature Engg - SET 1**

In [None]:
from IPython.display import Image
Image("../input/medicare-prv-fraud-files/Feat_Engg_SET1.png")

In [None]:
train_bene_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Beneficiarydata-1542865627584.csv")
train_ip_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Inpatientdata-1542865627584.csv")
train_op_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Outpatientdata-1542865627584.csv")
train_tgt_lbls_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train-1542865627584.csv")

## ***Exploring_Target_Labels_Data***

In [None]:
train_tgt_lbls_df.head()

* **Check the Fraud and Non-Fraud Counts**

In [None]:
print("### The unique number of providers are {}. ###".format(train_tgt_lbls_df.shape[0]))

In [None]:
with plt.style.context('seaborn-poster'):
    fig = train_tgt_lbls_df["PotentialFraud"].value_counts().plot(kind='bar', color=['green','orange'])
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/train_tgt_lbls_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
    # Providing the labels and title to the graph
    plt.xlabel("Provider Fraud or Not?", fontdict=label_font_dict)
    plt.ylabel("Number or % share of providers\n", fontdict=label_font_dict)
    plt.yticks(np.arange(0,5100,500))
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Distribution of Fraud & Non-fraud providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can say that 90% of the providers are not frausters and only 9% of them are involved in frauds.

### **Adding `New Feature - 1` :: `Admitted` or `Not Admitted` indicator in IP and OP Dataset**

* **Adding in IP Dataset**

In [None]:
train_ip_df["Admitted?"] = 1

In [None]:
train_ip_df.head()

* **Adding in OP Dataset**

In [None]:
train_op_df["Admitted?"] = 0

In [None]:
train_op_df.head()

### **Merging the Datasets**

In [None]:
# Commom columns must be 28
common_cols = [col for col in train_ip_df.columns if col in train_op_df.columns]
len(common_cols)

In [None]:
# Merging the IP and OP dataset on the basis of common columns
train_ip_op_df = pd.merge(left=train_ip_df, right=train_op_df, left_on=common_cols, right_on=common_cols, how="outer")
train_ip_op_df.shape

In [None]:
train_ip_op_df.head()

### **Merging the IP_OP Dataset with BENE Data**

In [None]:
# Joining the IP_OP dataset with the BENE data
train_ip_op_bene_df = pd.merge(left=train_ip_op_df, right=train_bene_df, left_on='BeneID', right_on='BeneID',how='inner')
train_ip_op_bene_df.shape

### **Merging the IP_OP_BENE Dataset with PROVIDER level Tgt Labels Data**

In [None]:
# Joining the IP_OP_BENE dataset with the Tgt Label Provider Data
train_iobp_df = pd.merge(left=train_ip_op_bene_df, right=train_tgt_lbls_df, left_on='Provider', right_on='Provider',how='inner')
train_iobp_df.shape

### **Entire Dataset**

In [None]:
train_iobp_df.shape

In [None]:
# Unique Providers
train_iobp_df["Provider"].nunique()

In [None]:
# Unique Claim numbers
train_iobp_df["ClaimID"].nunique()

In [None]:
# Joining with the PRV Tgt Labels
prvs_claims_df = pd.DataFrame(train_iobp_df.groupby(['Provider'])['ClaimID'].count()).reset_index()
prvs_claims_tgt_lbls_df = pd.merge(left=prvs_claims_df, right=train_tgt_lbls_df, on='Provider', how='inner')
prvs_claims_tgt_lbls_df

- **Fraud Count at Claims level**

In [None]:
print(pd.DataFrame(train_iobp_df['PotentialFraud'].value_counts()), "\n")

with plt.style.context('seaborn-poster'):
    fig = train_iobp_df['PotentialFraud'].value_counts().plot(kind='bar', color=['green','orange'])
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/train_iobp_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
    # Providing the labels and title to the graph
    plt.xlabel("Fraud or Not?", fontdict=label_font_dict)
    plt.ylabel("Number (or %) of claims\n", fontdict=label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Distribution of Fraud & Non-fraud claims\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us that, 62% of claims are Non-Fraud and 32% of them are Fraudulent. 
    * By looking at the percentages we may say that there is a class-imbalance problem but after looking at the number of records it doesn't seem to be a severe class-imbalance problem. 
        * So, I'll try some class balancing techniques only after training a baseline model w/o any synthetic or class weighting techniques.

## **Feature Engineering**
**`Let's create some features`**

### **Adding `New Feature - 2` :: `Is_Alive?`**

    - Is Alive? = No if DOD is NaN else Yes

In [None]:
train_iobp_df['DOB'] = pd.to_datetime(train_iobp_df['DOB'], format="%Y-%m-%d")
train_iobp_df['DOD'] = pd.to_datetime(train_iobp_df['DOD'], format="%Y-%m-%d")

In [None]:
train_iobp_df['Is_Alive?'] = train_iobp_df['DOD'].apply(lambda val: 'No' if val != val else 'Yes')

In [None]:
train_iobp_df['Is_Alive?'].value_counts()

### **Adding `New Feature - 3` :: `Claim_Duration`**
    
    - Claim Duration = Claim End Date - Claim Start Date

In [None]:
train_iobp_df['ClaimStartDt'] = pd.to_datetime(train_iobp_df['ClaimStartDt'], format="%Y-%m-%d")
train_iobp_df['ClaimEndDt'] = pd.to_datetime(train_iobp_df['ClaimEndDt'], format="%Y-%m-%d")

train_iobp_df['Claim_Duration'] = (train_iobp_df['ClaimEndDt'] - train_iobp_df['ClaimStartDt']).dt.days

### **Adding `New Feature - 4` :: `Admitted_Duration`**

    - Admitted Duration = Discharge Date - Admission Date

In [None]:
train_iobp_df['AdmissionDt'] = pd.to_datetime(train_iobp_df['AdmissionDt'], format="%Y-%m-%d")
train_iobp_df['DischargeDt'] = pd.to_datetime(train_iobp_df['DischargeDt'], format="%Y-%m-%d")

train_iobp_df['Admitted_Duration'] = (train_iobp_df['DischargeDt'] - train_iobp_df['AdmissionDt']).dt.days

### **Adding `New Feature - 5` :: `Bene_Age`**

    - Bene Age = DOD - DOB (if DOD is Null then replace it with MAX date in DOD)

In [None]:
# Filling the Null values as MAX Date of Death in the Dataset
train_iobp_df['DOD'].fillna(value=train_iobp_df['DOD'].max(), inplace=True)

In [None]:
train_iobp_df['Bene_Age'] = round(((train_iobp_df['DOD'] - train_iobp_df['DOB']).dt.days)/365,1)

### **Adding `New Feature - 6` :: `Att_Opr_Oth_Phy_Tot_Claims`**
    
   * It represents the total claims submitted by Attending, Operating and Other Physicians.
       
       * **`Reasoning`** :: The idea behind adding this feature is to see whether a total of physicians claims submission will help in influencing the potential frauds.


   * **`Logic`** :: Att_Phy_tot_claims + Opr_Phy_tot_claims + Oth_Phy_tot_claims

- **`Att_Phy_tot_claims`** :: **Total Number of claims or cases seen by Attending Physician**

In [None]:
# Total unique number of Attended Physicians
print("Unique number of Attending Physicians present in the dataset are --> {}".format(train_iobp_df['AttendingPhysician'].nunique()))

In [None]:
train_iobp_df['Att_Phy_tot_claims'] = train_iobp_df.groupby(['AttendingPhysician'])['ClaimID'].transform('count')
train_iobp_df['Att_Phy_tot_claims'].describe()

- **`Opr_Phy_tot_claims`** :: **Total Number of claims or cases seen by Opearting Physician**

In [None]:
# Total unique number of Operating Physicians
print("Unique number of Operating Physicians present in the dataset are --> {}".format(train_iobp_df['OperatingPhysician'].nunique()))

In [None]:
train_iobp_df['Opr_Phy_tot_claims'] = train_iobp_df.groupby(['OperatingPhysician'])['ClaimID'].transform('count')
train_iobp_df['Opr_Phy_tot_claims'].describe()

- **`Oth_Phy_tot_claims`** :: **Total Number of claims or cases seen by Other Physician**

In [None]:
# Total unique number of Other Physicians
print("Unique number of Other Physicians present in the dataset are --> {}".format(train_iobp_df['OtherPhysician'].nunique()))

In [None]:
train_iobp_df['Oth_Phy_tot_claims'] = train_iobp_df.groupby(['OtherPhysician'])['ClaimID'].transform('count')
train_iobp_df['Oth_Phy_tot_claims'].describe()

In [None]:
# Creating the combined feature
train_iobp_df['Att_Phy_tot_claims'].fillna(value=0, inplace=True)
train_iobp_df['Opr_Phy_tot_claims'].fillna(value=0, inplace=True)
train_iobp_df['Oth_Phy_tot_claims'].fillna(value=0, inplace=True)

In [None]:
train_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'] = train_iobp_df['Att_Phy_tot_claims'] + train_iobp_df['Opr_Phy_tot_claims'] + train_iobp_df['Oth_Phy_tot_claims']

In [None]:
train_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'].describe()

In [None]:
train_iobp_df.drop(['Att_Phy_tot_claims', 'Opr_Phy_tot_claims', 'Oth_Phy_tot_claims'], axis=1, inplace=True)

### **Adding `New Feature - 7` :: `Prv_Tot_Att_Opr_Oth_Phys`**
    
   * It represents the total of all kind of physicians that a provider has interacted with.
       
       * **`Reasoning`** :: The idea behind adding this feature is to see whether a fraudulent provider interacts with higher or lower numberof of various physicians.


   * **`Logic`** :: Prv_Tot_Att_Phy + Prv_Tot_Opr_Phy + Prv_Tot_Oth_Phy

In [None]:
train_iobp_df["Prv_Tot_Att_Phy"] = train_iobp_df.groupby(['Provider'])['AttendingPhysician'].transform('count')
train_iobp_df["Prv_Tot_Opr_Phy"] = train_iobp_df.groupby(['Provider'])['OperatingPhysician'].transform('count')
train_iobp_df["Prv_Tot_Oth_Phy"] = train_iobp_df.groupby(['Provider'])['OtherPhysician'].transform('count')

In [None]:
# Nulls in the above features
train_iobp_df.isna().sum().tail(3)

In [None]:
train_iobp_df["Prv_Tot_Att_Phy"].describe()

* The average number of attending physicians for providers are 820.

In [None]:
train_iobp_df["Prv_Tot_Opr_Phy"].describe()

* The average number of operating physicians for providers are 155.

In [None]:
train_iobp_df["Prv_Tot_Oth_Phy"].describe()

* The average number of other physicians for providers are 306.

In [None]:
train_iobp_df['Prv_Tot_Att_Opr_Oth_Phys'] = train_iobp_df['Prv_Tot_Att_Phy'] + train_iobp_df['Prv_Tot_Opr_Phy'] + train_iobp_df['Prv_Tot_Oth_Phy']

In [None]:
train_iobp_df["Prv_Tot_Att_Opr_Oth_Phys"].describe()

In [None]:
train_iobp_df.drop(['Prv_Tot_Att_Phy', 'Prv_Tot_Opr_Phy', 'Prv_Tot_Oth_Phy'], axis=1, inplace=True)

### **Adding `New Feature - 8` :: `Total Unique Claim Admit Codes used by a PROVIDER`**
   
   * **`Reasoning`** :: The idea behind adding this feature is to see how many unique number of `Claim Admit Diagnosis Codes` used by the Provider. 
       * As there may be a pattern that if a provider has used so many Admit Diagnosis Codes then it might increases or decreases the chances of fraud.

In [None]:
train_iobp_df['PRV_Tot_Admit_DCodes'] = train_iobp_df.groupby(['Provider'])['ClmAdmitDiagnosisCode'].transform('nunique')

In [None]:
train_iobp_df["PRV_Tot_Admit_DCodes"].describe()

### **Adding `New Feature - 9` :: `Total Unique Number of Diagnosis Group Codes used by a PROVIDER`**
   
   * **`Reasoning`** :: The idea behind adding this feature is to see how many unique `Diagnosis Group Codes` used by the Provider.
       * As there may be a pattern that if a provider has used so many Diagnosis Group Codes then it might increases or decreases the chances of fraud.

In [None]:
train_iobp_df['PRV_Tot_DGrpCodes'] = train_iobp_df.groupby(['Provider'])['DiagnosisGroupCode'].transform('nunique')

In [None]:
train_iobp_df["PRV_Tot_DGrpCodes"].describe()

### **Adding `New Feature - 10` :: `Total unique Date of Birth years of beneficiaries provided by a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that if a provider has very high variability in the year of birth of patients then that might be one of the signs of medicare frauds.
       - Because generally private hospitals who treat poor patients make false claims on their names. For example, Nazia is 10 years old. But, according to a claim filed by Chhattisgarh-based Shaheed Hospital with the Rashtriya Swasthya Bima Yojna (RSBY), she has delivered a baby after a caesarean operation. Mukul (name changed) is only 7. But Agarwal Hospital, Raipur, has made a claim for removing cataract from his eyes.

Read more at:
https://economictimes.indiatimes.com/news/politics-and-nation/private-hospitals-perform-fake-surgeries-to-claim-thousands-in-insurance-cover/articleshow/16934229.cms?utm_source=contentofinterest&utm_medium=text&utm_campaign=cppst

In [None]:
train_iobp_df['DOB_Year'] = train_iobp_df['DOB'].dt.year

In [None]:
train_iobp_df['PRV_Tot_Unq_DOB_Years'] = train_iobp_df.groupby(['Provider'])['DOB_Year'].transform('nunique')

In [None]:
train_iobp_df['PRV_Tot_Unq_DOB_Years'].describe()

In [None]:
train_iobp_df.drop(['DOB_Year'], axis=1, inplace=True)

### **Adding `New Feature - 11` :: `Sum of patients age treated by a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that there might be a pattern like if the sum of patients age treated by a provider is very high or low then it might influence the fraud.

In [None]:
train_iobp_df['PRV_Bene_Age_Sum'] = train_iobp_df.groupby(['Provider'])['Bene_Age'].transform('sum')

In [None]:
train_iobp_df['PRV_Bene_Age_Sum'].describe()

### **Adding `New Feature - 12` :: `Sum of Insc Claim Re-Imb Amount for a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that there might be a pattern like if the sum of claim re-imb amount for a provider is very high or low then it might influence the fraud.

In [None]:
train_iobp_df['PRV_Insc_Clm_ReImb_Amt'] = train_iobp_df.groupby(['Provider'])['InscClaimAmtReimbursed'].transform('sum')

In [None]:
train_iobp_df['PRV_Insc_Clm_ReImb_Amt'].describe()

### **Adding `New Feature - 13` :: `Total number of RKD Patients seen by a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that there might be a pattern like if the total number of RKD Patients seen by a Provider is very high or low then it might influence the fraud.

In [None]:
train_iobp_df['RenalDiseaseIndicator'] = train_iobp_df['RenalDiseaseIndicator'].apply(lambda val: 1 if val == "Y" else 0)

In [None]:
train_iobp_df['PRV_Tot_RKD_Patients'] = train_iobp_df.groupby(['Provider'])['RenalDiseaseIndicator'].transform('sum')

In [None]:
train_iobp_df['PRV_Tot_RKD_Patients'].describe()

In [None]:
# Dropping these 2 columns as there 99% of values are same
train_iobp_df.drop(['NoOfMonths_PartACov', 'NoOfMonths_PartBCov'], axis=1, inplace=True)

In [None]:
# Filling null values in Admitted_Duration with 0 (as it will represent the patients were admitted for 0 days)
train_iobp_df['Admitted_Duration'].fillna(value=0,inplace=True)

In [None]:
train_iobp_df.shape

### **Adding `Aggregated Features` :: For every possible level**
    - Provider
    - Beneficiary
    - Attending Physician
    - Operating Physician
    - Other Physician and etc..
   
   
   * **`Reasoning`** :: The idea behind adding the aggregated features at different levels is that fraud can be done by an individual or group of individuals or entities involved in the claim process.

In [None]:
# PRV Aggregate features
train_iobp_df["PRV_CoPayment"] = train_iobp_df.groupby('Provider')['DeductibleAmtPaid'].transform('sum')
train_iobp_df["PRV_IP_Annual_ReImb_Amt"] = train_iobp_df.groupby('Provider')['IPAnnualReimbursementAmt'].transform('sum')
train_iobp_df["PRV_IP_Annual_Ded_Amt"] = train_iobp_df.groupby('Provider')['IPAnnualDeductibleAmt'].transform('sum')
train_iobp_df["PRV_OP_Annual_ReImb_Amt"] = train_iobp_df.groupby('Provider')['OPAnnualReimbursementAmt'].transform('sum')
train_iobp_df["PRV_OP_Annual_Ded_Amt"] = train_iobp_df.groupby('Provider')['OPAnnualDeductibleAmt'].transform('sum')
train_iobp_df["PRV_Admit_Duration"] = train_iobp_df.groupby('Provider')['Admitted_Duration'].transform('sum')
train_iobp_df["PRV_Claim_Duration"] = train_iobp_df.groupby('Provider')['Claim_Duration'].transform('sum')

In [None]:
def create_agg_feats(grp_col, feat_name, operation='sum'):
    """
    Description :: This function is created for adding the aggregated features in the dataset for every level like:
        - Beneficiary
        - Attending Physician
        - Operating Physician
        - Other Physician and etc..
        
    Input Parameters :: It accepts below inputs:
        - grp_col : `str`
            - It represents the feature or level at which you want to perform the aggregation.
        
        - feat_name : `str`
            - It represents the feature whose aggregated aspect you want to capture.
        
        - operation : `str`
            - It represents the aggregation operation you want to perform.(By default it is SUM)
    """
    feat_1 = feat_name + "_Insc_ReImb_Amt"
    train_iobp_df[feat_1] = train_iobp_df.groupby(grp_col)['InscClaimAmtReimbursed'].transform(operation)

    feat_2 = feat_name + "_CoPayment"
    train_iobp_df[feat_2] = train_iobp_df.groupby(grp_col)['DeductibleAmtPaid'].transform(operation)

    feat_3 = feat_name + "_IP_Annual_ReImb_Amt"
    train_iobp_df[feat_3] = train_iobp_df.groupby(grp_col)['IPAnnualReimbursementAmt'].transform(operation)

    feat_4 = feat_name + "_IP_Annual_Ded_Amt"
    train_iobp_df[feat_4] = train_iobp_df.groupby(grp_col)['IPAnnualDeductibleAmt'].transform(operation)

    feat_5 = feat_name + "_OP_Annual_ReImb_Amt"
    train_iobp_df[feat_5] = train_iobp_df.groupby(grp_col)['OPAnnualReimbursementAmt'].transform(operation)

    feat_6 = feat_name + "_OP_Annual_Ded_Amt"
    train_iobp_df[feat_6] = train_iobp_df.groupby(grp_col)['OPAnnualDeductibleAmt'].transform(operation)

    feat_7 = feat_name + "_Admit_Duration"
    train_iobp_df[feat_7] = train_iobp_df.groupby(grp_col)['Admitted_Duration'].transform(operation)

    feat_8 = feat_name + "_Claim_Duration"
    train_iobp_df[feat_8] = train_iobp_df.groupby(grp_col)['Claim_Duration'].transform(operation)

In [None]:
# BENE, PHYs, Diagnosis Admit and Group Codes columns
create_agg_feats(grp_col='BeneID', feat_name="BENE")
create_agg_feats(grp_col='AttendingPhysician', feat_name="ATT_PHY")
create_agg_feats(grp_col='OperatingPhysician', feat_name="OPT_PHY")
create_agg_feats(grp_col='OtherPhysician', feat_name="OTH_PHY")
create_agg_feats(grp_col='ClmAdmitDiagnosisCode', feat_name="Claim_Admit_Diag_Code")
create_agg_feats(grp_col='DiagnosisGroupCode', feat_name="Diag_GCode")

In [None]:
# Dropping these 3 columns as there 99% of values are same
train_iobp_df.drop(['ClmProcedureCode_4', 'ClmProcedureCode_5', 'ClmProcedureCode_6'], axis=1, inplace=True)

In [None]:
# Diagnosis Codes columns
create_agg_feats(grp_col='ClmDiagnosisCode_1', feat_name="Claim_DiagCode1")
create_agg_feats(grp_col='ClmDiagnosisCode_2', feat_name="Claim_DiagCode2")
create_agg_feats(grp_col='ClmDiagnosisCode_3', feat_name="Claim_DiagCode3")
create_agg_feats(grp_col='ClmDiagnosisCode_4', feat_name="Claim_DiagCode4")
create_agg_feats(grp_col='ClmDiagnosisCode_5', feat_name="Claim_DiagCode5")
create_agg_feats(grp_col='ClmDiagnosisCode_6', feat_name="Claim_DiagCode6")
create_agg_feats(grp_col='ClmDiagnosisCode_7', feat_name="Claim_DiagCode7")
create_agg_feats(grp_col='ClmDiagnosisCode_8', feat_name="Claim_DiagCode8")
create_agg_feats(grp_col='ClmDiagnosisCode_9', feat_name="Claim_DiagCode9")
create_agg_feats(grp_col='ClmDiagnosisCode_10', feat_name="Claim_DiagCode10")

# Medical Procedure Codes columns
create_agg_feats(grp_col='ClmProcedureCode_1', feat_name="Claim_ProcCode1")
create_agg_feats(grp_col='ClmProcedureCode_2', feat_name="Claim_ProcCode2")
create_agg_feats(grp_col='ClmProcedureCode_3', feat_name="Claim_ProcCode3")

In [None]:
train_iobp_df.shape

### **Adding `Aggregated Features` :: Based on various combinations of different levels in order to introduce their interactions in the dataset.**
    - PROVIDER <--> BENE <--> PHYSICIANS
    - PROVIDER <--> BENE <--> ATTENDING PHYSICIAN <--> PROCEDURE CODES
    - PROVIDER <--> BENE <--> OPERATING PHYSICIAN <--> PROCEDURE CODES
    - PROVIDER <--> BENE <--> OTHER PHYSICIAN <--> PROCEDURE CODES
    - PROVIDER <--> BENE <--> ATTENDING PHYSICIAN <--> DIAGNOSIS CODES
    - PROVIDER <--> BENE <--> OPERATING PHYSICIAN <--> DIAGNOSIS CODES
    - PROVIDER <--> BENE <--> OTHER PHYSICIAN <--> DIAGNOSIS CODES
    - PROVIDER <--> BENE <--> DIAGNOSIS CODES <--> PROCEDURE CODES and etc..

   * **`Reasoning`** :: The idea behind adding the aggregated features based on the combinations of various features is that many parties or entities might work together to make a medicare fraud. Thus, we need to capture interactions among them to better classify the fraudsters.

In [None]:
# PROVIDER <--> other features :: To get claim counts
train_iobp_df["ClmCount_Provider"]=train_iobp_df.groupby(['Provider'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID"]=train_iobp_df.groupby(['Provider','BeneID'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_AttendingPhysician"]=train_iobp_df.groupby(['Provider','AttendingPhysician'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_OtherPhysician"]=train_iobp_df.groupby(['Provider','OtherPhysician'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_OperatingPhysician"]=train_iobp_df.groupby(['Provider','OperatingPhysician'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmAdmitDiagnosisCode"]=train_iobp_df.groupby(['Provider','ClmAdmitDiagnosisCode'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_1"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_2"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_3"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_4"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_4'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_5"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_5'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_6"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_6'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_7"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_7'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_8"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_8'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_9"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_9'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_ClmDiagnosisCode_10"]=train_iobp_df.groupby(['Provider','ClmDiagnosisCode_10'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_DiagnosisGroupCode"]=train_iobp_df.groupby(['Provider','DiagnosisGroupCode'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> PHYSICIANS :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> ATTENDING PHYSICIAN <--> PROCEDURE CODES :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmProcedureCode_3'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> OPERATING PHYSICIAN <--> PROCEDURE CODES :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmProcedureCode_3'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> OTHER PHYSICIAN <--> PROCEDURE CODES :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmProcedureCode_3'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> ATTENDING PHYSICIAN <--> DIAGNOSIS CODES :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_1"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_2"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_3"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_4"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_4'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_5"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_5'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_6"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_6'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_7"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_7'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_8"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_8'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_9"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_9'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician_ClmDiagnosisCode_10"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician','ClmDiagnosisCode_10'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> OPERATING PHYSICIAN <--> DIAGNOSIS CODES :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_1"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_2"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_3"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_4"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_4'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_5"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_5'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_6"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_6'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_7"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_7'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_8"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_8'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_9"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_9'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician_ClmDiagnosisCode_10"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician','ClmDiagnosisCode_10'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> OTHER PHYSICIAN <--> DIAGNOSIS CODES :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_1"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_2"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_3"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_4"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_4'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_5"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_5'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_6"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_6'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_7"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_7'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_8"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_8'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_9"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_9'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician_ClmDiagnosisCode_10"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician','ClmDiagnosisCode_10'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> PROCEDURE CODES :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmProcedureCode_3'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> DIAGNOSIS CODES :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_4"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_4'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_5"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_5'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_6"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_6'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_7"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_7'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_8"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_8'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_9"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_9'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_10"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_10'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> DIAGNOSIS CODES <--> PROCEDURE CODES :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_1_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_1','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_1_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_1','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_1_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_1','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_2_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_2','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_2_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_2','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_2_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_2','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_3_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_3','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_3_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_3','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_3_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_3','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_4_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_4','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_4_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_4','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_4_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_4','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_5_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_5','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_5_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_5','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_5_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_5','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_6_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_6','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_6_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_6','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_6_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_6','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_7_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_7','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_7_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_7','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_7_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_7','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_8_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_8','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_8_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_8','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_8_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_8','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_9_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_9','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_9_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_9','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_9_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_9','ClmProcedureCode_3'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_10_ClmProcedureCode_1"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_10','ClmProcedureCode_1'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_10_ClmProcedureCode_2"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_10','ClmProcedureCode_2'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_ClmDiagnosisCode_10_ClmProcedureCode_3"]=train_iobp_df.groupby(['Provider','BeneID','ClmDiagnosisCode_10','ClmProcedureCode_3'])['ClaimID'].transform('count')

In [None]:
train_iobp_df.shape

In [None]:
# Removing unwanted columns
remove_unwanted_columns=['BeneID', 'ClaimID', 'ClaimStartDt','ClaimEndDt','AttendingPhysician','OperatingPhysician', 'OtherPhysician',
                      'AdmissionDt', 'ClmAdmitDiagnosisCode', 'DischargeDt', 'DiagnosisGroupCode',
                      'ClmDiagnosisCode_1', 'ClmDiagnosisCode_2', 'ClmDiagnosisCode_3', 'ClmDiagnosisCode_4', 'ClmDiagnosisCode_5', 
                      'ClmDiagnosisCode_6', 'ClmDiagnosisCode_7', 'ClmDiagnosisCode_8', 'ClmDiagnosisCode_9', 'ClmDiagnosisCode_10',
                      'ClmProcedureCode_1', 'ClmProcedureCode_2', 'ClmProcedureCode_3', 'DOB', 'DOD', 'State', 'County']

train_iobp_df.drop(columns=remove_unwanted_columns, axis=1, inplace=True)

In [None]:
train_iobp_df.shape

In [None]:
# Filling Nulls in Deductible Amt Paid by Patient
train_iobp_df['DeductibleAmtPaid'].fillna(value=0, inplace=True)

In [None]:
# Binary encoding the categorical features --> 0 means No and 1 means Yes
train_iobp_df['Gender'] = train_iobp_df['Gender'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['PotentialFraud'] = train_iobp_df['PotentialFraud'].apply(lambda val: 0 if val == "No" else 1)
train_iobp_df['Is_Alive?'] = train_iobp_df['Is_Alive?'].apply(lambda val: 0 if val == "No" else 1)

train_iobp_df['ChronicCond_Alzheimer'] = train_iobp_df['ChronicCond_Alzheimer'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Heartfailure'] = train_iobp_df['ChronicCond_Heartfailure'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_KidneyDisease'] = train_iobp_df['ChronicCond_KidneyDisease'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Cancer'] = train_iobp_df['ChronicCond_Cancer'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_ObstrPulmonary'] = train_iobp_df['ChronicCond_ObstrPulmonary'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Depression'] = train_iobp_df['ChronicCond_Depression'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Diabetes'] = train_iobp_df['ChronicCond_Diabetes'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_IschemicHeart'] = train_iobp_df['ChronicCond_IschemicHeart'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Osteoporasis'] = train_iobp_df['ChronicCond_Osteoporasis'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_rheumatoidarthritis'] = train_iobp_df['ChronicCond_rheumatoidarthritis'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_stroke'] = train_iobp_df['ChronicCond_stroke'].apply(lambda val: 0 if val == 2 else val)

In [None]:
# Encoding the Categorical features
train_iobp_df = pd.get_dummies(train_iobp_df,columns=['Gender', 'Race', 'Admitted?', 'Is_Alive?'], drop_first=True)

In [None]:
pd.set_option('display.max_rows',310)

In [None]:
# Checking Nulls in the features
pd.DataFrame(train_iobp_df.isna().sum())

In [None]:
# Filling Nulls in the aggregated features
train_iobp_df.fillna(value=0, inplace=True)

In [None]:
# Checking Nulls in the features
pd.DataFrame(train_iobp_df.isna().sum())

In [None]:
# Checking the Datatypes of the features
train_iobp_df.dtypes

## **Entire Data `Aggregation` :: At provider level**

   * **`Reasoning`** :: The main objective is to predict the `Medicare Provider Fraud`. Thus, here we are grouping the entire dataset at the level of PROVIDER and taking SUM of all the columns to create n-dimensional representation of each provider.

In [None]:
train_iobp_df = train_iobp_df.groupby(['Provider','PotentialFraud'],as_index=False).agg('sum')

In [None]:
train_iobp_df.shape

# **`Data Seggregation`**

## **Creating separate sets of independent features and target column.**

   * **`Reasoning`** :: These sets will be used for training the ML Models.

In [None]:
X = train_iobp_df.drop(axis=1, columns=['Provider','PotentialFraud'])
y = train_iobp_df['PotentialFraud']

In [None]:
X.shape, type(X), y.shape, type(y)

In [None]:
X.head()

In [None]:
y.head()

## **`Train Test Split` :: Creating TRAIN and VALIDATION sets.**

   * **`Reasoning`** :: These sets will be used for measurng the performance of ML Models.

In [None]:
from IPython.display import Image
Image("../input/medicare-prv-fraud-files/Model_Training_and_Validation_Strategy.png")

In [None]:
from sklearn.model_selection import train_test_split as tts

In [None]:
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.20, stratify=y, random_state=39)

In [None]:
# Checking shape of each set
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Checking count of tgt labels in y_train
y_train.value_counts()

In [None]:
# Checking count of tgt labels in y_test
y_test.value_counts()

# **`Standardizing TRAIN & TEST sets`** 
## **Bringing every feature into the same scale.**

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
# Standardize the data (train and test)
robust_scaler = RobustScaler()
robust_scaler.fit(X_train)
X_train_std = robust_scaler.transform(X_train)
X_test_std = robust_scaler.transform(X_test)

# **`Baseline Model Training`**

### **`Using Class Weighting Scheme`**

#### **`1. Logistic Regression`**

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.calibration import CalibratedClassifierCV

In [None]:
# Training the model with all features and hyper-parameterized values
log_reg_1 = LogisticRegression(C=0.0316228, penalty='l1',
                               fit_intercept=True, solver='liblinear', tol=0.0001, max_iter=500, 
                               class_weight='balanced',
                               verbose=0, 
                               intercept_scaling=1.0,
                               multi_class='auto',
                               random_state=49)

log_reg_1.fit(X_train_std, y_train)

In [None]:
def pred_prob(clf, data): 
    """
    Description :: This function is created for storing the predicted probabability using the trained model.
    
    Input :: It accepts below input parameters :
      - clf : Trained model classifier
      - data : Dataset for which we want to generate the predictions
    """
    y_pred = clf.predict_proba(data)[:,1]
    return y_pred

def draw_roc(train_fpr, train_tpr, test_fpr, test_tpr):
    """
    Description :: This function is created for calculating the AUC score on train and test data. And, plotting the ROC curve.
    
    Input :: It accepts below input parameters :
      - train_fpr : Train False +ve rate
      - train_tpr : Train True +ve rate
      - test_fpr : Test False +ve rate
      - test_tpr : Test True +ve rate
    """
    # calculate auc for train and test
    train_auc = auc(train_fpr, train_tpr)
    test_auc = auc(test_fpr, test_tpr)
    with plt.style.context('seaborn-poster'):
      plt.plot(train_fpr, train_tpr, label="Train AUC ="+"{:.4f}".format(train_auc), color='blue')
      plt.plot(test_fpr, test_tpr, label="Test AUC ="+"{:.4f}".format(test_auc), color='red')
      plt.legend()
      plt.xlabel("False Positive Rate(FPR)", fontdict=label_font_dict)
      plt.ylabel("True Positive Rate(TPR)", fontdict=label_font_dict)
      plt.title("Area Under Curve", fontdict=title_font_dict)
      plt.grid(b=True, which='major', color='lightgrey', linestyle='--')
      plt.minorticks_on()
      plt.show()
    
def find_best_threshold(threshold, fpr, tpr):
    """
    Description :: This function is created for finding the best threshold value.
    """
    t = threshold[np.argmax(tpr * (1-fpr))]
    return t

def predict_with_best_t(proba, threshold):
    """
    Description :: This function is created for generating the predictions based on the best threshold value.
    """
    predictions = []
    for i in proba:
        if i>=threshold:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions

def draw_confusion_matrix(best_t, x_train, x_test, y_train, y_test, y_train_pred, y_test_pred):
    """
    Description :: This function is created for plotting the confusion matrix of TRAIN and TEST sets.
    """
    fig, ax = plt.subplots(1,2, figsize=(20,6))

    train_prediction = predict_with_best_t(y_train_pred, best_t)
    cm = confusion_matrix(y_train, train_prediction)
    with plt.style.context('seaborn'):
        sns.heatmap(cm, annot=True, fmt='d', ax=ax[0], cmap='viridis')
        ax[0].set_title('Train Dataset Confusion Matrix', fontdict=title_font_dict)
        ax[0].set_xlabel("Predicted Label", fontdict=label_font_dict)
        ax[0].set_ylabel("Actual Label", fontdict=label_font_dict)

    test_prediction = predict_with_best_t(y_test_pred, best_t)
    cm = confusion_matrix(y_test, test_prediction)
    with plt.style.context('seaborn'):
        sns.heatmap(cm, annot=True, fmt='d', ax=ax[1], cmap='summer')
        ax[1].set_title('Test Dataset Confusion Matrix', fontdict=title_font_dict)
        ax[1].set_xlabel("Predicted Label", fontdict=label_font_dict)
        ax[1].set_ylabel("Actual Label", fontdict=label_font_dict)
    
    plt.show()
    
    return train_prediction, test_prediction

In [None]:
def validate_model(clf, x_train, x_test, y_train, y_test):
    """
    Description :: This function is created for performing the evaluation of the trained model.
    """
    # predict the probability of train data
    y_train_pred = pred_prob(clf, x_train)
    
    # predict the probability of test data
    y_test_pred = pred_prob(clf, x_test)
    
    # calculate tpr, fpr using roc_curve
    train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
    test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
    
    # calculate auc for train and test
    train_auc = auc(train_fpr, train_tpr)
    print("### Train AUC = {}".format(train_auc))
    test_auc = auc(test_fpr, test_tpr)
    print("### Test AUC = {}".format(test_auc))
    
    # plotting the ROC curve
    draw_roc(train_fpr, train_tpr, test_fpr, test_tpr)
    
    # Best threshold value
    best_t = find_best_threshold(tr_thresholds, train_fpr, train_tpr)
    
    # Plotting the confusion matrices
    train_prediction, test_prediction = draw_confusion_matrix(best_t, x_train, x_test, y_train, y_test, y_train_pred, y_test_pred)
    
    # Generating the F1-scores
    train_f1_score = f1_score(y_train, train_prediction)
    test_f1_score = f1_score(y_test, test_prediction)
    
    return test_auc, train_f1_score, test_f1_score, best_t

In [None]:
# Validate Logistic Regression model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(log_reg_1, X_train_std, X_test_std, y_train, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': log_reg_1.coef_[0]})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_15_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:15]
top_15_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:15]

In [None]:
top_15_neg_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=True)['Features'].iloc[0:15]
top_15_neg_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=True)['Importance_Model_1'].iloc[0:15]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_pos_feats, x=top_15_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_neg_feats, x=top_15_neg_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Negative Features\n", fontdict=title_font_dict)

#### **`2. Decision Tree`**

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Training the model with all features and hyper-parameterized values
dec_tree_2 = DecisionTreeClassifier(criterion='gini',
                                   max_depth= 6,
                                   max_features='log2',
                                   min_samples_leaf=150,
                                   min_samples_split=150,
                                   class_weight='balanced',
                                   random_state=49,
                                   splitter='best',
                                   min_weight_fraction_leaf=0.0,
                                   max_leaf_nodes=None,
                                   min_impurity_decrease=0.0,
                                   ccp_alpha=0.0,)

dec_tree_2.fit(X_train_std, y_train)

In [None]:
# Validate model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(dec_tree_2, X_train_std, X_test_std, y_train, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': dec_tree_2.feature_importances_})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_15_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:15]
top_15_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:15]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_pos_feats, x=top_15_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

#### **`3. Random Forest Classifier`**

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Training the model with all features and hyper-parameterized values
rfc_3 = RandomForestClassifier(n_estimators=30,criterion='gini',
                                   max_depth= 4,
                                   max_features='auto',
                                   min_samples_leaf=50,
                                   min_samples_split=50,
                                   class_weight='balanced',
                                   random_state=49,
                                   min_weight_fraction_leaf=0.0,
                                   max_leaf_nodes=None,
                                   min_impurity_decrease=0.0,
                                   ccp_alpha=0.0,)

rfc_3.fit(X_train_std, y_train)

In [None]:
# Validate model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(rfc_3, X_train_std, X_test_std, y_train, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': rfc_3.feature_importances_})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_20_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:20]
top_20_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:20]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_20_pos_feats, x=top_20_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 20 Importance Positive Features\n", fontdict=title_font_dict)

### **`Using Minority Synthetic Oversampling`**

#### **`Train Test Split` :: Creating TRAIN and VALIDATION sets.**

   * **`Reasoning`** :: These sets will be used for measurng the performance of ML Models.

In [None]:
from sklearn.model_selection import train_test_split as tts

In [None]:
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.25, stratify=y, random_state=39)

In [None]:
# Checking shape of each set
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Checking count of tgt labels in y_train
y_train.value_counts()

In [None]:
# Checking count of tgt labels in y_test
y_test.value_counts()

#### **`Standardizing the TRAIN & TEST sets` :: Bringing every feature into the same scale.**

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
# Standardize the data (train and test)
robust_scaler = RobustScaler()
robust_scaler.fit(X_train)
X_train_std = robust_scaler.transform(X_train)
X_test_std = robust_scaler.transform(X_test)

In [None]:
from collections import Counter

In [None]:
# BEFORE Oversampling :: Checking the percentage share of fraud and non-fraud records in the TRAIN set
counter = Counter(y_train)
counter

In [None]:
fraud_percentage = (counter[1]*100 / (counter[0]+counter[1]))
non_fraud_percentage = (counter[0]*100 / (counter[0]+counter[1]))
print("Fraud Percentage = {:.2f}% and Non-Fraud Percentage = {:.2f}%".format(fraud_percentage, non_fraud_percentage))

In [None]:
# Performing minority oversampling
from imblearn.over_sampling import ADASYN

In [None]:
oversample = ADASYN(sampling_strategy=0.35, n_neighbors=12)
X_train_ovsamp, y_train_ovsamp = oversample.fit_resample(X_train_std, y_train)

X_train_ovsamp.shape, y_train_ovsamp.shape

In [None]:
counter = Counter(y_train_ovsamp)
counter

In [None]:
fraud_percentage = (counter[1]*100 / (counter[0]+counter[1]))
non_fraud_percentage = (counter[0]*100 / (counter[0]+counter[1]))
print("Fraud Percentage = {:.2f}% and Non-Fraud Percentage = {:.2f}%".format(fraud_percentage, non_fraud_percentage))

#### **`4. Logistic Regression`**

In [None]:
# Training the model with all features and hyper-parameterized values
log_reg_4 = LogisticRegression(C=0.0316228, penalty='l1',
                               fit_intercept=True, 
                               solver='liblinear', 
                               tol=0.0001, 
                               max_iter=500, 
                               verbose=0, 
                               intercept_scaling=1.0,
                               multi_class='auto',
                               random_state=49)

log_reg_4.fit(X_train_ovsamp, y_train_ovsamp)

In [None]:
# Validate Logistic Regression model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(log_reg_4, X_train_ovsamp, X_test_std, y_train_ovsamp, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': log_reg_4.coef_[0]})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_15_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:15]
top_15_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:15]

In [None]:
top_15_neg_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=True)['Features'].iloc[0:15]
top_15_neg_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=True)['Importance_Model_1'].iloc[0:15]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_pos_feats, x=top_15_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_neg_feats, x=top_15_neg_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Negative Features\n", fontdict=title_font_dict)

#### **`5. Decision Tree`**

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Training the model with all features and hyper-parameterized values
dec_tree_5 = DecisionTreeClassifier(criterion='gini',
                                   max_depth= 6,
                                   max_features='log2',
                                   min_samples_leaf=150,
                                   min_samples_split=150,
                                   random_state=49,
                                   splitter='best',
                                   min_weight_fraction_leaf=0.0,
                                   max_leaf_nodes=None,
                                   min_impurity_decrease=0.0,
                                   ccp_alpha=0.0,)

dec_tree_5.fit(X_train_ovsamp, y_train_ovsamp)

In [None]:
# Validate model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(dec_tree_5, X_train_ovsamp, X_test_std, y_train_ovsamp, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': dec_tree_5.feature_importances_})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_15_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:15]
top_15_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:15]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_pos_feats, x=top_15_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

#### **`6. Random Forest Classifier`**

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Training the model with all features and hyper-parameterized values
rfc_6 = RandomForestClassifier(n_estimators=30,criterion='gini',
                                   max_depth= 4,
                                   max_features='auto',
                                   min_samples_leaf=50,
                                   min_samples_split=50,
                                   random_state=49,
                                   min_weight_fraction_leaf=0.0,
                                   max_leaf_nodes=None,
                                   min_impurity_decrease=0.0,
                                   ccp_alpha=0.0,)

rfc_6.fit(X_train_std, y_train)

In [None]:
# Validate model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(rfc_6, X_train_ovsamp, X_test_std, y_train_ovsamp, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': rfc_6.feature_importances_})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_20_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:20]
top_20_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:20]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_pos_feats, x=top_20_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

# **`Models - SET 1 - RESULTS`**
- Best performing model highlighted in light yellow in the below table.

In [None]:
from IPython.display import Image
Image("../input/medicare-prv-fraud-files/Models_Set_1_Results.png")

## **`Models - SET 1 - OBSERVATIONS`**

- **Adding Aggregated features at below mentioned levels certainly helped in achieving the good performance scores.**
    - Provider
    - Beneficiary
    - Attending Physician
    - Operating Physician
    - Other Physician and etc..
    
    
- **Adding below mentioned Aggregated features in order to capture the interactions b/w the different parties involved in the CLAIM process certainly helped in achieving the good performance scores.**  
    - PROVIDER <--> BENE <--> PHYSICIANS
    - PROVIDER <--> BENE <--> ATTENDING PHYSICIAN <--> PROCEDURE CODES
    - PROVIDER <--> BENE <--> OPERATING PHYSICIAN <--> PROCEDURE CODES
    - PROVIDER <--> BENE <--> OTHER PHYSICIAN <--> PROCEDURE CODES
    - PROVIDER <--> BENE <--> ATTENDING PHYSICIAN <--> DIAGNOSIS CODES
    - PROVIDER <--> BENE <--> OPERATING PHYSICIAN <--> DIAGNOSIS CODES
    - PROVIDER <--> BENE <--> OTHER PHYSICIAN <--> DIAGNOSIS CODES
    - PROVIDER <--> BENE <--> DIAGNOSIS CODES <--> PROCEDURE CODES and etc..
    

- **Doing the synthetic oversampling of the minority class doesn't provide gain in the model's performance whehreas we can see a slight drop in the performace.** 

# **Feature Engg - SET 2**

In [None]:
from IPython.display import Image
Image("../input/medicare-prv-fraud-files/Feat_Engg_SET2.png")

In [None]:
train_bene_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Beneficiarydata-1542865627584.csv")
train_ip_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Inpatientdata-1542865627584.csv")
train_op_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train_Outpatientdata-1542865627584.csv")
train_tgt_lbls_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Train-1542865627584.csv")

In [None]:
test_bene_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Test_Beneficiarydata-1542969243754.csv")
test_ip_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Test_Inpatientdata-1542969243754.csv")
test_op_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Test_Outpatientdata-1542969243754.csv")
test_tgt_lbls_df = pd.read_csv("../input/healthcare-provider-fraud-detection-analysis/Test-1542969243754.csv")

## ***Exploring_Target_Labels_Data***

In [None]:
train_tgt_lbls_df.head()

In [None]:
test_tgt_lbls_df.head()

* **Check the Fraud and Non-Fraud Counts**

In [None]:
print("### The unique number of train providers are {}. ###".format(train_tgt_lbls_df.shape[0]))

In [None]:
print("### The unique number of test providers are {}. ###".format(test_tgt_lbls_df.shape[0]))

In [None]:
with plt.style.context('seaborn-poster'):
    fig = train_tgt_lbls_df["PotentialFraud"].value_counts().plot(kind='bar', color=['green','orange'])
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/train_tgt_lbls_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
    # Providing the labels and title to the graph
    plt.xlabel("Provider Fraud or Not?", fontdict=label_font_dict)
    plt.ylabel("Number or % share of providers\n", fontdict=label_font_dict)
    plt.yticks(np.arange(0,5100,500))
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Distribution of Fraud & Non-fraud providers\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* From the above plot, we can say that 90% of the providers are not frausters and only 9% of them are involved in frauds.

- **Removing some entirely NULL Procedure Codes Features**

- TRAIN Set

In [None]:
train_ip_df.shape, train_op_df.shape

In [None]:
(train_ip_df['ClmProcedureCode_4'].isna().sum() / train_ip_df.shape[0])*100,\
(train_ip_df['ClmProcedureCode_5'].isna().sum() / train_ip_df.shape[0])*100,\
(train_ip_df['ClmProcedureCode_6'].isna().sum() / train_ip_df.shape[0])*100

In [None]:
(train_op_df['ClmProcedureCode_4'].isna().sum() / train_op_df.shape[0])*100,\
(train_op_df['ClmProcedureCode_5'].isna().sum() / train_op_df.shape[0])*100,\
(train_op_df['ClmProcedureCode_6'].isna().sum() / train_op_df.shape[0])*100

- Unseen Set

In [None]:
test_ip_df.shape, test_op_df.shape

In [None]:
(test_ip_df['ClmProcedureCode_4'].isna().sum() / test_ip_df.shape[0])*100,\
(test_ip_df['ClmProcedureCode_5'].isna().sum() / test_ip_df.shape[0])*100,\
(test_ip_df['ClmProcedureCode_6'].isna().sum() / test_ip_df.shape[0])*100

In [None]:
(test_op_df['ClmProcedureCode_4'].isna().sum() / test_op_df.shape[0])*100,\
(test_op_df['ClmProcedureCode_5'].isna().sum() / test_op_df.shape[0])*100,\
(test_op_df['ClmProcedureCode_6'].isna().sum() / test_op_df.shape[0])*100

- **Removing the above columns**

In [None]:
train_ip_df.drop(['ClmProcedureCode_4','ClmProcedureCode_5','ClmProcedureCode_6'],axis=1,inplace=True)
train_op_df.drop(['ClmProcedureCode_4','ClmProcedureCode_5','ClmProcedureCode_6'],axis=1,inplace=True)

In [None]:
test_ip_df.drop(['ClmProcedureCode_4','ClmProcedureCode_5','ClmProcedureCode_6'],axis=1,inplace=True)
test_op_df.drop(['ClmProcedureCode_4','ClmProcedureCode_5','ClmProcedureCode_6'],axis=1,inplace=True)

In [None]:
# Converting the PROC CODES into STRING format
train_ip_df['ClmProcedureCode_1'] = train_ip_df['ClmProcedureCode_1'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)
train_ip_df['ClmProcedureCode_2'] = train_ip_df['ClmProcedureCode_2'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)
train_ip_df['ClmProcedureCode_3'] = train_ip_df['ClmProcedureCode_3'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)

train_op_df['ClmProcedureCode_1'] = train_op_df['ClmProcedureCode_1'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)
train_op_df['ClmProcedureCode_2'] = train_op_df['ClmProcedureCode_2'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)
train_op_df['ClmProcedureCode_3'] = train_op_df['ClmProcedureCode_3'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)

In [None]:
# Converting the PROC CODES into STRING format
test_ip_df['ClmProcedureCode_1'] = test_ip_df['ClmProcedureCode_1'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)
test_ip_df['ClmProcedureCode_2'] = test_ip_df['ClmProcedureCode_2'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)
test_ip_df['ClmProcedureCode_3'] = test_ip_df['ClmProcedureCode_3'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)

test_op_df['ClmProcedureCode_1'] = test_op_df['ClmProcedureCode_1'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)
test_op_df['ClmProcedureCode_2'] = test_op_df['ClmProcedureCode_2'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)
test_op_df['ClmProcedureCode_3'] = test_op_df['ClmProcedureCode_3'].apply(lambda val: str(val).split(".")[0] if val == val else np.NaN)

### **Adding `New Feature - 1` :: `Admitted` or `Not Admitted` indicator in IP and OP Dataset**

* **Adding in IP Dataset**

In [None]:
train_ip_df["Admitted?"] = 1

In [None]:
test_ip_df["Admitted?"] = 1

In [None]:
train_ip_df.head()

In [None]:
test_ip_df.head()

* **Adding in OP Dataset**

In [None]:
train_op_df["Admitted?"] = 0

In [None]:
test_op_df["Admitted?"] = 0

In [None]:
train_op_df.head()

In [None]:
test_op_df.head()

### **Merging the Datasets**

In [None]:
# Commom columns must be 28
common_cols_tr = [col for col in train_ip_df.columns if col in train_op_df.columns]
len(common_cols_tr)

In [None]:
# Merging the IP and OP dataset on the basis of common columns
train_ip_op_df = pd.merge(left=train_ip_df, right=train_op_df, left_on=common_cols_tr, right_on=common_cols_tr, how="outer")
train_ip_op_df.shape

In [None]:
# Merging the IP and OP dataset on the basis of common columns
test_ip_op_df = pd.merge(left=test_ip_df, right=test_op_df, left_on=common_cols_tr, right_on=common_cols_tr, how="outer")
test_ip_op_df.shape

In [None]:
train_ip_op_df.head()

In [None]:
test_ip_op_df.head()

### **Merging the IP_OP Dataset with BENE Data**

In [None]:
# Joining the IP_OP dataset with the BENE data
train_ip_op_bene_df = pd.merge(left=train_ip_op_df, right=train_bene_df, left_on='BeneID', right_on='BeneID',how='inner')
train_ip_op_bene_df.shape

In [None]:
# Joining the IP_OP dataset with the BENE data
test_ip_op_bene_df = pd.merge(left=test_ip_op_df, right=test_bene_df, left_on='BeneID', right_on='BeneID',how='inner')
test_ip_op_bene_df.shape

### **Merging the IP_OP_BENE Dataset with PROVIDER level Tgt Labels Data**

In [None]:
# Joining the IP_OP_BENE dataset with the Tgt Label Provider Data
train_iobp_df = pd.merge(left=train_ip_op_bene_df, right=train_tgt_lbls_df, left_on='Provider', right_on='Provider',how='inner')
train_iobp_df.shape

In [None]:
# Joining the IP_OP_BENE dataset with the Tgt Label Provider Data
test_iobp_df = pd.merge(left=test_ip_op_bene_df, right=test_tgt_lbls_df, left_on='Provider', right_on='Provider',how='inner')
test_iobp_df.shape

### **Entire Dataset**

In [None]:
train_iobp_df.shape

In [None]:
test_iobp_df.shape

In [None]:
# Unique Providers
train_iobp_df["Provider"].nunique()

In [None]:
# Unique Providers
test_iobp_df["Provider"].nunique()

In [None]:
# Unique Claim numbers
train_iobp_df["ClaimID"].nunique()

In [None]:
# Unique Claim numbers
test_iobp_df["ClaimID"].nunique()

In [None]:
# Joining with the PRV Tgt Labels
prvs_claims_df = pd.DataFrame(train_iobp_df.groupby(['Provider'])['ClaimID'].count()).reset_index()
prvs_claims_tgt_lbls_df = pd.merge(left=prvs_claims_df, right=train_tgt_lbls_df, on='Provider', how='inner')
prvs_claims_tgt_lbls_df

In [None]:
# Joining with the PRV Tgt Labels
prvs_claims_df = pd.DataFrame(test_iobp_df.groupby(['Provider'])['ClaimID'].count()).reset_index()
prvs_claims_tgt_lbls_df = pd.merge(left=prvs_claims_df, right=test_tgt_lbls_df, on='Provider', how='inner')
prvs_claims_tgt_lbls_df

- **Fraud Count at Claims level**

In [None]:
print(pd.DataFrame(train_iobp_df['PotentialFraud'].value_counts()), "\n")

with plt.style.context('seaborn-poster'):
    fig = train_iobp_df['PotentialFraud'].value_counts().plot(kind='bar', color=['green','orange'])
    # Using the "patches" function we will get the location of the rectangle bars from the graph.
    ## Then by using those location(width & height) values we will add the annotations
    for p in fig.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()
        fig.annotate(f'{str(round((height*100)/train_iobp_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
    # Providing the labels and title to the graph
    plt.xlabel("Fraud or Not?", fontdict=label_font_dict)
    plt.ylabel("Number (or %) of claims\n", fontdict=label_font_dict)
    plt.grid(which='major', linestyle="--", color='lightgrey')
    plt.minorticks_on()
    plt.title("Distribution of Fraud & Non-fraud claims\n", fontdict=title_font_dict)
    plt.plot();

**`OBSERVATION`**
* The above plot shows us that, 62% of claims are Non-Fraud and 32% of them are Fraudulent. 
    * By looking at the percentages we may say that there is a class-imbalance problem but after looking at the number of records it doesn't seem to be a severe class-imbalance problem. 
        * So, I'll try some class balancing techniques only after training a baseline model w/o any synthetic or class weighting techniques.

# **`VECTOR Embeddings`**

- Generating the vector embeddings of ::
    - `CLAIM Admit Diagnosis Codes`
    - `Diagnosis Codes`
    - `Procedure Codes`


- For now, I'm not including the `Dx Related Group Code` for generating the similarity score b/w these features.

In [None]:
import pickle 

# Loading the embeddings of CAD, DIAG and PROC codes
## Sentence embeddings are generated from the pre-trained Bio-BERT on PubMed and PMC datasets
### Dx and Proc Codes are downloaded from ICD-9 portal Effective from 2014
#### Refer Notebook --> CS_1_Codes_Desc_Embeddings.ipynb
with open('../input/medicare-prv-fraud-files/cad_diag_codes_embeddings.pkl', 'rb') as f:
    loaded_cad_dict = pickle.load(f)

with open('../input/medicare-prv-fraud-files/proc_codes_embeddings.pkl', 'rb') as f:
    loaded_proc_dict = pickle.load(f)

In [None]:
zeros_vec = np.zeros(shape=(1,768),dtype='float')

In [None]:
# Fetching the embeddings of every CAD and Dx CODE
train_iobp_df['Clm_Admit_Dx_embeddings'] = train_iobp_df['ClmAdmitDiagnosisCode'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_1_embeddings'] = train_iobp_df['ClmDiagnosisCode_1'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_2_embeddings'] = train_iobp_df['ClmDiagnosisCode_2'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_3_embeddings'] = train_iobp_df['ClmDiagnosisCode_3'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_4_embeddings'] = train_iobp_df['ClmDiagnosisCode_4'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_5_embeddings'] = train_iobp_df['ClmDiagnosisCode_5'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_6_embeddings'] = train_iobp_df['ClmDiagnosisCode_6'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_7_embeddings'] = train_iobp_df['ClmDiagnosisCode_7'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_8_embeddings'] = train_iobp_df['ClmDiagnosisCode_8'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_9_embeddings'] = train_iobp_df['ClmDiagnosisCode_9'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_Dx_10_embeddings'] = train_iobp_df['ClmDiagnosisCode_10'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))

In [None]:
# Fetching the embeddings of every CAD and Dx CODE
test_iobp_df['Clm_Admit_Dx_embeddings'] = test_iobp_df['ClmAdmitDiagnosisCode'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_1_embeddings'] = test_iobp_df['ClmDiagnosisCode_1'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_2_embeddings'] = test_iobp_df['ClmDiagnosisCode_2'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_3_embeddings'] = test_iobp_df['ClmDiagnosisCode_3'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_4_embeddings'] = test_iobp_df['ClmDiagnosisCode_4'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_5_embeddings'] = test_iobp_df['ClmDiagnosisCode_5'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_6_embeddings'] = test_iobp_df['ClmDiagnosisCode_6'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_7_embeddings'] = test_iobp_df['ClmDiagnosisCode_7'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_8_embeddings'] = test_iobp_df['ClmDiagnosisCode_8'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_9_embeddings'] = test_iobp_df['ClmDiagnosisCode_9'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_Dx_10_embeddings'] = test_iobp_df['ClmDiagnosisCode_10'].apply(lambda code: loaded_cad_dict.get(code,zeros_vec[0]))

In [None]:
# Adding the embeddings of all the Dx Codes for every claim
train_iobp_df['Clm_All_Dx_embeddings'] = train_iobp_df[['Clm_Dx_1_embeddings','Clm_Dx_2_embeddings','Clm_Dx_3_embeddings','Clm_Dx_4_embeddings','Clm_Dx_5_embeddings','Clm_Dx_6_embeddings','Clm_Dx_7_embeddings','Clm_Dx_8_embeddings','Clm_Dx_9_embeddings','Clm_Dx_10_embeddings']]\
.apply(lambda row : row['Clm_Dx_1_embeddings'] + row['Clm_Dx_2_embeddings'] + row['Clm_Dx_3_embeddings'] + row['Clm_Dx_4_embeddings'] + row['Clm_Dx_5_embeddings'] + row['Clm_Dx_6_embeddings'] + row['Clm_Dx_7_embeddings'] + row['Clm_Dx_8_embeddings'] + row['Clm_Dx_9_embeddings'] + row['Clm_Dx_10_embeddings'], axis=1)

In [None]:
# Final embeddings of all the Dx Codes for every claim
train_iobp_df['Clm_All_Dx_embeddings']

In [None]:
# Adding the embeddings of all the Dx Codes for every claim
test_iobp_df['Clm_All_Dx_embeddings'] = test_iobp_df[['Clm_Dx_1_embeddings','Clm_Dx_2_embeddings','Clm_Dx_3_embeddings','Clm_Dx_4_embeddings','Clm_Dx_5_embeddings','Clm_Dx_6_embeddings','Clm_Dx_7_embeddings','Clm_Dx_8_embeddings','Clm_Dx_9_embeddings','Clm_Dx_10_embeddings']]\
.apply(lambda row : row['Clm_Dx_1_embeddings'] + row['Clm_Dx_2_embeddings'] + row['Clm_Dx_3_embeddings'] + row['Clm_Dx_4_embeddings'] + row['Clm_Dx_5_embeddings'] + row['Clm_Dx_6_embeddings'] + row['Clm_Dx_7_embeddings'] + row['Clm_Dx_8_embeddings'] + row['Clm_Dx_9_embeddings'] + row['Clm_Dx_10_embeddings'], axis=1)

In [None]:
# Final embeddings of all the Dx Codes for every claim
test_iobp_df['Clm_All_Dx_embeddings']

In [None]:
# Fetching the embeddings of every PROC Code
train_iobp_df['Clm_PROC_1_embeddings'] = train_iobp_df['ClmProcedureCode_1'].apply(lambda code: loaded_proc_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_PROC_2_embeddings'] = train_iobp_df['ClmProcedureCode_2'].apply(lambda code: loaded_proc_dict.get(code,zeros_vec[0]))
train_iobp_df['Clm_PROC_3_embeddings'] = train_iobp_df['ClmProcedureCode_3'].apply(lambda code: loaded_proc_dict.get(code,zeros_vec[0]))

In [None]:
# Fetching the embeddings of every PROC Code
test_iobp_df['Clm_PROC_1_embeddings'] = test_iobp_df['ClmProcedureCode_1'].apply(lambda code: loaded_proc_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_PROC_2_embeddings'] = test_iobp_df['ClmProcedureCode_2'].apply(lambda code: loaded_proc_dict.get(code,zeros_vec[0]))
test_iobp_df['Clm_PROC_3_embeddings'] = test_iobp_df['ClmProcedureCode_3'].apply(lambda code: loaded_proc_dict.get(code,zeros_vec[0]))

In [None]:
# Adding the embeddings of all the PROC Codes for every claim
train_iobp_df['Clm_All_PROC_embeddings'] = train_iobp_df[['Clm_PROC_1_embeddings','Clm_PROC_2_embeddings','Clm_PROC_3_embeddings']]\
.apply(lambda row : row['Clm_PROC_1_embeddings'] + row['Clm_PROC_2_embeddings'] + row['Clm_PROC_3_embeddings'], axis=1)

In [None]:
# Final embeddings of all the PROC Codes for every claim
train_iobp_df['Clm_All_PROC_embeddings']

In [None]:
# Adding the embeddings of all the PROC Codes for every claim
test_iobp_df['Clm_All_PROC_embeddings'] = test_iobp_df[['Clm_PROC_1_embeddings','Clm_PROC_2_embeddings','Clm_PROC_3_embeddings']]\
.apply(lambda row : row['Clm_PROC_1_embeddings'] + row['Clm_PROC_2_embeddings'] + row['Clm_PROC_3_embeddings'], axis=1)

In [None]:
# Final embeddings of all the PROC Codes for every claim
test_iobp_df['Clm_All_PROC_embeddings']

In [None]:
from scipy.spatial import distance

In [None]:
# Generating the similarity scores features
## Similarity b/w CAD and Dx Codes
## Similarity b/w CAD and Proc Codes
## Similarity b/w Dx and Proc Codes
train_iobp_df['Clm_Admit_Diag_Dx_Similarity'] = train_iobp_df[['Clm_Admit_Dx_embeddings','Clm_All_Dx_embeddings']].apply(lambda row: 1 - distance.cosine(row['Clm_Admit_Dx_embeddings'], row['Clm_All_Dx_embeddings']), axis=1)
train_iobp_df['Clm_Admit_Diag_Proc_Similarity'] = train_iobp_df[['Clm_Admit_Dx_embeddings','Clm_All_PROC_embeddings']].apply(lambda row: 1 - distance.cosine(row['Clm_Admit_Dx_embeddings'], row['Clm_All_PROC_embeddings']), axis=1)
train_iobp_df['Clm_Dx_Proc_Similarity'] = train_iobp_df[['Clm_All_Dx_embeddings','Clm_All_PROC_embeddings']].apply(lambda row: 1 - distance.cosine(row['Clm_All_Dx_embeddings'], row['Clm_All_PROC_embeddings']), axis=1)

In [None]:
train_iobp_df['Clm_Admit_Diag_Dx_Similarity'].fillna(value=0,inplace=True)
train_iobp_df['Clm_Admit_Diag_Proc_Similarity'].fillna(value=0,inplace=True)
train_iobp_df['Clm_Dx_Proc_Similarity'].fillna(value=0,inplace=True)

In [None]:
train_iobp_df['Clm_CAD_Dx_Proc_Similarity'] = train_iobp_df['Clm_Admit_Diag_Dx_Similarity'] + train_iobp_df['Clm_Admit_Diag_Proc_Similarity'] + train_iobp_df['Clm_Dx_Proc_Similarity']

In [None]:
# Generating the similarity scores features
## Similarity b/w CAD and Dx Codes
## Similarity b/w CAD and Proc Codes
## Similarity b/w Dx and Proc Codes
test_iobp_df['Clm_Admit_Diag_Dx_Similarity'] = test_iobp_df[['Clm_Admit_Dx_embeddings','Clm_All_Dx_embeddings']].apply(lambda row: 1 - distance.cosine(row['Clm_Admit_Dx_embeddings'], row['Clm_All_Dx_embeddings']), axis=1)
test_iobp_df['Clm_Admit_Diag_Proc_Similarity'] = test_iobp_df[['Clm_Admit_Dx_embeddings','Clm_All_PROC_embeddings']].apply(lambda row: 1 - distance.cosine(row['Clm_Admit_Dx_embeddings'], row['Clm_All_PROC_embeddings']), axis=1)
test_iobp_df['Clm_Dx_Proc_Similarity'] = test_iobp_df[['Clm_All_Dx_embeddings','Clm_All_PROC_embeddings']].apply(lambda row: 1 - distance.cosine(row['Clm_All_Dx_embeddings'], row['Clm_All_PROC_embeddings']), axis=1)

In [None]:
test_iobp_df['Clm_Admit_Diag_Dx_Similarity'].fillna(value=0,inplace=True)
test_iobp_df['Clm_Admit_Diag_Proc_Similarity'].fillna(value=0,inplace=True)
test_iobp_df['Clm_Dx_Proc_Similarity'].fillna(value=0,inplace=True)

In [None]:
test_iobp_df['Clm_CAD_Dx_Proc_Similarity'] = test_iobp_df['Clm_Admit_Diag_Dx_Similarity'] + test_iobp_df['Clm_Admit_Diag_Proc_Similarity'] + test_iobp_df['Clm_Dx_Proc_Similarity']

In [None]:
# Dropping the individual CAD, Dx and PROC embeddings features
train_iobp_df.drop(['Clm_Admit_Dx_embeddings',
'Clm_Dx_1_embeddings',
'Clm_Dx_2_embeddings',
'Clm_Dx_3_embeddings',
'Clm_Dx_4_embeddings',
'Clm_Dx_5_embeddings',
'Clm_Dx_6_embeddings',
'Clm_Dx_7_embeddings',
'Clm_Dx_8_embeddings',
'Clm_Dx_9_embeddings',
'Clm_Dx_10_embeddings',
'Clm_PROC_1_embeddings',
'Clm_PROC_2_embeddings',
'Clm_PROC_3_embeddings'],axis=1,inplace=True)

test_iobp_df.drop(['Clm_Admit_Dx_embeddings',
'Clm_Dx_1_embeddings',
'Clm_Dx_2_embeddings',
'Clm_Dx_3_embeddings',
'Clm_Dx_4_embeddings',
'Clm_Dx_5_embeddings',
'Clm_Dx_6_embeddings',
'Clm_Dx_7_embeddings',
'Clm_Dx_8_embeddings',
'Clm_Dx_9_embeddings',
'Clm_Dx_10_embeddings',
'Clm_PROC_1_embeddings',
'Clm_PROC_2_embeddings',
'Clm_PROC_3_embeddings'],axis=1,inplace=True)

## **Feature Engineering**
**`Let's create some features`**

### **Adding `New Feature - 2` :: `Is_Alive?`**

    - Is Alive? = No if DOD is NaN else Yes

In [None]:
train_iobp_df['DOB'] = pd.to_datetime(train_iobp_df['DOB'], format="%Y-%m-%d")
train_iobp_df['DOD'] = pd.to_datetime(train_iobp_df['DOD'], format="%Y-%m-%d")

In [None]:
test_iobp_df['DOB'] = pd.to_datetime(test_iobp_df['DOB'], format="%Y-%m-%d")
test_iobp_df['DOD'] = pd.to_datetime(test_iobp_df['DOD'], format="%Y-%m-%d")

In [None]:
train_iobp_df['Is_Alive?'] = train_iobp_df['DOD'].apply(lambda val: 'No' if val != val else 'Yes')

In [None]:
test_iobp_df['Is_Alive?'] = test_iobp_df['DOD'].apply(lambda val: 'No' if val != val else 'Yes')

In [None]:
train_iobp_df['Is_Alive?'].value_counts()

In [None]:
test_iobp_df['Is_Alive?'].value_counts()

### **Adding `New Feature - 3` :: `Claim_Duration`**
    
    - Claim Duration = Claim End Date - Claim Start Date

In [None]:
train_iobp_df['ClaimStartDt'] = pd.to_datetime(train_iobp_df['ClaimStartDt'], format="%Y-%m-%d")
train_iobp_df['ClaimEndDt'] = pd.to_datetime(train_iobp_df['ClaimEndDt'], format="%Y-%m-%d")

train_iobp_df['Claim_Duration'] = (train_iobp_df['ClaimEndDt'] - train_iobp_df['ClaimStartDt']).dt.days

In [None]:
test_iobp_df['ClaimStartDt'] = pd.to_datetime(test_iobp_df['ClaimStartDt'], format="%Y-%m-%d")
test_iobp_df['ClaimEndDt'] = pd.to_datetime(test_iobp_df['ClaimEndDt'], format="%Y-%m-%d")

test_iobp_df['Claim_Duration'] = (test_iobp_df['ClaimEndDt'] - test_iobp_df['ClaimStartDt']).dt.days

### **Adding `New Feature - 4` :: `Admitted_Duration`**

    - Admitted Duration = Discharge Date - Admission Date

In [None]:
train_iobp_df['AdmissionDt'] = pd.to_datetime(train_iobp_df['AdmissionDt'], format="%Y-%m-%d")
train_iobp_df['DischargeDt'] = pd.to_datetime(train_iobp_df['DischargeDt'], format="%Y-%m-%d")

train_iobp_df['Admitted_Duration'] = (train_iobp_df['DischargeDt'] - train_iobp_df['AdmissionDt']).dt.days

In [None]:
test_iobp_df['AdmissionDt'] = pd.to_datetime(test_iobp_df['AdmissionDt'], format="%Y-%m-%d")
test_iobp_df['DischargeDt'] = pd.to_datetime(test_iobp_df['DischargeDt'], format="%Y-%m-%d")

test_iobp_df['Admitted_Duration'] = (test_iobp_df['DischargeDt'] - test_iobp_df['AdmissionDt']).dt.days

### **Adding `New Feature - 5` :: `Bene_Age`**

    - Bene Age = DOD - DOB (if DOD is Null then replace it with MAX date in DOD)

In [None]:
# Filling the Null values as MAX Date of Death in the Dataset
train_iobp_df['DOD'].fillna(value=train_iobp_df['DOD'].max(), inplace=True)

In [None]:
# Filling the Null values as MAX Date of Death in the Dataset
test_iobp_df['DOD'].fillna(value=train_iobp_df['DOD'].max(), inplace=True)

In [None]:
train_iobp_df['Bene_Age'] = round(((train_iobp_df['DOD'] - train_iobp_df['DOB']).dt.days)/365,1)

In [None]:
test_iobp_df['Bene_Age'] = round(((test_iobp_df['DOD'] - test_iobp_df['DOB']).dt.days)/365,1)

### **Adding `New Feature - 6` :: `Att_Opr_Oth_Phy_Tot_Claims`**
    
   * It represents the total claims submitted by Attending, Operating and Other Physicians.
       
       * **`Reasoning`** :: The idea behind adding this feature is to see whether a total of physicians claims submission will help in influencing the potential frauds.


   * **`Logic`** :: Att_Phy_tot_claims + Opr_Phy_tot_claims + Oth_Phy_tot_claims

- **`Att_Phy_tot_claims`** :: **Total Number of claims or cases seen by Attending Physician**

In [None]:
# Total unique number of Attended Physicians
print("Unique number of Attending Physicians present in the dataset are --> {}".format(train_iobp_df['AttendingPhysician'].nunique()))

In [None]:
# Total unique number of Attended Physicians
print("Unique number of Attending Physicians present in the dataset are --> {}".format(test_iobp_df['AttendingPhysician'].nunique()))

In [None]:
train_iobp_df['Att_Phy_tot_claims'] = train_iobp_df.groupby(['AttendingPhysician'])['ClaimID'].transform('count')
train_iobp_df['Att_Phy_tot_claims'].describe()

In [None]:
test_iobp_df['Att_Phy_tot_claims'] = test_iobp_df.groupby(['AttendingPhysician'])['ClaimID'].transform('count')
test_iobp_df['Att_Phy_tot_claims'].describe()

- **`Opr_Phy_tot_claims`** :: **Total Number of claims or cases seen by Opearting Physician**

In [None]:
# Total unique number of Operating Physicians
print("Unique number of Operating Physicians present in the dataset are --> {}".format(train_iobp_df['OperatingPhysician'].nunique()))

In [None]:
# Total unique number of Operating Physicians
print("Unique number of Operating Physicians present in the dataset are --> {}".format(test_iobp_df['OperatingPhysician'].nunique()))

In [None]:
train_iobp_df['Opr_Phy_tot_claims'] = train_iobp_df.groupby(['OperatingPhysician'])['ClaimID'].transform('count')
train_iobp_df['Opr_Phy_tot_claims'].describe()

In [None]:
test_iobp_df['Opr_Phy_tot_claims'] = test_iobp_df.groupby(['OperatingPhysician'])['ClaimID'].transform('count')
test_iobp_df['Opr_Phy_tot_claims'].describe()

- **`Oth_Phy_tot_claims`** :: **Total Number of claims or cases seen by Other Physician**

In [None]:
# Total unique number of Other Physicians
print("Unique number of Other Physicians present in the dataset are --> {}".format(train_iobp_df['OtherPhysician'].nunique()))

In [None]:
# Total unique number of Other Physicians
print("Unique number of Other Physicians present in the dataset are --> {}".format(test_iobp_df['OtherPhysician'].nunique()))

In [None]:
train_iobp_df['Oth_Phy_tot_claims'] = train_iobp_df.groupby(['OtherPhysician'])['ClaimID'].transform('count')
train_iobp_df['Oth_Phy_tot_claims'].describe()

In [None]:
test_iobp_df['Oth_Phy_tot_claims'] = test_iobp_df.groupby(['OtherPhysician'])['ClaimID'].transform('count')
test_iobp_df['Oth_Phy_tot_claims'].describe()

In [None]:
# Creating the combined feature
train_iobp_df['Att_Phy_tot_claims'].fillna(value=0, inplace=True)
train_iobp_df['Opr_Phy_tot_claims'].fillna(value=0, inplace=True)
train_iobp_df['Oth_Phy_tot_claims'].fillna(value=0, inplace=True)

In [None]:
# Creating the combined feature
test_iobp_df['Att_Phy_tot_claims'].fillna(value=0, inplace=True)
test_iobp_df['Opr_Phy_tot_claims'].fillna(value=0, inplace=True)
test_iobp_df['Oth_Phy_tot_claims'].fillna(value=0, inplace=True)

In [None]:
train_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'] = train_iobp_df['Att_Phy_tot_claims'] + train_iobp_df['Opr_Phy_tot_claims'] + train_iobp_df['Oth_Phy_tot_claims']

In [None]:
test_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'] = test_iobp_df['Att_Phy_tot_claims'] + test_iobp_df['Opr_Phy_tot_claims'] + test_iobp_df['Oth_Phy_tot_claims']

In [None]:
train_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'].describe()

In [None]:
test_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'].describe()

In [None]:
train_iobp_df.drop(['Att_Phy_tot_claims', 'Opr_Phy_tot_claims', 'Oth_Phy_tot_claims'], axis=1, inplace=True)

In [None]:
test_iobp_df.drop(['Att_Phy_tot_claims', 'Opr_Phy_tot_claims', 'Oth_Phy_tot_claims'], axis=1, inplace=True)

### **Adding `New Feature - 7` :: `Prv_Tot_Att_Opr_Oth_Phys`**
    
   * It represents the total of all kind of physicians that a provider has interacted with.
       
       * **`Reasoning`** :: The idea behind adding this feature is to see whether a fraudulent provider interacts with higher or lower numberof of various physicians.


   * **`Logic`** :: Prv_Tot_Att_Phy + Prv_Tot_Opr_Phy + Prv_Tot_Oth_Phy

In [None]:
train_iobp_df["Prv_Tot_Att_Phy"] = train_iobp_df.groupby(['Provider'])['AttendingPhysician'].transform('count')
train_iobp_df["Prv_Tot_Opr_Phy"] = train_iobp_df.groupby(['Provider'])['OperatingPhysician'].transform('count')
train_iobp_df["Prv_Tot_Oth_Phy"] = train_iobp_df.groupby(['Provider'])['OtherPhysician'].transform('count')

In [None]:
test_iobp_df["Prv_Tot_Att_Phy"] = test_iobp_df.groupby(['Provider'])['AttendingPhysician'].transform('count')
test_iobp_df["Prv_Tot_Opr_Phy"] = test_iobp_df.groupby(['Provider'])['OperatingPhysician'].transform('count')
test_iobp_df["Prv_Tot_Oth_Phy"] = test_iobp_df.groupby(['Provider'])['OtherPhysician'].transform('count')

In [None]:
# Nulls in the above features
train_iobp_df.isna().sum().tail(3)

In [None]:
# Nulls in the above features
test_iobp_df.isna().sum().tail(3)

In [None]:
train_iobp_df["Prv_Tot_Att_Phy"].describe()

In [None]:
test_iobp_df["Prv_Tot_Att_Phy"].describe()

* The average number of attending physicians for providers are 820.

In [None]:
train_iobp_df["Prv_Tot_Opr_Phy"].describe()

In [None]:
test_iobp_df["Prv_Tot_Opr_Phy"].describe()

* The average number of operating physicians for providers are 155.

In [None]:
train_iobp_df["Prv_Tot_Oth_Phy"].describe()

In [None]:
test_iobp_df["Prv_Tot_Oth_Phy"].describe()

* The average number of other physicians for providers are 306.

In [None]:
train_iobp_df['Prv_Tot_Att_Opr_Oth_Phys'] = train_iobp_df['Prv_Tot_Att_Phy'] + train_iobp_df['Prv_Tot_Opr_Phy'] + train_iobp_df['Prv_Tot_Oth_Phy']

In [None]:
test_iobp_df['Prv_Tot_Att_Opr_Oth_Phys'] = test_iobp_df['Prv_Tot_Att_Phy'] + test_iobp_df['Prv_Tot_Opr_Phy'] + test_iobp_df['Prv_Tot_Oth_Phy']

In [None]:
train_iobp_df["Prv_Tot_Att_Opr_Oth_Phys"].describe()

In [None]:
test_iobp_df["Prv_Tot_Att_Opr_Oth_Phys"].describe()

In [None]:
train_iobp_df.drop(['Prv_Tot_Att_Phy', 'Prv_Tot_Opr_Phy', 'Prv_Tot_Oth_Phy'], axis=1, inplace=True)

In [None]:
test_iobp_df.drop(['Prv_Tot_Att_Phy', 'Prv_Tot_Opr_Phy', 'Prv_Tot_Oth_Phy'], axis=1, inplace=True)

### **Adding `New Feature - 8` :: `Total Unique Claim Admit Codes used by a PROVIDER`**
   
   * **`Reasoning`** :: The idea behind adding this feature is to see how many unique number of `Claim Admit Diagnosis Codes` used by the Provider. 
       * As there may be a pattern that if a provider has used so many Admit Diagnosis Codes then it might increases or decreases the chances of fraud.

In [None]:
train_iobp_df['PRV_Tot_Admit_DCodes'] = train_iobp_df.groupby(['Provider'])['ClmAdmitDiagnosisCode'].transform('nunique')

In [None]:
test_iobp_df['PRV_Tot_Admit_DCodes'] = test_iobp_df.groupby(['Provider'])['ClmAdmitDiagnosisCode'].transform('nunique')

In [None]:
train_iobp_df["PRV_Tot_Admit_DCodes"].describe()

In [None]:
test_iobp_df["PRV_Tot_Admit_DCodes"].describe()

### **Adding `New Feature - 9` :: `Total Unique Number of Diagnosis Group Codes used by a PROVIDER`**
   
   * **`Reasoning`** :: The idea behind adding this feature is to see how many unique `Diagnosis Group Codes` used by the Provider.
       * As there may be a pattern that if a provider has used so many Diagnosis Group Codes then it might increases or decreases the chances of fraud.

In [None]:
train_iobp_df['PRV_Tot_DGrpCodes'] = train_iobp_df.groupby(['Provider'])['DiagnosisGroupCode'].transform('nunique')

In [None]:
test_iobp_df['PRV_Tot_DGrpCodes'] = test_iobp_df.groupby(['Provider'])['DiagnosisGroupCode'].transform('nunique')

In [None]:
train_iobp_df["PRV_Tot_DGrpCodes"].describe()

In [None]:
test_iobp_df["PRV_Tot_DGrpCodes"].describe()

### **Adding `New Feature - 10` :: `Total unique Date of Birth years of beneficiaries provided by a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that if a provider has very high variability in the year of birth of patients then that might be one of the signs of medicare frauds.
       - Because generally private hospitals who treat poor patients make false claims on their names. For example, Nazia is 10 years old. But, according to a claim filed by Chhattisgarh-based Shaheed Hospital with the Rashtriya Swasthya Bima Yojna (RSBY), she has delivered a baby after a caesarean operation. Mukul (name changed) is only 7. But Agarwal Hospital, Raipur, has made a claim for removing cataract from his eyes.

Read more at:
https://economictimes.indiatimes.com/news/politics-and-nation/private-hospitals-perform-fake-surgeries-to-claim-thousands-in-insurance-cover/articleshow/16934229.cms?utm_source=contentofinterest&utm_medium=text&utm_campaign=cppst

In [None]:
train_iobp_df['DOB_Year'] = train_iobp_df['DOB'].dt.year

In [None]:
test_iobp_df['DOB_Year'] = test_iobp_df['DOB'].dt.year

In [None]:
train_iobp_df['PRV_Tot_Unq_DOB_Years'] = train_iobp_df.groupby(['Provider'])['DOB_Year'].transform('nunique')

In [None]:
test_iobp_df['PRV_Tot_Unq_DOB_Years'] = test_iobp_df.groupby(['Provider'])['DOB_Year'].transform('nunique')

In [None]:
train_iobp_df['PRV_Tot_Unq_DOB_Years'].describe()

In [None]:
test_iobp_df['PRV_Tot_Unq_DOB_Years'].describe()

In [None]:
train_iobp_df.drop(['DOB_Year'], axis=1, inplace=True)

In [None]:
test_iobp_df.drop(['DOB_Year'], axis=1, inplace=True)

### **Adding `New Feature - 11` :: `Sum of patients age treated by a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that there might be a pattern like if the sum of patients age treated by a provider is very high or low then it might influence the fraud.

In [None]:
train_iobp_df['PRV_Bene_Age_Sum'] = train_iobp_df.groupby(['Provider'])['Bene_Age'].transform('sum')

In [None]:
test_iobp_df['PRV_Bene_Age_Sum'] = test_iobp_df.groupby(['Provider'])['Bene_Age'].transform('sum')

In [None]:
train_iobp_df['PRV_Bene_Age_Sum'].describe()

In [None]:
test_iobp_df['PRV_Bene_Age_Sum'].describe()

### **Adding `New Feature - 12` :: `Sum of Insc Claim Re-Imb Amount for a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that there might be a pattern like if the sum of claim re-imb amount for a provider is very high or low then it might influence the fraud.

In [None]:
train_iobp_df['PRV_Insc_Clm_ReImb_Amt'] = train_iobp_df.groupby(['Provider'])['InscClaimAmtReimbursed'].transform('sum')

In [None]:
test_iobp_df['PRV_Insc_Clm_ReImb_Amt'] = test_iobp_df.groupby(['Provider'])['InscClaimAmtReimbursed'].transform('sum')

In [None]:
train_iobp_df['PRV_Insc_Clm_ReImb_Amt'].describe()

In [None]:
test_iobp_df['PRV_Insc_Clm_ReImb_Amt'].describe()

### **Adding `New Feature - 13` :: `Total number of RKD Patients seen by a Provider`**
   
   * **`Reasoning`** :: The idea behind adding this feature is that there might be a pattern like if the total number of RKD Patients seen by a Provider is very high or low then it might influence the fraud.

In [None]:
train_iobp_df['RenalDiseaseIndicator'] = train_iobp_df['RenalDiseaseIndicator'].apply(lambda val: 1 if val == "Y" else 0)

In [None]:
test_iobp_df['RenalDiseaseIndicator'] = test_iobp_df['RenalDiseaseIndicator'].apply(lambda val: 1 if val == "Y" else 0)

In [None]:
train_iobp_df['PRV_Tot_RKD_Patients'] = train_iobp_df.groupby(['Provider'])['RenalDiseaseIndicator'].transform('sum')

In [None]:
test_iobp_df['PRV_Tot_RKD_Patients'] = test_iobp_df.groupby(['Provider'])['RenalDiseaseIndicator'].transform('sum')

In [None]:
train_iobp_df['PRV_Tot_RKD_Patients'].describe()

In [None]:
test_iobp_df['PRV_Tot_RKD_Patients'].describe()

In [None]:
# Dropping these 2 columns as there 99% of values are same
train_iobp_df.drop(['NoOfMonths_PartACov', 'NoOfMonths_PartBCov'], axis=1, inplace=True)

In [None]:
# Dropping these 2 columns as there 99% of values are same
test_iobp_df.drop(['NoOfMonths_PartACov', 'NoOfMonths_PartBCov'], axis=1, inplace=True)

In [None]:
# Filling null values in Admitted_Duration with 0 (as it will represent the patients were admitted for 0 days)
train_iobp_df['Admitted_Duration'].fillna(value=0,inplace=True)

In [None]:
# Filling null values in Admitted_Duration with 0 (as it will represent the patients were admitted for 0 days)
test_iobp_df['Admitted_Duration'].fillna(value=0,inplace=True)

In [None]:
train_iobp_df.shape

In [None]:
test_iobp_df.shape

### **Adding `Aggregated Features` :: For every possible level**
    - Provider
    - Beneficiary
    - Attending Physician
    - Operating Physician
    - Other Physician and etc..
   
   
   * **`Reasoning`** :: The idea behind adding the aggregated features at different levels is that fraud can be done by an individual or group of individuals or entities involved in the claim process.

In [None]:
# PRV Aggregate features
train_iobp_df["PRV_CoPayment"] = train_iobp_df.groupby('Provider')['DeductibleAmtPaid'].transform('sum')
train_iobp_df["PRV_IP_Annual_ReImb_Amt"] = train_iobp_df.groupby('Provider')['IPAnnualReimbursementAmt'].transform('sum')
train_iobp_df["PRV_IP_Annual_Ded_Amt"] = train_iobp_df.groupby('Provider')['IPAnnualDeductibleAmt'].transform('sum')
train_iobp_df["PRV_OP_Annual_ReImb_Amt"] = train_iobp_df.groupby('Provider')['OPAnnualReimbursementAmt'].transform('sum')
train_iobp_df["PRV_OP_Annual_Ded_Amt"] = train_iobp_df.groupby('Provider')['OPAnnualDeductibleAmt'].transform('sum')
train_iobp_df["PRV_Admit_Duration"] = train_iobp_df.groupby('Provider')['Admitted_Duration'].transform('sum')
train_iobp_df["PRV_Claim_Duration"] = train_iobp_df.groupby('Provider')['Claim_Duration'].transform('sum')

In [None]:
# PRV Aggregate features
test_iobp_df["PRV_CoPayment"] = test_iobp_df.groupby('Provider')['DeductibleAmtPaid'].transform('sum')
test_iobp_df["PRV_IP_Annual_ReImb_Amt"] = test_iobp_df.groupby('Provider')['IPAnnualReimbursementAmt'].transform('sum')
test_iobp_df["PRV_IP_Annual_Ded_Amt"] = test_iobp_df.groupby('Provider')['IPAnnualDeductibleAmt'].transform('sum')
test_iobp_df["PRV_OP_Annual_ReImb_Amt"] = test_iobp_df.groupby('Provider')['OPAnnualReimbursementAmt'].transform('sum')
test_iobp_df["PRV_OP_Annual_Ded_Amt"] = test_iobp_df.groupby('Provider')['OPAnnualDeductibleAmt'].transform('sum')
test_iobp_df["PRV_Admit_Duration"] = test_iobp_df.groupby('Provider')['Admitted_Duration'].transform('sum')
test_iobp_df["PRV_Claim_Duration"] = test_iobp_df.groupby('Provider')['Claim_Duration'].transform('sum')

In [None]:
def create_agg_feats(grp_col, feat_name, operation='sum'):
    """
    Description :: This function is created for adding the aggregated features in the dataset for every level like:
        - Beneficiary
        - Attending Physician
        - Operating Physician
        - Other Physician and etc..
        
    Input Parameters :: It accepts below inputs:
        - grp_col : `str`
            - It represents the feature or level at which you want to perform the aggregation.
        
        - feat_name : `str`
            - It represents the feature whose aggregated aspect you want to capture.
        
        - operation : `str`
            - It represents the aggregation operation you want to perform.(By default it is SUM)
    """
    feat_1 = feat_name + "_Insc_ReImb_Amt"
    train_iobp_df[feat_1] = train_iobp_df.groupby(grp_col)['InscClaimAmtReimbursed'].transform(operation)
    test_iobp_df[feat_1] = test_iobp_df.groupby(grp_col)['InscClaimAmtReimbursed'].transform(operation)

    feat_2 = feat_name + "_CoPayment"
    train_iobp_df[feat_2] = train_iobp_df.groupby(grp_col)['DeductibleAmtPaid'].transform(operation)
    test_iobp_df[feat_2] = test_iobp_df.groupby(grp_col)['DeductibleAmtPaid'].transform(operation)

    feat_3 = feat_name + "_IP_Annual_ReImb_Amt"
    train_iobp_df[feat_3] = train_iobp_df.groupby(grp_col)['IPAnnualReimbursementAmt'].transform(operation)
    test_iobp_df[feat_3] = test_iobp_df.groupby(grp_col)['IPAnnualReimbursementAmt'].transform(operation)

    feat_4 = feat_name + "_IP_Annual_Ded_Amt"
    train_iobp_df[feat_4] = train_iobp_df.groupby(grp_col)['IPAnnualDeductibleAmt'].transform(operation)
    test_iobp_df[feat_4] = test_iobp_df.groupby(grp_col)['IPAnnualDeductibleAmt'].transform(operation)

    feat_5 = feat_name + "_OP_Annual_ReImb_Amt"
    train_iobp_df[feat_5] = train_iobp_df.groupby(grp_col)['OPAnnualReimbursementAmt'].transform(operation)
    test_iobp_df[feat_5] = test_iobp_df.groupby(grp_col)['OPAnnualReimbursementAmt'].transform(operation)

    feat_6 = feat_name + "_OP_Annual_Ded_Amt"
    train_iobp_df[feat_6] = train_iobp_df.groupby(grp_col)['OPAnnualDeductibleAmt'].transform(operation)
    test_iobp_df[feat_6] = test_iobp_df.groupby(grp_col)['OPAnnualDeductibleAmt'].transform(operation)

    feat_7 = feat_name + "_Admit_Duration"
    train_iobp_df[feat_7] = train_iobp_df.groupby(grp_col)['Admitted_Duration'].transform(operation)
    test_iobp_df[feat_7] = test_iobp_df.groupby(grp_col)['Admitted_Duration'].transform(operation)

    feat_8 = feat_name + "_Claim_Duration"
    train_iobp_df[feat_8] = train_iobp_df.groupby(grp_col)['Claim_Duration'].transform(operation)
    test_iobp_df[feat_8] = test_iobp_df.groupby(grp_col)['Claim_Duration'].transform(operation)

In [None]:
# BENE, PHYs, Diagnosis Admit and Group Codes columns
create_agg_feats(grp_col='BeneID', feat_name="BENE")
create_agg_feats(grp_col='AttendingPhysician', feat_name="ATT_PHY")
create_agg_feats(grp_col='OperatingPhysician', feat_name="OPT_PHY")
create_agg_feats(grp_col='OtherPhysician', feat_name="OTH_PHY")
create_agg_feats(grp_col='ClmAdmitDiagnosisCode', feat_name="Claim_Admit_Diag_Code")
create_agg_feats(grp_col='DiagnosisGroupCode', feat_name="Diag_GCode")

In [None]:
train_iobp_df.shape

In [None]:
test_iobp_df.shape

### **Adding `Aggregated Features` :: Based on various combinations of different levels in order to introduce their interactions in the dataset.**
    - PROVIDER <--> BENE <--> PHYSICIANS
    - PROVIDER <--> BENE <--> ATTENDING PHYSICIAN <--> PROCEDURE CODES
    - PROVIDER <--> BENE <--> OPERATING PHYSICIAN <--> PROCEDURE CODES
    - PROVIDER <--> BENE <--> OTHER PHYSICIAN <--> PROCEDURE CODES
    - PROVIDER <--> BENE <--> ATTENDING PHYSICIAN <--> DIAGNOSIS CODES
    - PROVIDER <--> BENE <--> OPERATING PHYSICIAN <--> DIAGNOSIS CODES
    - PROVIDER <--> BENE <--> OTHER PHYSICIAN <--> DIAGNOSIS CODES
    - PROVIDER <--> BENE <--> DIAGNOSIS CODES <--> PROCEDURE CODES and etc..

   * **`Reasoning`** :: The idea behind adding the aggregated features based on the combinations of various features is that many parties or entities might work together to make a medicare fraud. Thus, we need to capture interactions among them to better classify the fraudsters.

In [None]:
# PROVIDER <--> other features :: To get claim counts
train_iobp_df["ClmCount_Provider"]=train_iobp_df.groupby(['Provider'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID"]=train_iobp_df.groupby(['Provider','BeneID'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_AttendingPhysician"]=train_iobp_df.groupby(['Provider','AttendingPhysician'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_OtherPhysician"]=train_iobp_df.groupby(['Provider','OtherPhysician'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_OperatingPhysician"]=train_iobp_df.groupby(['Provider','OperatingPhysician'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> PHYSICIANS :: To get claim counts
train_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician"]=train_iobp_df.groupby(['Provider','BeneID','AttendingPhysician'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OtherPhysician"]=train_iobp_df.groupby(['Provider','BeneID','OtherPhysician'])['ClaimID'].transform('count')
train_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician"]=train_iobp_df.groupby(['Provider','BeneID','OperatingPhysician'])['ClaimID'].transform('count')

In [None]:
train_iobp_df.shape

In [None]:
# PROVIDER <--> other features :: To get claim counts
test_iobp_df["ClmCount_Provider"]=test_iobp_df.groupby(['Provider'])['ClaimID'].transform('count')
test_iobp_df["ClmCount_Provider_BeneID"]=test_iobp_df.groupby(['Provider','BeneID'])['ClaimID'].transform('count')
test_iobp_df["ClmCount_Provider_AttendingPhysician"]=test_iobp_df.groupby(['Provider','AttendingPhysician'])['ClaimID'].transform('count')
test_iobp_df["ClmCount_Provider_OtherPhysician"]=test_iobp_df.groupby(['Provider','OtherPhysician'])['ClaimID'].transform('count')
test_iobp_df["ClmCount_Provider_OperatingPhysician"]=test_iobp_df.groupby(['Provider','OperatingPhysician'])['ClaimID'].transform('count')

# PROVIDER <--> BENE <--> PHYSICIANS :: To get claim counts
test_iobp_df["ClmCount_Provider_BeneID_AttendingPhysician"]=test_iobp_df.groupby(['Provider','BeneID','AttendingPhysician'])['ClaimID'].transform('count')
test_iobp_df["ClmCount_Provider_BeneID_OtherPhysician"]=test_iobp_df.groupby(['Provider','BeneID','OtherPhysician'])['ClaimID'].transform('count')
test_iobp_df["ClmCount_Provider_BeneID_OperatingPhysician"]=test_iobp_df.groupby(['Provider','BeneID','OperatingPhysician'])['ClaimID'].transform('count')

In [None]:
test_iobp_df.shape

In [None]:
# Removing unwanted columns
remove_unwanted_columns=['BeneID', 'ClaimID', 'ClaimStartDt','ClaimEndDt','AttendingPhysician','OperatingPhysician', 'OtherPhysician',
                      'AdmissionDt', 'ClmAdmitDiagnosisCode', 'DischargeDt', 'DiagnosisGroupCode',
                      'ClmDiagnosisCode_1', 'ClmDiagnosisCode_2', 'ClmDiagnosisCode_3', 'ClmDiagnosisCode_4', 'ClmDiagnosisCode_5', 
                      'ClmDiagnosisCode_6', 'ClmDiagnosisCode_7', 'ClmDiagnosisCode_8', 'ClmDiagnosisCode_9', 'ClmDiagnosisCode_10',
                      'ClmProcedureCode_1', 'ClmProcedureCode_2', 'ClmProcedureCode_3', 'DOB', 'DOD', 'State', 'County']

train_iobp_df.drop(columns=remove_unwanted_columns, axis=1, inplace=True)
test_iobp_df.drop(columns=remove_unwanted_columns, axis=1, inplace=True)

In [None]:
train_iobp_df.shape

In [None]:
test_iobp_df.shape

In [None]:
# Filling Nulls in Deductible Amt Paid by Patient
train_iobp_df['DeductibleAmtPaid'].fillna(value=0, inplace=True)

In [None]:
# Filling Nulls in Deductible Amt Paid by Patient
test_iobp_df['DeductibleAmtPaid'].fillna(value=0, inplace=True)

In [None]:
# Binary encoding the categorical features --> 0 means No and 1 means Yes
train_iobp_df['Gender'] = train_iobp_df['Gender'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['PotentialFraud'] = train_iobp_df['PotentialFraud'].apply(lambda val: 0 if val == "No" else 1)
train_iobp_df['Is_Alive?'] = train_iobp_df['Is_Alive?'].apply(lambda val: 0 if val == "No" else 1)

train_iobp_df['ChronicCond_Alzheimer'] = train_iobp_df['ChronicCond_Alzheimer'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Heartfailure'] = train_iobp_df['ChronicCond_Heartfailure'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_KidneyDisease'] = train_iobp_df['ChronicCond_KidneyDisease'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Cancer'] = train_iobp_df['ChronicCond_Cancer'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_ObstrPulmonary'] = train_iobp_df['ChronicCond_ObstrPulmonary'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Depression'] = train_iobp_df['ChronicCond_Depression'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Diabetes'] = train_iobp_df['ChronicCond_Diabetes'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_IschemicHeart'] = train_iobp_df['ChronicCond_IschemicHeart'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_Osteoporasis'] = train_iobp_df['ChronicCond_Osteoporasis'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_rheumatoidarthritis'] = train_iobp_df['ChronicCond_rheumatoidarthritis'].apply(lambda val: 0 if val == 2 else val)
train_iobp_df['ChronicCond_stroke'] = train_iobp_df['ChronicCond_stroke'].apply(lambda val: 0 if val == 2 else val)

In [None]:
# Binary encoding the categorical features --> 0 means No and 1 means Yes
test_iobp_df['Gender'] = test_iobp_df['Gender'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['Is_Alive?'] = test_iobp_df['Is_Alive?'].apply(lambda val: 0 if val == "No" else 1)

test_iobp_df['ChronicCond_Alzheimer'] = test_iobp_df['ChronicCond_Alzheimer'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_Heartfailure'] = test_iobp_df['ChronicCond_Heartfailure'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_KidneyDisease'] = test_iobp_df['ChronicCond_KidneyDisease'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_Cancer'] = test_iobp_df['ChronicCond_Cancer'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_ObstrPulmonary'] = test_iobp_df['ChronicCond_ObstrPulmonary'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_Depression'] = test_iobp_df['ChronicCond_Depression'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_Diabetes'] = test_iobp_df['ChronicCond_Diabetes'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_IschemicHeart'] = test_iobp_df['ChronicCond_IschemicHeart'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_Osteoporasis'] = test_iobp_df['ChronicCond_Osteoporasis'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_rheumatoidarthritis'] = test_iobp_df['ChronicCond_rheumatoidarthritis'].apply(lambda val: 0 if val == 2 else val)
test_iobp_df['ChronicCond_stroke'] = test_iobp_df['ChronicCond_stroke'].apply(lambda val: 0 if val == 2 else val)

In [None]:
# Encoding the Categorical features
train_iobp_df = pd.get_dummies(train_iobp_df,columns=['Gender', 'Race', 'Admitted?', 'Is_Alive?'], drop_first=True)

In [None]:
# Encoding the Categorical features
test_iobp_df = pd.get_dummies(test_iobp_df,columns=['Gender', 'Race', 'Admitted?', 'Is_Alive?'], drop_first=True)

In [None]:
train_iobp_df.shape

In [None]:
test_iobp_df.shape

In [None]:
pd.set_option('display.max_rows',120)

In [None]:
# Checking Nulls in the features
pd.DataFrame(train_iobp_df.isna().sum())

In [None]:
# Checking Nulls in the features
pd.DataFrame(test_iobp_df.isna().sum())

In [None]:
# Filling Nulls in the aggregated features
train_iobp_df.fillna(value=0, inplace=True)

In [None]:
# Filling Nulls in the aggregated features
test_iobp_df.fillna(value=0, inplace=True)

In [None]:
# Checking Nulls in the features
pd.DataFrame(train_iobp_df.isna().sum())

In [None]:
# Checking Nulls in the features
pd.DataFrame(test_iobp_df.isna().sum())

In [None]:
# Checking the Datatypes of the features
train_iobp_df.dtypes

In [None]:
# Checking the Datatypes of the features
test_iobp_df.dtypes

## **Entire Data `Aggregation` :: At provider level**

   * **`Reasoning`** :: The main objective is to predict the `Medicare Provider Fraud`. Thus, here we are grouping the entire dataset at the level of PROVIDER and taking SUM of all the columns to create n-dimensional representation of each provider.

In [None]:
train_iobp_df = train_iobp_df.groupby(['Provider','PotentialFraud'],as_index=False).agg('sum')

In [None]:
test_iobp_df = test_iobp_df.groupby(['Provider'],as_index=False).agg('sum')

In [None]:
train_iobp_df.shape

In [None]:
test_iobp_df.shape

## **`Data Segregation` :: Creating separate sets of independent features and target column.**

   * **`Reasoning`** :: These sets will be used for training the ML Models.

In [None]:
X = train_iobp_df.drop(axis=1, columns=['Provider','PotentialFraud'])
y = train_iobp_df['PotentialFraud']

In [None]:
X_unseen = test_iobp_df.drop(axis=1, columns=['Provider'])
y_unseen_prvs = test_iobp_df['Provider']

In [None]:
X.shape, type(X), y.shape, type(y)

In [None]:
X_unseen.shape, type(X), y_unseen_prvs.shape, type(y_unseen_prvs)

In [None]:
X.head()

In [None]:
X_unseen.head()

In [None]:
y.head()

In [None]:
y_unseen_prvs

## **`Train Test Split` :: Creating TRAIN and VALIDATION sets.**

   * **`Reasoning`** :: These sets will be used for measurng the performance of ML Models.

In [None]:
from sklearn.model_selection import train_test_split as tts

In [None]:
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.20, stratify=y, random_state=39)

In [None]:
# Checking shape of each set
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Checking count of tgt labels in y_train
y_train.value_counts()

In [None]:
# Checking count of tgt labels in y_test
y_test.value_counts()

## **`Standardizing the TRAIN & TEST sets` :: Bringing every feature into the same scale.**

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
# Standardize the data (train and test)
robust_scaler = RobustScaler()
robust_scaler.fit(X_train)
X_train_std = robust_scaler.transform(X_train)
X_test_std = robust_scaler.transform(X_test)
X_unseen_std = robust_scaler.transform(X_unseen)

## **`Baseline Model Training`**

### **`Using Class Weighting Scheme`**

#### **`1. Logistic Regression`**

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.calibration import CalibratedClassifierCV

In [None]:
# Training the model with all features and hyper-parameterized values
log_reg_1 = LogisticRegression(C=0.03, penalty='l1',
                               fit_intercept=True, solver='liblinear', tol=0.0001, max_iter=500, 
                               class_weight='balanced',
                               verbose=0, 
                               intercept_scaling=1.0,
                               multi_class='auto',
                               random_state=49)

log_reg_1.fit(X_train_std, y_train)

In [None]:
# Validate Logistic Regression model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(log_reg_1, X_train_std, X_test_std, y_train, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': log_reg_1.coef_[0]})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_15_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:15]
top_15_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:15]

In [None]:
top_15_neg_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=True)['Features'].iloc[0:15]
top_15_neg_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=True)['Importance_Model_1'].iloc[0:15]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_pos_feats, x=top_15_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_neg_feats, x=top_15_neg_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Negative Features\n", fontdict=title_font_dict)

#### **`2. Decision Tree`**

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Training the model with all features and hyper-parameterized values
dec_tree_2 = DecisionTreeClassifier(criterion='gini',
                                   max_depth= 6,
                                   max_features='auto',
                                   min_samples_leaf=100,
                                   min_samples_split=50,
                                   class_weight='balanced',
                                   random_state=49,
                                   splitter='best',
                                   min_weight_fraction_leaf=0.0,
                                   max_leaf_nodes=None,
                                   min_impurity_decrease=0.0,
                                   ccp_alpha=0.0,)

dec_tree_2.fit(X_train_std, y_train)

In [None]:
# Validate model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(dec_tree_2, X_train_std, X_test_std, y_train, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': dec_tree_2.feature_importances_})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_15_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:15]
top_15_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:15]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_pos_feats, x=top_15_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

#### **`3. Random Forest Classifier`**

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Training the model with all features and hyper-parameterized values
rfc_3 = RandomForestClassifier(n_estimators=30,criterion='gini',
                                   max_depth= 4,
                                   max_features='auto',
                                   min_samples_leaf=50,
                                   min_samples_split=50,
                                   class_weight='balanced',
                                   random_state=49,
                                   min_weight_fraction_leaf=0.0,
                                   max_leaf_nodes=None,
                                   min_impurity_decrease=0.0,
                                   ccp_alpha=0.0,)

rfc_3.fit(X_train_std, y_train)

In [None]:
# Validate model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(rfc_3, X_train_std, X_test_std, y_train, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': rfc_3.feature_importances_})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_20_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:20]
top_20_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:20]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_20_pos_feats, x=top_20_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 20 Importance Positive Features\n", fontdict=title_font_dict)

### **`Using Minority Synthetic Oversampling`**

#### **`Train Test Split` :: Creating TRAIN and VALIDATION sets.**

   * **`Reasoning`** :: These sets will be used for measurng the performance of ML Models.

In [None]:
from sklearn.model_selection import train_test_split as tts

In [None]:
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.25, stratify=y, random_state=39)

In [None]:
# Checking shape of each set
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Checking count of tgt labels in y_train
y_train.value_counts()

In [None]:
# Checking count of tgt labels in y_test
y_test.value_counts()

#### **`Standardizing the TRAIN & TEST sets` :: Bringing every feature into the same scale.**

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
# Standardize the data (train and test)
robust_scaler = RobustScaler()
robust_scaler.fit(X_train)
X_train_std = robust_scaler.transform(X_train)
X_test_std = robust_scaler.transform(X_test)

In [None]:
from collections import Counter

In [None]:
# BEFORE Oversampling :: Checking the percentage share of fraud and non-fraud records in the TRAIN set
counter = Counter(y_train)
counter

In [None]:
fraud_percentage = (counter[1]*100 / (counter[0]+counter[1]))
non_fraud_percentage = (counter[0]*100 / (counter[0]+counter[1]))
print("Fraud Percentage = {:.2f}% and Non-Fraud Percentage = {:.2f}%".format(fraud_percentage, non_fraud_percentage))

In [None]:
# Performing minority oversampling
from imblearn.over_sampling import ADASYN

In [None]:
oversample = ADASYN(sampling_strategy=0.35, n_neighbors=12)
X_train_ovsamp, y_train_ovsamp = oversample.fit_resample(X_train_std, y_train)

X_train_ovsamp.shape, y_train_ovsamp.shape

In [None]:
counter = Counter(y_train_ovsamp)
counter

In [None]:
fraud_percentage = (counter[1]*100 / (counter[0]+counter[1]))
non_fraud_percentage = (counter[0]*100 / (counter[0]+counter[1]))
print("Fraud Percentage = {:.2f}% and Non-Fraud Percentage = {:.2f}%".format(fraud_percentage, non_fraud_percentage))

#### **`4. Logistic Regression`**

In [None]:
# Training the model with all features and hyper-parameterized values
log_reg_4 = LogisticRegression(C=0.03, penalty='l1',
                               fit_intercept=True, 
                               solver='liblinear', 
                               tol=0.0001, 
                               max_iter=500, 
                               verbose=0, 
                               intercept_scaling=1.0,
                               multi_class='auto',
                               random_state=49)

log_reg_4.fit(X_train_ovsamp, y_train_ovsamp)

In [None]:
# Validate Logistic Regression model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(log_reg_4, X_train_ovsamp, X_test_std, y_train_ovsamp, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': log_reg_4.coef_[0]})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_15_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:15]
top_15_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:15]

In [None]:
top_15_neg_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=True)['Features'].iloc[0:15]
top_15_neg_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=True)['Importance_Model_1'].iloc[0:15]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_pos_feats, x=top_15_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_neg_feats, x=top_15_neg_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Negative Features\n", fontdict=title_font_dict)

#### **`5. Decision Tree`**

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Training the model with all features and hyper-parameterized values
dec_tree_5 = DecisionTreeClassifier(criterion='gini',
                                   max_depth= 6,
                                   max_features='log2',
                                   min_samples_leaf=150,
                                   min_samples_split=150,
                                   random_state=49,
                                   splitter='best',
                                   min_weight_fraction_leaf=0.0,
                                   max_leaf_nodes=None,
                                   min_impurity_decrease=0.0,
                                   ccp_alpha=0.0,)

dec_tree_5.fit(X_train_ovsamp, y_train_ovsamp)

In [None]:
# Validate model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(dec_tree_5, X_train_ovsamp, X_test_std, y_train_ovsamp, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': dec_tree_5.feature_importances_})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_15_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:15]
top_15_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:15]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_15_pos_feats, x=top_15_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

#### **`6. Random Forest Classifier`**

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Training the model with all features and hyper-parameterized values
rfc_6 = RandomForestClassifier(n_estimators=30,criterion='gini',
                                   max_depth= 4,
                                   max_features='auto',
                                   min_samples_leaf=100,
                                   min_samples_split=50,
                                   random_state=49,
                                   min_weight_fraction_leaf=0.0,
                                   max_leaf_nodes=None,
                                   min_impurity_decrease=0.0,
                                   ccp_alpha=0.0,)

rfc_6.fit(X_train_std, y_train)

In [None]:
# Validate model
test_auc, train_f1_score, test_f1_score, best_t = validate_model(rfc_6, X_train_ovsamp, X_test_std, y_train_ovsamp, y_test)

print("\n")
print("### Best Threshold = {:.4f}".format(best_t))
print("### Model AUC is : {:.4f}".format(test_auc))
print("### Model Train F1 Score is : {:.4f}".format(train_f1_score))
print("### Model Test F1 Score is : {:.4f}".format(test_f1_score))

In [None]:
feats_imps = pd.DataFrame({'Features': X_train.columns, 'Importance_Model_1': rfc_6.feature_importances_})
feats_imps = feats_imps[feats_imps['Importance_Model_1'] != 0]
feats_imps.reset_index(drop=True, inplace=True)
feats_imps.head()

In [None]:
top_20_pos_feats = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Features'].iloc[0:20]
top_20_pos_feats_scores = feats_imps.sort_values(by='Importance_Model_1',axis=0,ascending=False)['Importance_Model_1'].iloc[0:20]

In [None]:
with plt.style.context('seaborn-poster'):
    sns.barplot(y=top_20_pos_feats, x=top_20_pos_feats_scores, orient='h', palette='coolwarm')
    plt.xlabel("\nFeatures Importance", fontdict=label_font_dict)
    plt.ylabel("Features\n", fontdict=label_font_dict)
    plt.title("Top 15 Importance Positive Features\n", fontdict=title_font_dict)

# **`Models - SET 2 - RESULTS`**

In [None]:
from IPython.display import Image
Image("../input/medicare-prv-fraud-files/Models_Set_2_Results.png")

## **`Models - SET 2 - OBSERVATIONS`**

- **Adding the Similarity Score features of embeddings b/w below mentioned doesn't really helped in improving the models performance:**
    - CAD and Dx Codes
    - Dx and PROC Codes
    - CAD and PROC Codes
    
- **Doing the synthetic oversampling of the minority class doesn't provide gain in the model's performance whehreas we can see a noticeable drop in the performace.** 

# **`Referred Links`**

- https://www.icd10data.com/ICD10CM/Codes

- https://www.icd10data.com/ICD10CM/DRG

- https://www.plasticsurgery.org/Documents/Health-Policy/Coding-Payment/ICD-10/icd-10-medical-diagnosis-codes.pdf

- https://ftp.cdc.gov/pub/health_statistics/nchs/publications/ICD10CM/2019/icd10cm_tabular_2019.pdf


- Very imp link :: Also downloaded files :: https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes

- https://www.cms.gov/Medicare/Coding/OutpatientCodeEdit/Downloads/ICD-10-IOCE-Code-Lists.pdf

- https://medicaid.utah.gov/Documents/pdfs/ClaimDenialCodes.pdf

- https://www.cms.gov/Medicare/Coding/OutpatientCodeEdit/Downloads/ICD-10-IOCE-Code-Lists.pdf

- DRG :: https://www.findacode.com/code-set.php?set=DRG
