<a href="https://colab.research.google.com/github/mrudulamadhavan/f_labs-internship/blob/main/Healthcare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Healthcare Fraud Detection using Machine Learning**

#### **Python Code with Sample Dataset**

* **ClaimID:** A unique identifier for each healthcare claim.
* **PatientID:** Identifier for the patient associated with the claim.
* **ProviderID:** Identifier for the healthcare provider submitting the claim.
* **ProcedureCode:** Code representing the medical procedure performed.
* **DiagnosisCode:** Code representing the medical diagnosis associated with the claim.
* **ClaimAmount:** The amount claimed for the healthcare service.
* **AdmissionType:** Type of admission (e.g., Inpatient, Outpatient).
* **AdmissionSource:** Source of admission (e.g., Referral, Emergency Room).
* **IsFraud:** Binary indicator (0 or 1) for whether the claim is fraudulent (1) or not (0).

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import FeatureHasher


In [None]:
# Load the dataset
df = pd.read_csv('/healthcare.csv')

# Display the first few rows of the dataset
df.head()

Unnamed: 0,ClaimID,PatientID,ProviderID,ProcedureCode,DiagnosisCode,ClaimAmount,AdmissionType,AdmissionSource,IsFraud
0,1,101,P1,PC1,D1,1000,Inpatient,Referral,0
1,2,102,P2,PC2,D2,1500,Outpatient,Direct Admit,0
2,3,103,P3,PC3,D3,800,Inpatient,Emergency Room,0
3,4,104,P4,PC1,D4,2000,Inpatient,Referral,1
4,5,105,P1,PC2,D2,1200,Outpatient,Physician Referral,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ClaimID          20 non-null     int64 
 1   PatientID        20 non-null     int64 
 2   ProviderID       20 non-null     object
 3   ProcedureCode    20 non-null     object
 4   DiagnosisCode    20 non-null     object
 5   ClaimAmount      20 non-null     int64 
 6   AdmissionType    20 non-null     object
 7   AdmissionSource  20 non-null     object
 8   IsFraud          20 non-null     int64 
dtypes: int64(4), object(5)
memory usage: 1.5+ KB


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ClaimID          20 non-null     int64 
 1   PatientID        20 non-null     int64 
 2   ProviderID       20 non-null     object
 3   ProcedureCode    20 non-null     object
 4   DiagnosisCode    20 non-null     object
 5   ClaimAmount      20 non-null     int64 
 6   AdmissionType    20 non-null     object
 7   AdmissionSource  20 non-null     object
 8   IsFraud          20 non-null     int64 
dtypes: int64(4), object(5)
memory usage: 1.5+ KB


In [None]:
# Feature Engineering
# Encode categorical variables using Label Encoding
label_encoder = LabelEncoder()
df['AdmissionType'] = label_encoder.fit_transform(df['AdmissionType'])
df['AdmissionSource'] = label_encoder.fit_transform(df['AdmissionSource'])
df['ProviderID'] = label_encoder.fit_transform(df['ProviderID'])



# Drop the original categorical columns
df = df.drop(['ProcedureCode', 'DiagnosisCode'], axis=1)
df.head()

Unnamed: 0,ClaimID,PatientID,ProviderID,ClaimAmount,AdmissionType,AdmissionSource,IsFraud
0,1,101,0,1000,0,3,0
1,2,102,1,1500,1,0,0
2,3,103,2,800,0,1,0
3,4,104,3,2000,0,3,1
4,5,105,0,1200,1,2,0


In [None]:
# Split the data into features (X) and target variable (y)
X = df.drop('IsFraud', axis=1)
y = df['IsFraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest classifier
clf = RandomForestClassifier(random_state=42)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 1.0

Confusion Matrix:
 [[2 0]
 [0 2]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         2

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4

