<a href="https://colab.research.google.com/github/nallagondu/datatrained_inter_public/blob/main/Insurance_Claim_Fraud_Detection_3phase_Evaluation3_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Project Description**

Insurance fraud is a huge problem in the industry. It's difficult to identify fraud claims. Machine Learning is in a unique position to help the Auto Insurance industry with this problem.

In this project, you are provided a dataset which has the details of the insurance policy along with the customer details. It also has the details of the accident on the basis of which the claims have been made.
In this example, you will be working with some auto insurance data to demonstrate how you can create a predictive model that predicts if an insurance claim is fraudulent or not.


**Independent Variables**

1.	months_as_customer: Number of months of patronage

2.	age: the length of time a customer has lived or a thing has existed

3.	policy_number: It is a unique id given to the customer, to track the subscription status and other details of customer

4.	policy_bind_date:date which document that is given to customer after we accept your proposal for insurance

5.	policy_state: This identifies who is the insured, what risks or property are covered, the policy limits, and the policy period

6.	policy_csl: is basically Combined Single Limit

7.	policy_deductable: the amount of money that a customer is responsible for paying toward an insured loss

8.	policy_annual_premium: This means the amount of Regular Premium payable by the Policyholder in a Policy Year

9.	umbrella_limit: This means extra insurance that provides protection beyond existing limits and coverages of other policies

10.	insured_zip: It is the zip code where the insurance was made

11.	insured_sex: This refres to either of the two main categories (male and female) into which customer are divided on the basis of their reproductive functions

12.	insured_education_level: This refers to the Level of education of the customer

13.	insured_occupation: This refers Occupation of the customer

14.	insured_hobbies: This refers to an activity done regularly by customer in his/her leisure time for pleasure.

15.	insured_relationship: This whether customer is: single; or. married; or. in a de facto relationship (that is, living together but not married); or. in a civil partnership

16.	capital-gains: This refers to profit accrued due to insurance premium

17.	capital-loss: This refers to the losses incurred due to insurance claims

18.	incident_date: This refers to the date which claims where made by customers

19.	incident_type: This refers to the type of claim/vehicle damage made by customer

20.	collision_type: This refers to the area of damage on the vehicle

21.	incident_severity: This refers to the extent/level of damage

22.	authorities_contacted: This refers to the government agencies that were contacted after damage

23.	incident_state: This refers to the state at which the accident happened

24.	incident_city: This refers to the city at which the accident happened

25.	1ncident_location: This refers to the location at which the accident happened

26.	incident_hour_of_the_day: The period of the day which accident took place

27.	number_of_vehicles_involved: This refers to number of vehicles involved the accident

28.	property_damage: This refers to whether property was damaged or not

29.	bodily_injuries: This refers to injuries sustained

30.	witnesses: This refers to the number of witnesses involved

31.	police_report_available: This refers to whether the report on damage was documented or not

32.	total_claim_amount: This refers to the financial implications involved in claims

33.	injury_claim: This refers to physical injuries sustained

34.	property_claim: This refers to property damages during incident

35.	vehicle_claim: This refers to property damages during incident

36.	auto_make: This refers to the make of the vehicle

37.	auto_model: This refers to the model of the vehicle

38.	auto_year: This refers to the year which the vehicle was manufactured

39.	fraud_reported






**Dataset Link-**


•	https://raw.githubusercontent.com/FlipRoboTechnologies/ML_-Datasets/main/Insurance%20Claim%20Fraud%20Detection/Automobile_insurance_fraud.csv


First Step ,we are going to start with **Data Preparation**

- To load the dataset using above URl
- To handel missing values
- To encode categoricl variavles
- To Siplit the dataset into training and testing sets

In [None]:
#To load the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

dataset_url = 'https://raw.githubusercontent.com/FlipRoboTechnologies/ML_-Datasets/main/Insurance%20Claim%20Fraud%20Detection/Automobile_insurance_fraud.csv'
df_ICFD = pd.read_csv(dataset_url)
df_ICFD.head()


In [None]:
#In above data I didn't see headers part  and I need to add header column

headers_col = [
    'months_as_customer', 'age', 'policy_number', 'policy_bind_date', 'policy_state','policy_csl', 'policy_deductable', 'policy_annual_premium', 'umbrella_limit', 'insured_zip',
    'insured_sex', 'insured_education_level', 'insured_occupation', 'insured_hobbies', 'insured_relationship','capital_gains', 'capital_loss', 'incident_date', 'incident_type', 'collision_type',
    'incident_severity', 'authorities_contacted', 'incident_state', 'incident_city', 'incident_location','incident_hour_of_the_day', 'number_of_vehicles_involved', 'property_damage', 'bodily_injuries', 'witnesses',
    'police_report_available', 'total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim','auto_make', 'auto_model', 'auto_year', 'fraud_reported'
]
df_ICFD.columns = headers_col
df_ICFD.head()

In [None]:
df_ICFD.shape

In [None]:
df_ICFD.info()

In [None]:
#Find any missing values
df_ICFD.isnull().sum()

In [None]:
df_ICFD['authorities_contacted'].value_counts()

In [None]:
df_ICFD['authorities_contacted'].fillna(df_ICFD['authorities_contacted'].mode()[0], inplace=True)

we used mode method to manage the missing values in  '**authorities_contacted**'

In [None]:
df_ICFD.duplicated().sum()

In [None]:
df_ICFD.nunique()

In [None]:
#Categorical Columns Unique values encoding

df_ICFD['policy_state'].value_counts()



In [None]:
df_ICFD['authorities_contacted'].value_counts()

In [None]:
#Find any missing values
df_ICFD.isnull().sum()

In [None]:
# convert date columns like 'policy_bind_date '  and 'incident_date'
df_ICFD['policy_bind_date'] = pd.to_datetime(df_ICFD['policy_bind_date'], format ='%d-%m-%Y')
df_ICFD['incident_date'] = pd.to_datetime(df_ICFD['incident_date'],format ='%d-%m-%Y')

# Extract year and month ,day as new features

df_ICFD['policy_bind_year'] = df_ICFD['policy_bind_date'].dt.year
df_ICFD['policy_bind_month'] = df_ICFD['policy_bind_date'].dt.month
df_ICFD['policy_bind_day'] = df_ICFD['policy_bind_date'].dt.day
df_ICFD['incident_year'] = df_ICFD['incident_date'].dt.year
df_ICFD['incident_month'] = df_ICFD['incident_date'].dt.month
df_ICFD['incident_day'] = df_ICFD['incident_date'].dt.day

df_ICFD.head()

In [None]:
df_ICFD.info()

In [None]:
# Drop policy_bind_date and incident_date
df_ICFD.drop(['policy_bind_date', 'incident_date'], axis=1, inplace=True)
df_ICFD.head()

In [None]:
print(df_ICFD.columns ,'\n')

df_objectes_list =df_ICFD.dtypes[df_ICFD.dtypes == 'object']
df_objectes_list

In [None]:
for i in range(len(df_objectes_list.index)):
  print( df_ICFD[df_objectes_list.index[i]].value_counts(), '\n')


In [None]:
"""#policy_state               category
policy_csl                 category
insured_sex                category
insured_education_level    object
insured_occupation         object
insured_hobbies            object
insured_relationship       object
incident_type              object
collision_type             object
incident_severity          object
authorities_contacted      object
incident_state             object
incident_city              object
incident_location          object
property_damage            category
bodily_injuries            object
witnesses                  object
police_report_available    category
auto_year                  object
auto_make                  object
auto_model                 object
fraud_reported                category"""


In [None]:
import sklearn.preprocessing as preprocessing
from sklearn.preprocessing import LabelEncoder
##Binary and Small numbers of categories

le = LabelEncoder()

label_col = ['policy_state', 'policy_csl', 'insured_sex', 'property_damage', 'police_report_available', 'fraud_reported']
for col in label_col:
    df_ICFD[col] = le.fit_transform(df_ICFD[col])

df_ICFD.head()

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

num_df = df_ICFD.select_dtypes(include=['int64', 'float64'])

fige, ax = plt.subplots(figsize=(25, 20))
cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(num_df.corr(), annot=True,vmax = .3, cmap=cmap, ax=ax)

plt.title('Correlation Heatmap')
plt.show()

Here first correlation between Age and months_as_customers is have high correlation  : 0.92

as well as  :          
total_claim_amount vs  injury_clime (0.81)
total_claim_amount vs property_clime (0.81)
total_claim_amount vs Vehicle_clime(0.98)
injury_clime vs Vehicle_clime (0.72)
property_clime vs Vehicle_clime(0.73)
  between this four have high positive  correlation .

property_clime vs injury_clime (0.56)

   

In [None]:
#Categorical columns with many Unique values
#policy_number, insured_zip , incident_location ,auto_make,auto_model

#Unique values : 999,994,999,14,39

def frequency_encoding(df_ICFD,column):
  freq= df_ICFD[column].value_counts()
  df_ICFD[column + '_freq']=df_ICFD[column].map(freq)
  df_ICFD.drop(column,axis=1,inplace=True)

frequency_col = ['policy_number','insured_zip', 'incident_location', 'auto_make', 'auto_model']
for col in frequency_col:
  frequency_encoding(df_ICFD,col)

df_ICFD.head()





In [None]:
#One-hot Encoding or label encoding
one_hot_ecode_col  = ['insured_education_level','insured_occupation','insured_hobbies','insured_relationship','incident_type', 'collision_type', 'incident_severity', 'authorities_contacted', 'incident_state', 'incident_city']
df_ICFD = pd.get_dummies(df_ICFD, columns=one_hot_ecode_col, drop_first=True)
df_ICFD.head()

In [None]:
df_ICFD.info()

In [None]:
df_ICFD.dtypes

In [None]:
df_ICFD.describe()

In [None]:
df_ICFD.head()

In [None]:
bool_val = df_ICFD.select_dtypes(include='bool').columns
df_ICFD[bool_val] = df_ICFD[bool_val].astype(int)
df_ICFD.head()

In [None]:
print(df_ICFD.isnull().sum())

In [None]:
df_ICFD.describe()

In [None]:
object_cols = df_ICFD.select_dtypes(include=['object']).columns
print(object_cols)

In [None]:
df_ICFD['fraud_reported'].value_counts()


In [None]:
#Skewness
skewness = df_ICFD.skew()
skewness

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

fige, ax = plt.subplots(figsize=(25, 20))
cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(df_ICFD.corr(), annot=True,vmax = .3, cmap=cmap, ax=ax)

plt.title('Correlation Heatmap')
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

for feature in df_ICFD.columns:
    plt.figure(figsize=(8, 6))
    sns.histplot(df_ICFD[feature], kde=True)
    plt.title(f'Distribution of {feature}')

    plt.show()

In [None]:
#for Box plot

for feature in df_ICFD.columns:
    plt.figure(figsize=(8, 6))
    sns.boxplot(x=df_ICFD[feature])
    plt.title(f'Box Plot of {feature}')

    plt.show()

In [None]:
f,ax = plt.subplots(figsize=(10, 8))

sns.countplot(x='fraud_reported', data=df_ICFD)
plt.show()

In [None]:
#Here I can find more hobbies in Insured ,I need to melt it in insured columns

df_melted = pd.melt(df_ICFD, id_vars=['fraud_reported'], value_vars=[col for col in df_ICFD.columns if 'insured_hobbies_' in col ],var_name = 'insured_hobbies', value_name='is_hobbies')
df_melted = df_melted[df_melted['is_hobbies'] == 1]
df_melted.head()

f,ax = plt.subplots(figsize=(10, 8))
sns.countplot(x='fraud_reported', hue = 'insured_hobbies', data=df_melted)
plt.show()


In [None]:
#Second time correlation test

corr = df_ICFD.corr()
corr


In [None]:
!pip install lightgbm==3.3.2


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
import xgboost as xgb
import lightgbm as lgb

X = df_ICFD.drop('fraud_reported', axis=1)
y = df_ICFD['fraud_reported']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score

def lgb_f1_score(y_hat, df_ICFD):
    y_true = df_ICFD.get_label()
    y_hat = np.where(y_hat < 0.5, 0, 1)  # Convert probabilities to 0/1
    return 'f1',  f1_score(y_true, y_hat),True

In [None]:
def r_lgb(X_train, X_test, y_train, y_test, test_data):
  params = {
      'objective': 'binary',
      'metric': 'f1',
      'boosting': 'gbdt',
      'num_leaves': 3,
      'learning_rate': 0.05,
      'feature_fraction': 0.9,
      #'bagging_fraction': 0.8,
      'bagging_freq': 5,
      'verbose': 0,
      'num_iterations': 200,
      'n_jobs': -1,
      'random_state': 42,
      'lambda_l1': 0.1,
      'lambda_l2': 0.1,
      'min_data_in_leaf': 10,
      'max_depth': -1,
      'min_child_weight': 0.001,
      'reg_alpha': 0.5,
      'reg_lambda': 0.5,
      'min_split_gain': 0.0222415,
      'subsample': 1.0,
      'subsample_freq': 1,
      'force_row_wise': True
  }

  X_train.columns = [col.replace(' ','_') for col in X_train.columns]
  X_test.columns = [col.replace(' ','_') for col in X_test.columns]

  # Create LightGBM datasets
  lgtrain = lgb.Dataset(X_train, label=y_train)
  lgval = lgb.Dataset(X_test, label=y_test)
  evals_result = {}
  model = lgb.train(params, lgtrain, 5000,
                      valid_sets=[lgtrain, lgval],
                      early_stopping_rounds=100,
                      verbose_eval=100,
                      evals_result=evals_result,feval = lgb_f1_score)
  pred_test_y = model.predict(test_data,num_iteration = model.best_iteration)
  return pred_test_y, model, evals_result

In [None]:
from sklearn.metrics import f1_score
pred_test, model,evals_result = r_lgb(X_train, X_test, y_train, y_test, X_test)

#Convert probabilities to 0/1 predictions
pred_test_binary = np.where(pred_test < 0.5, 0, 1)

# Calculate F1 score
f1 = f1_score(y_test, pred_test_binary)

# Print the F1 score
print("F1 Score:", f1)
print("LightGBM Model Performance")
print(f"Accuracy Score: {accuracy_score(y_test, pred_test_binary)}")


In [None]:
roc_auc_score(y_test,pred_test)

#roc_auc_score(y_test,pred_test_binary)

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, roc_curve
fpr, tpr, thresholds = roc_curve(y_test, pred_test)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

In [None]:
from sklearn import metrics
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, roc_curve

fpr, tpr, thresholds = metrics.roc_curve(y_test, pred_test)
roc_auc = metrics.auc(fpr, tpr)
f, ax = plt.subplots(figsize=(10, 8))
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

In [None]:

print('plot features')

ax = lgb.plot_importance(model, max_num_features=20, height=0.5)
plt.show

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
import xgboost as xgb


In [None]:
classifiers = {
          'svc':SVC(),
          'rfc':RandomForestClassifier(),
          'knc':KNeighborsClassifier(),
          'gau':GaussianNB(),
          'dtc' : DecisionTreeClassifier(),
          'abc' : AdaBoostClassifier(),
          'grd':GradientBoostingClassifier(),
          'bagg':BaggingClassifier(),
          'xgb':xgb.XGBClassifier(),
          'lgb':lgb.LGBMClassifier()
}

In [None]:
from sklearn.impute import SimpleImputer
#train AND evalate classifiers for primary fuel predection
imputer = SimpleImputer(strategy='mean')

X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

for name, classifier in classifiers.items():
  classifier.fit(X_train,y_train)
  y_pred = classifier.predict(X_test)
  accuracy = accuracy_score(y_test,y_pred)
  print(f'{name} : {accuracy}')

In [None]:
GaussianNB_model = GaussianNB()
GaussianNB_model.fit(X_train,y_train)
y_pred = GaussianNB_model.predict(X_test)

Accuracy = accuracy_score(y_test,y_pred)
print(f'Accuracy : {Accuracy}')

In [None]:
GaussianNB_model = GaussianNB()
GaussianNB_model.fit(X_train,y_train)
y_pred = GaussianNB_model.predict(X_test)

Accuracy = accuracy_score(y_test,y_pred)
print(f'Accuracy : {Accuracy}')

In [None]:
SVC_model = SVC()
SVC_model.fit(X_train,y_train)
y_pred = SVC_model.predict(X_test)

Accuracy = accuracy_score(y_test,y_pred)
print(f'Accuracy SVC_model: {Accuracy}')

In [None]:
Ada_model = AdaBoostClassifier()
Ada_model.fit(X_train,y_train)
y_pred = Ada_model.predict(X_test)

Accuracy = accuracy_score(y_test,y_pred)
print(f'Accuracy Ada_model: {Accuracy}')

In [None]:
Gradient_model  = GradientBoostingClassifier()
Gradient_model.fit(X_train,y_train)
y_pred = Gradient_model.predict(X_test)

Accuracy = accuracy_score(y_test,y_pred)
print(f'Accuracy Gradient_model: {Accuracy}')

##Best model
we can consider **GradientBoostingClassifier** is the best model when I compare with other models in above list

In [None]:
import pickle
filename = 'Gradient_model.pkl'
pickle.dump(Gradient_model, open(filename, 'wb'))