# **Week #5 - Fraud Analytics: Predictive Modeling**

Fraud Analytics - Sekolah Data - Pacmann Academy

---
**Objectives**

- You can perform predictive modeling for fraud analytics


---
**Outline**

1. Business Understanding
2. Importing Data
3. Splitting Data
4. Pre-Processing
5. Modeling
6. Prediction

Run this cell first!

In [3]:
#%pip install seaborn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# <font color='blue'> 1. Business Understanding
---

- A vehicle insurance company might experience false or exaggerated claim involving vehicle damage or personal injury after an accident. It can be caused by
  - An accident that "arranged" by fraudster to occur
  - Ghost passenger where people were not involve in an accident claim to have suffered serious injuries
  - False personal injuries that are grossly exaggerated
- We need to detect the fraudulent cases
- Thus, we will perform predictive modeling for fraud detection

The suspected claim should be hold and perform further investigation manually by dedicated team

## 1.2 Data Description

___

- We will use vehicle insurance claim dataset from [Vehicle Insurance Claim - Kaggle](https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection?resource=download)
- This dataset contain, accident, vehicle along with policy details.
  - Accident Details:
    1. Time: - `Month`
  - Vehicle Details:
    1. Vehicle:
       - `VehicleCategory`,
       - `VehiclePrice`,
       - `AgeOfVehicle`
       - `Make`
  - Claim Details:
    1. Time:
       - `MonthClaimed`
    2. `Fault`
    3. Policy:
       - `PastNumberOfClaims`,
       - `AgeOfPolicyHolder`,
       - `BasePolicy`,
       - `AgentType`
    4. Others:
       - `NumberOfSuppliments`,
       - `AddressChange_Claim`
    6. target variable, `FraudFound_P`

## 1.3 Define the Problem
---

- We are facing two problems
  - If we incorrectly predict fraud transaction as not fraud, we will loss due to the fraud.
  - If we incorrectly predict not fraud transaction as fraud, we will loss the customer.

- Hence, we create a model to **classify** between fraud and not fraud

Model descriptions:

1. **What are the inputs?** Insurance claim records, `PoliceReportFiled`, `NumberOfSuppliments`, `Deductible`

2. **What are the outputs?** `FraudFound_P` prediction,
      - It takes value `1` in case of predicted as fraud, and
      - `0` otherwise.

3. **What do we do with the prediction?** If a transaction is flagged as suspicious, the decision maker may decide:
   - Flag claim as suspicious AND
   - Alert fraud investigation officer.

## 1.4 Task
---

- **Task:**
  - Fraudulent insurance claim classification

- **Technique used:**
  - Baseline: all transaction is not fraudulent
  - Logistic Regression
  - Decision Tree
  - Support Vector Machine
  - Random Forest

- **Evaluation:**
  - ROC - AUC

# <font color='blue'> 2. Importing Data


- We import the data from a `fraud_oracle.csv`

In [7]:
# read dataset function
def read_data(path):
    """
    Reads a CSV file at the given path, removes any duplicate rows,
    and returns its contents as a pandas DataFrame.

    Parameters
    ----------
    path : str
        The sample data input path (csv format)

    Return
    ------
    df : pd.DataFrame
        The sample data input
    """
    # Read data
    df = pd.read_csv(path)

    # Validate
    print('Data shape:', df.shape)

    return df

In [8]:
col = ["AgeOfVehicle",
       "NumberOfSuppliments",
       "AgentType",
       "AgeOfPolicyHolder",
       "Month",
       "MonthClaimed",
       "AddressChange_Claim",
       "VehiclePrice",
       "PastNumberOfClaims",
       "Make",
       "VehicleCategory",
       "Fault",
       "BasePolicy",
       "FraudFound_P"]

In [9]:
# Dataset name / Path
path = 'fraud_oracle.csv'

# Read the data
df = read_data(path = path)
df.head()

Data shape: (15420, 33)


Unnamed: 0,Month,WeekOfMonth,DayOfWeek,Make,AccidentArea,DayOfWeekClaimed,MonthClaimed,WeekOfMonthClaimed,Sex,MaritalStatus,...,AgeOfVehicle,AgeOfPolicyHolder,PoliceReportFiled,WitnessPresent,AgentType,NumberOfSuppliments,AddressChange_Claim,NumberOfCars,Year,BasePolicy
0,Dec,5,Wednesday,Honda,Urban,Tuesday,Jan,1,Female,Single,...,3 years,26 to 30,No,No,External,none,1 year,3 to 4,1994,Liability
1,Jan,3,Wednesday,Honda,Urban,Monday,Jan,4,Male,Single,...,6 years,31 to 35,Yes,No,External,none,no change,1 vehicle,1994,Collision
2,Oct,5,Friday,Honda,Urban,Thursday,Nov,2,Male,Married,...,7 years,41 to 50,No,No,External,none,no change,1 vehicle,1994,Collision
3,Jun,2,Saturday,Toyota,Rural,Friday,Jul,1,Male,Married,...,more than 7,51 to 65,Yes,No,External,more than 5,no change,1 vehicle,1994,Liability
4,Jan,5,Monday,Honda,Urban,Tuesday,Feb,2,Female,Single,...,5 years,31 to 35,No,No,External,none,no change,1 vehicle,1994,Collision


In [10]:
# get the selected cols
df = df[col]
df.head()

Unnamed: 0,AgeOfVehicle,NumberOfSuppliments,AgentType,AgeOfPolicyHolder,Month,MonthClaimed,AddressChange_Claim,VehiclePrice,PastNumberOfClaims,Make,VehicleCategory,Fault,BasePolicy,FraudFound_P
0,3 years,none,External,26 to 30,Dec,Jan,1 year,more than 69000,none,Honda,Sport,Policy Holder,Liability,0
1,6 years,none,External,31 to 35,Jan,Jan,no change,more than 69000,none,Honda,Sport,Policy Holder,Collision,0
2,7 years,none,External,41 to 50,Oct,Nov,no change,more than 69000,1,Honda,Sport,Policy Holder,Collision,0
3,more than 7,more than 5,External,51 to 65,Jun,Jul,no change,20000 to 29000,1,Toyota,Sport,Third Party,Liability,0
4,5 years,none,External,31 to 35,Jan,Feb,no change,more than 69000,none,Honda,Sport,Third Party,Collision,0


# <font color='blue'> 3. Splitting Data
    
___

- Our objective is to classify unseen transactions data, thus we need to make sure that we do not leak unseen transactions data during training.
- Our tasks:
  1. Split into `input` (`X`) and `output` (`y`)
  2. Split into `train` (60% data), `valid` (20% data), and `test` (20% data).
    - `train` data: will be used to build the model
    - `validation` data: will be used to choose the best model
    - `test` data: will be used for final evaluation

**Split Input & Output**

In [11]:
# function split input and output
def split_input_output(data, target_column):
    """
    Function to split input (x) and output (y)

    Parameters
    ----------
    data : pd.DataFrame
        The sample data input

    target_column : str
        The output column name

    Return
    ------
    X : pd.DataFrame
        input data

    y : pd.DataFrame
        output data
    """
    X = data.drop(columns = target_column)
    y = data[target_column]

    # Validate
    print('X shape:', X.shape)
    print('y shape :', y.shape)

    return X, y


In [12]:
# Split input x and output y
X, y = split_input_output(data = df,
                          target_column = "FraudFound_P")

# Show 5 first rows of input
X.head()

X shape: (15420, 13)
y shape : (15420,)


Unnamed: 0,AgeOfVehicle,NumberOfSuppliments,AgentType,AgeOfPolicyHolder,Month,MonthClaimed,AddressChange_Claim,VehiclePrice,PastNumberOfClaims,Make,VehicleCategory,Fault,BasePolicy
0,3 years,none,External,26 to 30,Dec,Jan,1 year,more than 69000,none,Honda,Sport,Policy Holder,Liability
1,6 years,none,External,31 to 35,Jan,Jan,no change,more than 69000,none,Honda,Sport,Policy Holder,Collision
2,7 years,none,External,41 to 50,Oct,Nov,no change,more than 69000,1,Honda,Sport,Policy Holder,Collision
3,more than 7,more than 5,External,51 to 65,Jun,Jul,no change,20000 to 29000,1,Toyota,Sport,Third Party,Liability
4,5 years,none,External,31 to 35,Jan,Feb,no change,more than 69000,none,Honda,Sport,Third Party,Collision


**Validate the Data Dimension**

we want to validate whether the data dimensions are in accordance with the data standards in the data definition process

In [13]:
# check data dimension
n_samples, n_features = X.shape

# print number samples and features
print(f"Number of samples  : {n_samples}")
print(f"Number of features : {n_features}")

Number of samples  : 15420
Number of features : 13


In [14]:
# check data features name
features_names = X.columns

# print name of features
print(f"name of features : {features_names}")

name of features : Index(['AgeOfVehicle', 'NumberOfSuppliments', 'AgentType', 'AgeOfPolicyHolder',
       'Month', 'MonthClaimed', 'AddressChange_Claim', 'VehiclePrice',
       'PastNumberOfClaims', 'Make', 'VehicleCategory', 'Fault', 'BasePolicy'],
      dtype='object')


**Split Train, Valid, and Test**

- Create a function to split train-valid-test

In [17]:
from sklearn.model_selection import train_test_split

def split_train_valid_test(X, y, test_size, valid_size, stratify, random_state=42):
    """
    Split data into train & test

    Parameters
    ----------
    X : pd.DataFrame
        The input data

    y : pd.Series
        The output data

    test_size : float
        The proportion of number of test data to total data

    valid_size : float
        The proportion of number of validation data to total data

    stratify : pd.Series
        Reference to stratify the splitting

    random_state : int, default=42
        The random seed, for reproducibility

    Returns
    -------
    X_train : pd.DataFrame
        The input train data

    X_test : pd.DataFrame
        The input test data

    y_train : pd.Series
        The output train data

    y_test : pd.Series
        The output test data
    """
    # Split the data
    X_train, X_not_train, y_train, y_not_train = train_test_split(
        X,
        y,
        test_size = test_size + valid_size,
        stratify = stratify,
        random_state = random_state
    )

    # Then, split valid and test from not_train
    X_valid, X_test, y_valid, y_test = train_test_split(
        X_not_train,
        y_not_train,
        test_size = valid_size/(test_size + valid_size),
        stratify = y_not_train,
        random_state = random_state
    )

    # Validate
    print('X train shape:', X_train.shape)
    print('y train shape:', y_train.shape)
    print('X valid shape :', X_valid.shape)
    print('y valid shape :', y_valid.shape)
    print('X test shape :', X_test.shape)
    print('y test shape :', y_test.shape)

    return X_train, X_valid, X_test, y_train, y_valid, y_test


In [18]:
# Run the code
splitted_data = split_train_valid_test(
    X = X,
    y = y,
    test_size = 0.2,
    valid_size = 0.2,
    stratify = y,
    random_state = 42
)

X_train, X_valid, X_test, y_train, y_valid, y_test = splitted_data

X train shape: (9252, 13)
y train shape: (9252,)
X valid shape : (3084, 13)
y valid shape : (3084,)
X test shape : (3084, 13)
y test shape : (3084,)


**Summary**
- Now we have training, validation and testing data
  - test_size = 20% from original data,
  - validation_size = 20% from original data,
  - train_size = 60% from original data

# <font color='blue'> 4. Preprocess Data


### **4.1 Split numerical and categoric data**


- All our data is categorical, however there's some categories that have ordered values
- We will seperate the columns into numerical to handle it separately with unordered values

In [19]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9252 entries, 2839 to 5044
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   AgeOfVehicle         9252 non-null   object
 1   NumberOfSuppliments  9252 non-null   object
 2   AgentType            9252 non-null   object
 3   AgeOfPolicyHolder    9252 non-null   object
 4   Month                9252 non-null   object
 5   MonthClaimed         9252 non-null   object
 6   AddressChange_Claim  9252 non-null   object
 7   VehiclePrice         9252 non-null   object
 8   PastNumberOfClaims   9252 non-null   object
 9   Make                 9252 non-null   object
 10  VehicleCategory      9252 non-null   object
 11  Fault                9252 non-null   object
 12  BasePolicy           9252 non-null   object
dtypes: object(13)
memory usage: 1011.9+ KB


In [20]:
# define numerical data
NUM_COLS = ['AgeOfVehicle', 'NumberOfSuppliments', 'AgeOfPolicyHolder',
            'MonthClaimed', 'Month',
            'VehiclePrice', 'PastNumberOfClaims']
# define categorical
CAT_COLS = ['AgentType', 'AddressChange_Claim', 'Make',
            'VehicleCategory', 'Fault',"BasePolicy"]

In [21]:
# split
X_train_num = X_train[NUM_COLS]
X_train_cat = X_train[CAT_COLS]

In [22]:
X_train_cat.head()

Unnamed: 0,AgentType,AddressChange_Claim,Make,VehicleCategory,Fault,BasePolicy
2839,External,no change,Pontiac,Sport,Policy Holder,Liability
5783,External,no change,Pontiac,Sedan,Third Party,All Perils
10425,External,no change,Pontiac,Sedan,Policy Holder,Collision
10966,External,no change,Toyota,Sport,Policy Holder,Liability
14520,External,no change,Dodge,Sport,Third Party,Liability


### **4.2 Encode categorical data**


- Use one hot encoder in unordered categorical data

<center>
<img src="https://datagy.io/wp-content/uploads/2022/01/One-Hot-Encoding-for-Scikit-Learn-in-Python-Explained-1024x576.png">

- Gunakan library `sklearn` di module `OneHotEncoder`

In [23]:
from sklearn.preprocessing import OneHotEncoder

- Buat encoder

In [26]:
# initiate encoder
encoder = OneHotEncoder(drop='if_binary',           # If the value is only 2, drop one
                        handle_unknown='ignore')    # If something is not recognized, 0 is all

In [27]:
# Fit encoder to categorical data in train data
encoder.fit(X_train_cat)

In [28]:
# result encoder: feature name example
encoder.get_feature_names_out()

array(['AgentType_Internal', 'AddressChange_Claim_1 year',
       'AddressChange_Claim_2 to 3 years',
       'AddressChange_Claim_4 to 8 years',
       'AddressChange_Claim_no change',
       'AddressChange_Claim_under 6 months', 'Make_Accura', 'Make_BMW',
       'Make_Chevrolet', 'Make_Dodge', 'Make_Ferrari', 'Make_Ford',
       'Make_Honda', 'Make_Jaguar', 'Make_Lexus', 'Make_Mazda',
       'Make_Mecedes', 'Make_Mercury', 'Make_Nisson', 'Make_Pontiac',
       'Make_Porche', 'Make_Saab', 'Make_Saturn', 'Make_Toyota',
       'Make_VW', 'VehicleCategory_Sedan', 'VehicleCategory_Sport',
       'VehicleCategory_Utility', 'Fault_Third Party',
       'BasePolicy_All Perils', 'BasePolicy_Collision',
       'BasePolicy_Liability'], dtype=object)

- Perform transformation

In [30]:
# Transform
X_train_cat_enc = pd.DataFrame(
    encoder.transform(X_train_cat).toarray(),
    columns = encoder.get_feature_names_out(),
    index = X_train_cat.index
)

X_train_cat_enc.head()

Unnamed: 0,AgentType_Internal,AddressChange_Claim_1 year,AddressChange_Claim_2 to 3 years,AddressChange_Claim_4 to 8 years,AddressChange_Claim_no change,AddressChange_Claim_under 6 months,Make_Accura,Make_BMW,Make_Chevrolet,Make_Dodge,...,Make_Saturn,Make_Toyota,Make_VW,VehicleCategory_Sedan,VehicleCategory_Sport,VehicleCategory_Utility,Fault_Third Party,BasePolicy_All Perils,BasePolicy_Collision,BasePolicy_Liability
2839,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
5783,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
10425,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
10966,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
14520,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


In [32]:
X_train_cat.columns

Index(['AgentType', 'AddressChange_Claim', 'Make', 'VehicleCategory', 'Fault',
       'BasePolicy'],
      dtype='object')

In [35]:
encoder.transform(X_train_cat).toarray()

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.]])

In [36]:
X_train_cat.head()

Unnamed: 0,AgentType,AddressChange_Claim,Make,VehicleCategory,Fault,BasePolicy
2839,External,no change,Pontiac,Sport,Policy Holder,Liability
5783,External,no change,Pontiac,Sedan,Third Party,All Perils
10425,External,no change,Pontiac,Sedan,Policy Holder,Collision
10966,External,no change,Toyota,Sport,Policy Holder,Liability
14520,External,no change,Dodge,Sport,Third Party,Liability


### 4.3 Transform category into ordered valuess


**Transform** - convert categories into numerical with ordered values

In [42]:
X_train_num["VehiclePrice"].value_counts()

VehiclePrice
20000 to 29000     4844
30000 to 39000     2117
more than 69000    1298
less than 20000     649
40000 to 59000      290
60000 to 69000       54
Name: count, dtype: int64

In [43]:
    price_list = ["less than 20000", "20000 to 29000", "30000 to 39000", "40000 to 59000",
                  "60000 to 69000", "more than 69000"]

    number_list = [i+1 for i in range(len(price_list))]

In [44]:
number_list

[1, 2, 3, 4, 5, 6]

In [45]:
def transformVehiclePrice(data):
    price_list = ["less than 20000", "20000 to 29000", "30000 to 39000", "40000 to 59000",
                  "60000 to 69000", "more than 69000"]

    number_list = [i+1 for i in range(len(price_list))]

    data["VehiclePrice"] = data["VehiclePrice"].replace(price_list, number_list)

    return data

In [46]:
def transformAgeOfVehicle(data):
    age_list = ["new", "2 years", "3 years", "4 years",
                  "5 years", "6 years", "7 years","more than 7"]

    number_list = [i for i in range(len(age_list))]

    data["AgeOfVehicle"] = data["AgeOfVehicle"].replace(age_list, number_list)

    return data

In [47]:
def transformPastNumberOfClaims(data):
    claimn_list = ["none", "1", "2 to 4", "more than 4"]

    number_list = [i for i in range(len(claimn_list))]

    data["PastNumberOfClaims"] = data["PastNumberOfClaims"].replace(claimn_list, number_list)

    return data

In [48]:
def transformAgeOfPolicyHolder(data):
    agepolice_list = ["16 to 17", "18 to 20", "21 to 25", "26 to 30",
                  "31 to 35", "36 to 40", "41 to 50","51 to 65",
                 "over 65"]

    number_list = [i+1 for i in range(len(agepolice_list))]

    data["AgeOfPolicyHolder"] = data["AgeOfPolicyHolder"].replace(agepolice_list, number_list)

    return data

In [49]:
def transformNumberOfSuppliments(data):
    nsup = ["none", "1 to 2", "3 to 5", "more than 5"]

    number_list = [i for i in range(len(nsup))]

    data["NumberOfSuppliments"] = data["NumberOfSuppliments"].replace(nsup, number_list)

    return data

In [50]:
def transformMonth(data):
    month_list = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
                 "Jul", "Aug", "Sep", "Oct", "Nov","Dec"]

    number_list = [i for i in range(len(month_list))]

    data["Month"] = data["Month"].replace(month_list, number_list)
    data["MonthClaimed"] = data["MonthClaimed"].replace(month_list, number_list)

    return data

In [51]:
X_train[NUM_COLS] = transformVehiclePrice(data = X_train[NUM_COLS])
X_train[NUM_COLS] = transformAgeOfVehicle(data = X_train[NUM_COLS])
X_train[NUM_COLS] = transformPastNumberOfClaims(data = X_train[NUM_COLS])
X_train[NUM_COLS] = transformAgeOfPolicyHolder(data = X_train[NUM_COLS])
X_train[NUM_COLS] = transformNumberOfSuppliments(data = X_train[NUM_COLS])
X_train[NUM_COLS] = transformMonth(data = X_train[NUM_COLS])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["VehiclePrice"] = data["VehiclePrice"].replace(price_list, number_list)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["AgeOfVehicle"] = data["AgeOfVehicle"].replace(age_list, number_list)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["PastNumberOfClaims"] = data["PastNumberOfClaims

In [52]:
X_train[NUM_COLS]

Unnamed: 0,AgeOfVehicle,NumberOfSuppliments,AgeOfPolicyHolder,MonthClaimed,Month,VehiclePrice,PastNumberOfClaims
2839,7,3,8,0,0,3,2
5783,6,0,6,2,2,2,0
10425,6,0,6,0,0,1,0
10966,7,0,8,0,0,2,2
14520,5,3,5,6,6,2,3
...,...,...,...,...,...,...,...
11935,7,3,8,9,9,3,2
12067,7,1,8,0,0,2,1
14159,5,0,5,0,0,3,3
2829,6,0,5,8,8,6,2


**Missing Values** - Numerical

In [54]:
# Cek missing value
X_train.isna().any()

AgeOfVehicle           False
NumberOfSuppliments    False
AgentType              False
AgeOfPolicyHolder      False
Month                  False
MonthClaimed           False
AddressChange_Claim    False
VehiclePrice           False
PastNumberOfClaims     False
Make                   False
VehicleCategory        False
Fault                  False
BasePolicy             False
dtype: bool

In [55]:
# Create an imputer, in case someone needs test data
from sklearn.impute import SimpleImputer

def imputerNumeric(data, imputer = None):
    if imputer == None:
        # Create imputer
        imputer = SimpleImputer(missing_values = np.nan,
                                strategy = "median")
        imputer.fit(data)

    # Transform data
    data_imputed = imputer.transform(data)
    data_imputed = pd.DataFrame(data = data_imputed,
                                columns = data.columns,
                                index = data.index)

    return data_imputed, imputer

In [56]:
train_imputed, train_imputer = imputerNumeric(data = X_train[NUM_COLS])

### **4.4 Combine categorical and numerical data**


- Next, combine the categorical data that has been encoded & convert into numeric

In [58]:
# Join data
X_train_concat = pd.concat((X_train[NUM_COLS].copy(), X_train_cat_enc), axis=1)

# Validate
print('Data shape:', X_train_concat.shape)
X_train_concat.head()

Data shape: (9252, 39)


Unnamed: 0,AgeOfVehicle,NumberOfSuppliments,AgeOfPolicyHolder,MonthClaimed,Month,VehiclePrice,PastNumberOfClaims,AgentType_Internal,AddressChange_Claim_1 year,AddressChange_Claim_2 to 3 years,...,Make_Saturn,Make_Toyota,Make_VW,VehicleCategory_Sedan,VehicleCategory_Sport,VehicleCategory_Utility,Fault_Third Party,BasePolicy_All Perils,BasePolicy_Collision,BasePolicy_Liability
2839,7,3,8,0,0,3,2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
5783,6,0,6,2,2,2,0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
10425,6,0,6,0,0,1,0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
10966,7,0,8,0,0,2,2,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
14520,5,3,5,6,6,2,3,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


# <font color='blue'> 5. Modeling


In [59]:
# Import model

from sklearn.dummy import DummyClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


## 5.1 Create Models
---

- Define model & hyperparameter

In [60]:
model_dict = {
    'baseline': DummyClassifier(),
    'logistic regression': LogisticRegression(),
    'svm': SVC(),
    'decision tree': DecisionTreeClassifier(random_state=42),
    'random forest': RandomForestClassifier(random_state=42)
}

hyperparam_dict = {
    'baseline': {'strategy':['most_frequent']},
    'logistic regression': {},
    'svm': {'C':[0.1, 0.5, 1], 'kernel': ["linear", "poly", "rbf"]},
    'decision tree': {'max_depth': [5, 10, 20]},
    'random forest': {'n_estimators': [100, 300]}
}

## 5.1 Hyperparameter Tuning
---

In [61]:
from sklearn.model_selection import GridSearchCV

In [62]:
# Perform modeling
models = []
auc_trains = []
auc_tests = []
best_params = []

for model_name in model_dict.keys():
    # Log
    print('start modeling', model_name)

    cv_ = GridSearchCV(estimator = model_dict[model_name],
                       param_grid = hyperparam_dict[model_name],
                       cv = 5,
                       scoring = 'roc_auc',
                       return_train_score = True,
                       verbose = 3)
    cv_.fit(X_train_concat, y_train)

    auc_trains_ = cv_.cv_results_['mean_train_score'][0]
    auc_tests_ = cv_.best_score_
    best_params_ = cv_.best_params_

    # append
    models.append(model_name)
    auc_trains.append(auc_trains_)
    auc_tests.append(auc_tests_)
    best_params.append(best_params_)

    # log
    print('finish modeling', model_name)
    print('')

start modeling baseline
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END strategy=most_frequent;, score=(train=0.500, test=0.500) total time=   0.0s
[CV 2/5] END strategy=most_frequent;, score=(train=0.500, test=0.500) total time=   0.0s
[CV 3/5] END strategy=most_frequent;, score=(train=0.500, test=0.500) total time=   0.0s
[CV 4/5] END strategy=most_frequent;, score=(train=0.500, test=0.500) total time=   0.0s
[CV 5/5] END strategy=most_frequent;, score=(train=0.500, test=0.500) total time=   0.0s
finish modeling baseline

start modeling logistic regression
Fitting 5 folds for each of 1 candidates, totalling 5 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END ..............., score=(train=0.818, test=0.812) total time=   0.0s
[CV 2/5] END ..............., score=(train=0.824, test=0.785) total time=   0.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END ..............., score=(train=0.821, test=0.794) total time=   0.0s
[CV 4/5] END ..............., score=(train=0.818, test=0.823) total time=   0.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END ..............., score=(train=0.814, test=0.823) total time=   0.0s
finish modeling logistic regression

start modeling svm
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END C=0.1, kernel=linear;, score=(train=0.658, test=0.615) total time=   0.2s
[CV 2/5] END C=0.1, kernel=linear;, score=(train=0.721, test=0.712) total time=   0.2s
[CV 3/5] END C=0.1, kernel=linear;, score=(train=0.683, test=0.684) total time=   0.2s
[CV 4/5] END C=0.1, kernel=linear;, score=(train=0.729, test=0.722) total time=   0.2s
[CV 5/5] END C=0.1, kernel=linear;, score=(train=0.726, test=0.707) total time=   0.2s
[CV 1/5] END C=0.1, kernel=poly;, score=(train=0.625, test=0.600) total time=   0.7s
[CV 2/5] END C=0.1, kernel=poly;, score=(train=0.643, test=0.565) total time=   0.6s
[CV 3/5] END C=0.1, kernel=poly;, score=(train=0.615, test=0.570) total time=   0.6s
[CV 4/5] END C=0.1, kernel=poly;, score=(train=0.635, test=0.547) total time=   0.6s
[CV 5/5] END C=0.1, kernel=p

## 5.2 Best Parameters
---

In [63]:
summ_exp = pd.DataFrame(
    {'model': models,
     'AUC train': auc_trains,
     'AUC test': auc_tests,
     'Best param': best_params}
)

summ_exp

Unnamed: 0,model,AUC train,AUC test,Best param
0,baseline,0.5,0.5,{'strategy': 'most_frequent'}
1,logistic regression,0.818985,0.8077,{}
2,svm,0.70329,0.719045,"{'C': 1, 'kernel': 'rbf'}"
3,decision tree,0.832718,0.818199,{'max_depth': 5}
4,random forest,0.999867,0.780936,{'n_estimators': 300}


- Retraining model with best hyperparameters

In [65]:
dt_best = DecisionTreeClassifier(max_depth = 5)
dt_best.fit(X_train_concat, y_train)

lr_best = LogisticRegression()
lr_best.fit(X_train_concat, y_train)

svm_best = SVC(C = 1, kernel = 'rbf')
svm_best.fit(X_train_concat, y_train)

rf_best = RandomForestClassifier(n_estimators = 300)
rf_best.fit(X_train_concat, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# <font color='blue'> 6. Prediction


To make predictions, we need
- Models
- Data that has been processed

## 6.1 Preprocess validation and test data
---

- Preprocess data so that it can be modeled
- So preprocess first, follow the method of preprocessing data when training the model

In [67]:
# Split cat & num
X_valid_num = X_valid[NUM_COLS]
X_valid_cat = X_valid[CAT_COLS]

# Encode cat
X_valid_cat_enc = pd.DataFrame(
    encoder.transform(X_valid_cat).toarray(),
    index = X_valid_cat.index,
    columns = encoder.get_feature_names_out()
)

X_valid_num = transformVehiclePrice(data = X_valid_num)
X_valid_num = transformAgeOfVehicle(data = X_valid_num)
X_valid_num = transformPastNumberOfClaims(data = X_valid_num)
X_valid_num = transformAgeOfPolicyHolder(data = X_valid_num)
X_valid_num = transformNumberOfSuppliments(data = X_valid_num)
X_valid_num = transformMonth(data = X_valid_num)

# Concat
X_valid_concat = pd.concat((X_valid_num, X_valid_cat_enc), axis=1)
X_valid_concat

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["VehiclePrice"] = data["VehiclePrice"].replace(price_list, number_list)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["PastNumberOfClaims"] = data["PastNumberOfClaims"].replace(claimn_list, number_list)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["AgeOfPolicyHolder"] = data["AgeO

Unnamed: 0,AgeOfVehicle,NumberOfSuppliments,AgeOfPolicyHolder,MonthClaimed,Month,VehiclePrice,PastNumberOfClaims,AgentType_Internal,AddressChange_Claim_1 year,AddressChange_Claim_2 to 3 years,...,Make_Saturn,Make_Toyota,Make_VW,VehicleCategory_Sedan,VehicleCategory_Sport,VehicleCategory_Utility,Fault_Third Party,BasePolicy_All Perils,BasePolicy_Collision,BasePolicy_Liability
13829,7,0,8,2,2,3,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
6974,7,3,7,3,2,5,2,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
14783,6,0,6,11,11,2,1,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
8888,6,3,5,8,8,2,1,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
7083,7,3,7,1,1,3,3,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12962,5,3,4,1,1,6,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
103,7,3,7,0,0,2,0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
5712,7,1,8,5,4,1,1,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
11575,5,1,5,11,11,2,2,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [68]:
# Split cat & num
X_test_num = X_test[NUM_COLS]
X_test_cat = X_test[CAT_COLS]

# Encode cat
X_test_cat_enc = pd.DataFrame(
    encoder.transform(X_test_cat).toarray(),
    index = X_test_cat.index,
    columns = encoder.get_feature_names_out()
)

X_test_num = transformVehiclePrice(data = X_test_num)
X_test_num = transformAgeOfVehicle(data = X_test_num)
X_test_num = transformPastNumberOfClaims(data = X_test_num)
X_test_num = transformAgeOfPolicyHolder(data = X_test_num)
X_test_num = transformNumberOfSuppliments(data = X_test_num)
X_test_num = transformMonth(data = X_test_num)

# Concat
X_test_concat = pd.concat((X_test_num, X_test_cat_enc), axis=1)
X_test_concat

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["VehiclePrice"] = data["VehiclePrice"].replace(price_list, number_list)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["PastNumberOfClaims"] = data["PastNumberOfClaims"].replace(claimn_list, number_list)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["AgeOfPolicyHolder"] = data["AgeO

Unnamed: 0,AgeOfVehicle,NumberOfSuppliments,AgeOfPolicyHolder,MonthClaimed,Month,VehiclePrice,PastNumberOfClaims,AgentType_Internal,AddressChange_Claim_1 year,AddressChange_Claim_2 to 3 years,...,Make_Saturn,Make_Toyota,Make_VW,VehicleCategory_Sedan,VehicleCategory_Sport,VehicleCategory_Utility,Fault_Third Party,BasePolicy_All Perils,BasePolicy_Collision,BasePolicy_Liability
2497,6,2,5,8,7,3,0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
9707,6,3,6,7,7,3,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4989,6,3,6,11,11,3,3,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
13894,7,0,8,1,0,2,0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
15288,6,0,8,5,5,1,0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2436,5,1,5,4,2,2,1,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
10295,7,0,6,0,0,2,3,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
5339,7,2,6,0,0,2,0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4796,6,2,5,10,10,2,1,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0


## 6.2 Predict


### Logistic Regression

- Perform prediction

In [70]:
y_val_pred = lr_best.predict(X_valid_concat)
y_val_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [72]:
from sklearn.metrics import roc_auc_score

roc = roc_auc_score(y_valid, y_val_pred)

print('ROC  :', roc)

ROC  : 0.49948258020006897


### Support Vector Machine

In [73]:
y_val_pred = svm_best.predict(X_valid_concat)
y_val_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [74]:
roc = roc_auc_score(y_valid, y_val_pred)

print('ROC  :', roc)

ROC  : 0.5


### Decision Tree

In [76]:
y_val_pred = dt_best.predict(X_valid_concat)
y_val_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [77]:
roc = roc_auc_score(y_valid, y_val_pred)

print('ROC  :', roc)

ROC  : 0.5324324324324324


### Random Forest

In [78]:
y_val_pred = rf_best.predict(X_valid_concat)
y_val_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [79]:
roc = roc_auc_score(y_valid, y_val_pred)

print('ROC  :', roc)

ROC  : 0.5143190102831359


## 6.3 Best Model on Train Data
---

highest ROC value is decision tree 0.53, so we will take decision tree as the best model

In [80]:
y_train_pred = dt_best.predict(X_train_concat)
y_train_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [81]:
roc = roc_auc_score(y_train, y_train_pred)

print('ROC  :', roc)

ROC  : 0.5189530685920578


In [82]:
result  = pd.crosstab(y_train_pred,
            y_train,
            margins = True)

In [83]:
result

FraudFound_P,0,1,All
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,8698,533,9231
1,0,21,21
All,8698,554,9252


## 6.4 Best Model on Test Data


In [84]:
y_test_pred = dt_best.predict(X_test_concat)
y_test_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [85]:
roc = roc_auc_score(y_test, y_test_pred)

print('ROC  :', roc)

ROC  : 0.5190217391304348


In [86]:
result  = pd.crosstab(y_test_pred,
            y_test,
            margins = True)

In [87]:
result

FraudFound_P,0,1,All
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2900,177,3077
1,0,7,7
All,2900,184,3084
