# Predicting Fraud claim in Auto Insurance using SVC modeling on Virtualized data

The notebook will train, create and deploy a Fraud prediction model.

### Contents

- [Setup](#setup)
- [Loading Refined data](#data)
- [Model building](#model)
- [Saving the model](#save)
- [Model Deployment](#deployment)
- [Testing the model](#testing)

## 1. Setup the Notebook Environment <a name="setup"></a>


### 1.1 Review Use Case


The analytics use case implemented in this notebook is fraud claim prediction in auto insurance. We virtualized data sets from DB2Warehouse in Cloud and using the same to build a predictive XGBoost model here.


### Working with Notebooks

If you are new to Notebooks, here's a quick overview of how to work in this environment.

1. The notebook has 2 types of cells - markdown (text) and code. 
2. Each cell with code can be executed independently or together (see options under the Cell menu). When working in this notebook, we will be running one cell at a time because we need to make code changes to some of the cells.
3. To run the cell, position cursor in the code cell and click the Run (arrow) icon. The cell is running when you see the * next to it. Some cells have printable output.
4. Work through this notebook by reading the instructions and executing code cell by cell. Some cells might require modifications before you run them. 

### 1.2 Import Packages


In [4]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn import preprocessing
from sklearn import svm, metrics
from scipy import sparse
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
import json
import sys,os,os.path


import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics  import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier

from ibm_watson_machine_learning import APIClient

## 2 Add Dataset <a name="data"></a>

Select the Insert Pandas Dataframe option, after selecting the below cell. Ensure the variable name is df_data_1



## 3. Create the Fraud claim prediction Model using Scikit-Learn <a name="model"></a>


From our Exploratory Data analysis using Data refinery, we observed that the following predictors were significantly correlated with the `fraud_reported` target label:

### 3.1 Feature Selection

In [6]:
required_columns = ['INSURED_SEX', 'INSURED_OCCUPATION', 'INSURED_HOBBIES',
       'CAPITAL_GAINS', 'CAPITAL_LOSS', 'INCIDENT_TYPE', 'COLLISION_TYPE', 'INCIDENT_SEVERITY',
       'AUTHORITIES_CONTACTED', 'INCIDENT_HOUR_OF_THE_DAY', 'NUMBER_OF_VEHICLES_INVOLVED',
       'WITNESSES', 'TOTAL_CLAIM_AMOUNT', 'FRAUD_REPORTED', 'POLICY_ANNUAL_PREMIUM']

Therefore, we will use only these feature in creating our initial model

In [7]:
df1 = df_data_1[required_columns]
df1.head()

Unnamed: 0,INSURED_SEX,INSURED_OCCUPATION,INSURED_HOBBIES,CAPITAL_GAINS,CAPITAL_LOSS,INCIDENT_TYPE,COLLISION_TYPE,INCIDENT_SEVERITY,AUTHORITIES_CONTACTED,INCIDENT_HOUR_OF_THE_DAY,NUMBER_OF_VEHICLES_INVOLVED,WITNESSES,TOTAL_CLAIM_AMOUNT,FRAUD_REPORTED,POLICY_ANNUAL_PREMIUM
0,FEMALE,tech-support,other,0,0,Multi-vehicle Collision,Side Collision,Major Damage,Fire,23,3,3,77880,N,1003.23
1,FEMALE,machine-op-inspct,other,0,0,Multi-vehicle Collision,Front Collision,Minor Damage,Ambulance,17,3,3,47080,N,987.42
2,FEMALE,other-service,other,58100,0,Single Vehicle Collision,Front Collision,Minor Damage,Fire,21,1,1,47300,N,1355.08
3,MALE,machine-op-inspct,other,0,-39100,Vehicle Theft,Unknown,Trivial Damage,Police,7,1,2,4680,N,1344.56
4,FEMALE,transport-moving,other,0,0,Multi-vehicle Collision,Rear Collision,Minor Damage,Police,1,3,0,31700,N,903.32


#### Check for missing values

In [8]:
df1.isnull().sum()

INSURED_SEX                    0
INSURED_OCCUPATION             0
INSURED_HOBBIES                0
CAPITAL_GAINS                  0
CAPITAL_LOSS                   0
INCIDENT_TYPE                  0
COLLISION_TYPE                 0
INCIDENT_SEVERITY              0
AUTHORITIES_CONTACTED          0
INCIDENT_HOUR_OF_THE_DAY       0
NUMBER_OF_VEHICLES_INVOLVED    0
WITNESSES                      0
TOTAL_CLAIM_AMOUNT             0
FRAUD_REPORTED                 0
POLICY_ANNUAL_PREMIUM          0
dtype: int64

### 3.2 Encode categorical features

In [9]:
columns_to_encode = []
for col in df1.columns:
    if col != 'FRAUD_REPORTED':
      if df1[col].dtype == 'object':
        columns_to_encode.append(col)

columns_to_encode

['INSURED_SEX',
 'INSURED_OCCUPATION',
 'INSURED_HOBBIES',
 'INCIDENT_TYPE',
 'COLLISION_TYPE',
 'INCIDENT_SEVERITY',
 'AUTHORITIES_CONTACTED']

In [10]:
df2 = pd.get_dummies(df1, columns = columns_to_encode)

df2.head()

Unnamed: 0,CAPITAL_GAINS,CAPITAL_LOSS,INCIDENT_HOUR_OF_THE_DAY,NUMBER_OF_VEHICLES_INVOLVED,WITNESSES,TOTAL_CLAIM_AMOUNT,FRAUD_REPORTED,POLICY_ANNUAL_PREMIUM,INSURED_SEX_FEMALE,INSURED_SEX_MALE,...,COLLISION_TYPE_Unknown,INCIDENT_SEVERITY_Major Damage,INCIDENT_SEVERITY_Minor Damage,INCIDENT_SEVERITY_Total Loss,INCIDENT_SEVERITY_Trivial Damage,AUTHORITIES_CONTACTED_Ambulance,AUTHORITIES_CONTACTED_Fire,AUTHORITIES_CONTACTED_None,AUTHORITIES_CONTACTED_Other,AUTHORITIES_CONTACTED_Police
0,0,0,23,3,3,77880,N,1003.23,1,0,...,0,1,0,0,0,0,1,0,0,0
1,0,0,17,3,3,47080,N,987.42,1,0,...,0,0,1,0,0,1,0,0,0,0
2,58100,0,21,1,1,47300,N,1355.08,1,0,...,0,0,1,0,0,0,1,0,0,0
3,0,-39100,7,1,2,4680,N,1344.56,0,1,...,1,0,0,0,1,0,0,0,0,1
4,0,0,1,3,0,31700,N,903.32,1,0,...,0,0,1,0,0,0,0,0,0,1


#### Convert target label from Y/N to 1/0

In [11]:
df2['FRAUD_REPORTED'] = df2['FRAUD_REPORTED'].str.replace('Y', '1')
df2['FRAUD_REPORTED'] = df2['FRAUD_REPORTED'].str.replace('N', '0')
df2['FRAUD_REPORTED'] = df2['FRAUD_REPORTED'].astype(int)

#### Features and Target

In [12]:
features = []
for col in df2.columns:
  if col != 'FRAUD_REPORTED':
    features.append(col)

target = 'FRAUD_REPORTED'

X = df2[features]
y = df2[target]

#### Split the dataset into training and testing data

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

### 3.3 Modeling

#### Logistic Regression

In [14]:
lr = LogisticRegression()
pipeline = Pipeline([
        ('scale', StandardScaler()),
        ('clf', lr)])
pipeline.fit(X_train, y_train)

preds = pipeline.predict(X_test)

print(accuracy_score(preds, y_test))
print(classification_report(y_test, preds))

0.848
              precision    recall  f1-score   support

           0       0.90      0.91      0.90       193
           1       0.67      0.65      0.66        57

    accuracy                           0.85       250
   macro avg       0.79      0.78      0.78       250
weighted avg       0.85      0.85      0.85       250



#### Random Forest Classifier

In [15]:
random_forest = RandomForestClassifier()
pipeline = Pipeline([
        ('scale', StandardScaler()),
        ('clf', random_forest)])
pipeline.fit(X_train, y_train)

preds = pipeline.predict(X_test)

print(accuracy_score(preds, y_test))
print(classification_report(y_test, preds))

0.824
              precision    recall  f1-score   support

           0       0.90      0.87      0.88       193
           1       0.60      0.68      0.64        57

    accuracy                           0.82       250
   macro avg       0.75      0.77      0.76       250
weighted avg       0.83      0.82      0.83       250



#### XGBoost Classifier

In [16]:
# import xgboost as xgb

# xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.01, objective= 'binary:logistic',n_jobs=-1)
# xgb_model.fit(X_train, y_train)
# predictions_test = xgb_model.predict(X_test)

# print(accuracy_score(predictions_test, y_test))
# print(classification_report(predictions_test, y_test))

#### Linear SVC (Support Vector Classifier)

In [17]:
from sklearn.svm import SVC
pipeline = Pipeline([
        ('scale', StandardScaler()),
        ('clf', SVC(kernel = 'linear'))])
pipeline.fit(X_train, y_train)

preds = pipeline.predict(X_test)

print(accuracy_score(preds, y_test))
print(classification_report(y_test, preds))

0.84
              precision    recall  f1-score   support

           0       0.93      0.85      0.89       193
           1       0.62      0.79      0.69        57

    accuracy                           0.84       250
   macro avg       0.77      0.82      0.79       250
weighted avg       0.86      0.84      0.85       250



## 4. Save the model <a name="save"></a>


### 4.1 Input your WML Credentials


In [18]:
WML_CREDENTIALS = {
"token": os.environ['USER_ACCESS_TOKEN'],
"instance_id" : "wml_local",
"url" : os.environ['RUNTIME_ENV_APSX_URL'],
"version": "4.0"
}

### 4.2 Setup Watson Machine Learning Client 

In [19]:
wml_client = APIClient(WML_CREDENTIALS)
wml_client.spaces.list()


Note: 'limit' is not provided. Only first 50 records will be displayed if the number of records exceed 50
------------------------------------  ---------------------------------  ------------------------
ID                                    NAME                               CREATED
792f03b9-e291-4d4e-82c5-cc7af74cb213  Fraud prediction model deployment  2022-06-10T22:38:49.184Z
------------------------------------  ---------------------------------  ------------------------


### 4.3 Configuration

Now the model can be saved for future deployment. The model will be saved using the Watson Machine Learning client, to a deployment space.

**<font color='red'> UPDATE THE VARIABLE 'MODEL_NAME' TO A UNIQUE NAME</font>**

**<font color='red'> UPDATE THE VARIABLE 'dep_name' TO THE NAME OF THE DEPLOYMENT SPACE DISPLAYED IN THE LAST CELL OUTPUT</font>**

**<font color='red'> NOTE: If you have not completed the AutoAI tutorial in this learning path, you might not have created the deployment space. Follow step 3.3 in this tutorial: https://github.com/ibm-hcbt/cp4d-assets/blob/511667b796031bff325c3cb8672973407b2026ef/Fraud_claim_use_case/3A.%20Build%20model%20using%20AutoAI.md</font>**

In [20]:
MODEL_NAME="fraud_prediction"
DEPLOYMENT_NAME="fraud_prediction_deployment"

# Enter the Deployment Space you have associated project with 
dep_name="Fraud prediction model deployment"

In [21]:
project_id = os.environ['PROJECT_ID']
wml_client.set.default_project(project_id)

'SUCCESS'

In [22]:
meta_props={
 wml_client.repository.ModelMetaNames.NAME: MODEL_NAME,
 wml_client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: wml_client.software_specifications.get_id_by_name("runtime-22.1-py3.9"),
 wml_client.repository.ModelMetaNames.TYPE: "scikit-learn_1.0",
}

In [23]:
def get_id_from_space_name(client, space_name):

    instance_details = client.service_instance.get_details()

    space = client.spaces.get_details()
    res=[]
    for item in space['resources']: 
        if item['entity']["name"] == space_name:
            res=item['metadata']['id']

    return res

The name of your deployment space of the current project is read from the variable `dep_name` and stored in `space_uid`


In [24]:

space_id = get_id_from_space_name(wml_client, dep_name)

In [25]:
space_id

'792f03b9-e291-4d4e-82c5-cc7af74cb213'

In [26]:
wml_client.set.default_space(space_id)

Unsetting the project_id ...


'SUCCESS'

### 4.4 Store the model

In [27]:
deploy_meta = {
     wml_client.deployments.ConfigurationMetaNames.NAME: DEPLOYMENT_NAME,
     wml_client.deployments.ConfigurationMetaNames.ONLINE: {}
 }

In [28]:
## Store the model on WML
published_model = wml_client.repository.store_model(pipeline,
                                             meta_props=meta_props,
                                             training_data=X_train,
                                             training_target=y_train
                                                )

At this point you can verify the stored model by going to the deployment space you created earlier. You will be able to see the model listed in the assets tab.

In [29]:
published_model_id = wml_client.repository.get_model_id(published_model)

## 5. Deploy the model <a name="deployment"></a>

In [30]:
## Create a Deployment for your stored model

created_deployment = wml_client.deployments.create(published_model_id, meta_props=deploy_meta)



#######################################################################################

Synchronous deployment creation for uid: '6040e9aa-2f06-46a5-a875-00abd41b178e' started

#######################################################################################


initializing
Note: online_url is deprecated and will be removed in a future release. Use serving_urls instead.
.
ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='2b714dcb-2687-4dfd-bf41-f52402f2d32d'
------------------------------------------------------------------------------------------------




At this point you can verify the deployed model by going to the deployment space you created earlier. You will be able to see the model deployment listed in the Deployments tab with a green tick (for the successfully deployed model)

In [31]:
scoring_endpoint = None
deployment_uid=created_deployment['metadata']['id']

## 6. Testing the deployed model <a name="testing"></a>

In [32]:
fields = list(X_test.columns)
score=X_test.head(20)
scoring_data=list(list(x) for x in zip(*(score[x].values.tolist() for x in score.columns)))

In [33]:
job_payload = {
wml_client.deployments.ScoringMetaNames.INPUT_DATA: [{
 'values': scoring_data
}]
}
print(job_payload)

{'input_data': [{'values': [[0, 0, 7, 3, 1, 59040, 865.33, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0], [0, -13200, 22, 1, 3, 82800, 1609.67, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0], [43400, -91200, 12, 1, 1, 89700, 1239.22, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0], [0, 0, 13, 4, 3, 65070, 1451.54, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1], [0, -45700, 11, 3, 0, 57970, 1446.98, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0], [37300, -31700, 16, 3, 3, 28800, 1497.35, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0], [64800, -44200, 17, 1, 0, 73260, 1209.63, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0

In [34]:
scoring_response = wml_client.deployments.score(deployment_uid, job_payload)

print(scoring_response)

{'predictions': [{'fields': ['prediction'], 'values': [[1], [1], [0], [0], [0], [0], [1], [1], [0], [1], [0], [0], [1], [1], [1], [0], [0], [1], [0], [0]]}]}


In [35]:
job_payload_ui = {
wml_client.deployments.ScoringMetaNames.INPUT_DATA: [{
 "fields": fields,
 "values": scoring_data
}]
}
print(json.dumps(job_payload_ui))

{"input_data": [{"fields": ["CAPITAL_GAINS", "CAPITAL_LOSS", "INCIDENT_HOUR_OF_THE_DAY", "NUMBER_OF_VEHICLES_INVOLVED", "WITNESSES", "TOTAL_CLAIM_AMOUNT", "POLICY_ANNUAL_PREMIUM", "INSURED_SEX_FEMALE", "INSURED_SEX_MALE", "INSURED_OCCUPATION_adm-clerical", "INSURED_OCCUPATION_armed-forces", "INSURED_OCCUPATION_craft-repair", "INSURED_OCCUPATION_exec-managerial", "INSURED_OCCUPATION_farming-fishing", "INSURED_OCCUPATION_handlers-cleaners", "INSURED_OCCUPATION_machine-op-inspct", "INSURED_OCCUPATION_other-service", "INSURED_OCCUPATION_priv-house-serv", "INSURED_OCCUPATION_prof-specialty", "INSURED_OCCUPATION_protective-serv", "INSURED_OCCUPATION_sales", "INSURED_OCCUPATION_tech-support", "INSURED_OCCUPATION_transport-moving", "INSURED_HOBBIES_chess", "INSURED_HOBBIES_cross-fit", "INSURED_HOBBIES_other", "INCIDENT_TYPE_Multi-vehicle Collision", "INCIDENT_TYPE_Parked Car", "INCIDENT_TYPE_Single Vehicle Collision", "INCIDENT_TYPE_Vehicle Theft", "COLLISION_TYPE_Front Collision", "COLLISION_

Copy this text above ^ and paste it in the `Enter Input data` box for testing the deployed model. The results should match the predictions shown below

In [36]:
scoring_response = wml_client.deployments.score(deployment_uid, job_payload_ui)

print(scoring_response)

{'predictions': [{'fields': ['prediction'], 'values': [[1], [1], [0], [0], [0], [0], [1], [1], [0], [1], [0], [0], [1], [1], [1], [0], [0], [1], [0], [0]]}]}


## Congratulations!

You have finished running the notebook for training, creating and deploying Fraud claim prediction model. You can now view the deployed model by going to the Project and selecting the `Settings` tab. Choose the `Associated deployment space` that you have created and click to `Open`. Select `Deployments` tab and click on your deployment to open it. Click on the test tab for the deployment you've created to test the model.
