# Predicting Fraud claim in Auto Insurance using SVC modeling on Virtualized data

The notebook will train, create and deploy a Fraud prediction model.

### Contents

- [Setup](#setup)
- [Loading Refined data](#data)
- [Model building](#model)
- [Saving the model](#save)
- [Model Deployment](#deployment)
- [Testing the model](#testing)

## 1. Setup the Notebook Environment <a name="setup"></a>


### 1.1 Review Use Case


The analytics use case implemented in this notebook is fraud claim prediction in auto insurance. We virtualized data sets from DB2Warehouse in Cloud and using the same to build a predictive XGBoost model here.


### Working with Notebooks

If you are new to Notebooks, here's a quick overview of how to work in this environment.

1. The notebook has 2 types of cells - markdown (text) and code. 
2. Each cell with code can be executed independently or together (see options under the Cell menu). When working in this notebook, we will be running one cell at a time because we need to make code changes to some of the cells.
3. To run the cell, position cursor in the code cell and click the Run (arrow) icon. The cell is running when you see the * next to it. Some cells have printable output.
4. Work through this notebook by reading the instructions and executing code cell by cell. Some cells might require modifications before you run them. 

### 1.2 Install the necessary packages


### Scikit-learn version 0.22

In [None]:
!pip install scikit-learn==0.22.0

### Watson Machine Learning Python SDK


In [None]:
!pip install --upgrade watson-machine-learning-client-V4==1.0.93 | tail -n 1

### Action: restart the kernel!

### 1.3 Import Packages


In [1]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn import preprocessing
from sklearn import svm, metrics
from scipy import sparse
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
import json
import sys,os,os.path


import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics  import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier

from watson_machine_learning_client import WatsonMachineLearningAPIClient

## 2 Add Dataset <a name="data"></a>

Select the Insert Pandas Dataframe option, after selecting the below cell. Ensure the variable name is df_data_1



## 3. Create the Fraud claim prediction Model using Scikit-Learn <a name="model"></a>


From our Exploratory Data analysis using Data refinery, we observed that the following predictors were significantly correlated with the `fraud_reported` target label:

### 3.1 Feature Selection

In [3]:
required_columns = ['insured_sex', 'insured_occupation', 'insured_hobbies',
       'capital_gains', 'capital_loss', 'incident_type', 'collision_type', 'incident_severity',
       'authorities_contacted', 'incident_hour_of_the_day', 'number_of_vehicles_involved',
       'witnesses', 'total_claim_amount', 'fraud_reported', 'policy_annual_premium']

Therefore, we will use only these feature in creating our initial model

In [4]:
df1 = df_data_1[required_columns]
df1.head()

Unnamed: 0,insured_sex,insured_occupation,insured_hobbies,capital_gains,capital_loss,incident_type,collision_type,incident_severity,authorities_contacted,incident_hour_of_the_day,number_of_vehicles_involved,witnesses,total_claim_amount,fraud_reported,policy_annual_premium
0,FEMALE,craft-repair,other,0,-36600,Single Vehicle Collision,Rear Collision,Minor Damage,Ambulance,16,1,3,45180,N,1416.08
1,MALE,machine-op-inspct,other,67800,-48600,Multi-vehicle Collision,Side Collision,Total Loss,Police,21,3,3,83160,N,1356.64
2,FEMALE,adm-clerical,other,0,-48800,Vehicle Theft,Unknown,Trivial Damage,Police,16,1,0,7590,N,1074.99
3,FEMALE,craft-repair,cross-fit,0,-36400,Vehicle Theft,Unknown,Minor Damage,Police,7,1,1,3900,Y,1200.33
4,FEMALE,sales,other,0,0,Multi-vehicle Collision,Rear Collision,Minor Damage,Other,21,3,0,62900,N,1441.6


#### Check for missing values

In [5]:
df1.isnull().sum()

insured_sex                    0
insured_occupation             0
insured_hobbies                0
capital_gains                  0
capital_loss                   0
incident_type                  0
collision_type                 0
incident_severity              0
authorities_contacted          0
incident_hour_of_the_day       0
number_of_vehicles_involved    0
witnesses                      0
total_claim_amount             0
fraud_reported                 0
policy_annual_premium          0
dtype: int64

### 3.2 Encode categorical features

In [6]:
columns_to_encode = []
for col in df1.columns:
    if col != 'fraud_reported':
      if df1[col].dtype == 'object':
        columns_to_encode.append(col)

columns_to_encode

['insured_sex',
 'insured_occupation',
 'insured_hobbies',
 'incident_type',
 'collision_type',
 'incident_severity',
 'authorities_contacted']

In [7]:
df2 = pd.get_dummies(df1, columns = columns_to_encode)

df2.head()

Unnamed: 0,capital_gains,capital_loss,incident_hour_of_the_day,number_of_vehicles_involved,witnesses,total_claim_amount,fraud_reported,policy_annual_premium,insured_sex_FEMALE,insured_sex_MALE,...,collision_type_Unknown,incident_severity_Major Damage,incident_severity_Minor Damage,incident_severity_Total Loss,incident_severity_Trivial Damage,authorities_contacted_Ambulance,authorities_contacted_Fire,authorities_contacted_None,authorities_contacted_Other,authorities_contacted_Police
0,0,-36600,16,1,3,45180,N,1416.08,1,0,...,0,0,1,0,0,1,0,0,0,0
1,67800,-48600,21,3,3,83160,N,1356.64,0,1,...,0,0,0,1,0,0,0,0,0,1
2,0,-48800,16,1,0,7590,N,1074.99,1,0,...,1,0,0,0,1,0,0,0,0,1
3,0,-36400,7,1,1,3900,Y,1200.33,1,0,...,1,0,1,0,0,0,0,0,0,1
4,0,0,21,3,0,62900,N,1441.6,1,0,...,0,0,1,0,0,0,0,0,1,0


#### Convert target label from Y/N to 1/0

In [8]:
df2['fraud_reported'] = df2['fraud_reported'].str.replace('Y', '1')
df2['fraud_reported'] = df2['fraud_reported'].str.replace('N', '0')
df2['fraud_reported'] = df2['fraud_reported'].astype(int)

#### Features and Target

In [9]:
features = []
for col in df2.columns:
  if col != 'fraud_reported':
    features.append(col)

target = 'fraud_reported'

X = df2[features]
y = df2[target]

#### Split the dataset into training and testing data

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

### 3.3 Modeling

#### Logistic Regression

In [11]:
lr = LogisticRegression()
pipeline = Pipeline([
        ('scale', StandardScaler()),
        ('clf', lr)])
pipeline.fit(X_train, y_train)

preds = pipeline.predict(X_test)

print(accuracy_score(preds, y_test))
print(classification_report(y_test, preds))

0.856
              precision    recall  f1-score   support

           0       0.90      0.91      0.90       184
           1       0.73      0.71      0.72        66

    accuracy                           0.86       250
   macro avg       0.82      0.81      0.81       250
weighted avg       0.85      0.86      0.86       250



#### Random Forest Classifier

In [12]:
random_forest = RandomForestClassifier()
pipeline = Pipeline([
        ('scale', StandardScaler()),
        ('clf', random_forest)])
pipeline.fit(X_train, y_train)

preds = pipeline.predict(X_test)

print(accuracy_score(preds, y_test))
print(classification_report(y_test, preds))

0.836
              precision    recall  f1-score   support

           0       0.88      0.90      0.89       184
           1       0.70      0.67      0.68        66

    accuracy                           0.84       250
   macro avg       0.79      0.78      0.79       250
weighted avg       0.83      0.84      0.83       250



#### XGBoost Classifier

In [13]:
import xgboost as xgb

xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.01, objective= 'binary:logistic',n_jobs=-1)
xgb_model.fit(X_train, y_train)
predictions_test = xgb_model.predict(X_test)

print(accuracy_score(predictions_test, y_test))
print(classification_report(predictions_test, y_test))

0.84
              precision    recall  f1-score   support

           0       0.90      0.89      0.89       186
           1       0.68      0.70      0.69        64

    accuracy                           0.84       250
   macro avg       0.79      0.80      0.79       250
weighted avg       0.84      0.84      0.84       250



#### Linear SVC (Support Vector Classifier)

In [14]:
from sklearn.svm import SVC
pipeline = Pipeline([
        ('scale', StandardScaler()),
        ('clf', SVC(kernel = 'linear'))])
pipeline.fit(X_train, y_train)

preds = pipeline.predict(X_test)

print(accuracy_score(preds, y_test))
print(classification_report(y_test, preds))

0.884
              precision    recall  f1-score   support

           0       0.96      0.88      0.92       184
           1       0.72      0.91      0.81        66

    accuracy                           0.88       250
   macro avg       0.84      0.89      0.86       250
weighted avg       0.90      0.88      0.89       250



## 4. Save the model <a name="save"></a>


### 4.1 Configuration

Steps to complete before running the below cell:

1. Right Click on the project name in the upper left section of the screen
2. Click on the tab where the project is opened
3. Click on Settings tab
4. Click on `Associate a deployment Space`
5. Enter `fraud_prediction_deployment_space` in the deployment space name
6. Click on `Associate` to associate the `fraud_prediction_deployment_space` deployment space to the project

Now the model can be saved for future deployment. The model will be saved using the Watson Machine Learning client, to a deployment space.

**<font color='red'> UPDATE THE VARIABLE 'MODEL_NAME' TO A UNIQUE NAME</font>**

**<font color='red'> UPDATE THE VARIABLE 'dep_name' TO THE NAME OF THE DEPLOYMENT SPACE CREATED PREVIOUSLY</font>**

In [15]:
MODEL_NAME="fraud_prediction"
DEPLOYMENT_NAME="fraud_prediction_deployment"

# Enter the Deployment Space you have associated project with 
dep_name="fraud_prediction_deployment_space"

### 4.2 Input your WML Credentials


In [16]:
WML_CREDENTIALS = {
"token": os.environ['USER_ACCESS_TOKEN'],
"instance_id" : "wml_local",
"url" : os.environ['RUNTIME_ENV_APSX_URL'],
"version": "3.0.0"
}

### 4.3 Setup Watson Machine Learning Client 

In [17]:
client = WatsonMachineLearningAPIClient(WML_CREDENTIALS)


In [18]:
meta_props={
 client.repository.ModelMetaNames.NAME: MODEL_NAME,
 client.repository.ModelMetaNames.RUNTIME_UID: "scikit-learn_0.22-py3.6",
 client.repository.ModelMetaNames.TYPE: "scikit-learn_0.22",
}

In [19]:
project_id = os.environ['PROJECT_ID']
client.set.default_project(project_id)

'SUCCESS'

In [20]:
def guid_from_space_name(client, space_name):

    instance_details = client.service_instance.get_details()

    space = client.spaces.get_details()
    res=[]
    for item in space['resources']: 
        if item['entity']["name"] == space_name:
            res=item['metadata']['guid']

    return res

The name of your deployment space of the current project is read from the variable `dep_name` and stored in `space_uid`


In [21]:

space_uid = guid_from_space_name(client, dep_name)

In [22]:
space_uid

'0c2a7455-5905-4906-bc37-b217dca1a6cf'

In [23]:
client.set.default_space(space_uid)

Unsetting the project_id ...


'SUCCESS'

### 4.4 Store the model

In [24]:
deploy_meta = {
     client.deployments.ConfigurationMetaNames.NAME: DEPLOYMENT_NAME,
     client.deployments.ConfigurationMetaNames.ONLINE: {}
 }

In [25]:
## Store the model on WML
published_model = client.repository.store_model(pipeline,
                                             meta_props=meta_props,
                                             training_data=X_train,
                                             training_target=y_train
                                                )

At this point you can verify the stored model by going to the deployment space you created earlier. You will be able to see the model listed in the assets tab.

In [26]:
published_model_uid = client.repository.get_model_uid(published_model)

## 5. Deploy the model <a name="deployment"></a>

In [27]:
## Create a Deployment for your stored model

created_deployment = client.deployments.create(published_model_uid, meta_props=deploy_meta)



#######################################################################################

Synchronous deployment creation for uid: '92bd9967-ad96-47ee-afa9-48452386627e' started

#######################################################################################


initializing
ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='6b0aebd6-e22b-4b69-95c5-e09a0b599f61'
------------------------------------------------------------------------------------------------




At this point you can verify the deployed model by going to the deployment space you created earlier. You will be able to see the model deployment listed in the Deployments tab with a green tick (for the successfully deployed model)

In [28]:
scoring_endpoint = None
deployment_uid=created_deployment['metadata']['guid']

## 6. Testing the deployed model <a name="testing"></a>

In [29]:
fields = list(X_test.columns)
score=X_test.head(20)
scoring_data=list(list(x) for x in zip(*(score[x].values.tolist() for x in score.columns)))

In [30]:
job_payload = {
client.deployments.ScoringMetaNames.INPUT_DATA: [{
 'values': scoring_data
}]
}
print(job_payload)

{'input_data': [{'values': [[0, 0, 5, 1, 1, 60700, 1672.88, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0], [0, 0, 16, 1, 2, 100210, 1241.04, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1], [46300, 0, 21, 3, 3, 61440, 1132.47, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 6, 3, 2, 53730, 1437.53, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0], [91900, 0, 22, 4, 0, 71760, 1083.01, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 10, 1, 1, 70700, 1405.71, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1], [0, -39500, 14, 1, 1, 75500, 1286.44, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 

In [31]:
scoring_response = client.deployments.score(deployment_uid, job_payload)

print(scoring_response)

{'predictions': [{'fields': ['prediction'], 'values': [[0], [0], [0], [1], [1], [0], [0], [0], [1], [0], [0], [1], [0], [1], [0], [0], [0], [1], [1], [0]]}]}


In [32]:
job_payload_ui = {
client.deployments.ScoringMetaNames.INPUT_DATA: [{
 "fields": fields,
 "values": scoring_data
}]
}
print(json.dumps(job_payload_ui))

{'input_data': [{'fields': ['capital_gains', 'capital_loss', 'incident_hour_of_the_day', 'number_of_vehicles_involved', 'witnesses', 'total_claim_amount', 'policy_annual_premium', 'insured_sex_FEMALE', 'insured_sex_MALE', 'insured_occupation_adm-clerical', 'insured_occupation_armed-forces', 'insured_occupation_craft-repair', 'insured_occupation_exec-managerial', 'insured_occupation_farming-fishing', 'insured_occupation_handlers-cleaners', 'insured_occupation_machine-op-inspct', 'insured_occupation_other-service', 'insured_occupation_priv-house-serv', 'insured_occupation_prof-specialty', 'insured_occupation_protective-serv', 'insured_occupation_sales', 'insured_occupation_tech-support', 'insured_occupation_transport-moving', 'insured_hobbies_chess', 'insured_hobbies_cross-fit', 'insured_hobbies_other', 'incident_type_Multi-vehicle Collision', 'incident_type_Parked Car', 'incident_type_Single Vehicle Collision', 'incident_type_Vehicle Theft', 'collision_type_Front Collision', 'collision_

Copy this text above ^ and paste it in the `Enter Input data` box for testing the deployed model. The results should match the predictions shown below

In [33]:
scoring_response = client.deployments.score(deployment_uid, job_payload_ui)

print(scoring_response)

{'predictions': [{'fields': ['prediction'], 'values': [[0], [0], [0], [1], [1], [0], [0], [0], [1], [0], [0], [1], [0], [1], [0], [0], [0], [1], [1], [0]]}]}


## Congratulations!

You have finished running the notebook for training, creating and deploying Fraud claim prediction model. You can now view the deployed model by going to the Project and selecting the `Settings` tab. Choose the `Associated deployment space` that you have created and click to `Open`. Select `Deployments` tab and click on your deployment to open it. Click on the test tab for the deployment you've created to test the model.
