## Train the dataset


Lets train the dataset. We will split the dataset into 80% train data and the rest 20% test data

In [1]:
import pandas as pd
import time
import sagemaker
import boto3
from sagemaker_graph_fraud_detection import config
from sagemaker.feature_store.feature_group import FeatureGroup

role = config.role
sess = sagemaker.Session()
bucket = sess.default_bucket()  
region = sagemaker.Session().boto_region_name

#source = f"s3://{bucket}/AutoInsuranceFraudDetection/Results/DataWrangler/output_1631641776/part-00000-34b6c48b-2e45-4f81-8243-a598721e5e61-c000.csv"
#dataset = pd.read_csv(source)

In [2]:
fraud_fg_name = f"auto-fraud"
fraud_feature_group = FeatureGroup(name=fraud_fg_name, sagemaker_session=sess)

fraud_query = fraud_feature_group.athena_query()
fraud_table = fraud_query.table_name

In [3]:
# Athena query
query_string = 'SELECT * FROM "'+fraud_table+'"'

# run Athena query. The output is loaded to a Pandas dataframe.
dataset = pd.DataFrame()
fraud_query.run(query_string=query_string, output_location='s3://'+bucket+'/query_results/')
fraud_query.wait()
dataset = fraud_query.as_dataframe()

In [4]:
dataset.head(5)

Unnamed: 0,fraud_reported,age,policy_number,policy_state,policy_deductable,policy_annual_premium,umbrella_limit,insured_sex,insured_education_level,insured_occupation,...,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,fraud_id,fraud_time,write_time,api_invocation_time,is_deleted
0,1,46,283267,2,2000,1090.32,0,1,1,5,...,51390,5710,5710,39970,13,1709,1631681000.0,2021-09-15 04:42:36.972,2021-09-15 04:37:16.000,False
1,0,33,529112,1,500,1240.47,0,0,0,7,...,77110,0,14020,63090,6,1046,1631681000.0,2021-09-15 04:42:36.972,2021-09-15 04:37:16.000,False
2,1,22,691115,1,500,1173.21,0,1,3,4,...,86130,15660,7830,62640,11,1727,1631681000.0,2021-09-15 04:42:36.972,2021-09-15 04:37:16.000,False
3,1,41,621756,1,1000,1129.23,0,0,1,4,...,50300,10060,5030,35210,11,1729,1631681000.0,2021-09-15 04:42:36.972,2021-09-15 04:37:16.000,False
4,0,48,290971,2,500,1698.51,0,1,3,7,...,51840,8640,8640,34560,13,1736,1631681000.0,2021-09-15 04:42:36.972,2021-09-15 04:37:16.000,False


In [5]:
train = dataset.sample(frac=0.80, random_state=0)
test = dataset.drop(train.index)

In [6]:
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)
dataset.to_csv("dataset.csv", index=True)

Write train, test data to S3


In [7]:
# initialize boto3 client

boto3.setup_default_session(region_name=region)
s3_client = boto3.client("s3", region_name=region)

s3_client.upload_file(
    Filename="train.csv", Bucket=bucket, Key="AutoInsuranceFraudDetection/Results/DataSet/train/train.csv"
)
s3_client.upload_file(Filename="test.csv", Bucket=bucket, Key="AutoInsuranceFraudDetection/Results/DataSet/test/test.csv")

In [8]:
train.head(5)

Unnamed: 0,fraud_reported,age,policy_number,policy_state,policy_deductable,policy_annual_premium,umbrella_limit,insured_sex,insured_education_level,insured_occupation,...,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,fraud_id,fraud_time,write_time,api_invocation_time,is_deleted
993,1,24,326180,0,2000,1304.46,0,0,6,6,...,5940,540,1080,4320,1,1837,1631681000.0,2021-09-15 04:42:38.464,2021-09-15 04:37:17.000,False
859,0,45,322609,2,1000,1230.69,0,1,5,13,...,53800,5380,5380,43040,0,1632,1631681000.0,2021-09-15 04:42:37.305,2021-09-15 04:37:18.000,False
298,0,45,272910,1,500,1594.37,0,1,0,3,...,7260,660,1320,5280,10,1896,1631681000.0,2021-09-15 04:42:36.591,2021-09-15 04:37:17.000,False
553,1,31,651948,1,1000,1354.5,0,1,5,6,...,64800,6480,12960,45360,11,1424,1631681000.0,2021-09-15 04:42:38.868,2021-09-15 04:37:16.000,False
672,0,39,728839,2,2000,1524.18,0,1,3,2,...,48870,5430,5430,38010,10,1841,1631681000.0,2021-09-15 04:42:38.222,2021-09-15 04:37:17.000,False


In [9]:
test.head(5)

Unnamed: 0,fraud_reported,age,policy_number,policy_state,policy_deductable,policy_annual_premium,umbrella_limit,insured_sex,insured_education_level,insured_occupation,...,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,fraud_id,fraud_time,write_time,api_invocation_time,is_deleted
9,1,35,998192,0,500,1433.24,0,0,1,3,...,24570,2730,2730,19110,10,1558,1631681000.0,2021-09-15 04:42:36.972,2021-09-15 04:37:18.000,False
11,1,42,779156,0,1000,1230.76,0,1,2,6,...,78980,7180,14360,57440,10,1623,1631681000.0,2021-09-15 04:42:36.972,2021-09-15 04:37:18.000,False
19,1,29,789208,2,500,1304.35,0,1,3,13,...,75400,11600,11600,52200,4,1489,1631681000.0,2021-09-15 04:42:36.645,2021-09-15 04:37:17.000,False
23,0,28,818413,2,1000,1377.94,0,1,5,6,...,44640,9920,4960,29760,12,1330,1631681000.0,2021-09-15 04:42:36.645,2021-09-15 04:37:19.000,False
28,0,29,221283,2,500,914.85,0,1,0,9,...,7110,790,1580,4740,0,1723,1631681000.0,2021-09-15 04:42:39.214,2021-09-15 04:37:16.000,False


## Train a model using XGBoost

Once the training and test datasets have been persisted in S3, you can start training a model by defining which SageMaker Estimator you’d like to use. For this guide, you will use the XGBoost Open Source Framework to train your model. This estimator is accessed via the SageMaker SDK, but mirrors the open source version of the XGBoost Python package. Any functioanlity provided by the XGBoost Python package can be implemented in your training script.

In [10]:
from sagemaker.debugger import Rule, rule_configs
from sagemaker.session import TrainingInput

sess = sagemaker.Session()
bucket = sess.default_bucket()  
s3_output_location='s3://{}/{}/{}'.format(bucket, "AutoInsuranceFraudDetection/Results", 'xgboost_model')

container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")
print(container)

xgb_model=sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    train_volume_size=5,
    output_path=s3_output_location,
    sagemaker_session=sagemaker.Session(),
    rules=[Rule.sagemaker(rule_configs.create_xgboost_report())]
)

train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


720646828776.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-xgboost:1.2-1


### Set the hyperparameters
These are the parameters which will be sent to our training script in order to train the model. Although they are all defined as “hyperparameters” here, they can encompass XGBoost’s Learning Task Parameters, Tree Booster Parameters, or any other parameters you’d like to configure for XGBoost.

In [11]:
xgb_model.set_hyperparameters(objective = "binary:logistic",num_round = 100)

### Create and fit the estimator
Use the TrainingInput class to configure a data input flow for training. The following example code shows how to configure TrainingInput objects to use the training and validation datasets you uploaded to Amazon S3

In [12]:
from sagemaker.session import TrainingInput

train_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, "AutoInsuranceFraudDetection/Results", "DataSet/train/train.csv"), content_type="csv"
)
validation_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, "AutoInsuranceFraudDetection/Results", "DataSet/test/test.csv"), content_type="csv"
)

In [13]:
xgb_model.fit({"train": train_input, "validation": validation_input}, wait=True)

2021-09-15 04:51:32 Starting - Starting the training job...
2021-09-15 04:51:56 Starting - Launching requested ML instancesCreateXgboostReport: InProgress
ProfilerReport-1631681491: InProgress
......
2021-09-15 04:52:56 Starting - Preparing the instances for training.........
2021-09-15 04:54:29 Downloading - Downloading input data
2021-09-15 04:54:29 Training - Downloading the training image.....[34m[2021-09-15 04:55:16.150 ip-10-0-164-211.ap-south-1.compute.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','