# Fraud Detection

This notebook shows how to use Amazon Sagemaker Processsing, Data Wrangler and Amazon Glue Data Brew to prepare the data. 

First, we process the raw dataset to prepare the features and extract the interactions in the dataset that will be used to construct the graph. 

Then, we create a launch a training job using the SageMaker framework estimator to train a XGBoost model.

## Sagemaker Initial Setup

The below code is used to get the S3 Bucket name configured for Sagemaker

In [3]:
!bash setup.sh
! pip install -qU sagemaker
import sagemaker
from sagemaker_graph_fraud_detection import config

role = config.role
sess = sagemaker.Session()
bucket = sess.default_bucket()  
print("S3 bucket name: ", bucket)

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
[31mERROR: File "setup.py" or "setup.cfg" not found. Directory cannot be installed in editable mode: /root/AutoInsuranceFraudDetection/Code/sagemaker_graph_fraud_detection[0m
S3 bucket name:  sagemaker-us-east-1-367858208265


## Amazon Sagemaker Data Preprocessing 

In [2]:
#container to run the processing. The ecr_repository_uri will vary depending on the region. The "source" field is used for the dataset and the "destination" is used to store the prepared data
ecr_repository_uri = "683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3"
source = 's3://'+bucket+'/AutoInsuranceFraudDetection/DataSet/insurance_claims.csv'
destination = 's3://'+bucket+'/AutoInsuranceFraudDetection/Results/DataProcessing'

In [15]:
%%writefile AutoInsuranceFraudProcessing.py
#This block of code generates a file "AutoInsuranceFraudProcessing.py" which has the code to process the data

import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import make_column_transformer

from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import LabelEncoder

warnings.filterwarnings(action="ignore", category=DataConversionWarning)

if __name__ == "__main__":
    #get arguments
    parser = argparse.ArgumentParser()
    args, _ = parser.parse_known_args()
    print("Received arguments {}".format(args))
    
    #get the input data
    input_data_path = os.path.join("/opt/ml/processing/input", "insurance_claims.csv")
    print("Reading input data from {}".format(input_data_path))
    df = pd.read_csv(input_data_path)
    df = pd.DataFrame(data=df)
    print(df.head())

    #replacing ? with nan for the columns
    df['police_report_available']=df['police_report_available'].replace('?',np.nan)
    df['collision_type']=df['collision_type'].replace('?',np.nan)
    df['property_damage']=df['property_damage'].replace('?',np.nan)
    
    #droping rows with nan
    df=df.dropna(subset=['police_report_available', 'collision_type','property_damage'])
    
    #dropping the unnecessary rows
    df=df.drop(['months_as_customer','policy_number','policy_bind_date','policy_csl','auto_year','auto_model','insured_hobbies','insured_zip'],axis=1)
    
    #now deal with the categorical features
    le=LabelEncoder()
    for i in list(df.columns):
        if df[i].dtype=='object':
            df[i]=le.fit_transform(df[i])
    
    #final preprocessed data
    print(df.head())
    train_features_output_path = os.path.join("/opt/ml/processing/output", "preprocessed_data.csv")
    df.to_csv(train_features_output_path, index=False)
    print("done")
    

Overwriting AutoInsuranceFraudProcessing.py


### Run Preprocessing job with Amazon SageMaker Processing

The script we have defined at `AutoInsuranceFraudProcessing.py` performs data preprocessing transformations on the raw data. The preproceesing involes replacing values, droping rows, dropping columns and categorical encoding

In [13]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

script_processor = ScriptProcessor(command=['python3'],
                                   image_uri=ecr_repository_uri,
                                   role=role,
                                   instance_count=1,
                                   instance_type='ml.m4.2xlarge')

script_processor.run(code='AutoInsuranceFraudProcessing.py',
                     inputs=[ProcessingInput(source=source,
                                             destination='/opt/ml/processing/input')],
                     outputs=[ProcessingOutput(output_name="preprocessed_data.csv", destination=destination,
                                               source='/opt/ml/processing/output')])



Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  sagemaker-scikit-learn-2021-09-08-12-41-10-629
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-367858208265/AutoInsuranceFraudDetection/DataSet/insurance_claims.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-367858208265/sagemaker-scikit-learn-2021-09-08-12-41-10-629/input/code/AutoInsuranceFraudProcessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'preprocessed_data.csv', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-367858208265/AutoInsuranceFraudDetection/Results/DataProcessing', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
..............................
[34mRec

### View Results of Data Preprocessing

Once the preprocessing job is complete, we can take a look at the contents of the S3 bucket to see the transformed data.

In [8]:
preprocessing_job_description = script_processor.jobs[-1].describe()

output_config = preprocessing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    print(output)
    preprocessed_data = output["S3Output"]["S3Uri"]
    print(preprocessed_data)


{'OutputName': 'preprocessed_data.csv', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-367858208265/AutoInsuranceFraud/Results/DataProcessing', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}, 'AppManaged': False}
s3://sagemaker-us-east-1-367858208265/AutoInsuranceFraud/Results/DataProcessing


Once the training is completed, the training instances are automatically saved and SageMaker stores the trained model and evaluation results to a location in S3.

## Amazon Sagemaker Feature Store

In [4]:
import pandas as pd
import time

# read the prepared data from S3. Enter any of the Results processed file S3 location
source = 's3://sagemaker-us-east-1-367858208265/Results/DataWrangler/output_1631272206/part-00000-d3369d58-6799-4d9f-91bd-0f0159be50b4-c000.csv'
df = pd.read_csv(source)

When creating a feature group, you can also create the metadata for the feature group, such as a short description, storage configuration, features for identifying each record, and the event time, as well as tags to store information such as the author, data source, version, and more. Since we do not have any such column, we are adding two extra columns called Fraud_ID and Fraud_time

In [4]:
#Add unique ID and event time for features store
df['Fraud_ID'] = df.index + 1000
current_time_sec = int(round(time.time()))
df['Fraud_time'] = pd.Series([current_time_sec]*len(df), dtype="float64")
df=df.drop(['_c0'],axis=1)
df.head()

Unnamed: 0,age,policy_number,policy_state,policy_deductable,policy_annual_premium,umbrella_limit,insured_sex,insured_education_level,insured_occupation,insured_relationship,...,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,fraud_reported,Fraud_ID,Fraud_time
0,48,521585,2,1000,1406.91,0,1,4,2,0,...,2,1,71610,6510,13020,52080,10,1.0,1000,1631350000.0
1,29,687698,2,2000,1413.14,5000000,0,6,11,3,...,3,0,34650,7700,3850,23100,4,0.0,1001,1631350000.0
2,41,227811,0,2000,1415.74,6000000,0,6,1,4,...,2,0,63400,6340,6340,50720,3,1.0,1002,1631350000.0
3,44,367455,0,1000,1583.91,6000000,1,0,11,4,...,1,0,6500,1300,650,4550,0,0.0,1003,1631350000.0
4,39,104594,2,1000,1351.1,0,0,6,12,4,...,2,0,64100,6410,6410,51280,10,1.0,1004,1631350000.0


In [18]:
# initialize necessary variables
import boto3
region = sagemaker.Session().boto_region_name
boto3.setup_default_session(region_name=region)
s3_client = boto3.client("s3", region_name=region)

### Configure the feature groups
The datatype for each feature is set by passing a dataframe and inferring the proper datatype. Feature data types can also be set via a config variable, but it will have to match the correspongin Python data type in the Pandas dataframe when it’s ingested to the Feature Group.

In [1]:
#configure the features
from sagemaker.feature_store.feature_group import FeatureGroup
fraud_fg_name = f"auto-fraud"
fraud_feature_group = FeatureGroup(name=fraud_fg_name, sagemaker_session=sess)
fraud_feature_group.load_feature_definitions(data_frame=df)

ModuleNotFoundError: No module named 'sagemaker.feature_store'

### Create the feature groups
You must tell the Feature Group which columns in the dataframe correspond to the required record indentifier and event time features.

In [4]:
record_identifier_feature_name = "Fraud_ID"
event_time_feature_name = "Fraud_time"
sagemaker_role = sagemaker.get_execution_role()
try:
    print(f"\n Using s3://{bucket}/DataSet/insurance_claims.csv")
    fraud_feature_group.create(
        s3_uri=f"s3://{bucket}/DataSet/insurance_claims.csv",
        record_identifier_name='Fraud_ID',
        event_time_feature_name='Fraud_time',
        role_arn=sagemaker_role,
        enable_online_store=True,
    )
    print(f'Create "fraud" feature group: SUCCESS')
except Exception as e:
    code = e.response.get("Error").get("Code")
    if code == "ResourceInUse":
        print(f"Using existing feature group: {fraud_fg_name}")
    else:
        raise (e)


 Using s3://sagemaker-us-east-1-367858208265/DataSet/insurance_claims.csv


AttributeError: 'NameError' object has no attribute 'response'

### Wait until feature group creation has fully completed

In [9]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")


wait_for_feature_group_creation_complete(feature_group=fraud_feature_group)


FeatureGroup auto-fraud successfully created.


### Ingest records into the Feature Groups
After the Feature Groups have been created, we can put data into each store by using the PutRecord API. This API can handle high TPS and is designed to be called by different streams. The data from all of these Put requests is buffered and written to s3 in chunks. The files will be written to the offline store within a few minutes of ingestion.

In [10]:
fraud_feature_group.ingest(data_frame=df, max_workers=3, wait=True)

IngestionManagerPandas(feature_group_name='auto-fraud', sagemaker_fs_runtime_client_config=<botocore.config.Config object at 0x7f6bc411b910>, max_workers=3, max_processes=1, _async_result=<multiprocess.pool.MapResult object at 0x7f6bc4d86950>, _processing_pool=<pool ProcessPool(ncpus=1)>, _failed_indices=[])

### Wait for offline store data to become available
This usually takes 5-8 minutes

In [None]:
fraud_feature_group_s3_prefix = (
    f"ResultSet/FeatureStore/offline-store/data"
)

offline_store_contents = None
while offline_store_contents is None:
    objects_in_bucket = s3_client.list_objects(
        Bucket=bucket, Prefix=fraud_feature_group_s3_prefix
    )
    if "Contents" in objects_in_bucket and len(objects_in_bucket["Contents"]) > 1:
        offline_store_contents = objects_in_bucket["Contents"]
    else:
        print("Waiting for data in offline store...")
        time.sleep(60)

print("\nData available.")

Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...


## Create train and test datasets

Lets train the dataset. We will split the dataset into 80% train data and the rest 20% test data

In [19]:
source = 's3://sagemaker-us-east-1-367858208265/Results/DataWrangler/output_1631272206/part-00000-d3369d58-6799-4d9f-91bd-0f0159be50b4-c000.csv'
dataset = pd.read_csv(source)

In [20]:
train = dataset.sample(frac=0.80, random_state=0)
test = dataset.drop(train.index)


Write train, test data to S3


In [12]:
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)
dataset.to_csv("dataset.csv", index=True)

In [14]:
s3_client.upload_file(
    Filename="train.csv", Bucket=bucket, Key=f"Results/DataSet/train/train.csv"
)
s3_client.upload_file(Filename="test.csv", Bucket=bucket, Key=f"Results/DataSet/test/test.csv")

In [15]:
train.head(5)

Unnamed: 0,_c0,age,policy_number,policy_state,policy_deductable,policy_annual_premium,umbrella_limit,insured_sex,insured_education_level,insured_occupation,...,number_of_vehicles_involved,bodily_injuries,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,fraud_reported
538,806,37,798579,1,1000,1114.23,0,1,1,9,...,1,0,1,1,52200,10440,5220,36540,9,0.0
493,743,39,892148,1,500,1359.36,5000000,1,6,3,...,3,2,2,1,71610,13020,6510,52080,12,1.0
14,18,37,921202,2,500,1374.22,0,0,4,2,...,1,1,0,0,72930,6630,13260,53040,0,0.0
247,383,46,858060,0,2000,1209.07,0,1,3,1,...,1,0,1,1,56430,6270,6270,43890,2,1.0
85,133,33,649082,0,1000,1922.84,0,0,2,6,...,1,2,1,0,46800,4680,9360,32760,7,0.0


In [16]:
test.head(5)

Unnamed: 0,_c0,age,policy_number,policy_state,policy_deductable,policy_annual_premium,umbrella_limit,insured_sex,insured_education_level,insured_occupation,...,number_of_vehicles_involved,bodily_injuries,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,fraud_reported
9,13,34,626808,2,1000,936.61,0,0,4,1,...,1,1,1,0,7280,1120,1120,5040,12,0.0
11,15,58,892874,1,2000,1131.4,0,0,4,13,...,4,0,0,0,63120,10520,10520,42080,0,1.0
19,26,43,863236,1,2000,1322.1,0,1,2,9,...,1,1,3,1,9020,1640,820,6560,12,0.0
23,34,37,990493,0,500,1415.68,0,1,6,9,...,1,0,1,1,64800,10800,5400,48600,1,0.0
28,42,23,448961,0,500,1475.93,0,0,1,9,...,3,1,0,0,51660,5740,5740,40180,4,0.0


## Train a model using XGBoost

Once the training and test datasets have been persisted in S3, you can start training a model by defining which SageMaker Estimator you’d like to use. For this guide, you will use the XGBoost Open Source Framework to train your model. This estimator is accessed via the SageMaker SDK, but mirrors the open source version of the XGBoost Python package. Any functioanlity provided by the XGBoost Python package can be implemented in your training script.

In [21]:
from sagemaker.debugger import Rule, rule_configs
from sagemaker.session import TrainingInput

sess = sagemaker.Session()
bucket = sess.default_bucket()  
s3_output_location='s3://{}/{}/{}'.format(bucket, "Results/", 'xgboost_model')

container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")
print(container)

xgb_model=sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    train_volume_size=5,
    output_path=s3_output_location,
    sagemaker_session=sagemaker.Session(),
    rules=[Rule.sagemaker(rule_configs.create_xgboost_report())]
)

### Set the hyperparameters
These are the parameters which will be sent to our training script in order to train the model. Although they are all defined as “hyperparameters” here, they can encompass XGBoost’s Learning Task Parameters, Tree Booster Parameters, or any other parameters you’d like to configure for XGBoost.

In [22]:
xgb_model.set_hyperparameters(
    max_depth = 5,
    eta = 0.2,
    gamma = 4,
    min_child_weight = 6,
    subsample = 0.7,
    objective = "binary:logistic",
    num_round = 1000
)


### Create and fit the estimator
If you want to explore the breadth of functionailty offered by the SageMaker XGBoost Framework you can read about all the configuration parameters by referencing the inhereting classes. The XGBoost class inherets from the Framework class and Framework inherets from the EstimatorBase class

In [33]:
xgb_estimator = XGBoost(
    entry_point="xgboost_starter_script.py",
    hyperparameters=hyperparameters,
    role=sagemaker.get_execution_role(),
    train_instance_count = 1,
    train_instance_type = "ml.m4.xlarge",
    framework_version="1.0-1",
)


In [5]:
from sagemaker.debugger import Rule, rule_configs

s3_output_location='s3://{}/{}/{}'.format(bucket, 'Results/Train/', 'xgboost_model')
train_data_uri=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")
print(container)
xgb_model=sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    train_volume_size=5,
    output_path=s3_output_location,
    sagemaker_session=sagemaker.Session(),
    rules=[Rule.sagemaker(rule_configs.create_xgboost_report())]
)

AttributeError: module 'sagemaker' has no attribute 'image_uris'

In [1]:
train_data_uri=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")
print(container)
if 'training_job_1_name' not in locals():

    xgb_estimator.fit(inputs = {'train': train_data_uri})
    training_job_1_name = xgb_estimator.latest_training_job.job_name
    %store training_job_1_name

else:
    print(f'Using previous training job: {training_job_1_name}')

NameError: name 'sagemaker' is not defined