# Fraud Detection

This notebook shows how to use Amazon Sagemaker Processsing, Data Wrangler and Amazon Glue Data Brew to prepare the data. 

First, we process the raw dataset to prepare the features and extract the interactions in the dataset that will be used to construct the graph. 

Then, we create a launch a training job using the SageMaker framework estimator to train a XGBoost model.

## Sagemaker Initial Setup

The below code is used to get the S3 Bucket name configured for Sagemaker

In [None]:
import pandas as pd
import time
import sagemaker
import boto3
from sagemaker import get_execution_role

role = get_execution_role()
sess = sagemaker.Session()
bucket = sess.default_bucket()
print("S3 bucket name: ", bucket)

## Amazon Sagemaker Data Preprocessing 

In [None]:
# Documentation link to refer for ECR URI
#https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-docker-containers-scikit-learn-spark.html

#container to run the processing. The ecr_repository_uri will vary depending on the region. The "source" field is used for the dataset and the "destination" is used to store the prepared data
ecr_repository_uri = "720646828776.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3"
    
source = 's3://'+bucket+'/AutoInsuranceFraudDetection/DataSet/insurance_claims.csv'
destination = 's3://'+bucket+'/AutoInsuranceFraudDetection/Results/DataProcessing'

In [None]:
%%writefile AutoInsuranceFraudProcessing.py
#This block of code generates a file "AutoInsuranceFraudProcessing.py" which has the code to process the data
import argparse
import os
import warnings
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import make_column_transformer
from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import LabelEncoder

warnings.filterwarnings(action="ignore", category=DataConversionWarning)

if __name__ == "__main__":
    #get arguments
    parser = argparse.ArgumentParser()
    args, _ = parser.parse_known_args()
    print("Received arguments {}".format(args))
    
    #get the input data
    input_data_path = os.path.join("/opt/ml/processing/input", "insurance_claims.csv")
    print("Reading input data from {}".format(input_data_path))
    df = pd.read_csv(input_data_path)
    df = pd.DataFrame(data=df)
    print(df.head())

    #replacing ? with nan for the columns
    df['police_report_available']=df['police_report_available'].replace('?',np.nan)
    #dropping the unnecessary rows
    df=df.dropna(subset=['police_report_available'])
    
    #drop unnecessary columns
    df=df.drop(['months_as_customer'],axis=1)
    
    #now deal with the categorical features
    #for the columns insured_sex and fraud_reported
    le=LabelEncoder()
    for i in list(df.columns):
        if df[i].dtype=='object':
            df[i]=le.fit_transform(df[i])
    
    #final preprocessed data
    print(df.head())
    train_features_output_path = os.path.join("/opt/ml/processing/output", "preprocessed_data.csv")
    df.to_csv(train_features_output_path, index=False)
    print("done")
    

### Run Preprocessing job with Amazon SageMaker Processing

The script we have defined at `AutoInsuranceFraudProcessing.py` performs data preprocessing transformations on the raw data. The preproceesing involes replacing values, droping rows, dropping columns and categorical encoding

In [None]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

script_processor = ScriptProcessor(command=['python3'],
                                   image_uri=ecr_repository_uri,
                                   role=role,
                                   instance_count=1,
                                   instance_type='ml.m4.2xlarge')

script_processor.run(code='AutoInsuranceFraudProcessing.py',
                     inputs=[ProcessingInput(source=source,
                                             destination='/opt/ml/processing/input')],
                     outputs=[ProcessingOutput(output_name="preprocessed_data.csv", destination=destination,
                                               source='/opt/ml/processing/output')])


### View Results of Data Preprocessing

Once the preprocessing job is complete, we can take a look at the contents of the S3 bucket to see the transformed data.

In [None]:
preprocessing_job_description = script_processor.jobs[-1].describe()

output_config = preprocessing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    print(output)
    preprocessed_data = output["S3Output"]["S3Uri"]
    print(preprocessed_data)


Once the training is completed, the training instances are automatically saved and SageMaker stores the trained model and evaluation results to a location in S3.