# Introduction to Amazon Fraud Detector
---

**[Amazon Fraud Detector](https://aws.amazon.com/fraud-detector/)** is a fully managed fraud detection service that automates the detection of potentially *fraudulent activities online*. These activities can be unauthorized transaction, creation of fake accounts, or fraud claim. Amazon Fraud Detector works by using machine learning to analyze your data. It does this in a way that builds off of the seasoned expertise of more than 20 years of fraud detection at Amazon.

You can use Amazon Fraud Detector to build customized fraud-detection models, add decision logic to interpret the model’s fraud evaluations, and assign outcomes such as pass or send for review for each possible fraud evaluation. With Amazon Fraud Detector, you <u>don't need machine learning expertise</u> to detect fraudulent activities.

For more details, please visit the [documentation](https://docs.aws.amazon.com/frauddetector/latest/ug/what-is-frauddetector.html).


## About dataset...
---

In this notebook demonstration, we will be using **Auto Insurance** claim data, with 100k observations. The data is provided and is kept in `../data/` folder.


## Set up

In [None]:
%pip install --upgrade pip awscli botocore boto3 sagemaker --quiet --root-user-action=ignore
%pip install Jinja2 awswrangler jsonpath-ng --quiet --root-user-action=ignore

## Data Exploration
---

Let's quickly go through our sample dataset.

In [None]:
import pandas as pd
import boto3
import sagemaker
import numpy as np

# Change this according to your S3 location
S3_BUCKET_NM = sagemaker.Session().default_bucket()
S3_PREFIX = 'amazon-fraud-detector'

In [None]:
df = pd.read_csv('../data/Insurance_FraudulentAutoInsuranceClaims_100k.csv')
print(df.shape)
display(df.head(3))

In [None]:
print(np.unique(df['EVENT_LABEL'], return_counts=True))

### Split data into Train and test

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2)
print(f'''
Train label distribution: {np.unique(train_df.EVENT_LABEL, return_counts=True)[1] / train_df.shape[0]}
Train label distribution: {np.unique(test_df.EVENT_LABEL, return_counts=True)[1] / test_df.shape[0]}
''')

In [None]:
train_df.to_csv('../data/auto_insurance_fraud_train.csv', index=False, header=True)
test_df.to_csv('../data/auto_insurance_fraud_test.csv', index=False, header=True)

Write to Amazon S3

In [None]:
train_df.to_csv(f's3://{S3_BUCKET_NM}/{S3_PREFIX}/train/auto_insurance_fraud_train.csv', index=False, header=True)
test_df.to_csv(f's3://{S3_BUCKET_NM}/{S3_PREFIX}/test/auto_insurance_fraud_test.csv', index=False, header=True)

### Prepare the configuration file

In [None]:
RECIPE = {
    "Insurance_FraudulentAutoInsuranceClaims": 
    {
        "data_path": "train/auto_insurance_fraud_train.csv",
        "variable_mappings": [
            {
                "variable_name": "first_name",
                "variable_type": "SHIPPING_NAME",
                "data_type": "STRING"
            },
            {
                "variable_name": "last_name",
                "variable_type": "BILLING_NAME",
                "data_type": "STRING"
            },
            {
                "variable_name": "policy_id",
                "variable_type": "ORDER_ID",
                "data_type": "STRING"
            },
            {
                "variable_name": "policy_deductable",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "customer_age",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "policy_annual_premium",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "incident_severity",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "vehicle_claim",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "incident_hour",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "num_injuries",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "num_claims_past_year",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "injury_claim",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "num_vehicles_involved",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "num_witnesses",
                "variable_type": "NUMERIC",
                "data_type": "FLOAT"
            },
            {
                "variable_name": "incident_type",
                "variable_type": "CATEGORICAL",
                "data_type": "STRING"
            },
            {
                "variable_name": "police_report_available",
                "variable_type": "CATEGORICAL",
                "data_type": "STRING"
            }
        ],
        "label_mappings": {
            "FRAUD": ["fraud"],
            "LEGIT": ["legit"]
        },
    }
}

## Amazon Fraud Detector

### Set up parameters

In [None]:
fraud_use_case = 'Insurance_FraudulentAutoInsuranceClaims'
config_file = RECIPE[fraud_use_case]

Set up IAM role

In [None]:
sts_client = boto3.client('sts')
ACCT_NUM = sts_client.get_caller_identity().get('Account')

In [None]:
__trust_policy__ = """{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "frauddetector.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
"""

__managed_policy__ = """{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "frauddetector:*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:*",
                "sagemaker-geospatial:*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*",
                "s3-object-lambda:*"
            ],
            "Resource": "*"
        }
    ]
}
"""

print(
    __managed_policy__, 
    file=open('afd-managed-policy.json', 'w')
)

print(
    __trust_policy__, 
    file=open('afd-trust-policy.json', 'w')
)

AFD_ROLE = 'AWSFraudDetectorServiceRole-demoAutoInsuranceClaim'

In [None]:
%%sh -s "$AFD_ROLE"
echo $1
afd_role="$1"
managed_policy_name="managed_policy-$afd_role"
echo $managed_policy_name
aws iam create-role --role-name $afd_role --assume-role-policy-document file://afd-trust-policy.json
output=$(aws iam create-policy --policy-document file://afd-managed-policy.json --policy-name $managed_policy_name)
arn=$(echo "$output" | grep -oP '"Arn": "\K[^"]+')
echo "$arn"
aws iam attach-role-policy --policy-arn $arn --role-name $afd_role

In [None]:
iam_client = boto3.client('iam')
iam_resp = iam_client.get_role(
    RoleName=AFD_ROLE,
)

<div class="alert alert-block alert-info">
    <b>Remark</b>: You need to ensure that this role can access <b>Amazon Fraud Detector</b> and <b>Amazon S3</b>.
</div>

In [None]:
IAM_ROLE = iam_resp['Role']['Arn']
EVENT_VARIABLES = [variable["variable_name"] for variable in config_file["variable_mappings"]]
EVENT_LABELS = ["fraud", "legit"]

In [None]:
pd.DataFrame(config_file["variable_mappings"])

In [None]:
s3_client = boto3.client('s3')
data_path = config_file['data_path']
resp = s3_client.list_objects_v2(
    Bucket=S3_BUCKET_NM, 
    Prefix=os.path.join(S3_PREFIX, data_path)
)

if resp['KeyCount'] == 0:
    s3_client.put_object(
        Bucket=S3_BUCKET_NM, 
        Key=os.path.join(S3_PREFIX, data_path),
        Body=open(f'../{data_path}', 'rb')
    )
    
S3_DATA_PATH = "s3://" + os.path.join(S3_BUCKET_NM, S3_PREFIX, data_path)
print(S3_DATA_PATH)

### Create variables and label
---

The first components for Amazon Fraud Detector is to create variables. A **variable** is a data element from your dataset that you want to use to create event type, model, and rules.

The `create_variable` method from boto3. You will need to map each variable to its data type. When you specify the variable type, Amazon Fraud Detector interprets the variable during <u>model training</u> and when <u>getting predictions</u>. **Only variables with an associated variable type can be used for model training**.

<br>
    
A **label** classifies an event as **fraudulent** or **legitimate** and is used to train the fraud detection model. The model learns to classify events using these label values. The `put_label` method from boto3 is used to create 2 labels, fraud and legit.


In [None]:
afd_client = boto3.client('frauddetector')

In [None]:
for variable in config_file["variable_mappings"]:
    # Loop through our variable config files
    DEFAULT_VALUE = '0.0' if variable["data_type"] == "FLOAT" else '<null>'
    
    try:
        resp = afd_client.get_variables(name=variable["variable_name"])
        print(
            'variable {0} exists, data type: {1}'.format(variable['variable_name'], resp['variables'][0]['dataType'])
        )
    
    except:
        print(
            'Creating variable: {0}'.format(variable['variable_name'])
        )
        resp = afd_client.create_variable(
            name=variable['variable_name'],
            dataType=variable['data_type'],
            dataSource='EVENT',
            defaultValue=DEFAULT_VALUE,
            description=variable['variable_name'],
            variableType=variable['variable_type']
        )
        
    
        
resp = afd_client.put_label(
    name='fraud',
    description='FRAUD'
)

resp = afd_client.put_label(
    name='legit',
    description='LEGIT'
)

### Create Entity Type and Event Type
---

An **entity** represents who is performing the event and an **entity type** classifies the entity. Example of the classification can be customer, merchant, account, or hospital.

<br>

With Amazon Fraud Detector, you build model that evaluate risks (potentially fraud) and generate fraud predictions for individual events. An **event type** defines the structure of an individual event.

Let's look at simple example, you try to do fraud prediction on auto-insurance claim transaction;
- You create event type as **claim_transaction**
- You define the event type by specifying the variables (i.e., claim amount, incident type, incident severity).
- You define the entity type as **customer** (or for the sake of demo, **demo_customer**).
- You define labels (fraud and legit)


In [None]:
ENTITY_TYPE = 'demo_customer'
ENTITY_DESC = 'Demo of customer claiming auto-insurance'
EVENT_TYPE = 'auto_insurance_claim_event'

In [None]:
try:
    resp = afd_client.get_entity_types(name=ENTITY_TYPE)
    print(f'The given entity type - {ENTITY_TYPE} - already exists')
    print(resp)
    
except:
    resp = afd_client.put_entity_type(
        name=ENTITY_TYPE,
        description=ENTITY_DESC
    )
    print(f'Create Entity Type: {ENTITY_TYPE}')
    print(resp)

In [None]:
try:
    resp = afd_client.get_event_types(name=EVENT_TYPE)
    print(f'Event type: {EVENT_TYPE} already existed')
    print(resp)
    
except:
    resp = afd_client.put_event_type(
        name=EVENT_TYPE,
        eventVariables=EVENT_VARIABLES,
        labels=EVENT_LABELS,
        entityTypes=[ENTITY_TYPE]
    )
    print(f'Create event type: {EVENT_TYPE}')
    print(resp)

### Create, train, and deploy model
---

Amazon Fraud Detector trains models to learn to detect fraud for a specific event type. In the previous step, you have created the event type. Now, you will create and train a model for that specific event type. <u>The model acts as a container for your model versions.</u> **Each time you train a model, a new version is created.**

For more information about different model types that Amazon Fraud Detector supports, see [Choose a model type](https://docs.aws.amazon.com/frauddetector/latest/ug/choosing-model-type.html).


#### Create a model


In [None]:
MODEL_NAME = 'demo_auto_insurance_claim_fraud'
MODEL_TYPE = "ONLINE_FRAUD_INSIGHTS"
MODEL_DESC = 'Demo model to detect the fraud or legit of the auto-insurance claim'

In [None]:
try:
    resp = afd_client.create_model(
        modelId=MODEL_NAME,
        modelType=MODEL_TYPE,
        description=MODEL_DESC,
        eventTypeName=EVENT_TYPE
    )
    print(f'Initialize the model: {MODEL_NAME}')
    print(resp)

except Exception:
    pass


#### Define the training data schema

In [None]:
training_data_schema = {
    'modelVariables': EVENT_VARIABLES,
    'labelSchema': {
        'labelMapper': {
            'FRAUD': ["fraud"],
            'LEGIT': ["legit"]
        }
    }
}

#### Train model
---

I will use `create_model_version` method to train the model. I have specified **EXTERNAL_EVENTS** for the `trainingDataSource` and the Amazon S3 location where your dataset are being stored.

We also need to provide how to include and classify the event variables and labels in the `trainingDataSchema` parameter.


**Note**: for `trainingDataSource`, you can either specify it as internal (data stored in Amazon Fraud Detector) or external (data stored in Amazon S3). Please visit [Event Data Storage](https://docs.aws.amazon.com/frauddetector/latest/ug/event-data-storage.html) page for more information.


In [None]:
resp = afd_client.create_model_version(
    modelId=MODEL_NAME,
    modelType=MODEL_TYPE,
    trainingDataSource='EXTERNAL_EVENTS',
    trainingDataSchema=training_data_schema,
    externalEventsDetail={
        'dataLocation': S3_DATA_PATH,
        'dataAccessRoleArn': IAM_ROLE
    }
)

model_version = resp['modelVersionNumber']
print('Model training...')
print(resp)

Remark: This can take up to 1 hour for model to finish training.

In [None]:
import sys
class StatusIndicator:
    def __init__(self):
        self.previous_status = None
        self.need_newline = False
        
    def update( self, status ):
        if self.previous_status != status:
            if self.need_newline:
                sys.stdout.write("\n")
            sys.stdout.write( status + " ")
            self.need_newline = True
            self.previous_status = status
        else:
            sys.stdout.write(".")
            self.need_newline = True
        sys.stdout.flush()

    def end(self):
        if self.need_newline:
            sys.stdout.write("\n")

In [None]:
import time
def wait(callback, time_interval: int=30):
    status_indicator = StatusIndicator()

    while True:
        status = callback()['modelVersionDetails'][0]['status']
        status_indicator.update(status)
        if status.upper() in ('READY_TO_DEPLOY', 'TRAINING_COMPLETE'): break
        time.sleep(time_interval)

    status_indicator.end()
    
    return (status=="READY_TO_DEPLOY")

In [None]:
resp = afd_client.describe_model_versions(
    modelId=MODEL_NAME,
    modelType=MODEL_TYPE,
    modelVersionNumber='1.0'
)

resp['modelVersionDetails'][0]['status']

In [None]:
status = wait(
    lambda: afd_client.describe_model_versions(
        modelId=MODEL_NAME,
        modelType=MODEL_TYPE,
        modelVersionNumber='1.0'
    )
)

#### Review model performance

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

resp = afd_client.describe_model_versions(
    modelId=MODEL_NAME,
    modelVersionNumber=model_version,
    modelType=MODEL_TYPE,
)

training_metrics = resp['modelVersionDetails'][0]['trainingResult']['trainingMetrics']
perf_auc = training_metrics['auc']
df_model = pd.DataFrame(training_metrics['metricDataPoints'])

plt.figure(figsize=(15, 6))
plt.plot(df_model["fpr"], df_model["tpr"], color='darkorange', lw=2, label='ROC curve (area = %0.3f)'%perf_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(MODEL_NAME+' ROC Chart')
plt.legend(loc="lower right", fontsize=12)
plt.show()

In [None]:
var_imp_metrics = resp['modelVersionDetails'][0]['trainingResult']['variableImportanceMetrics']
df_var_imp = pd.DataFrame(var_imp_metrics['logOddsMetrics']).sort_values(by='variableImportance')
df_var_imp.plot.barh(
    x='variableName',
    y='variableImportance',
    figsize=(14, int(0.5 * df_var_imp.shape[0]))
)
plt.xlabel('Variable Importance (logOdds)')
plt.legend(loc="lower right", fontsize=12)
plt.show();

#### Deploy a model
--- 
Once we review the model performance and want to deploy the trained model, use `update_model_version_status` method.

In [None]:
import time
from IPython.display import clear_output

resp = afd_client.update_model_version_status (
    modelId=MODEL_NAME,
    modelType=MODEL_TYPE,
    modelVersionNumber=model_version,
    status='ACTIVE'
)
print(f'Activate model: {MODEL_NAME}')
print(resp)

print('Waiting until model status is active ...')
stime = time.time()

while True:
    clear_output(wait=True)
    resp = afd_client.get_model_version(
        modelId=MODEL_NAME,
        modelType=MODEL_TYPE, 
        modelVersionNumber=model_version
    )
    if resp['status'] != 'ACTIVE':
        print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
        time.sleep(60)
        
    if resp['status'] == 'ACTIVE':
        print("Model status : " +  resp['status'])
        break
        
etime = time.time()
print("Elapsed time : %s" % (etime - stime) + " seconds \n"  )
print(resp)

You can check the status of the model. It should be marked as **ACTIVE**.

In [None]:
resp = afd_client.describe_model_versions(
    modelId= MODEL_NAME,
    modelVersionNumber=model_version,
    modelType=MODEL_TYPE,
)
print(resp['modelVersionDetails'][0]['status'])

### Create detector, outcomes, rules, and detector version
---

A **detector** contains the detection logic, such as the models and rules. This logic is for a particular event that you want to evaluate for fraud. A rule is a condition that you specify to tell Amazon Fraud Detector how to interpret variable values during prediction. And outcome is the result of a fraud prediction. A detector can have multiple versions with each version having a status of *DRAFT*, *ACTIVE*, or *INACTIVE*. A detector version must have at least one rule that's associated with it.


#### Create a detector

In this demonstration, I will use `put_detector` method to create a **auto_insurance_claim** detector for my event type.

In [None]:
DETECTOR_NAME = 'auto_insurance_claim'
DETECTOR_DESC = 'Detector to detect fraud in auto insurance claim'

In [None]:
resp = afd_client.put_detector(
    detectorId=DETECTOR_NAME,
    description=DETECTOR_DESC,
    eventTypeName=EVENT_TYPE
)
print(resp)

#### Create Rule and Outcomes

In this demonstration, I will also create the **rule**. Rule consists of one or more variables from the dataset, a logical expression, and one or more outcomes.


##### Before we create the rule...

Let' review the earlier evaluation metric. 

- **TPR (True Positive Rate)** or Recall (Formula: # True positives / # positives)

Recall is fraction of relevant instances that were retrieved, or in more simple term, it measures what proportion of actual positives was identified correctly.

- **FPR (False Positive Rate)** (Formula: # False Positives / # negatives)

This is <u>not</u> precision. While precision measures the probability of a sample classified as positive to actually be positive, the **false positive rate measures the ratio of false positives within the negative samples**.

In [None]:
# -- check the score thresholds with FPR from 1% to 10% --
model_stat = df_model.sort_values(by='fpr')
model_stat['fpr_bin'] = np.ceil(model_stat['fpr'] * 100) * 0.01
m = model_stat.loc[model_stat.groupby(["fpr_bin"])["threshold"].idxmin()] 
m = m.round(decimals=2)[['fpr','precision','tpr','threshold']]
print ("--- score thresholds 1% to 10% ---")
print(m.loc[(m['fpr'] > 0.00 ) & (m['fpr'] <= 0.10)].reset_index(drop=True))

The choice of choosing the score will heavily based on your problem statement and objective.

For this specific example, I will create 3 simple rules by cutting score at 955, and 815, and create 3 possible outcomes - `fraud`, `investigate`, and `approve`.

- score > 955: fraud
- score > 815: investigate
- score <= 815: approve

In [None]:
score_cutoffs = [955, 815]
outcomes = ['fraud', 'investigate', 'approve']
afd_client = boto3.client('frauddetector')

In [None]:
def create_afd_outcomes(outcomes: list):
    for outcome in outcomes:
        print(f'Creating outcome variables: {outcome}')
        resp = afd_client.put_outcome(
            name=outcome,
            description=outcome
        )
        
    return None


def create_afd_rules(score_cutoffs: list, outcomes: list):
    if len(score_cutoffs) + 1 != len(outcomes):
        print('Your socre cuts and outcomes are not matched.')
        
    rule_list = []  # initialize
    for i in range(len(outcomes)):
        if i < (len(outcomes) - 1):
            rule = f'${MODEL_NAME}_insightscore > {score_cutoffs[i]}'
            
        else:
            rule = f'${MODEL_NAME}_insightscore <= {score_cutoffs[i - 1]}'
            
        rule_id = f'rule_{i}_{MODEL_NAME}'
        
        rule_list.append({
            'ruleId': rule_id,
            'ruleVersion': '1',
            'detectorId': DETECTOR_NAME
        })
        
        print(f'Creating rule: {rule_id}: IF {rule} THEN {outcomes[i]}')
        
        try:
            resp = afd_client.create_rule(
                detectorId=DETECTOR_NAME,
                ruleId=rule_id,
                expression=rule,
                language='DETECTORPL',
                outcomes=[outcomes[i]]
            )
            print(resp)
            
        except:
            print(f'This rule is already existed in this detector ({DETECTOR_NAME})')
        

    return rule_list
     
print(" -- create AFD outcomes --")
_ = create_afd_outcomes(outcomes)

print(" -- create AFD rules --")
rule_list = create_afd_rules(score_cutoffs, outcomes)

#### Create Detector Version

A detector version defines model and rules that are used to get fraud prediction.


Note that there are 2 options in `ruleExecutionMode` parameter:
- **ALL_MATCHED**   - return all matched rules' outcome
- **FIRST_MATCHED** - return first matched rule's outcome

In [None]:
resp = afd_client.create_detector_version(
    detectorId=DETECTOR_NAME,
    rules=rule_list,
    modelVersions=[{
        'modelId': MODEL_NAME,
        'modelType': MODEL_TYPE,
        'modelVersionNumber': model_version
    }],
    ruleExecutionMode='FIRST_MATCHED'
)

print(f"Detector created: {DETECTOR_NAME}")
print(resp) 

Once we create the **detector version**, it will be in **DRAFT** status. To use it, let's **ACTIVATE** it.

In [None]:
detector_version_summaries = afd_client.describe_detector(
    detectorId=DETECTOR_NAME
)['detectorVersionSummaries']

latest_detector_version = max([det['detectorVersionId'] for det in detector_version_summaries])
print(f'Latest Detector Version: {latest_detector_version}')

resp = afd_client.update_detector_version_status(
    detectorId=DETECTOR_NAME,
    detectorVersionId=latest_detector_version,
    status='ACTIVE'
)
print(f'Activating the detector: {DETECTOR_NAME}')
print(resp)

We can verify the version and status of the detector by using `describe_detector` method.

In [None]:
afd_client.describe_detector(detectorId=DETECTOR_NAME)

### Make Prediction
---
Lastly, we will use the detector to make the prediction on the dataset. There are 2 ways of making the prediction.

1. `get_event_prediction` method - this method evaluates an event against a detector version. 
2. `create_batch_prediction_job` method - this method creates a batch prediction job. This is more suitable for **offline** prediction (i.e., at schedule based time).



#### Real-time Prediction

In [None]:
%%time
import datetime
import uuid

def _predict(record, event_variables: list=EVENT_VARIABLES):
    """
    Get prediction on one event
    """
    
    # Initialize
    _event_id = str(uuid.uuid1())
    _entity_id = str(uuid.uuid1())
    _event_timestamp = str(datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%SZ'))
    _client = boto3.client('frauddetector')
    try:
        rec_content = {
            event_variables[i]: str(record[i]) for i in range(len(event_variables))
        }
        pred = _client.get_event_prediction(
            detectorId=DETECTOR_NAME,
            detectorVersionId=latest_detector_version,
            eventId=_event_id,
            eventTypeName=EVENT_TYPE,
            eventTimestamp=_event_timestamp,
            entities=[{
                'entityType': ENTITY_TYPE,
                'entityId': _entity_id
            }],
            eventVariables=rec_content
        )

        record.append(pred['modelScores'][0]['scores'][f'{MODEL_NAME}_insightscore'])
        record.append(pred['ruleResults'][0]['outcomes'])

    except:
        record.append("-999")
        record.append(["error"])
    
    return record


In [None]:
print(test_df.shape)

In [None]:
from multiprocessing import Pool

test_list = test_df[EVENT_VARIABLES + ['EVENT_LABEL']].values.tolist()
with Pool(processes=2) as p:
    result = p.map(_predict, test_list)
    
test_pred = pd.DataFrame(
    result, columns=EVENT_VARIABLES + ['EVENT_LABEL', 'score', 'outcomes']
)

In [None]:
test_pred.head(3)

##### Check the distribution by labels

In [None]:
import numpy as np
import warnings

plt.figure(figsize = (20, 8))
warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
ax = plt.hist(
    [
        test_pred.loc[test_pred['EVENT_LABEL'].isin(config_file['label_mappings']['LEGIT'])]['score'],
        test_pred.loc[test_pred['EVENT_LABEL'].isin(config_file['label_mappings']['FRAUD'])]['score']
    ], 
    bins = 50
)
plt.legend(["Legit", "Fraud"], fontsize=12)
plt.title("Predicted Score Distribution By Label")
plt.xlabel("Predicted Score")
plt.ylabel("Frequency")
plt.show();

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

test_pred['outcome_str'] = test_pred['outcomes'].apply(lambda x: x[0])
sns.catplot(
    data=test_pred, x='outcome_str', y='score',
    col='EVENT_LABEL', kind='box', col_wrap=2
)
plt.show();


ax = sns.countplot(data=test_pred, x="outcome_str", hue="EVENT_LABEL")
ax.set_yscale('log')
plt.show();

#### Batch prediction
---

You can schedule and run the **detector** as batch model. To call batch prediction, you need to have the dataset with same column names as what we define in the **fraud detector**.

In [None]:
col_required = [
    'customer_age', 'incident_type', 'last_name', 'num_witnesses', 'EVENT_TIMESTAMP', 'num_vehicles_involved', 
    'policy_deductable', 'injury_claim', 'num_claims_past_year', 'policy_id', 'incident_hour', 
    'ENTITY_TYPE', 'vehicle_claim', 'num_injuries', 'policy_annual_premium', 'incident_severity', 'EVENT_ID', 
    'police_report_available', 'first_name', 'ENTITY_ID'
]

batch_df = test_df[col_required]
batch_df.loc[:, 'ENTITY_TYPE'] = 'demo_customer'
print(batch_df.shape)
batch_df.to_csv(f's3://{S3_BUCKET_NM}/{S3_PREFIX}/batch_input/batch_data.csv', header=True, index=False)

In [None]:
JOB_ID = f'auto-insurance-batch-prediction-{datetime.datetime.now().strftime("%Y%m%d_%H%M%S")}'
print(JOB_ID)

In [None]:
resp = afd_client.create_batch_prediction_job(
    jobId=JOB_ID,
    inputPath=f's3://{S3_BUCKET_NM}/{S3_PREFIX}/batch_input/batch_data.csv',
    outputPath=f's3://{S3_BUCKET_NM}/{S3_PREFIX}/batch_output/',
    eventTypeName=EVENT_TYPE,
    detectorName=DETECTOR_NAME,
    detectorVersion=latest_detector_version,
    iamRoleArn=IAM_ROLE
)

In [None]:
resp = afd_client.get_batch_prediction_jobs(jobId=JOB_ID)
resp

Once complete, let's look at the batch output file.

In [None]:
batch_output_path = '../batch_output/'
!mkdir -p {batch_output_path}
resp = s3.list_objects_v2(
    Bucket=S3_BUCKET_NM,
    Prefix=f'{S3_PREFIX}/batch_output/',
)

for content_ in resp['Contents']:
    print(content_['Key'])
    s3.download_file(S3_BUCKET_NM, content_['Key'], f"{batch_output_path}/{content_['Key'].split('/')[2]}")

In [None]:
out_batch_df = pd.read_csv(f"{batch_output_path}auto-insurance-batch-prediction-20231106_083533_1699259778_output.csv")
out_batch_df.head()

As can be seen, the Detector will append 4 new columns, namely; 
- **status**: The status of the record whether it is success or failed
- **outcomes**: The outcome after passing through the ML and rules (based on our definition to detector)
- **rule_results**: The rule where each record is falling into
- **model_scores**: The actual score from the supervised ML model