# Amazon Fraud Detector - Data Profiler Notebook 


### Dataset Guidance
-------

AWS Fraud Detector's Online Fraud Insights(OFI) model supports a flexible schema, enabling you to train an OFI model to your specific data and business need. This notebook was developed to help you profile your data and identify potenital issues before you train an OFI model. The following summarizes the minimimum CSV File requirements:

* The files are in CSV UTF-8 (comma delimited) format (*.csv).
* The file should contain at least 10k rows and the following __four__ required fields:  

    * Event timestamp 
    * IP address 
    * Email address
    * Fraud label 
    
* The maximum file size is 10 gigabytes (GB).  

* The following dates and datetime formats are supported:
    * Dates: YYYY-MM-DD (eg. 2019-03-21)
    * Datetime: YYYY-MM-DD HH:mm:ss (eg. 2019-03-21 12:01:32) 
    * ISO 8601 Datetime: YYYY-MM-DDTHH:mm:ss+/-HH:mm (eg. 2019-03-21T20:58:41+07:00)

* The decimal precision is up to four decimal places.
* Numeric data should not contain commas and currency symbols. 
* Columns with values that could contain commas, such as address or custom text should be enclosed in double quotes. 



### Getting Started with Data 
-------
The following general guidance is provided to get the most out of your AWS Fraud Detector Online Fraud Insights Model. 

* Gathering Data -  The OFI model requires a minimum of 10k records. We recommend that a minimum of 6 weeks of historic data is collected, though 3 - 6 months of data is preferable.   As part of the process the OFI model partitions your data based on the Event Timestamp such that performance metrics are calculated on the out of sample (latest) data, thus the format of the event timestamp is important. 

  
* Data & Label Maturity: As part of the data gathering process we want to insure that records have had sufficient time to “mature”, i.e. that enough time has passed to insure “non-fraud" and  “fraud” records have been correctly identified. It often takes 30 - 45 days (or more) to correctly identify fraudulent events, because of this it is important to insure that the latest records are at least 30 days old or older.  

  
* Sampling: The OFI training process will sample and partition historic based on event timestamp. There is no need to manually sample the data and doing so may negatively influence your model’s results.  

  
* Fraud Labels:  The OFI model requires that a minimum of 500 observations are identified and labeled as “fraud”. As noted above, fraud label maturity is important. Insure that extracted data has sufficiently matured to insure that fraudulent events have been reliably found. 
  
  
* Custom Fields: the OFI model requires 4 fields: event timestamp, IP address, email address and fraud label. The more custom fields you provide the better the OFI model can differentiate between fraud and not fraud.  
  
  
* Nulls and Missing Values: OFI model handles null and missing values, however the percentage of nulls in key fields should be limited. Especially timestamp and fraud label columns should not contain any missing values.   

  
If you would like to know more, please check out the [Fraud Detector's Documentation](https://docs.aws.amazon.com/frauddetector/). 


In [1]:
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
from IPython.display import IFrame
# ------------------------------------------------------------------
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.options.display.float_format = '{:.4f}'.format

# -- AWS stuff -- 
import boto3


### Amazon Fraud Detector Profiling 
-----

from github download and copy the afd_profile.py python program and template directory to your notebook  

<div class="alert alert-info"> <strong> afd_profile.py </strong>

- afd_profile.py - is the python package which will generate your profile report. 
- /templates - directory contains the supporting profile templates 


</div>


In [2]:
# -- get this package from github -- 
import afd_profile

### Intialize your S3 client 
-----
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html

In [3]:
client = boto3.client('s3')

### File & Field Mapping
-----
Simply map your file and field names to the required config values. 

<div class="alert alert-info"> <strong> Map the Required fields </strong>

- input_file: this is your CSV file in your s3 bucket 

<b> required_features </b> are the minimally required freatures to run Amazon Fraud Detector 
- EVENT_TIMESTAMP: map this to your file's Date or Datetime field.    
- IP_ADDRESS: map this to your file's IP address field.   
- EMAIL_ADDRESS: map this to your file's email address field.  
- FRAUD_LABEL: map this to your file's fraud label field.  
    **note: the profiler will identify the "rare" case and assume that it is fraud**
    
</div>


In [4]:
input_file = 's3://sagemaker-us-east-2-057716757052/fraud/console/registration_data_20K_minimum.csv'
! aws s3 cp {input_file} .

download: s3://sagemaker-us-east-2-057716757052/fraud/console/registration_data_20K_minimum.csv to ./registration_data_20K_minimum.csv


In [5]:
local_input_file = 'registration_data_20K_minimum.csv'
%store -r train_data_file_name
%store -r test_data_file_name

## 기존 노트북에서 생성한 데이터 사용
train_data_file_name = '../' + train_data_file_name
test_data_file_name = '../' +  test_data_file_name

In [6]:
import pandas as pd
df = pd.read_csv(local_input_file)
df.head()

Unnamed: 0,ip_address,email_address,EVENT_TIMESTAMP,EVENT_LABEL
0,46.41.252.160,fake_acostasusan@example.org,10/8/2019 20:44,legit
1,152.58.247.12,fake_christopheryoung@example.com,5/23/2020 19:44,legit
2,12.252.206.222,fake_jeffrey09@example.org,4/24/2020 18:26,legit
3,170.81.164.240,fake_ncastro@example.org,4/22/2020 19:07,legit
4,165.182.68.217,fake_charles99@example.org,12/31/2019 17:08,legit


In [7]:
# -- update your configuration -- 
config = {  
    "input_file"        : local_input_file,
    "required_features" : {
        "EVENT_TIMESTAMP" : "EVENT_TIMESTAMP",
        "EVENT_LABEL"     : "EVENT_LABEL",
        "IP_ADDRESS"      : "ip_address",
        "EMAIL_ADDRESS"   : "email_address"
    }
}


#### Run Profiler
-----
The profiler will read your file and produce an HTML file as a result which will be displayed inline within this notebook.  
  
Note: you can also open **report.html** in a separate browser tab. 

In [8]:
# -- generate the report object --
report = afd_profile.profile_report(config)

0


In [9]:
with open("report.html", "w") as file:
    file.write(report)

IFrame(src='report.html', width=1500, height=800)


## 훈련 데이터 프러파일링

In [10]:
# -- update your configuration -- 
config = {  
    "input_file"        : train_data_file_name,
    "required_features" : {
        "EVENT_TIMESTAMP" : "EVENT_TIMESTAMP",
        "EVENT_LABEL"     : "EVENT_LABEL",
        "IP_ADDRESS"      : "ip_address",
        "EMAIL_ADDRESS"   : "email_address"
    }
}

# -- generate the report object --
report = afd_profile.profile_report(config)

with open("report.html", "w") as file:
    file.write(report)

IFrame(src='report.html', width=1500, height=800)



0


In [11]:
import pandas as pd
train_df = pd.read_csv(train_data_file_name)
train_df

Unnamed: 0,ip_address,email_address,EVENT_TIMESTAMP,EVENT_LABEL
0,124.199.26.246,fake_valdezstephen@example.com,2019-07-16 06:36:00,legit
1,105.141.20.185,fake_markwilliams@example.org,2019-07-16 06:45:00,legit
2,105.141.20.185,fake_velasquezjonathan@example.com,2019-07-16 06:47:00,legit
3,87.199.12.89,fake_richardsmith@example.net,2019-07-16 06:59:00,legit
4,31.175.34.58,fake_nolandenise@example.net,2019-07-16 07:07:00,legit
...,...,...,...,...
15995,83.177.133.184,fake_dmyers@example.net,2020-05-04 05:17:00,legit
15996,114.138.128.120,fake_christophersummers@example.org,2020-05-04 05:50:00,legit
15997,198.223.24.11,fake_claire35@example.org,2020-05-04 06:46:00,legit
15998,3.86.113.1,fake_elizabethyoung@example.com,2020-05-04 06:52:00,legit


In [12]:

test_df = pd.read_csv(test_data_file_name)
test_df

Unnamed: 0,ip_address,email_address,EVENT_TIMESTAMP,EVENT_LABEL
0,101.217.74.233,fake_tonyawhite@example.com,2020-05-04 07:48:00,legit
1,15.152.71.113,fake_alexandra43@example.org,2020-05-04 08:03:00,legit
2,20.202.25.188,fake_thamilton@example.org,2020-05-04 08:10:00,legit
3,138.156.38.109,fake_cwright@example.net,2020-05-04 08:18:00,legit
4,34.252.83.112,fake_cheyenne26@example.net,2020-05-04 08:45:00,legit
...,...,...,...,...
3995,120.7.171.56,fake_susan29@example.org,2020-07-15 10:24:00,fraud
3996,58.165.149.100,fake_lynn08@example.net,2020-07-15 10:41:00,legit
3997,222.130.170.141,fake_jonathananderson@example.net,2020-07-15 11:26:00,legit
3998,103.157.16.47,fake_julia74@example.com,2020-07-15 11:33:00,legit


## 테스트 데이터 프로파일링

fraud 의 갯수가 500개 보다 작아서 에러 발생

In [13]:
# -- update your configuration -- 
config = {  
    "input_file"        : test_data_file_name,
    "required_features" : {
        "EVENT_TIMESTAMP" : "EVENT_TIMESTAMP",
        "EVENT_LABEL"     : "EVENT_LABEL",
        "IP_ADDRESS"      : "ip_address",
        "EMAIL_ADDRESS"   : "email_address"
    }
}

# -- generate the report object --
report = afd_profile.profile_report(config)

with open("report.html", "w") as file:
    file.write(report)

IFrame(src='report.html', width=1500, height=800)




TypeError: must be str, not int