# Intro

This notebook presents an ETL pipeline flow that maps raw data for name screeing from AIA Bank to "agent-friendly" input. 

Folder data contains only raw data. After each step - a new folder with transformed data is created.

# Libraries

In [1]:
from pipeline.preprocessors import (convert_to_standardized, 
                                    spark_instance, 
                                    RAW_DATA_DIR, 
                                    STANDARDIZED_DATA_DIR, 
                                    CLEANSED_DATA_DIR,
                                    APPLICATION_DATA_DIR,
                                    transform_cleansed_to_application,
                                    show_files_in_directory,
                                    get_pandas_dataframe,
                                    convert_standardized_to_cleansed,)

{'dataset': 'learning_quality', 'input_files_dir': 'inputs', 'artifact_files_dir': 'artifacts', 'crucial_artifact_files_dir': 'crucial_artifacts', 'exportable_files_dir': 'exportable', 'tmp_dir': 'tmp', 'processing_pool_size': 10, 'customer_specific': {'agents': {'name': {'data_availability': {'agent_input': 'agent-name-inputs.csv', 'fields': {'alerted_party_field': 'ad_alerted_name', 'watchlist_field': 'ad_matched_name'}, 'crucial_artifact': 'agent-name-availability.csv'}, 'artifact_file': 'agent-name-results.csv', 'tuning_report_path': 'agent-name-tuning-report.csv'}, 'dob': {'data_availability': {'agent_input': 'agent-dob-inputs.csv', 'fields': {'alerted_party_field': 'ad_alerted_dob', 'watchlist_field': 'ad_matched_dob'}, 'crucial_artifact': 'agent-dob-availability.csv'}, 'artifact_file': 'agent-dob-results.csv'}, 'nationality': {'data_availability': {'agent_input': 'agent-nationality-inputs.csv', 'fields': {'alerted_party_field': 'ad_alerted_nationality', 'watchlist_field': 'ad_ma

Ivy Default Cache set to: /app/dependencies/ivy2/cache
The jars for the packages stored in: /app/dependencies/ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e3ae5b3d-9e31-4fd8-b3b5-d6abb90e9c48;1.0
	confs: [default]


:: loading settings :: url = jar:file:/env/ds/anaconda/envs/pipeline/lib/python3.7/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found io.delta#delta-core_2.12;1.0.0 in central
	found org.antlr#antlr4;4.7 in central
	found org.antlr#antlr4-runtime;4.7 in central
	found org.antlr#antlr-runtime;3.5.2 in central
	found org.antlr#ST4;4.0.8 in central
	found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
	found org.glassfish#javax.json;1.0.4 in central
	found com.ibm.icu#icu4j;58.2 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.0/delta-core_2.12-1.0.0.jar ...
	[SUCCESSFUL ] io.delta#delta-core_2.12;1.0.0!delta-core_2.12.jar (183ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr4/4.7/antlr4-4.7.jar ...
	[SUCCESSFUL ] org.antlr#antlr4;4.7!antlr4.jar (49ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr4-runtime/4.7/antlr4-runtime-4.7.jar ...
	[SUCCESSFUL ] org.antlr#antlr4-runtime;4.7!antlr4-runtime.jar (30ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr-runtime/3.5.2/antlr-runtime-3.5.2.jar ...
	[SUCCESSFUL ] org.antlr#antlr-ru

21/12/10 15:05:11 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 318f71919d8a, 34959, None)
21/12/10 15:05:12 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/app/spark-warehouse').
21/12/10 15:05:12 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'.
21/12/10 15:05:12 INFO SparkContext: Added file /app/dependencies/ivy2/cache/io.delta/delta-core_2.12/jars/delta-core_2.12-1.0.0.jar at file:/app/dependencies/ivy2/cache/io.delta/delta-core_2.12/jars/delta-core_2.12-1.0.0.jar with timestamp 1639148712535
21/12/10 15:05:12 INFO Utils: Copying /app/dependencies/ivy2/cache/io.delta/delta-core_2.12/jars/delta-core_2.12-1.0.0.jar to /tmp/app/spark/spark-7ed021cc-799a-4bf1-9229-d56bc17a6185/userFiles-d1dc15db-5582-4a11-9074-720a6028745a/delta-core_2.12-1.0.0.jar
21/12/10 15:05:12 WARN aia-ns-pov: Test warn log
2021/12/10 15:05:12 - root INFO: Test info log, from logging


pySpark version: 3.1.1
Spark UI - http://318f71919d8a:4040


# 1. Convert raw csv data to standardized format 

## Input (raw data)

In [2]:
show_files_in_directory(RAW_DATA_DIR)

./data/1.raw/ACM_ALERT_NOTES.csv
./data/1.raw/ACM_MD_ALERT_STATUSES.csv
./data/1.raw/ACM_ITEM_STATUS_HISTORY.csv
./data/1.raw/ALERTS.csv


## Process

In [3]:
convert_to_standardized(RAW_DATA_DIR, STANDARDIZED_DATA_DIR)

2021/12/10 15:05:12 - root INFO: Start to process ./data/1.raw/ACM_ALERT_NOTES.csv
2021/12/10 15:05:17 - root INFO: Data saved to ./data/2.standardized/ACM_ALERT_NOTES.delta
2021/12/10 15:05:17 - root INFO: Time lapsed 5.19 s





2021/12/10 15:05:18 - root INFO: Start to process ./data/1.raw/ACM_MD_ALERT_STATUSES.csv
2021/12/10 15:05:19 - root INFO: Data saved to ./data/2.standardized/ACM_MD_ALERT_STATUSES.delta
2021/12/10 15:05:19 - root INFO: Time lapsed 0.85 s





2021/12/10 15:05:20 - root INFO: Start to process ./data/1.raw/ACM_ITEM_STATUS_HISTORY.csv
2021/12/10 15:05:21 - root INFO: Data saved to ./data/2.standardized/ACM_ITEM_STATUS_HISTORY.delta
2021/12/10 15:05:21 - root INFO: Time lapsed 0.73 s





2021/12/10 15:05:22 - root INFO: Start to process ./data/1.raw/ALERTS.csv
21/12/10 15:05:22 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
2021/12/10 15:05:23 - root INFO: Data saved to ./data/2.standardized/ALERTS.delta
2021/12/10 15:05:23 - root INFO: Time lapsed 0.77 s





## Output (standardized format)

In [4]:
show_files_in_directory(STANDARDIZED_DATA_DIR)

./data/2.standardized/ACM_ALERT_NOTES.delta
./data/2.standardized/ACM_ITEM_STATUS_HISTORY.delta
./data/2.standardized/ALERTS.delta
./data/2.standardized/ACM_MD_ALERT_STATUSES.delta


In [5]:
get_pandas_dataframe("./data/2.standardized/ALERTS.delta")

Unnamed: 0,ALERT_INTERNAL_ID,ENTITY_TYPE_ID,ALERT_DATE,ALERT_TYPE_ID,STATUS_ID,STATUS_INTERNAL_ID,DELETED,HTML_FILE_KEY,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,P38,P39,P40,P41,P42,P43,P44,P45,P46,P47,P48,P49,IS_CASE,BUNIT_IDENTIFIER,OWNER_INTERNAL_ID,BU_INTERNAL_ID,ORIGINAL_BU_INTERNAL_ID,FL_ARCHIVE,FL_READ,FL_READ_BY_OWNER,LAST_READ_DATE,LAST_READ_USER_ID,LAST_UPDATE_DATE,LAST_UPDATE_USER_ID,CLOSED_DATE,CREATE_DATE,ALERT_CUSTOM_ATTRIBUTES_ID,SCORE,ALERT_TYPE_VERSION,FL_MANUAL,FL_GENERATED_BY_ACM,RESOLUTION_ID,ALERT_ID,ALERT_TYPE_INTERNAL_ID,FL_HAS_ATTACHMENTS,FL_UPDATED_BY_ACM,ENTITY_ID,PREV_STATUS_INTERNAL_ID,FL_ENCRYPTED,LAST_REFRESH_MODIFED_DATE,DEADLINE_DATE,HIGHLIGHT_DATE,EMAIL_DATE,AUTO_ESC_STATUS_INTERNAL_ID,CASE_COUNT_FOR_CONFIDENTIAL,P50,GLOBAL_DEADLINE_DATE,GLOBAL_HIGHLIGHT_DATE,GLOBAL_EMAIL_DATE,GLOBAL_AUTO_ESC_STATUS_ID,RFI_STATE,FL_HAS_NOTES,FL_HAS_CONFIDENTIAL_NOTES,CONSOLIDATION_KEY,HIBERNATE_OBJECT_VERSION,OWNER_IDENTIFIER,FL_DOUBT,NUM_EXISTING_ENTITIES,WORKSPACE_INTERNAL_ID,ALERT_NAME,PRIORITY_INTERNAL_ID,DETAILS_FOR_SEARCH,DETAILS
0,1649364,,24-Feb-20,,,7027,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",01-13636,SSBSAPB0025000389,,,I,NAME;,,,,,,,,,,,,,,,AML_SAP_BN01_20022415001_334,FACTIVA_PEP_SAN,CPF BOARD,,AML-EWLF,FACTIVA_SIE;,1995-12-08;,Special Interest Entity (SIE)-Other Official L...,Self Service Batch Scan,,,,,,,,,,Watch List: FACTIVA_SIE; Entry ID: 1091285; En...,0,1,,3280419,3280419,0,1,0,12-Jul-21,30801,12-Jul-21,30801,12-Jul-21,24-Feb-20,865017,80,AML-EWLF/3.5.1.33,0,1,,WLF101-1363601-89626,119,0,1,,,0,,,,,,0,,,,,,0,1,0,,13,,0,,,,,,\r
1,1619405,,8-Oct-18,,,7027,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-97,42823012-P560016505,1.0,1/1/89,P,NAME;,,,,,,,,,,,,,,,AML_MAGNUM_1810081211_55,FACTIVA_PEP_SAN,P ONE,:S0135242C,AML-EWLF,FACTIVA_SAN;,1992; 1994; 1993; 1995;,Special Interest Person (SIP)-Sanctions Lists;...,Self Service Batch Scan,,,,,,,,,,Watch List: FACTIVA_SAN; Entry ID: 4790496; En...,0,1,,3280419,3280419,0,1,0,12-Jul-21,30801,29-Dec-20,28202,29-Dec-20,8-Oct-18,812105,80,AML-EWLF/3.4.0.8,0,1,,WLF101-939701-62908,119,0,1,,,0,,,,,,0,,,,,,0,1,0,,9,,0,0.0,,,,,\r
2,1619436,,10-Oct-18,,,7027,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-54,SSBPLASC550212199,1.0,5/12/62,P,NAME;,,,,,,,,,,,,,,,AML_PLAS_W_1810061800_971,FACTIVA_PEP_SAN,KIM,,AML-EWLF,FACTIVA_SAN;,1964; 1962-08-28;,Politically Exposed Person (PEP);Special Inter...,Self Service Batch Scan,,,,,,,,,,Watch List: FACTIVA_SAN; Entry ID: 1198704; En...,0,1,,3280419,3280419,0,1,0,12-Jul-21,30801,29-Dec-20,28202,29-Dec-20,10-Oct-18,812136,115,AML-EWLF/3.4.0.8,0,1,,WLF101-945401-62939,119,0,1,,,0,,,,,,0,,,,,,0,1,0,,9,,0,0.0,,,,,


In [6]:
get_pandas_dataframe("./data/2.standardized/ACM_ALERT_NOTES.delta")

Unnamed: 0,ALERT_ID,NOTE,CREATE_DATE
0,WLF101-945401-62939,<p>RBA applied. Approval from JM to close the ...,2021-03-25 16:35:44
1,WLF101-1363601-89626,"<p><span style=""font-family: 'courier new', co...",2021-08-16 18:02:47
2,WLF101-939701-62908,"<p><span style=""font-family: 'courier new', co...",2021-03-17 15:00:55
3,WLF101-1363601-89626,<p>*1-3: Name mismatch</p>,2021-08-17 09:03:03
4,WLF101-945401-62939,"<p><br />hit 1: Positive hit. noted PH is YOB,...",2020-03-17 15:22:06


In [7]:
get_pandas_dataframe("./data/2.standardized/ACM_ITEM_STATUS_HISTORY.delta").head(5)

Unnamed: 0,STATUS_JOIN_ID,ITEM_JOIN_ID,ITEM_ID,FROM_STATUS_IDENTIFIER,FROM_STATE,FROM_FINDING,TO_STATUS_IDENTIFIER,TO_STATE,TO_FINDING,CREATE_DATE,USER_JOIN_ID
0,6059387,3658498,WLF101-1363601-89626,2.0,Open,No_Determination,3,Open,No_Determination,2021-08-16 18:02:47,34001
1,6059825,3658498,WLF101-1363601-89626,3.0,Open,No_Determination,IMPL_AML_FALSE_POSITIVE,Closed,Non_Issue,2021-08-17 09:03:03,30203
2,5780701,3541863,WLF101-939701-62908,,,,2,Open,No_Determination,2021-03-17 14:52:06,22601
3,5135431,3292184,WLF101-945401-62939,2.0,Open,No_Determination,24,Open,No_Determination,2020-03-17 15:22:11,30202
4,5782825,3541863,WLF101-939701-62908,3.0,Open,No_Determination,IMPL_AML_FALSE_POSITIVE,Closed,Non_Issue,2021-03-18 12:04:25,30204


In [8]:
get_pandas_dataframe("./data/2.standardized/ACM_MD_ALERT_STATUSES.delta").head(5)

Unnamed: 0,STATUS_INTERNAL_ID,NUMERIC_IDENTIFIER,STATUS_DELETABLE,DESCRIPTION,E_STATE,E_ISSUE,E_SCOPE,STATUS_IDENTIFIER,STATUS_NAME
0,7021,,1,AIA China Alert Step #5,Open,No_Determination,Selected,IMPL_AML_INFO_REQUESTED_WAIT,Information requested - Response awaited
1,7020,,1,AIA China CDD Alert Step #7,Closed,Issue,Selected,IMPL_AML_RISK_SCORE_UP,Risk score manually increased
2,7025,,1,AIA China Alert Step #8,Closed,Issue,Selected,IMPL_AML_TRUE_POSITIVE,True Positive - Sanctions
3,7030,19.0,1,,Open,No_Determination,All,19,Potentially suspicious transaction - escalated...
4,7019,,1,AIA China CDD Alert Step #10,Closed,Issue,Selected,IMPL_AML_EDD_COMPLETED,EDD completed


# 2. Create cleansed data

## Input 

See: output from previous stage

# Process

In [9]:
convert_standardized_to_cleansed()

Dimension 3 98


21/12/10 15:05:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/12/10 15:05:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/12/10 15:05:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.8ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.8

## Output (cleansed data)

Note: "ACM_MD_ALERT_STATUSES" table was merged to the content of other tables

In [10]:
show_files_in_directory(CLEANSED_DATA_DIR)

./data/3.cleansed/ACM_ALERT_NOTES.delta
./data/3.cleansed/ACM_ITEM_STATUS_HISTORY.delta
./data/3.cleansed/ALERTS.delta


##### Before 

In [11]:
get_pandas_dataframe("./data/2.standardized/ALERTS.delta")

Unnamed: 0,ALERT_INTERNAL_ID,ENTITY_TYPE_ID,ALERT_DATE,ALERT_TYPE_ID,STATUS_ID,STATUS_INTERNAL_ID,DELETED,HTML_FILE_KEY,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,P38,P39,P40,P41,P42,P43,P44,P45,P46,P47,P48,P49,IS_CASE,BUNIT_IDENTIFIER,OWNER_INTERNAL_ID,BU_INTERNAL_ID,ORIGINAL_BU_INTERNAL_ID,FL_ARCHIVE,FL_READ,FL_READ_BY_OWNER,LAST_READ_DATE,LAST_READ_USER_ID,LAST_UPDATE_DATE,LAST_UPDATE_USER_ID,CLOSED_DATE,CREATE_DATE,ALERT_CUSTOM_ATTRIBUTES_ID,SCORE,ALERT_TYPE_VERSION,FL_MANUAL,FL_GENERATED_BY_ACM,RESOLUTION_ID,ALERT_ID,ALERT_TYPE_INTERNAL_ID,FL_HAS_ATTACHMENTS,FL_UPDATED_BY_ACM,ENTITY_ID,PREV_STATUS_INTERNAL_ID,FL_ENCRYPTED,LAST_REFRESH_MODIFED_DATE,DEADLINE_DATE,HIGHLIGHT_DATE,EMAIL_DATE,AUTO_ESC_STATUS_INTERNAL_ID,CASE_COUNT_FOR_CONFIDENTIAL,P50,GLOBAL_DEADLINE_DATE,GLOBAL_HIGHLIGHT_DATE,GLOBAL_EMAIL_DATE,GLOBAL_AUTO_ESC_STATUS_ID,RFI_STATE,FL_HAS_NOTES,FL_HAS_CONFIDENTIAL_NOTES,CONSOLIDATION_KEY,HIBERNATE_OBJECT_VERSION,OWNER_IDENTIFIER,FL_DOUBT,NUM_EXISTING_ENTITIES,WORKSPACE_INTERNAL_ID,ALERT_NAME,PRIORITY_INTERNAL_ID,DETAILS_FOR_SEARCH,DETAILS
0,1649364,,24-Feb-20,,,7027,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",01-13636,SSBSAPB0025000389,,,I,NAME;,,,,,,,,,,,,,,,AML_SAP_BN01_20022415001_334,FACTIVA_PEP_SAN,CPF BOARD,,AML-EWLF,FACTIVA_SIE;,1995-12-08;,Special Interest Entity (SIE)-Other Official L...,Self Service Batch Scan,,,,,,,,,,Watch List: FACTIVA_SIE; Entry ID: 1091285; En...,0,1,,3280419,3280419,0,1,0,12-Jul-21,30801,12-Jul-21,30801,12-Jul-21,24-Feb-20,865017,80,AML-EWLF/3.5.1.33,0,1,,WLF101-1363601-89626,119,0,1,,,0,,,,,,0,,,,,,0,1,0,,13,,0,,,,,,\r
1,1619405,,8-Oct-18,,,7027,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-97,42823012-P560016505,1.0,1/1/89,P,NAME;,,,,,,,,,,,,,,,AML_MAGNUM_1810081211_55,FACTIVA_PEP_SAN,P ONE,:S0135242C,AML-EWLF,FACTIVA_SAN;,1992; 1994; 1993; 1995;,Special Interest Person (SIP)-Sanctions Lists;...,Self Service Batch Scan,,,,,,,,,,Watch List: FACTIVA_SAN; Entry ID: 4790496; En...,0,1,,3280419,3280419,0,1,0,12-Jul-21,30801,29-Dec-20,28202,29-Dec-20,8-Oct-18,812105,80,AML-EWLF/3.4.0.8,0,1,,WLF101-939701-62908,119,0,1,,,0,,,,,,0,,,,,,0,1,0,,9,,0,0.0,,,,,\r
2,1619436,,10-Oct-18,,,7027,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-54,SSBPLASC550212199,1.0,5/12/62,P,NAME;,,,,,,,,,,,,,,,AML_PLAS_W_1810061800_971,FACTIVA_PEP_SAN,KIM,,AML-EWLF,FACTIVA_SAN;,1964; 1962-08-28;,Politically Exposed Person (PEP);Special Inter...,Self Service Batch Scan,,,,,,,,,,Watch List: FACTIVA_SAN; Entry ID: 1198704; En...,0,1,,3280419,3280419,0,1,0,12-Jul-21,30801,29-Dec-20,28202,29-Dec-20,10-Oct-18,812136,115,AML-EWLF/3.4.0.8,0,1,,WLF101-945401-62939,119,0,1,,,0,,,,,,0,,,,,,0,1,0,,9,,0,0.0,,,,,


##### After transformation (e.g. XML extraction)

In [12]:
get_pandas_dataframe("./data/3.cleansed/ALERTS.delta")

Unnamed: 0,STATUS_INTERNAL_ID,STATUS_NAME,ALERT_INTERNAL_ID,ENTITY_TYPE_ID,ALERT_DATE,ALERT_TYPE_ID,STATUS_ID,DELETED,HTML_FILE_KEY,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,P38,P39,P40,P41,P42,P43,P44,P45,P46,P47,P48,P49,IS_CASE,BUNIT_IDENTIFIER,OWNER_INTERNAL_ID,BU_INTERNAL_ID,ORIGINAL_BU_INTERNAL_ID,FL_ARCHIVE,FL_READ,FL_READ_BY_OWNER,LAST_READ_DATE,LAST_READ_USER_ID,LAST_UPDATE_DATE,LAST_UPDATE_USER_ID,CLOSED_DATE,CREATE_DATE,ALERT_CUSTOM_ATTRIBUTES_ID,SCORE,ALERT_TYPE_VERSION,FL_MANUAL,FL_GENERATED_BY_ACM,RESOLUTION_ID,ALERT_ID,ALERT_TYPE_INTERNAL_ID,FL_HAS_ATTACHMENTS,FL_UPDATED_BY_ACM,ENTITY_ID,PREV_STATUS_INTERNAL_ID,FL_ENCRYPTED,LAST_REFRESH_MODIFED_DATE,DEADLINE_DATE,HIGHLIGHT_DATE,EMAIL_DATE,AUTO_ESC_STATUS_INTERNAL_ID,CASE_COUNT_FOR_CONFIDENTIAL,P50,GLOBAL_DEADLINE_DATE,GLOBAL_HIGHLIGHT_DATE,GLOBAL_EMAIL_DATE,GLOBAL_AUTO_ESC_STATUS_ID,RFI_STATE,FL_HAS_NOTES,FL_HAS_CONFIDENTIAL_NOTES,CONSOLIDATION_KEY,HIBERNATE_OBJECT_VERSION,OWNER_IDENTIFIER,FL_DOUBT,NUM_EXISTING_ENTITIES,WORKSPACE_INTERNAL_ID,ALERT_NAME,PRIORITY_INTERNAL_ID,DETAILS_FOR_SEARCH,DETAILS,alert_alertId,alert_alertDate,alert_alertEntityKey,alert_score,alert_ahData_alertID,alert_ahData_alertDateTime,alert_ahData_jobID,alert_ahData_jobName,alert_ahData_jobType,alert_ahData_score,alert_ahData_numberOfHits,alert_ahData_partyKey,alert_ahData_partyName,alert_ahData_entityExcludeListsNames,alert_ahData_hitExcludeListsNames,alert_partyType,alert_partyDOB,alert_partyYOB,alert_partyBirthCountry,alert_partyBirthLocation,alert_partyGender,alert_partyIds_idType,alert_partyIds_idNumber,alert_partyIds_idCountry,alert_partyNatCountries_countryCd,alert_partyAddresses_partyAddressLine1,alert_partyAddresses_partyAddressLine2,alert_partyAddresses_partyCity,alert_partyAddresses_partyPostalCd,alert_partyAddresses_partyStateProvince,alert_partyAddresses_partyCountry,hit_listId,hit_entryId,hit_listVersion,hit_entryType,hit_listUpdateDate,hit_entryCreatedDate,hit_entryUpdateDate,hit_displayName,hit_matchedName,hit_isNameBroken,hit_aliases_displayName,hit_aliases_matchedName,hit_aliases_isNameBroken,hit_aliases_matchStrength,hit_addresses_streetAddress1,hit_addresses_streetAddress2,hit_addresses_city,hit_addresses_stateProvince,hit_addresses_postalCode,hit_addresses_country,hit_ids_idType,hit_ids_idNumber,hit_ids_idCountry,hit_nationalityCountries_country,hit_placesOfBirth_birthPlace,hit_placesOfBirth_birthCountry,hit_age,hit_ageAsOfDate,hit_datesOfBirth_birthDate,hit_datesOfBirth_yearOfBirth,hit_categories_category,hit_keywords_keyword,hit_title,hit_position,hit_gender,hit_isDeceased,hit_deceasedDate,hit_cs_1,hit_cs_2,hit_cs_3,hit_cs_4,hit_cs_5,hit_cs_6,hit_cs_7,hit_cs_8,hit_cs_9,hit_cs_10,hit_cs_11,hit_cs_12,hit_cs_13,hit_cs_14,hit_cs_15,hit_cs_16,hit_cs_17,hit_cs_18,hit_additionalInfo_name,hit_additionalInfo_value,hit_score,hit_matchType,hit_scoreFactors_factorId,hit_scoreFactors_factorDesc,hit_scoreFactors_factorValue,hit_scoreFactors_factorScore,hit_scoreFactors_factorImpact,hit_scoresBreakdown_matchedName,hit_scoresBreakdown_aliases_matchedName,hit_scoresBreakdown_addresses_city,hit_scoresBreakdown_addresses_country,hit_scoresBreakdown_addresses_stateProvince,hit_scoresBreakdown_ids_idNumber,hit_explanations_matchedName_Explanation,hit_explanations_aliases_matchedName_Explanation,hit_explanations_nationalityCountries_country_Explanation,hit_explanations_address_city_Explanation,hit_explanations_address_country_Explanation,hit_explanations_addresses_stateProvince_Explanation,hit_explanations_ids_idNumber_Explanation,hit_inputExplanations_matchedName_inputExplanation,hit_inputExplanations_aliases_matchedName_inputExplanation,hit_inputExplanations_nationalityCountries_country_inputExplanation,hit_inputExplanations_address_city_inputExplanation,hit_inputExplanations_address_country_inputExplanation,hit_inputExplanations_addresses_stateProvince_inputExplanation,hit_inputExplanations_ids_idNumber_inputExplanation,ap_hit_names,wl_hit_matched_name,wl_hit_aliases_matched_name,wl_hit_names,hit_cs_1_data_points,ap_nric
0,7027,False Positive,1649364,,24-Feb-20,,,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",01-13636,SSBSAPB0025000389,,,I,NAME;,,,,,,,,,,,,,,,AML_SAP_BN01_20022415001_334,FACTIVA_PEP_SAN,CPF BOARD,,AML-EWLF,FACTIVA_SIE;,1995-12-08;,Special Interest Entity (SIE)-Other Official L...,Self Service Batch Scan,,,,,,,,,,Watch List: FACTIVA_SIE; Entry ID: 1091285; En...,0,1,,3280419,3280419,0,1,0,12-Jul-21,30801,12-Jul-21,30801,12-Jul-21,24-Feb-20,865017,80,AML-EWLF/3.5.1.33,0,1,,WLF101-1363601-89626,119,0,1,,,0,,,,,,0,,,,,,0,1,0,,13,,0,,,,,,\r,WLF101-1363601-89626,24/02/20 15:02:12,SSBSAPB0025000389,80,[WLF101-1363601-89626],[24/02/20 15:02:12],[01-13636],[AML_SAP_BN01_20022415001_334],[Self Service Batch Scan],[80],1,[SSBSAPB0025000389],[CPF BOARD],[None],[None],Organization,,,,,UNKNOWN,[None],[None],[None],[None],[],[],[],[],[],[],FACTIVA_SIE,1091285,160,ORGANIZATION,21/02/20,02/03/18,02/03/18,China Petroleum Finance,China Petroleum Finance,,"[CPF, China Petroleum Finance Co. Ltd, Zhongyo...","[CPF, China Petroleum Finance Co. Ltd, Zhongyo...","[False, False, False, False, False, False]","[H, H, H, H, H, H]","[No 5 Gulouwai Avenue, Xicheng District]",[None],[Beijing;;100029],[None],[None],[CN],"[Company Identification No., DUNS Number, Nati...","[91110000100018558M, 544956829, 91110000100018...","[None, None, None, None, None, None]",[CN],[],[],,,[1995-12-08],[1995],[Special Interest Entity (SIE)-Other Official ...,[2395],,,,,,4;4;,4;4;,12;14;,,,,,,,,,,,,,,,,"[ProfileNotes, Sanction References]",[IOWA BOARD OF REGENTS NOTES: SUDAN Scrutiniz...,80.0,NAME,"[AML_WLF_CF_SF_matchingEngine, AML_WLF_CF_SF_s...","[Matching Engine Match, Single Token Match]","[95, YES]","[95.0, -15.0]","[MEDIUM, CORRECTIVE]",0,95,0,0,0,0,[],[CPF],[],[],[],[],[],[],[CPF BOARD],[],[],[],[],[],[CPF BOARD],,[CPF],[CPF],"{'possible_nric': None, 'nric': None, 'dob': N...",[]
1,7027,False Positive,1619405,,8-Oct-18,,,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-97,42823012-P560016505,1.0,1/1/89,P,NAME;,,,,,,,,,,,,,,,AML_MAGNUM_1810081211_55,FACTIVA_PEP_SAN,P ONE,:S0135242C,AML-EWLF,FACTIVA_SAN;,1992; 1994; 1993; 1995;,Special Interest Person (SIP)-Sanctions Lists;...,Self Service Batch Scan,,,,,,,,,,Watch List: FACTIVA_SAN; Entry ID: 4790496; En...,0,1,,3280419,3280419,0,1,0,12-Jul-21,30801,29-Dec-20,28202,29-Dec-20,8-Oct-18,812105,80,AML-EWLF/3.4.0.8,0,1,,WLF101-939701-62908,119,0,1,,,0,,,,,,0,,,,,,0,1,0,,9,,0,0.0,,,,,\r,WLF101-939701-62908,08/10/18 12:31:01,42823012-P560016505,80,[WLF101-939701-62908],[08/10/18 12:31:01],[01-9397],[AML_MAGNUM_1810081211_55],[Self Service Batch Scan],[80],1,[42823012-P560016505],[P ONE],[None],[None],Person,01/01/89,1989.0,,,MALE,[None],[S0135242C],[None],[SG],[],[],[],[],[],[],FACTIVA_SAN,4790496,167,PERSON,06/06/18,17/08/17,17/08/17,Ali Kony,{GN=Ali}{SN=Kony},True,"[アリ・コニー, アリ・ラロボ・バシール, Bashir,Ali,Lalobo, Kaper...","[アリ・コニー, アリ・ラロボ・バシール, {GN=Ali,Lalobo}{SN=Bashi...","[False, False, True, True, True, True, True, T...","[H, H, H, H, H, H, H, H, H, H, H, H, L, L, L, ...",[Kafia Kingi (border of Sudan and South Sudan)...,"[None, None, None]","[None, None, ;Kafia Kingi]","[None, None, None]","[None, None, None]","[None, None, None]","[SECO SSID, DFAT Reference Number]","[34712, 3194]","[None, None]",[None],[None],[None],,,"[1992, 1994, 1993, 1995]","[1992, 1994, 1993, 1995]",[Special Interest Person (SIP)-Sanctions Lists...,"[1, 275, 281, 283, 1156, 1326, 2556, 2644, 316...",,,MALE,False,,3;3;,1;2;,,,,,,,,,,,,,,,,,"[ProfileNotes, Images, Primary Occupation, Oth...",[OFFICE OF FOREIGN ASSETS CONTROL (OFAC) NOTES...,80.0,NAME,"[AML_WLF_CF_SF_matchingEngine, AML_WLF_CF_SF_y...","[Matching Engine Match, Year of Birth Match, L...","[100, MISMATCH, YES]","[100.0, -10.0, -10.0]","[HIGH, CORRECTIVE, CORRECTIVE]",0,100,0,0,0,0,[],"[One P, One P]",[],[],[],[],[],[],"[P ONE, P ONE]",[],[],[],[],[],[P ONE],,[One P],[One P],"{'possible_nric': None, 'nric': None, 'dob': N...",[S0135242C]
2,7027,False Positive,1619436,,10-Oct-18,,,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-54,SSBPLASC550212199,1.0,5/12/62,P,NAME;,,,,,,,,,,,,,,,AML_PLAS_W_1810061800_971,FACTIVA_PEP_SAN,KIM,,AML-EWLF,FACTIVA_SAN;,1964; 1962-08-28;,Politically Exposed Person (PEP);Special Inter...,Self Service Batch Scan,,,,,,,,,,Watch List: FACTIVA_SAN; Entry ID: 1198704; En...,0,1,,3280419,3280419,0,1,0,12-Jul-21,30801,29-Dec-20,28202,29-Dec-20,10-Oct-18,812136,115,AML-EWLF/3.4.0.8,0,1,,WLF101-945401-62939,119,0,1,,,0,,,,,,0,,,,,,0,1,0,,9,,0,0.0,,,,,,WLF101-945401-62939,10/10/18 11:43:55,SSBPLASC550212199,115,[WLF101-945401-62939],[10/10/18 11:43:55],[01-9454],[AML_PLAS_W_1810061800_971],[Self Service Batch Scan],[115],1,[SSBPLASC550212199],[KIM],[None],[None],Person,05/12/62,1962.0,,,UNKNOWN,[None],[None],[None],[None],[],[],[],[],[],[],FACTIVA_SAN,1198704,168,PERSON,06/06/18,06/06/18,06/06/18,Tong-Myo'ng Kim,{GN=Tong-Myo'ng}{SN=Kim},True,"[김동명, キム・トンミョン, Дон Мён Ким, キム・チンソク, Чин Сок ...","[김동명, キム・トンミョン, Дон Мён Ким, キム・チンソク, Чин Сок ...","[False, False, False, False, False, True, True...","[H, H, H, H, H, H, H, H, H, H, H, H, H, H, H, ...","[c/o Tanchon Commercial Bank, Saemul 1-Dong Py...","[None, None]","[Pyongyang, Pyongyang]","[None, None]","[None, None]","[KP, KP]","[Passport No., SECO SSID, DFAT Reference Numbe...","[290320764, 33845, 1133, 3157, P23]","[None, None, None, None, None]","[KP, KP]",[None],[None],,,"[1964, 1962-08-28]","[1964, 1962]","[Politically Exposed Person (PEP), Special Int...","[13, 275, 281, 283, 627, 1310, 1318, 1324, 132...",,"President, Tanchon Commercial Bank",MALE,False,,1;3;,1;,,,,,,,,,,,,,,,,,"[ProfileNotes, Images, Primary Occupation, Oth...",[OFFICE OF FOREIGN ASSETS CONTROL (OFAC) NOTES...,115.0,NAME,"[AML_WLF_CF_SF_matchingEngine, AML_WLF_CF_SF_y...","[Matching Engine Match, Year of Birth Match, S...","[100, MATCH, YES]","[100.0, 30.0, -15.0]","[HIGH, LOW, CORRECTIVE]",0,100,0,0,0,0,[],"[Kim,]",[],[],[],[],[],[],[KIM],[],[],[],[],[],[KIM],,"[Kim,]","[Kim,]","{'possible_nric': None, 'nric': None, 'dob': N...",[]


##### Before 

In [13]:
get_pandas_dataframe("./data/2.standardized/ACM_ALERT_NOTES.delta")

Unnamed: 0,ALERT_ID,NOTE,CREATE_DATE
0,WLF101-945401-62939,<p>RBA applied. Approval from JM to close the ...,2021-03-25 16:35:44
1,WLF101-1363601-89626,"<p><span style=""font-family: 'courier new', co...",2021-08-16 18:02:47
2,WLF101-939701-62908,"<p><span style=""font-family: 'courier new', co...",2021-03-17 15:00:55
3,WLF101-1363601-89626,<p>*1-3: Name mismatch</p>,2021-08-17 09:03:03
4,WLF101-945401-62939,"<p><br />hit 1: Positive hit. noted PH is YOB,...",2020-03-17 15:22:06


##### After transformation (e.g. notes stage is extracted)

In [14]:
get_pandas_dataframe("./data/3.cleansed/ACM_ALERT_NOTES.delta")

Unnamed: 0,ALERT_ID,NOTE,CREATE_DATE,row_num,first_analyst_row_num,last_analyst_row_num,analyst_note_stage
0,WLF101-939701-62908,"<p><span style=""font-family: 'courier new', co...",2021-03-17 15:00:55,1,1,1,first_last_analyst_note
1,WLF101-1363601-89626,"<p><span style=""font-family: 'courier new', co...",2021-08-16 18:02:47,1,1,2,first_analyst_note
2,WLF101-1363601-89626,<p>*1-3: Name mismatch</p>,2021-08-17 09:03:03,2,1,2,last_analyst_note
3,WLF101-945401-62939,"<p><br />hit 1: Positive hit. noted PH is YOB,...",2020-03-17 15:22:06,1,1,2,first_analyst_note
4,WLF101-945401-62939,<p>RBA applied. Approval from JM to close the ...,2021-03-25 16:35:44,2,1,2,last_analyst_note


##### Before

In [15]:
get_pandas_dataframe("./data/2.standardized/ACM_ITEM_STATUS_HISTORY.delta").head(5)

Unnamed: 0,STATUS_JOIN_ID,ITEM_JOIN_ID,ITEM_ID,FROM_STATUS_IDENTIFIER,FROM_STATE,FROM_FINDING,TO_STATUS_IDENTIFIER,TO_STATE,TO_FINDING,CREATE_DATE,USER_JOIN_ID
0,6059387,3658498,WLF101-1363601-89626,2.0,Open,No_Determination,3,Open,No_Determination,2021-08-16 18:02:47,34001
1,6059825,3658498,WLF101-1363601-89626,3.0,Open,No_Determination,IMPL_AML_FALSE_POSITIVE,Closed,Non_Issue,2021-08-17 09:03:03,30203
2,5780701,3541863,WLF101-939701-62908,,,,2,Open,No_Determination,2021-03-17 14:52:06,22601
3,5135431,3292184,WLF101-945401-62939,2.0,Open,No_Determination,24,Open,No_Determination,2020-03-17 15:22:11,30202
4,5782825,3541863,WLF101-939701-62908,3.0,Open,No_Determination,IMPL_AML_FALSE_POSITIVE,Closed,Non_Issue,2021-03-18 12:04:25,30204


##### After transformation (e.g. notes stage is extracted)

In [16]:
get_pandas_dataframe("./data/3.cleansed/ACM_ITEM_STATUS_HISTORY.delta").head(5)

Unnamed: 0,STATUS_JOIN_ID,ITEM_JOIN_ID,ITEM_ID,FROM_STATUS_IDENTIFIER,FROM_STATUS_NAME,FROM_STATE,FROM_FINDING,TO_STATUS_IDENTIFIER,TO_STATUS_NAME,TO_STATE,TO_FINDING,CREATE_DATE,USER_JOIN_ID,row_num,first_analyst_row_num,last_analyst_row_num,analyst_status_stage
0,5074477,3292184,WLF101-945401-62939,,,,,2,Ready,Open,No_Determination,2020-02-11 16:00:01,22601,1,2,3,system_activity
1,5135431,3292184,WLF101-945401-62939,2.0,Ready,Open,No_Determination,24,Potential PEP / RCA match - escalated to Compl...,Open,No_Determination,2020-03-17 15:22:11,30202,2,2,3,first_analyst_status
2,5800865,3292184,WLF101-945401-62939,24.0,Potential PEP / RCA match - escalated to Compl...,Open,No_Determination,IMPL_AML_FALSE_POSITIVE,False Positive,Closed,Non_Issue,2021-03-25 16:35:50,18506,3,2,3,last_analyst_status
3,6059345,3658498,WLF101-1363601-89626,,,,,2,Ready,Open,No_Determination,2021-08-16 17:54:06,22601,1,2,3,system_activity
4,6059387,3658498,WLF101-1363601-89626,2.0,Ready,Open,No_Determination,3,In Process,Open,No_Determination,2021-08-16 18:02:47,34001,2,2,3,first_analyst_status


# 3. Prepare agent inputs

## Input

See: previous tables

## Process

In [17]:
transform_cleansed_to_application()

2021/12/10 15:05:40 - root INFO: Agent: rba_agent, Input written to ./data/4.application/agent-input/party_type_agent_input.delta, elapsed time: 0.63s
2021/12/10 15:05:40 - root INFO: Agent: rba_agent, Input written to ./data/4.application/agent-input/name_agent_input.delta, elapsed time: 0.63s
2021/12/10 15:05:41 - root INFO: Agent: rba_agent, Input written to ./data/4.application/agent-input/dob_agent_input.delta, elapsed time: 0.65s
2021/12/10 15:05:42 - root INFO: Agent: rba_agent, Input written to ./data/4.application/agent-input/pob_agent_input.delta, elapsed time: 0.60s
2021/12/10 15:05:42 - root INFO: Agent: rba_agent, Input written to ./data/4.application/agent-input/gender_agent_input.delta, elapsed time: 0.61s
2021/12/10 15:05:43 - root INFO: Agent: rba_agent, Input written to ./data/4.application/agent-input/national_id_agent_input.delta, elapsed time: 0.61s
2021/12/10 15:05:44 - root INFO: Agent: rba_agent, Input written to ./data/4.application/agent-input/document_number_

## Output (example)

In [18]:
show_files_in_directory(APPLICATION_DATA_DIR)

./data/4.application/agent-input
./data/4.application/agent_input_agg_df.delta


In [19]:
show_files_in_directory('./data/4.application/agent-input')

./data/4.application/agent-input/pob_agent_input.delta
./data/4.application/agent-input/document_number_agent_input.delta
./data/4.application/agent-input/rba_agent_input.delta
./data/4.application/agent-input/gender_agent_input.delta
./data/4.application/agent-input/hit_has_dob_id_address_agent_input.delta
./data/4.application/agent-input/name_agent_input.delta
./data/4.application/agent-input/hit_is_deceased_agent_input.delta
./data/4.application/agent-input/pep_payment_agent_input.delta
./data/4.application/agent-input/historical_decision_name_agent_input.delta
./data/4.application/agent-input/party_type_agent_input.delta
./data/4.application/agent-input/dob_agent_input.delta
./data/4.application/agent-input/national_id_agent_input.delta
./data/4.application/agent-input/hit_is_san_agent_input.delta
./data/4.application/agent-input/nationality_agent_input.delta


In [20]:
get_pandas_dataframe('./data/4.application/agent-input/name_agent_input.delta')

Unnamed: 0,_index,ALERT_INTERNAL_ID,ALERT_ID,hit_listId,hit_entryId,ap_all_names_aggregated,wl_all_names_aggregated,party_type_agent_ap,party_type_agent_wl
0,0,1649364,WLF101-1363601-89626,FACTIVA_SIE,1091285,[CPF BOARD],[CPF],Organization,ORGANIZATION
1,1,1619405,WLF101-939701-62908,FACTIVA_SAN,4790496,[P ONE],[One P],Person,PERSON
2,2,1619436,WLF101-945401-62939,FACTIVA_SAN,1198704,[KIM],"[Kim,]",Person,PERSON


# 4. Appendix

__Detailed implementation__

It's rather easy to implement if the goal is just to produce the dataframe for agent to consume. Some interim data need to be created to serve the purpose of analytics.

There are 2 main categories of transformations.
1. Interface/config transformation: Activities on the agent input config/interface.
1. Data transformation: Activities on the data based on the config/interface.

Steps
1. Create the agent input config.
    1. Interface/config transformation. Define agent input template. Each agent's input is a dictionary with 4 key-value pairs.
    ```
    {
        'ap': [],
        'ap_aliases': [],
        'wl': [] ,
        'wl_aliases': []
    }
    ```

        - `ap`: The primary value(s) of alerted party's specific attribute, e.g, name, it could be from one or multiple columns.
        - `ap_aliases`: The aliases of alerted party's specific attribute, it could be from one or multiple columns.
        - So on and so forth for `wl` and `wl_aliases`.

    1. Interface/config transformation. Define the list of agents. __Each agent's name must end with `_agent`.__
    ```
    agent_list = [
        'name_agent',
        'gender_agent'
    ]
    ```

    1. Interface/config transformation. Config the agent input by specifying which column(s) should be treated as the input of which agent's which party's primary or aliase value(s). 
    ```
    {
        'name_agent': {
            'ap': ['record_name', 'short_name'],
            'ap_aliases': ['alternate_name'],
            'wl': ['name_hit'],
            'wl_aliases': []
        },
        'gender_agent': {'ap': ['record_gender'],
                         'ap_aliases': [],
                         'wl': ['additional_infos_gender'],
                         'wl_aliases': []
                        },
    }
    ```
    Certain concepts need to be defined here.
        1. `level-1-key`: The name of each agent, it's `name_agent` and `gender_agent`.
        1. `level-1-value`: The value of each agent's config, it's a dictionary, e.g, 
        ```
        {
            'ap': ['record_name', 'short_name'],
            'ap_aliases': ['alternate_name'],
            'wl': ['name_hit'],
            'wl_aliases': []
        }
        ```
        1. `level-2-key`: The key of each agent config's value, or rather the key of `level-1-value`. It's `ap`, `ap_aliases`, `wl` and `wl_aliases`.
        1. `level-2-value`: The list of column names, e.g, `['record_name', 'short_name']`.

1. Create the interim agent input config and data. The interface is standardized from here onwards.  In reality, the data format can be more complex, e.g, national IDs we need to consider both type and document number.
    1. Interface/config transformation. Prepend `level-1-key` to `level-2-key` so that `level-2-key` can be used as new column names to host the interim data for analytics and/or debugging activites. Take `name_agent` for example.
    ```
    {
        'name_agent': {
            'name_agent_ap': ['record_name', 'short_name'],
            'name_agent_ap_aliases': ['alternate_name'],
            'name_agent_wl': ['name_hit'],
            'name_agent_wl_aliases': []
        }
    }
    ```
    1. Data transformation. Merge the values from `level-2-value` columns to `level-2-key` column. Below table will be the result.
    
| uuid | record_name | short_name | alternate_name | wl_primary_name | name_hit |   name_agent_ap   | name_agent_ap_aliases | name_agent_wl | name_agent_wl_aliases |
| ---- | :---------: | :--------: | :------------: | :-------------: | :------: | :---------------: | :-------------------: | :-----------: | :-------------------: |
| 1234 |  Jim Green  |    J.G.    |      Jim       |   James Greg    |   J.G    | [Jim Green, J.G.] |          Jim          |      J.G      |         None          |

1. Create the final agent input config and data based on the standardized interface.
    1. Interface/config transformation. Now we have a consistent schema to create the 1 list of alerted party values and 1 list of watchlist party values. We no longer need to worry about the customer specific schema, e.g, `record_name`, `short_name` and etcs. They have been standardized as `name_agent_ap`, `name_agent_ap_aliases` and etcs.
    ```
    {
        'name_agent': {'ap_all_names_aggregated': ['name_agent_ap', 'name_agent_ap_aliases'],
                       'wl_all_names_aggregated': ['name_agent_wl', 'name_agent_wl_aliases']
                      }
    }
    ```
    1. Data transformation. Merge the values from the primary and alias columns. Below table will be the result.
    
| uuid | ap_all_names_aggregated | wl_all_names_aggregated |
| ---- | :---------------------: | :---------------------: |
| 1234 | [Jim Green, J.G., Jim]  |          [J.G]          |