# Intro

This notebook presents an ETL pipeline flow that maps raw data from AIA Bank to "agent-friendly" input.  

# Libraries

In [1]:
import os
os.chdir("/app")
import warnings
warnings.simplefilter("ignore", ResourceWarning)

In [2]:
import company_name
import pipeline.knowledge_service
import xml.etree.ElementTree as ET
from company_name import CompanyNameAgent
import organization_name_knowledge as org_know
from pipeline.preprocessors import (convert_raw_to_standardized, 
                                    spark_instance, 

                                    transform_cleansed_to_application,
                                    show_files_in_directory,
                                    get_pandas_dataframe,
                                    transform_standardized_to_cleansed,)
import sys
sys.path.append("spark_manager")
from spark_manager.config.config import (
    RAW_DATA_DIR, 
                                    STANDARDIZED_DATA_DIR, 
                                    CLEANSED_DATA_DIR,
                                    APPLICATION_DATA_DIR,)


Ivy Default Cache set to: /app/dependencies/ivy2/cache
The jars for the packages stored in: /app/dependencies/ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-a5f8e83f-ce6d-4674-8b56-87864991b0d2;1.0
	confs: [default]


:: loading settings :: url = jar:file:/env/ds/anaconda/envs/pipeline/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found io.delta#delta-core_2.12;1.0.0 in spark-list
	found org.antlr#antlr4;4.7 in spark-list
	found org.antlr#antlr4-runtime;4.7 in central
	found org.antlr#antlr-runtime;3.5.2 in central
	found org.antlr#ST4;4.0.8 in spark-list
	found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in spark-list
	found org.glassfish#javax.json;1.0.4 in spark-list
	found com.ibm.icu#icu4j;58.2 in central
:: resolution report :: resolve 161ms :: artifacts dl 4ms
	:: modules in use:
	com.ibm.icu#icu4j;58.2 from central in [default]
	io.delta#delta-core_2.12;1.0.0 from spark-list in [default]
	org.abego.treelayout#org.abego.treelayout.core;1.0.3 from spark-list in [default]
	org.antlr#ST4;4.0.8 from spark-list in [default]
	org.antlr#antlr-runtime;3.5.2 from central in [default]
	org.antlr#antlr4;4.7 from spark-list in [default]
	org.antlr#antlr4-runtime;4.7 from central in [default]
	org.glassfish#javax.json;1.0.4 from spark-list in [default]
	-------------------------------------------------------

# 1. Convert raw csv data to standardized format 

## Input (raw data)

In [3]:
show_files_in_directory(RAW_DATA_DIR)

./data/1.raw/ACM_ALERT_NOTES.csv
./data/1.raw/ACM_MD_ALERT_STATUSES.csv
./data/1.raw/ACM_ITEM_STATUS_HISTORY.csv
./data/1.raw/ALERTS.csv


## Process

In [4]:
convert_raw_to_standardized(RAW_DATA_DIR, STANDARDIZED_DATA_DIR)

## Output (standardized format)

In [5]:
show_files_in_directory(STANDARDIZED_DATA_DIR)

./data/2.standardized/ACM_ALERT_NOTES.delta
./data/2.standardized/ACM_ITEM_STATUS_HISTORY.delta
./data/2.standardized/ALERTS.delta
./data/2.standardized/ACM_MD_ALERT_STATUSES.delta


In [10]:
standardized_data = get_pandas_dataframe("./data/2.standardized/ALERTS.delta")

In [11]:
get_pandas_dataframe("./data/2.standardized/ALERTS.delta/")

Unnamed: 0,ALERT_INTERNAL_ID,ENTITY_TYPE_ID,ALERT_DATE,ALERT_TYPE_ID,STATUS_ID,STATUS_INTERNAL_ID,DELETED,HTML_FILE_KEY,P11,P12,...,CONSOLIDATION_KEY,HIBERNATE_OBJECT_VERSION,OWNER_IDENTIFIER,FL_DOUBT,NUM_EXISTING_ENTITIES,WORKSPACE_INTERNAL_ID,ALERT_NAME,PRIORITY_INTERNAL_ID,DETAILS_FOR_SEARCH,DETAILS\r
0,1649364,,24-Feb-20,,,7027,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",01-13636,SSBSAPB0025000389,...,,13,,0,,,,,,\r
1,1619405,,8-Oct-18,,,7027,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-97,42823012-P560016505,...,,9,,0,0.0,,,,,\r
2,1619436,,10-Oct-18,,,7027,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-54,SSBPLASC550212199,...,,9,,0,0.0,,,,,


# 2. Create cleansed data

## Input 

See: output from previous stage

# Process

In [5]:
# %%capture --no-display --no-stderr --no-stdout
transform_standardized_to_cleansed()

  df[column_name] = series


> [0;32m/app/pipeline/preprocessors.py[0m(211)[0;36mtransform_standardized_to_cleansed[0;34m()[0m
[0;32m    209 [0;31m    )
[0m[0;32m    210 [0;31m    [0;32mimport[0m [0mpdb[0m[0;34m;[0m [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 211 [0;31m    reordered_columns = spark_instance.reorder_columns(
[0m[0;32m    212 [0;31m        [0mitem_status_history_df[0m[0;34m,[0m [0;34m"FROM_STATUS_IDENTIFIER"[0m[0;34m,[0m [0;34m[[0m[0;34m"FROM_STATUS_NAME"[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    213 [0;31m    )
[0m
ipdb> c


ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.8ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.8

## Output (cleansed data)

Note: "ACM_MD_ALERT_STATUSES" table was merged to the content of other tables

In [6]:
show_files_in_directory(CLEANSED_DATA_DIR)

./data/3.cleansed/ACM_ALERT_NOTES.delta
./data/3.cleansed/ACM_ITEM_STATUS_HISTORY.delta
./data/3.cleansed/ALERTS.delta


In [4]:
cleansed_data = get_pandas_dataframe("./data/3.cleansed/ALERTS.delta")

  df[column_name] = series


In [7]:
cleansed_data

Unnamed: 0,STATUS_INTERNAL_ID,STATUS_NAME,ALERT_INTERNAL_ID,ENTITY_TYPE_ID,ALERT_DATE,ALERT_TYPE_ID,STATUS_ID,DELETED,HTML_FILE_KEY,P11,...,hit_inputExplanations_address_city_inputExplanation,hit_inputExplanations_address_country_inputExplanation,hit_inputExplanations_addresses_stateProvince_inputExplanation,hit_inputExplanations_ids_idNumber_inputExplanation,ap_hit_names,wl_hit_matched_name,wl_hit_aliases_matched_name,wl_hit_names,hit_cs_1_data_points,ap_nric
0,7027,False Positive,1649364,,24-Feb-20,,,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",01-13636,...,[],[],[],[],[CPF BOARD],,[CPF],[CPF],"{'possible_nric': None, 'nric': None, 'dob': N...",[]
1,7027,False Positive,1619405,,8-Oct-18,,,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-97,...,[],[],[],[],[P ONE],,[One P],[One P],"{'possible_nric': None, 'nric': None, 'dob': N...",[S0135242C]
2,7027,False Positive,1619436,,10-Oct-18,,,0,"<?xml version=""1.0"" encoding=""UTF16"" standalon...",Jan-54,...,[],[],[],[],[KIM],,"[Kim,]","[Kim,]","{'possible_nric': None, 'nric': None, 'dob': N...",[]


In [8]:
new_columns = set(cleansed_data.columns).difference(set(standardized_data.columns))

NameError: name 'standardized_data' is not defined

In [9]:
len(standardized_data.columns), len(cleansed_data.columns)

NameError: name 'standardized_data' is not defined

In [17]:
len(new_columns)

122

In [6]:
cleansed_data[['hit_addresses_country', 'hit_datesOfBirth_birthDate',  'wl_hit_names', 'party_type_agent_ap']]

KeyError: "['ap_all_party_types_aggregated'] not in index"

# 3. Prepare agent inputs

## Input

See: previous tables

## Process

In [3]:
transform_cleansed_to_application()

  df[column_name] = series


> [0;32m/app/pipeline/preprocessors.py[0m(449)[0;36mtransform_cleansed_to_application[0;34m()[0m
[0;32m    447 [0;31m    [0;32mimport[0m [0mpdb[0m[0;34m;[0m [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    448 [0;31m    [0;31m# Agent input creator[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 449 [0;31m    [0mkey_cols[0m [0;34m=[0m [0;34m[[0m[0;34m"_index"[0m[0;34m,[0m [0;34m"ALERT_INTERNAL_ID"[0m[0;34m,[0m [0;34m"ALERT_ID"[0m[0;34m,[0m [0;34m"hit_listId"[0m[0;34m,[0m [0;34m"hit_entryId"[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    450 [0;31m    [0;32mfor[0m [0magent_name[0m[0;34m,[0m [0minput_agg_col_config[0m [0;32min[0m [0magent_input_agg_col_config[0m[0;34m.[0m[0mitems[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    451 [0;31m        [0mstart[0m [0;34m=[0m [0mtime[0m[0;34m.[0m[0mtime[0m[0;34m([0m[0;34m)

## Output (example)

In [16]:
show_files_in_directory(APPLICATION_DATA_DIR)

./data/4.application/agent-input
./data/4.application/agent_input_agg_df.delta


In [17]:
show_files_in_directory('./data/4.application/agent-input')

./data/4.application/agent-input/pob_agent_input.delta
./data/4.application/agent-input/document_number_agent_input.delta
./data/4.application/agent-input/rba_agent_input.delta
./data/4.application/agent-input/gender_agent_input.delta
./data/4.application/agent-input/hit_has_dob_id_address_agent_input.delta
./data/4.application/agent-input/name_agent_input.delta
./data/4.application/agent-input/hit_is_deceased_agent_input.delta
./data/4.application/agent-input/pep_payment_agent_input.delta
./data/4.application/agent-input/historical_decision_name_agent_input.delta
./data/4.application/agent-input/party_type_agent_input.delta
./data/4.application/agent-input/dob_agent_input.delta
./data/4.application/agent-input/national_id_agent_input.delta
./data/4.application/agent-input/hit_is_san_agent_input.delta
./data/4.application/agent-input/nationality_agent_input.delta


In [4]:
from glob import glob
REFERENCE_DIR = "tests/data"
for reference in glob(os.path.join(f"{REFERENCE_DIR}", "4.application/agent-input/*delta")):
    rel_path = os.path.relpath(os.path.dirname(reference), "tests")
    reference_dataframe = spark_instance.read_delta(reference)
    if not os.path.exists(os.path.join(rel_path, os.path.basename(reference))):
        print("not found", os.path.join(rel_path, os.path.basename(reference)))
        continue
    tested_dataframe = spark_instance.read_delta(os.path.join(rel_path, os.path.basename(reference)))
    print(os.path.join(rel_path, os.path.basename(reference)))
    id = "_index"
    reference_rows = reference_dataframe.sort(id, ascending=False).collect()
    tested_rows = tested_dataframe.sort(id, ascending=False).collect()
    for tested_row, reference_row in zip(tested_rows, reference_rows):
        assert tested_row == reference_row
    print(f"PASSED for \n- {os.path.join(rel_path, os.path.basename(reference))} \n -{reference}\n")

data/4.application/agent-input/pob_agent_input.delta
PASSED for 
- data/4.application/agent-input/pob_agent_input.delta 
 -tests/data/4.application/agent-input/pob_agent_input.delta

data/4.application/agent-input/document_number_agent_input.delta
PASSED for 
- data/4.application/agent-input/document_number_agent_input.delta 
 -tests/data/4.application/agent-input/document_number_agent_input.delta

data/4.application/agent-input/rba_agent_input.delta
PASSED for 
- data/4.application/agent-input/rba_agent_input.delta 
 -tests/data/4.application/agent-input/rba_agent_input.delta

data/4.application/agent-input/gender_agent_input.delta
PASSED for 
- data/4.application/agent-input/gender_agent_input.delta 
 -tests/data/4.application/agent-input/gender_agent_input.delta

data/4.application/agent-input/hit_has_dob_id_address_agent_input.delta
PASSED for 
- data/4.application/agent-input/hit_has_dob_id_address_agent_input.delta 
 -tests/data/4.application/agent-input/hit_has_dob_id_address_ag

In [6]:
tested_row == reference_row

True

In [17]:
tested_dataframe

DataFrame[_index: bigint, ALERT_INTERNAL_ID: int, ALERT_ID: string, hit_listId: string, hit_entryId: string, ap_all_names_aggregated: array<string>, wl_all_names_aggregated: array<string>, party_type_agent_ap: array<string>, party_type_agent_wl: array<string>]

In [23]:
reference_dataframe.toPandas().party_type_agent_ap

0    Organization
1          Person
2          Person
Name: party_type_agent_ap, dtype: object

In [24]:
reference_dataframe.toPandas().party_type_agent_ap

0    Organization
1          Person
2          Person
Name: party_type_agent_ap, dtype: object

In [7]:
get_pandas_dataframe('tests/data/4.application/agent-input/name_agent_input.delta')

Unnamed: 0,_index,ALERT_INTERNAL_ID,ALERT_ID,hit_listId,hit_entryId,ap_all_names_aggregated,wl_all_names_aggregated,party_type_agent_ap,party_type_agent_wl
0,0,1649364,WLF101-1363601-89626,FACTIVA_SIE,1091285,[CPF BOARD],[CPF],Organization,ORGANIZATION
1,1,1619405,WLF101-939701-62908,FACTIVA_SAN,4790496,[P ONE],[One P],Person,PERSON
2,2,1619436,WLF101-945401-62939,FACTIVA_SAN,1198704,[KIM],"[Kim,]",Person,PERSON


In [18]:
get_pandas_dataframe('./data/4.application/agent-input/name_agent_input.delta')

Unnamed: 0,_index,ALERT_INTERNAL_ID,ALERT_ID,hit_listId,hit_entryId,ap_all_names_aggregated,wl_all_names_aggregated,party_type_agent_ap,party_type_agent_wl
0,0,1649364,WLF101-1363601-89626,FACTIVA_SIE,1091285,[CPF BOARD],[CPF],Organization,ORGANIZATION
1,1,1619405,WLF101-939701-62908,FACTIVA_SAN,4790496,[P ONE],[One P],Person,PERSON
2,2,1619436,WLF101-945401-62939,FACTIVA_SAN,1198704,[KIM],"[Kim,]",Person,PERSON


In [19]:
get_pandas_dataframe('./data/4.application/agent-input/nationality_agent_input.delta')

Unnamed: 0,_index,ALERT_INTERNAL_ID,ALERT_ID,hit_listId,hit_entryId,ap_all_nationalities_aggregated,wl_all_nationalities_aggregated
0,0,1649364,WLF101-1363601-89626,FACTIVA_SIE,1091285,[],[CN]
1,1,1619405,WLF101-939701-62908,FACTIVA_SAN,4790496,[SG],[]
2,2,1619436,WLF101-945401-62939,FACTIVA_SAN,1198704,[],[KP]


# 4. Knowledge services

In [20]:
config = Config(configuration_dirs=(pathlib.Path("./config"),))
org_name_agent = CompanyNameAgent(config)


NameError: name 'Config' is not defined

In [None]:
org_name_agent.resolve(["China Petroleum Finance Co. Ltd"], ["CPF"])

In [None]:
company_name.compare("China Petroleum Finance Co. Ltd", "CPF")

### Data discovery example

In [None]:
ROW_ID = 0

row_tree = ET.ElementTree(ET.fromstring(standardized_data['HTML_FILE_KEY'][ROW_ID]))
row_values = standardized_data[[i for i in standardized_data if 'HTML_FILE_KEY' not in i]].loc[ROW_ID]

In [None]:
org_features = pipeline.knowledge_service.get_features(row_tree, org_know.parse_freetext, pipeline.knowledge_service.print_org_results)

In [None]:
org_features_columns = [pipeline.knowledge_service.get_column_features(column, value, org_know.parse_freetext, pipeline.knowledge_service.print_column_results) for column, value in row_values.to_dict().items()]

# 5. Appendix

__Detailed implementation__

It's rather easy to implement if the goal is just to produce the dataframe for agent to consume. Some interim data need to be created to serve the purpose of analytics.

There are 2 main categories of transformations.
1. Interface/config transformation: Activities on the agent input config/interface.
1. Data transformation: Activities on the data based on the config/interface.

Steps
1. Create the agent input config.
    1. Interface/config transformation. Define agent input template. Each agent's input is a dictionary with 4 key-value pairs.
    ```
    {
        'ap': [],
        'ap_aliases': [],
        'wl': [] ,
        'wl_aliases': []
    }
    ```

        - `ap`: The primary value(s) of alerted party's specific attribute, e.g, name, it could be from one or multiple columns.
        - `ap_aliases`: The aliases of alerted party's specific attribute, it could be from one or multiple columns.
        - So on and so forth for `wl` and `wl_aliases`.

    1. Interface/config transformation. Define the list of agents. __Each agent's name must end with `_agent`.__
    ```
    agent_list = [
        'name_agent',
        'gender_agent'
    ]
    ```

    1. Interface/config transformation. Config the agent input by specifying which column(s) should be treated as the input of which agent's which party's primary or aliase value(s). 
    ```
    {
        'name_agent': {
            'ap': ['record_name', 'short_name'],
            'ap_aliases': ['alternate_name'],
            'wl': ['name_hit'],
            'wl_aliases': []
        },
        'gender_agent': {'ap': ['record_gender'],
                         'ap_aliases': [],
                         'wl': ['additional_infos_gender'],
                         'wl_aliases': []
                        },
    }
    ```
    Certain concepts need to be defined here.
        1. `level-1-key`: The name of each agent, it's `name_agent` and `gender_agent`.
        1. `level-1-value`: The value of each agent's config, it's a dictionary, e.g, 
        ```
        {
            'ap': ['record_name', 'short_name'],
            'ap_aliases': ['alternate_name'],
            'wl': ['name_hit'],
            'wl_aliases': []
        }
        ```
        1. `level-2-key`: The key of each agent config's value, or rather the key of `level-1-value`. It's `ap`, `ap_aliases`, `wl` and `wl_aliases`.
        1. `level-2-value`: The list of column names, e.g, `['record_name', 'short_name']`.

1. Create the interim agent input config and data. The interface is standardized from here onwards.  In reality, the data format can be more complex, e.g, national IDs we need to consider both type and document number.
    1. Interface/config transformation. Prepend `level-1-key` to `level-2-key` so that `level-2-key` can be used as new column names to host the interim data for analytics and/or debugging activites. Take `name_agent` for example.
    ```
    {
        'name_agent': {
            'name_agent_ap': ['record_name', 'short_name'],
            'name_agent_ap_aliases': ['alternate_name'],
            'name_agent_wl': ['name_hit'],
            'name_agent_wl_aliases': []
        }
    }
    ```
    1. Data transformation. Merge the values from `level-2-value` columns to `level-2-key` column. Below table will be the result.
    
| uuid | record_name | short_name | alternate_name | wl_primary_name | name_hit |   name_agent_ap   | name_agent_ap_aliases | name_agent_wl | name_agent_wl_aliases |
| ---- | :---------: | :--------: | :------------: | :-------------: | :------: | :---------------: | :-------------------: | :-----------: | :-------------------: |
| 1234 |  Jim Green  |    J.G.    |      Jim       |   James Greg    |   J.G    | [Jim Green, J.G.] |          Jim          |      J.G      |         None          |

1. Create the final agent input config and data based on the standardized interface.
    1. Interface/config transformation. Now we have a consistent schema to create the 1 list of alerted party values and 1 list of watchlist party values. We no longer need to worry about the customer specific schema, e.g, `record_name`, `short_name` and etcs. They have been standardized as `name_agent_ap`, `name_agent_ap_aliases` and etcs.
    ```
    {
        'name_agent': {'ap_all_names_aggregated': ['name_agent_ap', 'name_agent_ap_aliases'],
                       'wl_all_names_aggregated': ['name_agent_wl', 'name_agent_wl_aliases']
                      }
    }
    ```
    1. Data transformation. Merge the values from the primary and alias columns. Below table will be the result.
    
| uuid | ap_all_names_aggregated | wl_all_names_aggregated |
| ---- | :---------------------: | :---------------------: |
| 1234 | [Jim Green, J.G., Jim]  |          [J.G]          |