###  Notebook 01 – Simulating Flags for Sender & Receiver Entities

####  Objective:
To simulate real-world **risk indicator flags** for **sender and receiver entities** involved in financial transactions. These flags serve as essential features for downstream **risk scoring** and **alert generation** logic.

####  What We Did:
- Created a synthetic dataset of sender and receiver entities.
- Simulated the following key flags:
 
  - **KYC Status** (e.g., Verified, Unverified, Expired)
  - **PEP (Politically Exposed Person)**: Higher-risk individuals.
  - **Watchlist**: Entities listed in global sanctions/watchlists.
  - **Risk Country**: Entities from high-risk jurisdictions.
  - **Sanctioned**: Entities already sanctioned (randomly tagged).
  - **Fraud**: Entities flagged for potential fraud.
  - **Any Risk Flag**: At least one risk factor is triggered.
  
  - Generated composite risk flags (`Sender_Any_Risk_Flag`, `Receiver_Any_Risk_Flag`) for quick aggregation of risk.
- Saved the final dataset for use in later notebooks for scoring, rule logic, and dashboards.

####  Why This Matters:
This type of risk flag simulation closely resembles pre-processing workflows in AML/KYC pipelines. It enables:
- Financial risk scoring systems
- Real-time or batch alert simulations
- Regulatory compliance reporting and dashboards

  **Note**: This dataset was simulated and enriched to resemble realistic financial transaction patterns with embedded compliance risk indicators. All transformations are documented step-by-step for transparency.




---

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Step 1 Define data types to reduce memory usage
dtypes = {
    'From Bank': 'int32',
    'From_Account': 'category',
    'To Bank': 'int32',
    'To_Account': 'category',
    'Amount Received': 'float32',
    'Amount Paid': 'float32',
    'Receiving Currency': 'category',
    'Payment Currency': 'category',
    'Payment Format': 'category',
    'Is Laundering': 'int8'
}

# Step 2  Define datetime columns to parse
parse_dates = ['Timestamp']

In [3]:
df =pd.read_csv(r"C:\Users\amalm\OneDrive\Desktop\finamcial_data_analysis_lerning\project\Client Data Lifecycle Simulation\merged_data\merged_transactions_with_accounts.csv" ,dtype=dtypes,parse_dates=parse_dates,low_memory=False)
df.head(-10)

Unnamed: 0,Timestamp,From Bank,From Account,To Bank,To Account,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,...,From Bank Name,Bank ID,Account Number,Sender Entity ID,Sender Entity Name,To Bank Name,Bank ID_Receiver,Account Number_Receiver,Receiver Entity ID,Receiver Entity Name
0,2022-09-01 00:08:00,11,8000ECA90,11,8000ECA90,3.195403e+06,US Dollar,3.195403e+06,US Dollar,Reinvestment,...,Savings Bank of Madison,11,8000ECA90,80006E250,Individual #1,Savings Bank of Madison,11,8000ECA90,80006E250,Individual #1
1,2022-09-01 00:21:00,3402,80021DAD0,3402,80021DAD0,1.858960e+03,US Dollar,1.858960e+03,US Dollar,Reinvestment,...,National Bank of New York,3402,80021DAD0,80154AF70,Sole Proprietorship #1,National Bank of New York,3402,80021DAD0,80154AF70,Sole Proprietorship #1
2,2022-09-01 00:00:00,11,8000ECA90,1120,8006AA910,5.925710e+05,US Dollar,5.925710e+05,US Dollar,Cheque,...,Savings Bank of Madison,11,8000ECA90,80006E250,Individual #1,First Bank of the South,1120,8006AA910,801537F30,Partnership #1
3,2022-09-01 00:16:00,3814,8006AD080,3814,8006AD080,1.232000e+01,US Dollar,1.232000e+01,US Dollar,Reinvestment,...,Sappo Thrift,3814,8006AD080,801415540,Partnership #2,Sappo Thrift,3814,8006AD080,801415540,Partnership #2
4,2022-09-01 00:00:00,20,8006AD530,20,8006AD530,2.941560e+03,US Dollar,2.941560e+03,US Dollar,Reinvestment,...,First Bank of Danbury,20,8006AD530,800CA6070,Corporation #1,First Bank of Danbury,20,8006AD530,800CA6070,Corporation #1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6924034,2022-09-10 23:50:00,70426,819DE8791,71657,81BE73611,2.924000e-02,Bitcoin,2.924000e-02,Bitcoin,Bitcoin,...,Crytpo Bank #59,70426,819DE8791,8015B10D0,Partnership #953,Crytpo Bank #97,71657,81BE73611,8006D12E0,Partnership #50307
6924035,2022-09-10 23:30:00,70,10042BA51,170558,81BEAF511,3.661940e-01,Bitcoin,3.661940e-01,Bitcoin,Bitcoin,...,Willows Thrift,70,10042BA51,8015B10D0,Partnership #953,Crytpo Bank #41,170558,81BEAF511,8000FBB80,Corporation #49952
6924036,2022-09-10 23:59:00,70,10042BA51,170558,81BEAF511,5.755200e-02,Bitcoin,5.755200e-02,Bitcoin,Bitcoin,...,Willows Thrift,70,10042BA51,8015B10D0,Partnership #953,Crytpo Bank #41,170558,81BEAF511,8000FBB80,Corporation #49952
6924037,2022-09-10 23:45:00,70,10042BA51,270308,81BEBEE71,1.623860e-01,Bitcoin,1.623860e-01,Bitcoin,Bitcoin,...,Willows Thrift,70,10042BA51,8015B10D0,Partnership #953,Crytpo Bank #16,270308,81BEBEE71,8015B10D0,Partnership #953


In [4]:
entities = pd.concat ([
    df[['Sender Entity ID', 'Sender Entity Name']].rename(columns={
        'Sender Entity ID': 'Entity ID', 'Sender Entity Name': 'Entity Name'}),
    df[['Receiver Entity ID', 'Receiver Entity Name']].rename(columns={
        'Receiver Entity ID': 'Entity ID', 'Receiver Entity Name': 'Entity Name'})
]).drop_duplicates().reset_index(drop=True)

##  Generating Risk Flags from Entity Attributes

We simulate whether an entity is a PEP, is on a Watchlist, or is from a Risk Country using existing columns:

- If the entity's `type` is in a pre-defined list → PEP flag = 1.
- If the entity's `name` matches names in a known watchlist → Watchlist flag = 1.
- If the entity is from a known high-risk country → Risk Country flag = 1.

These flags are useful for identifying red flags in KYC (Know Your Customer) processes.

In [5]:
# Simulate KYC attributes
np.random.seed(42)
entities['PEP_Flag'] = np.random.choice([0, 1], size=len(entities), p=[0.95, 0.05])
entities['Watchlist_Flag'] = np.random.choice([0, 1], size=len(entities), p=[0.97, 0.03])
entities['Risk_Country_Flag'] = np.random.choice([0, 1], size=len(entities), p=[0.90, 0.10])
entities['KYC_Status'] = np.where(entities['Watchlist_Flag'] == 1, 'Fail', 'Pass')


In [6]:
entities.head()


Unnamed: 0,Entity ID,Entity Name,PEP_Flag,Watchlist_Flag,Risk_Country_Flag,KYC_Status
0,80006E250,Individual #1,0,0,0,Pass
1,80154AF70,Sole Proprietorship #1,1,0,1,Pass
2,801415540,Partnership #2,0,0,0,Pass
3,800CA6070,Corporation #1,0,0,0,Pass
4,8011BBB00,Partnership #3,0,1,0,Fail


In [7]:
# Merge with sender
df = df.merge(
    entities.rename(columns=lambda x: f"Sender_{x}" if x not in ['Entity ID'] else 'Sender Entity ID'),
    on='Sender Entity ID',
    how='left'
)

# Merge with receiver
df = df.merge(
    entities.rename(columns=lambda x: f"Receiver_{x}" if x not in ['Entity ID'] else 'Receiver Entity ID'),
    on='Receiver Entity ID',
    how='left'
)

In [8]:
df.head(10)

Unnamed: 0,Timestamp,From Bank,From Account,To Bank,To Account,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,...,Sender_Entity Name,Sender_PEP_Flag,Sender_Watchlist_Flag,Sender_Risk_Country_Flag,Sender_KYC_Status,Receiver_Entity Name,Receiver_PEP_Flag,Receiver_Watchlist_Flag,Receiver_Risk_Country_Flag,Receiver_KYC_Status
0,2022-09-01 00:08:00,11,8000ECA90,11,8000ECA90,3195403.0,US Dollar,3195403.0,US Dollar,Reinvestment,...,Individual #1,0,0,0,Pass,Individual #1,0,0,0,Pass
1,2022-09-01 00:21:00,3402,80021DAD0,3402,80021DAD0,1858.96,US Dollar,1858.96,US Dollar,Reinvestment,...,Sole Proprietorship #1,1,0,1,Pass,Sole Proprietorship #1,1,0,1,Pass
2,2022-09-01 00:00:00,11,8000ECA90,1120,8006AA910,592571.0,US Dollar,592571.0,US Dollar,Cheque,...,Individual #1,0,0,0,Pass,Partnership #1,0,0,1,Pass
3,2022-09-01 00:16:00,3814,8006AD080,3814,8006AD080,12.32,US Dollar,12.32,US Dollar,Reinvestment,...,Partnership #2,0,0,0,Pass,Partnership #2,0,0,0,Pass
4,2022-09-01 00:00:00,20,8006AD530,20,8006AD530,2941.56,US Dollar,2941.56,US Dollar,Reinvestment,...,Corporation #1,0,0,0,Pass,Corporation #1,0,0,0,Pass
5,2022-09-01 00:24:00,12,8006ADD30,12,8006ADD30,6473.62,US Dollar,6473.62,US Dollar,Reinvestment,...,Partnership #3,0,1,0,Fail,Partnership #3,0,1,0,Fail
6,2022-09-01 00:17:00,11,800059120,1217,8006AD4E0,60562.0,US Dollar,60562.0,US Dollar,ACH,...,Sole Proprietorship #2,0,0,1,Pass,Corporation #2,0,0,0,Pass
7,2022-09-01 00:07:00,11,8000ECA90,11,8000ECA90,22.97,US Dollar,22.97,US Dollar,Reinvestment,...,Individual #1,0,0,0,Pass,Individual #1,0,0,0,Pass
8,2022-09-01 00:28:00,1120,8006AA910,243166,81470DCF0,43.53,US Dollar,43.53,US Dollar,Credit Card,...,Partnership #1,0,0,1,Pass,Partnership #4,1,0,0,Pass
9,2022-09-01 00:22:00,1217,8006AD4E0,1217,8006AD4E0,5.04,US Dollar,5.04,US Dollar,Reinvestment,...,Corporation #2,0,0,0,Pass,Corporation #2,0,0,0,Pass


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6924049 entries, 0 to 6924048
Data columns (total 31 columns):
 #   Column                      Dtype         
---  ------                      -----         
 0   Timestamp                   datetime64[ns]
 1   From Bank                   int32         
 2   From Account                object        
 3   To Bank                     int32         
 4   To Account                  object        
 5   Amount Received             float32       
 6   Receiving Currency          category      
 7   Amount Paid                 float32       
 8   Payment Currency            category      
 9   Payment Format              category      
 10  Is Laundering               int8          
 11  From Bank Name              object        
 12  Bank ID                     int64         
 13  Account Number              object        
 14  Sender Entity ID            object        
 15  Sender Entity Name          object        
 16  To Bank Name      

##  Simulating Fraud and Sanctioned Columns

We introduce randomness to simulate:
- **Fraud Flag**: Randomly assigned to ~5% of rows to simulate real-world fraud prevalence.
- **Sanctioned Flag**: Assigned to ~2% of rows, representing entities flagged by regulatory bodies.

These probabilities reflect realistic compliance data distributions and help in building synthetic datasets for supervised learning or dashboard monitoring.


In [10]:
# Simulate Fraud Flags: 5% chance of being flagged
df['Sender_Fraud_Flag'] = np.random.choice([0, 1], size=len(df), p=[0.95, 0.05])
df['Receiver_Fraud_Flag'] = np.random.choice([0, 1], size=len(df), p=[0.95, 0.05])

# Simulate Sanctioned Flags: 2% chance
df['Sender_Sanctioned_Flag'] = np.random.choice([0, 1], size=len(df), p=[0.98, 0.02])
df['Receiver_Sanctioned_Flag'] = np.random.choice([0, 1], size=len(df), p=[0.98, 0.02])

##  Aggregating Risk Flags

We create a single `any_flagged` column that indicates whether an entity is flagged for **any** of the risk factors (PEP, Watchlist, Risk Country, Fraud, or Sanctioned).

This simplifies filtering and visualization in downstream applications (e.g., dashboards, machine learning).

In [11]:
# Does entity raise any flag 
df['Sender_Any_Risk_Flag'] = (
    df['Sender_PEP_Flag'] | 
    df['Sender_Watchlist_Flag'] | 
    df['Sender_Risk_Country_Flag'] |
    df['Sender_Fraud_Flag'] |
    df['Sender_Sanctioned_Flag']
)
df['Receiver_Any_Risk_Flag'] = (
    df['Receiver_PEP_Flag'] | 
    df['Receiver_Watchlist_Flag'] | 
    df['Receiver_Risk_Country_Flag'] |
    df['Receiver_Fraud_Flag']|
    df['Receiver_Sanctioned_Flag']
    
)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6924049 entries, 0 to 6924048
Data columns (total 37 columns):
 #   Column                      Dtype         
---  ------                      -----         
 0   Timestamp                   datetime64[ns]
 1   From Bank                   int32         
 2   From Account                object        
 3   To Bank                     int32         
 4   To Account                  object        
 5   Amount Received             float32       
 6   Receiving Currency          category      
 7   Amount Paid                 float32       
 8   Payment Currency            category      
 9   Payment Format              category      
 10  Is Laundering               int8          
 11  From Bank Name              object        
 12  Bank ID                     int64         
 13  Account Number              object        
 14  Sender Entity ID            object        
 15  Sender Entity Name          object        
 16  To Bank Name      

##  Saving Transformed Data

The final dataframe is saved as a CSV file for use in subsequent analysis notebooks. This modular design ensures reusability and clean data flow across the project.


In [13]:
df.to_csv(r"C:\Users\amalm\OneDrive\Desktop\finamcial_data_analysis_lerning\project\Client Data Lifecycle Simulation\merged_data\flagged_data.csv", index =False)


##  Summary

- Created 5 binary risk flags using a combination of logic and randomness.
- Introduced synthetic but realistic indicators used in financial compliance workflows.
- Saved the enriched dataset for use in dashboards, fraud detection models, or reporting pipelines.

This notebook lays the foundation for visualizing and analyzing entity-level risk in a structured and scalable way.
