# Librares

In [1]:
import pandas as pd

# Data sources

- Original source of the data: https://webpages.charlotte.edu/mirsad/itcs6265/group1/domain.html. Some relevant information:
  - "The Berka dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data." 
    - Note that to replicate the results of Tariq and Hassani (2023) but with this dataset, only the bank transactions are relevant (and not the issued loans and credit cards).
    - According to Tariq and Hassani (2023), we need 5 variables to implement their algorithm (pp. 6-7):
      - *transaction_id*: unique identifier for a transaction.
      - *transaction_timestamp*, as the Unix time for the transaction.
      - *account_identifiers* for both the sender and the receiver of a transaction.
      - *amount*: amount involved in the transaction.
    - Tables of interest from the dataset, considering the previous variables:
      - *transaction*: each record describes one transaction on an account.
    - Potential problems of this dataset:
      1. The data is much less extensive than the one they use in Tariq and Hassani (2023), where they initially deal with 1.1 billion transactions.
      2. In contrast to the data used in Tariq and Hassani (2023), **all of the data comes from a single Czech bank** (there are no operations between banks), which may significantly limit the analysis. This does not mean, however, that the bank of origin of the transaction is the same as the bank of the destination, as the dataset considers this potential distinction.
      3. We have the *date* of the transaction, but not the *time* as Unix time.
      4. Only has credit card transactions.
- Source for data download: https://www.kaggle.com/datasets/marceloventura/the-berka-dataset

Alternative datasets that could overcome some of the problems of the Berka dataset:
- Synthetic transaction data from IBM: https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml. Source paper for the generation of this data: https://arxiv.org/abs/2306.16424
  - Advantages:
    - Labelled.
    - Realistic.
    - Has all of the necessary data, including a timestamp (up to the minute detail).
    - Huge dataset.
  - Disadvantages:
    - Synthetic.
- Anti Money Laundering Transaction Data (SAML-D): https://www.kaggle.com/datasets/berkanoztas/synthetic-transaction-monitoring-dataset-aml. Source paper for the generation of this data: https://ieeexplore.ieee.org/document/10356193
  - Advantages:
    - Labelled.
    - Realistic.
    - Has all of the necessary data, including a timestamp (up to the second detail).
    - Huge dataset.
  - Disadvantages:
    - Synthetic.

# Importing the data

In [7]:
transactions = pd.read_csv('berka_dataset/trans.csv', delimiter = ';', 
                           low_memory = False)

transactions.head()

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,


In [None]:
# Remove uninteresting columns (see https://webpages.charlotte.edu/mirsad/itcs6265/group1/transaction_domain.html)

transactions = transactions.drop(columns = ['balance', 'operation', 'k_symbol'])

transactions.head()

In [11]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1056320 entries, 0 to 1056319
Data columns (total 7 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   trans_id    1056320 non-null  int64  
 1   account_id  1056320 non-null  int64  
 2   date        1056320 non-null  int64  
 3   type        1056320 non-null  object 
 4   amount      1056320 non-null  float64
 5   bank        273508 non-null   object 
 6   account     295389 non-null   float64
dtypes: float64(2), int64(3), object(2)
memory usage: 56.4+ MB


Note how there are several null values in `account`, which denotes the account of the destination transaction! This means that the amount of usable data that we have is, in fact, much smaller than the original data. Only the non-null destination account numbers are usable.

In [12]:
transactions = transactions.dropna(subset = ['account'])

transactions.head()

Unnamed: 0,trans_id,account_id,date,type,amount,bank,account
15,637742,2177,930105,PRIJEM,5123.0,YZ,62457513.0
17,232961,793,930105,PRIJEM,3401.0,IJ,6149286.0
21,542216,1844,930107,PRIJEM,3242.0,ST,42988401.0
24,579374,1972,930107,PRIJEM,5298.0,UV,14132887.0
46,1049882,3592,930110,PRIJEM,6007.0,MN,73166322.0


In [15]:
# Number of distinct origin and destination accounts
print(f'Number of distinct origin accounts: {transactions['account_id'].nunique()}\n')
print(f'Number of distinct destination accounts: {transactions['account'].nunique()}\n')
print(f'Total number of nodes (unique bank accounts): {(transactions['account_id'] + transactions['account']).nunique()}\n')
print(f'Total number of edges (transactions): {len(transactions)}')

Number of distinct origin accounts: 3941

Number of distinct destination accounts: 7665

Total number of nodes (unique bank accounts): 9084

Total number of edges (transactions): 295389
