## Online banking fraud detection using natural language processing techniques

**Team 9:**

- Latif Masud
- ​Wesley Mitchell
- Gerald Wagner​

**Course:** AI 574 – Natural Language Processing (Spring 2025)

### Problem Statement
* This project aims to identify fraudulent activity in online banking transactions using Natural Language Processing techniques. Online banking activity can be monitored by the webpages or API endpoints a user interacts with throughout their entire session history. With this sequence of user actions, a binary classification can be trained such that it labels the activity as valid or fraudulent; if fraudulent, remediation steps could then be implemented such as denying the transaction. With online banking a staple of people every day financial lives and 100's of millions of dollars transacted daily, identifying fraudulent activity is of upmost importance to prevent unnecessary monetary losses for both individuals and financial institutions.
    
* **Keywords:** Online banking, fraud, fraud detection, financial industry 

### Data Collection

* Source(url): https://github.com/pboulieris/FraudNLP/blob/master/Fraud%20Detection%20with%20Natural%20Language%20Processing.rar
* Short Description: The data set of 105,303 online banking transactions with 9 transaction characteristics:
    * Action time mean: the average time between actions in a transaction
    * Action time std: the standard deviation of the time between actions
    * log(amount): the natural logarithm of the transaction amount
    * Transaction Type: a string indicating whether the transaction is fraudulent or not
    * time_to_first_action: the time between the start of the transaction and the first action taken
    * actions_str: a string containing the names of all actions taken in the transaction
    * total_time_to_transaction: the total time elapsed from the start of the transaction to its completion

* Keywords: bank transactions, user actions, API endpoints, webpage urls, dollar amount

### Required packages

* the following packages are required to run this notebook:
    * pandas
    * scikit-learn

Install by creating and activating a virtual environment, then installing via the pip command:

!pip install pandas scikit-learn

### Imports

In [None]:
from pathlib import Path

import pandas as pd
from sklearn.model_selection import train_test_split

### Load the data

In [None]:
# online banking transaction data
path = Path('./data/Fraud Detection with Natural Language Processing.pkl')
df = pd.read_pickle(path)

In [None]:
# vocabulary of API calls
path_vocab = Path('./data/vocab.csv')
df_vocab = pd.read_csv(path_vocab)

### Exploratory data analysis (EDA)

In [None]:
df.shape

In [None]:
df.head(10)

In [None]:
df.describe()

In [None]:
df.dtypes

In [None]:
df.groupby('is_fraud').describe()

In [None]:
df_vocab.head()

In [None]:
df_vocab.shape

In [None]:
df_vocab.describe()

EDA Summary

- the transaction dataset contains 105303 online banking transactions
- of the 105303 transactions, 105202 are valid while only 101 are fraudulent
    - this is a severe class imbalance that will have to be handled in the neural network architecture
- there are 9 attributes for each banking transaction:
    - a label for trasactions that are valid or fraudulent (0 or 1 respectively)
    - list of user actions encoded as a list of integers which corresponds to the vocabulary dataframe
    - list of times in ms for each user action to occur
    - the total elapsed time of the transaction in ms
    - Recency, Frequency, and Monetary features:
        - the transaction amount in log(Euros)
        - the device characteristics
        - the IP address of the user
        - the beneficiary's frequency of conducting a transaction
        - the applications used for the transaction (i.e., Android or iOS)
- there also exists a vocabulary dataset which contains a list of API endpoints/webpage urls which a user can access
    - these are used to translate the encoded user action column of the transaction dataset back to the original url's
    - there are 1916 total endpoints/url's, all of which are unique
    - the index of the dataframe corresponds to the id value used in the user action list from the transaction dataframe

### Data Preprocessing

* Enumerate and present the main steps you preformed in the data preprocessing
* Add your code and interpret the outcome of main steps/functions

In [None]:
# dictionary mapping ids in transaction data to vocabulary
vocab = df_vocab['Name'].to_list()

vocab_sentences = []
for endpoint in vocab:
    sentence = endpoint.replace('/', ' ').lstrip() + ' .'
    vocab_sentences.append(sentence)

id_to_action = {i:a for i, a in enumerate(vocab_sentences)}

In [None]:
# convert the tokenized user actions during online banking to API endpoint calls
actions_raw = df['actions'].to_list()

actions = []
for action in actions_raw:

    action_str = (action.replace('[', '')
           .replace(']', '')
           .replace(' ', '')
           .split(','))
    
    action_ids = []
    for id in action_str:
        if id:
            
            action_ids.append(id_to_action[int(id)])

    actions.append(' '.join(action_ids))

In [None]:
# create an array of labels
labels = df['is_fraud'].to_list()

print(f'there are {sum(labels)} fraudulent transactions')
print(f'which is only {sum(labels)/len(labels)*100:0.2f}% of the total transactions')

In [None]:
# seperate the data into training and testing datasets
# enable the stratify option to ensure there are proportional amounts of fraudulent transactions in the training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(actions, labels, test_size=0.2, shuffle=True, stratify=labels)

In [None]:
print(sum(y_train))
print(sum(y_test))

### Methodology

1. Explan your Deep Learning process / methodology



2. Introduce the Deep Neural Networks you used in your project
 * Model 1
    * Description 
 
 * Model 2
    * Description
 
 * Ensemble method
     * Description 
 
 
3. Add keywords  
**Keywords:** natural language processing, sentiment analysis, clustering, binary classification, multi-label classification, prediction
	___
 **Example**
* ConvNet
    * A convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery(source Wikipedia). 
 
* **Keywords:** supervised learning, classification, ...

In [None]:
# TODO: Add code

### Model Fitting and Validation

1. model 1 
    - decription 
2. model 2
    - decription 

In [None]:
# TODO: Add Code

### Model Evaluation 

* Examine your models (coefficients, parameters, errors, etc...)

* Compute and interpret your results in terms of accuracy, precision, recall, ROC etc. 

In [None]:
# TODO: Add code

### Issues / Improvements
1. Dataset is very small
2. Use regularization / initialization
3. Use cross-validaiton
4. ...

###  References
   - Academic (if any)
   - Online (if any)
	

### Credits

- If you use and/or adapt your code from existing projects, you must provide links and acknowldge the authors. Keep in mind that all documents in your projects and code will be check against the official plagiarism detection tool used by Penn State ([Turnitin](https://turnitin.psu.edu))

> *This code is based on .... (if any)*