From 381458aef09dd4573038ce831c777ccbcc8c4037 Mon Sep 17 00:00:00 2001 From: Maksym Zhytnikov <63515947+Maxxx-zh@users.noreply.github.com> Date: Wed, 5 Jun 2024 16:57:18 +0300 Subject: [PATCH 1/5] Create README.md --- .../fraud_cheque_detection/README.md | 72 +++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 advanced_tutorials/fraud_cheque_detection/README.md diff --git a/advanced_tutorials/fraud_cheque_detection/README.md b/advanced_tutorials/fraud_cheque_detection/README.md new file mode 100644 index 00000000..527c22bd --- /dev/null +++ b/advanced_tutorials/fraud_cheque_detection/README.md @@ -0,0 +1,72 @@ +## 🏦 Cheque Fraud Detection + +### 👨🏻‍🏫 Overview + +The Cheque Fraud Detection project is designed to identify fraudulent cheques and provide detailed explanations for each validation. The project encompasses three main components: feature pipeline, training pipeline, and inference pipeline. + +### 📖 Feature Pipeline + +The Feature Pipeline does the following: + +1. **Data Loading**: The process begins with downloading the dataset into a pandas DataFrame. Initial data exploration is conducted to understand the structure and distribution of the data. + +2. **Feature Engineering**: Key features are extracted from the raw data, including textual corrections and matching the amount written in words with the numerical amount. These features are essential for training an effective fraud detection model. + +4. **Feature Group Creation**: Create a Cheque Feature Group to store your cheque data. + +This notebook lays the groundwork for building a robust fraud detection model by ensuring the data is accurate, consistent, and ready for the training phase. + +### 🏃🏻‍♂️ Training Pipeline + +The Training Pipeline notebook focuses on building and evaluating the fraud detection model. This phase is crucial for developing a robust classifier capable of accurately identifying fraudulent cheques. The notebook includes several important steps: + +1. **Data Import and Setup**: The notebook begins by importing the necessary libraries and connecting to the Hopsworks feature store. This setup phase ensures that all required resources and datasets are accessible. + +2. **Feature Retrieval**: Create a Feature View with selected features and retrieve a train-test split for model training. + +3. **Model Training**: An XGBoost classifier is trained using the retrieved features. XGBoost is chosen for its efficiency and high performance in classification tasks. The model learns to differentiate between valid and fraudulent cheques based on the provided features. + +4. **Model Evaluation**: The trained model is evaluated using various metrics, including accuracy, precision, recall, F1 score, and confusion matrix. These metrics provide a comprehensive understanding of the model's performance and highlight areas for potential improvement. + +5. **Model Saving**: The trained model is saved in the Hopsworks Model Registry. + +### 🚀 Inference Pipeline + +The Inference Pipeline notebook is designed to make predictions on new cheques and provide detailed explanations for each validation. This phase is crucial for deploying the model in a real-world scenario, where it can validate cheques and explain the reasons behind its decisions. The notebook includes the next steps: + +1. **Model Loading**: The trained XGBoost model is loaded from the Hopsworks Model Registry to perform inference on new data. Additionally, the Donut OCR model is loaded for text extraction from cheque images. + +2. **Text Parsing and Validation**: Donut OCR is used to parse text from new cheque images. This step extracts the necessary textual data, such as the written amount, for validation against numerical data. + +3. **Fraud Detection**: The extracted features are fed into the trained model to predict whether a cheque is fraudulent or valid. This prediction is based on the learned patterns from the training phase. + + +4. **LLM Loading**: Load the `meta-llama/Meta-Llama-3-8B-Instruct` model and set up LLM Chain for text generation using Langchain. + +5. **Explanation Generation**: An LLM chain is used to generate detailed explanations for each validation result. This step leverages a Large Language Model to provide insights into why a cheque is considered fraudulent or valid, enhancing transparency and understanding. + +6. **Batch Inference**: The notebook also supports batch inference, allowing multiple cheques to be processed at once. Then validations and correcponding descriptions are saved in the **cheque_validation** Feature Group. This is useful for validating large volumes of cheques efficiently. + +🛠 Setup and Installation + +1. Clone the Repository: + +``` +git clone https://github.com/logicalclocks/hopsworks-tutorials.git +cd hopsworks-tutorials/advanced_tutorials/fraud_cheque_detection +``` + +2. Install Dependencies: + +``` +pip install -r requirements.txt +``` + +3. Run Notebooks Sequentially: + +- Start with 1_feature_pipeline.ipynb to preprocess and store features. +- Proceed with 2_training_pipeline.ipynb to train and save the model. +- Finally, execute 3_inference_pipeline.ipynb to perform predictions and generate explanations. + +--- + From 221fb8b99494759de4e68ed8e060ef18f4c6fd9f Mon Sep 17 00:00:00 2001 From: Maksym Zhytnikov <63515947+Maxxx-zh@users.noreply.github.com> Date: Wed, 5 Jun 2024 17:01:17 +0300 Subject: [PATCH 2/5] Update README.md --- .../fraud_cheque_detection/README.md | 29 +++++++++---------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/advanced_tutorials/fraud_cheque_detection/README.md b/advanced_tutorials/fraud_cheque_detection/README.md index 527c22bd..1c743450 100644 --- a/advanced_tutorials/fraud_cheque_detection/README.md +++ b/advanced_tutorials/fraud_cheque_detection/README.md @@ -1,10 +1,10 @@ -## 🏦 Cheque Fraud Detection +# 🏦 Cheque Fraud Detection -### 👨🏻‍🏫 Overview +## 👨🏻‍🏫 Overview The Cheque Fraud Detection project is designed to identify fraudulent cheques and provide detailed explanations for each validation. The project encompasses three main components: feature pipeline, training pipeline, and inference pipeline. -### 📖 Feature Pipeline +## 📖 Feature Pipeline The Feature Pipeline does the following: @@ -12,42 +12,41 @@ The Feature Pipeline does the following: 2. **Feature Engineering**: Key features are extracted from the raw data, including textual corrections and matching the amount written in words with the numerical amount. These features are essential for training an effective fraud detection model. -4. **Feature Group Creation**: Create a Cheque Feature Group to store your cheque data. +4. **Feature Group Creation**: Create a **Cheque Feature Group** to store your cheque data. This notebook lays the groundwork for building a robust fraud detection model by ensuring the data is accurate, consistent, and ready for the training phase. -### 🏃🏻‍♂️ Training Pipeline +## 🏃🏻‍♂️ Training Pipeline The Training Pipeline notebook focuses on building and evaluating the fraud detection model. This phase is crucial for developing a robust classifier capable of accurately identifying fraudulent cheques. The notebook includes several important steps: 1. **Data Import and Setup**: The notebook begins by importing the necessary libraries and connecting to the Hopsworks feature store. This setup phase ensures that all required resources and datasets are accessible. -2. **Feature Retrieval**: Create a Feature View with selected features and retrieve a train-test split for model training. +2. **Feature Retrieval**: Create a **Feature View** with selected features and retrieve a train-test split for model training. 3. **Model Training**: An XGBoost classifier is trained using the retrieved features. XGBoost is chosen for its efficiency and high performance in classification tasks. The model learns to differentiate between valid and fraudulent cheques based on the provided features. 4. **Model Evaluation**: The trained model is evaluated using various metrics, including accuracy, precision, recall, F1 score, and confusion matrix. These metrics provide a comprehensive understanding of the model's performance and highlight areas for potential improvement. -5. **Model Saving**: The trained model is saved in the Hopsworks Model Registry. +5. **Model Saving**: The trained model is saved in the **Hopsworks Model Registry**. -### 🚀 Inference Pipeline +## 🚀 Inference Pipeline The Inference Pipeline notebook is designed to make predictions on new cheques and provide detailed explanations for each validation. This phase is crucial for deploying the model in a real-world scenario, where it can validate cheques and explain the reasons behind its decisions. The notebook includes the next steps: -1. **Model Loading**: The trained XGBoost model is loaded from the Hopsworks Model Registry to perform inference on new data. Additionally, the Donut OCR model is loaded for text extraction from cheque images. +1. **Model Loading**: The trained XGBoost model is loaded from the **Hopsworks Model Registry** to perform inference on new data. Additionally, the `Donut OCR model` is loaded for text extraction from cheque images. -2. **Text Parsing and Validation**: Donut OCR is used to parse text from new cheque images. This step extracts the necessary textual data, such as the written amount, for validation against numerical data. +2. **Text Parsing and Validation**: `Donut OCR` is used to parse text from new cheque images. This step extracts the necessary textual data, such as the written amount, for validation against numerical data. 3. **Fraud Detection**: The extracted features are fed into the trained model to predict whether a cheque is fraudulent or valid. This prediction is based on the learned patterns from the training phase. - 4. **LLM Loading**: Load the `meta-llama/Meta-Llama-3-8B-Instruct` model and set up LLM Chain for text generation using Langchain. 5. **Explanation Generation**: An LLM chain is used to generate detailed explanations for each validation result. This step leverages a Large Language Model to provide insights into why a cheque is considered fraudulent or valid, enhancing transparency and understanding. 6. **Batch Inference**: The notebook also supports batch inference, allowing multiple cheques to be processed at once. Then validations and correcponding descriptions are saved in the **cheque_validation** Feature Group. This is useful for validating large volumes of cheques efficiently. -🛠 Setup and Installation +## 🛠 Setup and Installation 1. Clone the Repository: @@ -64,9 +63,9 @@ pip install -r requirements.txt 3. Run Notebooks Sequentially: -- Start with 1_feature_pipeline.ipynb to preprocess and store features. -- Proceed with 2_training_pipeline.ipynb to train and save the model. -- Finally, execute 3_inference_pipeline.ipynb to perform predictions and generate explanations. +- Start with `1_feature_pipeline.ipynb` to preprocess and store features. +- Proceed with `2_training_pipeline.ipynb` to train and save the model. +- Finally, execute `3_inference_pipeline.ipynb` to perform predictions and generate explanations. --- From 6da78048cc23006b63ce992680815dd93e7f1e67 Mon Sep 17 00:00:00 2001 From: Maksym Zhytnikov <63515947+Maxxx-zh@users.noreply.github.com> Date: Wed, 5 Jun 2024 17:06:24 +0300 Subject: [PATCH 3/5] Typo --- advanced_tutorials/fraud_cheque_detection/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced_tutorials/fraud_cheque_detection/README.md b/advanced_tutorials/fraud_cheque_detection/README.md index 1c743450..efa3b87e 100644 --- a/advanced_tutorials/fraud_cheque_detection/README.md +++ b/advanced_tutorials/fraud_cheque_detection/README.md @@ -44,7 +44,7 @@ The Inference Pipeline notebook is designed to make predictions on new cheques a 5. **Explanation Generation**: An LLM chain is used to generate detailed explanations for each validation result. This step leverages a Large Language Model to provide insights into why a cheque is considered fraudulent or valid, enhancing transparency and understanding. -6. **Batch Inference**: The notebook also supports batch inference, allowing multiple cheques to be processed at once. Then validations and correcponding descriptions are saved in the **cheque_validation** Feature Group. This is useful for validating large volumes of cheques efficiently. +6. **Batch Inference**: The notebook also supports batch inference, allowing multiple cheques to be processed at once. Then validations and corresponding descriptions are saved in the **cheque_validation** Feature Group. This is useful for validating large volumes of cheques efficiently. ## 🛠 Setup and Installation From c94f7754a1ad9d53a0255944520766bba7475f83 Mon Sep 17 00:00:00 2001 From: Maksym Zhytnikov <63515947+Maxxx-zh@users.noreply.github.com> Date: Wed, 5 Jun 2024 17:10:01 +0300 Subject: [PATCH 4/5] VideoLink --- advanced_tutorials/fraud_cheque_detection/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/advanced_tutorials/fraud_cheque_detection/README.md b/advanced_tutorials/fraud_cheque_detection/README.md index efa3b87e..b9bf90a7 100644 --- a/advanced_tutorials/fraud_cheque_detection/README.md +++ b/advanced_tutorials/fraud_cheque_detection/README.md @@ -4,6 +4,8 @@ The Cheque Fraud Detection project is designed to identify fraudulent cheques and provide detailed explanations for each validation. The project encompasses three main components: feature pipeline, training pipeline, and inference pipeline. +[Helper video describing how to implement the Cheque Fraud Detection system](https://www.youtube.com/watch?v=0H-XJ5qLFyA&t=1s) + ## 📖 Feature Pipeline The Feature Pipeline does the following: From 7d56f956ae456c02c748aa8a181ea33022bc34bc Mon Sep 17 00:00:00 2001 From: rcnnnghm Date: Wed, 5 Jun 2024 15:24:26 +0100 Subject: [PATCH 5/5] Update README.md Minor grammatical changes. --- advanced_tutorials/fraud_cheque_detection/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced_tutorials/fraud_cheque_detection/README.md b/advanced_tutorials/fraud_cheque_detection/README.md index b9bf90a7..3cc5d98b 100644 --- a/advanced_tutorials/fraud_cheque_detection/README.md +++ b/advanced_tutorials/fraud_cheque_detection/README.md @@ -2,7 +2,7 @@ ## 👨🏻‍🏫 Overview -The Cheque Fraud Detection project is designed to identify fraudulent cheques and provide detailed explanations for each validation. The project encompasses three main components: feature pipeline, training pipeline, and inference pipeline. +The Cheque Fraud Detection project is designed to identify fraudulent cheques and provide detailed explanations for each validation. The project encompasses three main components: a feature pipeline, a training pipeline, and an inference pipeline. [Helper video describing how to implement the Cheque Fraud Detection system](https://www.youtube.com/watch?v=0H-XJ5qLFyA&t=1s)