## **Fraud and detecting on banking system::**
## **Ray machine learning, tuning, and Kubeflow(Kserve) serving experiment.** 

Some introduction text, formatted in heading 2 style
Fraudulent behavior can be seen across many different fields such as e-commerce, healthcare, payment and banking systems. Fraud is a billion-dollar business and it is increasing every year. Understood and implemented Avi as per the guidelinces The PwC global economic crime survey of 2018 found that half (49 percent) of the 7,200 companies they surveyed had experienced fraud of some kind.

Even if fraud seems to be scary for businesses it can be detected using intelligent systems such as rules engines or machine learning. Here we are trying to explain and demonstrate A rules engine is a software system that executes one or more business rules in a runtime production environment. These rules are generally written by domain experts for transferring the knowledge of the problem to the rules engine and from there to production. Two rules examples for fraud detection would be limiting the number of transactions in a time period (velocity rules), denying the transactions which come from previously known fraudulent IP's and/or domains.

Rules are great for detecting some type of frauds but they can fire a lot of false positives or false negatives in some cases because they have predefined threshold values. For example let's think of a rule for denying a transaction which has an amount that is bigger than 10000 dollars for a specific user. If this user is an experienced fraudster, he/she may be aware of the fact that the system would have a threshold and he/she can just make a transaction just below the threshold value (9999 dollars).

For these type of problems ML comes for help and reduce the risk of frauds and the risk of business to lose money. With the combination of rules and machine learning, detection of the fraud would be more precise and confident.


We detect the fraudulent transactions from the Banksim dataset. This synthetically generated dataset consists of payments from various customers made in different time periods and with different amounts.

Here what we'll do in this kernel:
1. [Exploratory Data Analysis](#Explaratory-Data-Analysis)
2. [Install Required Prerequisites Packages](#Install-Required-Prerequisites-Packages)
3. [Data Preprocessing](#Data-Preprocessing)
4. [XGBoost Classifier](#XGBoost-Classifier)
5. [Logistic Regression Classifier](#Logistic-Regression-Classifier)
6. [ASHAScheduler](#ASHAScheduler)
7. [accuracy_score](#accuracy_score)
8. [precision_score](#precision_score)
9. [recall_score](#recall_score)
10. [f1_score](#f1_score)
11. [roc_auc_score](#roc_auc_score)
12. [Conclusion](#Conclusion)

In this chapter we will perform an EDA on the data and try to gain some insight from it.

## **Explaratory Data Analysis**

Here, we will perform an EDA on the data and try to gain some insight from it.

**Data**
As we can see in the first rows below the dataset has 9 feature columns and a target column. 
The feature columms are :
* **Step**: This feature represents the day from the start of simulation. It has 180 steps so simulation ran for virtually 6 months.
* **Customer**: This feature represents the customer id
* **zipCodeOrigin**: The zip code of origin/source.
* **Merchant**: The merchant's id
* **zipMerchant**: The merchant's zip code
* **Age**: Categorized age 
    * 0: <= 18, 
    * 1: 19-25, 
    * 2: 26-35, 
    * 3: 36-45,
    * 4: 46:55,
    * 5: 56:65,
    * 6: > 65
    * U: Unknown
* **Gender**: Gender for customer
     * E : Enterprise,
     * F: Female,
     * M: Male,
     * U: Unknown
* **Category**: Category of the purchase. I won't write all categories here, we'll see them later in the analysis.
* **Amount**: Amount of the purchase
* **Fraud**: Target variable which shows if the transaction fraudulent(1) or Kind(unfraudulent)(0)

## **Experiment UML Sequence diagram:**
Here in thie kernel experiment, I have reprasented the end to end A sequence diagram is a type of interaction diagram because it is necessary to describes how and in what order a group underneath servcies and objects works together to bring the best fit robot and realtime fraud detection classification machine learning workflow.

## **Steps are following:**
1. Initialize Jupiter notebook service.
2. Import all required packages, user define functions, initialize global objects
3. Procure dataset from object storage, once the dataset loaded successfully
4. Data we need to pre-process, validating, once all the above actions are done here, we are defining the required variables.
4. Once the global variables we are available, we are initiating the model training, at the and based on the best satisfactory score. 
5. Based on the satisfactory score we are deploying the model with KServe on top kubernetes cluster.

<style>
.center {
  display: block;
  margin-left: auto;
  margin-right: auto;
  width: 80%;
}
</style>
<p><big><center> End to End Ray Experiment UML Sequence Diagram :: </center><big>
<img src="./images/ray-fraud-detection-on-banking.png" alt="ray-fraud-detection-on-banking" class="center"> 

In [1]:
from imblearn.over_sampling import SMOTE
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import xgboost as xgb
import matplotlib.pyplot as plt
from minio import Minio
import urllib3
import uuid
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import time
import pickle

  from pandas import MultiIndex, Int64Index


In [2]:
@ray.remote(num_cpus=3)
class RayFraudDetectionExperiment:
    def __init__(self):
        self.data = None
        self.preprocessed_data = None
        self.models = []
        self._s3_host_endpoint = "home.hpe-staging-ezaf.com:31900"
        self._s3_access_key = "minioadmin"
        self._s3_secret_key = "minioadmin"
        self._s3_bucket_name = "experiments"
        self._s3_bucket_path = "ray/pickels/{0}/model/model.pkl"
        self._raypath = "/home/ray/ray_results"
        
            
    def load_data(self):
        # Load the data from the provided data_path
        MINIO_CLIENT_DATASET = Minio(
        endpoint= self._s3_host_endpoint,
        access_key= self._s3_access_key,
        secret_key= self._s3_secret_key,
        secure=True,
        http_client = urllib3.PoolManager(cert_reqs='CERT_NONE'))
        print("MINIO_CLIENT", MINIO_CLIENT_DATASET)
        print("=============:Fetching", "."*10)

        buckets = MINIO_CLIENT_DATASET.list_buckets()
        for bucket in buckets:
            print("=============:", bucket.name, bucket.creation_date)

        csv_file = MINIO_CLIENT_DATASET.get_object(self._s3_bucket_name, "/source/feed.csv")
        self.data = pd.read_csv(csv_file)
    
    def preprocess_data(self):
        # Implement data preprocessing steps here
        preprocessed_data = self.data.copy()
        # Remove rows with missing values
        preprocessed_data = preprocessed_data.dropna()  
        # Remove duplicate rows
        preprocessed_data = preprocessed_data.drop_duplicates()  
        
        # Reset the index
        preprocessed_data = preprocessed_data.reset_index(drop=True)  
        # Additional preprocessing steps based on specific requirements
        # ... we can add here
        print("=============:after preprocess:", preprocessed_data)
        self.preprocessed_data = preprocessed_data

    def data_splitting(self):
        df_fraud = self.preprocessed_data.loc[self.preprocessed_data.fraud == 1] 
        df_non_fraud = self.preprocessed_data.loc[self.preprocessed_data.fraud == 0]
        pd.concat([df_fraud.groupby('category')['amount'].mean(), df_non_fraud.groupby('category')['amount'].mean(),\
           self.preprocessed_data.groupby('category')['fraud'].mean()*100], keys=["Fraudulent","Non-Fraudulent","Percent(%)"], axis=1,\
          sort=False).sort_values(by=['Non-Fraudulent'])
        
        data_reduced = self.preprocessed_data.drop(['zipcodeOri','zipMerchant'],axis=1)
        data_reduced.loc[:,['customer','merchant','category']].astype('category')

        # turning object columns type to categorical for easing the transformation process
        col_categorical = data_reduced.select_dtypes(include= ['object']).columns
        for col in col_categorical:
            data_reduced[col] = data_reduced[col].astype('category')
        # categorical values ==> numeric values
        data_reduced[col_categorical] = data_reduced[col_categorical].apply(lambda x: x.cat.codes)
        data_reduced.head(5)
        '''
            # [before] **Oversampling**
        '''

        # Implement data splitting strategies here
        X = data_reduced.drop(['fraud'],axis=1)
        #X.fillna(0, inplace=True)
        y = data_reduced['fraud']
        #y.fillna(0, inplace=True)
        print(X.head(),"\n")
        print(y.head())
        
        '''
            ##  [after] **Oversampling with SMOTE**
            1. Using SMOTE (Synthetic Minority Oversampling Technique) for balancing the dataset. 
            2. Resulted counts show that now we have exact number of class instances (1 and 0).
        '''
        sm = SMOTE(random_state=42)
        X_res, y_res = sm.fit_resample(X, y)
        y_res = pd.DataFrame(y_res)
        
        # Split the data into train and test sets
        # I won't do cross validation since we have a lot of instances
        X_train, X_test, y_train, y_test = train_test_split(X_res,y_res,test_size=0.3,random_state=42,shuffle=True,stratify=y_res)
        print(X_train, X_test, y_train, y_test )
        print("=============:AFTER SPLIT", X_train, X_test, y_train, y_test)

        return X_train, X_test, y_train, y_test

    def train_models(self, config, checkpoint_dir=None):
        X_train, X_test, y_train, y_test = self.data_splitting()
        print("X_train, X_test, y_train, y_test :::" , X_train, X_test, y_train, y_test)
        model = None
        model_name = config["model"]
        
        if model_name == "LogisticRegression":
            model = LogisticRegression(
                C=config.get("C", 1.0),
                max_iter=config.get("max_iter", 999),
                solver=config.get("solver", "lbfgs"),
            )
        elif model_name == "xgboost":
            config["verbosity"]=1
            model = xgb.XGBClassifier(
                silent=config.get("silent",None), 
                seed=config.get("seed",42),
                colsample_bynode=config.get("colsample_bynode",1), 
                max_depth=config.get("max_depth",6), 
                learning_rate=config.get("learning_rate",0.05), 
                n_estimators=config.get("n_estimators",400), 
                objective=config.get("objective","binary:hinge"), 
                booster=config.get("booster","gbtree"),
                missing=config.get("missing",1), 
                n_jobs=config.get("n_jobs",-1), 
                nthread=config.get("nthread",None), 
                gamma=config.get("gamma",0), 
                min_child_weight=config.get("min_child_weight",1), 
                max_delta_step=config.get("max_delta_step",0),
                subsample=config.get("subsample",1), 
                colsample_bytree=config.get("colsample_bytree",1), 
                colsample_bylevel=config.get("colsample_bylevel",1), 
                reg_alpha=config.get("reg_alpha",0),
                reg_lambda=config.get("reg_lambda",1), 
                base_score=config.get("base_score",0.5), 
                random_state=config.get("random_state",42), 
                verbosity=config.get("verbosity", 999))

        if model:
            # Evaluate the model
            print("=============:avi: system reached phase-3")
            model.fit(X_train, y_train)
            self.models.append(model)
            print("===========:", self.models)
            y_pred = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred)
            recall = recall_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            auc_roc = roc_auc_score(y_test, y_pred)
            print("=============:accuracy:", accuracy)
            print("=============:precision:", precision)
            print("=============:recall:", recall)
            print("=============:f1:", f1)
            print("=============:auc_roc:", auc_roc)
            print("=============:avi: system reached final phase-4")
            
            '''
                Here I am uploading generated model into our object store ray-cluster --> s3/minio
            '''
            print("Model path========>", f"{model_name}.pkl")
            client = Minio(
                endpoint= self._s3_host_endpoint,
                access_key= self._s3_access_key,
                secret_key= self._s3_secret_key,
                secure=True, 
                http_client = urllib3.PoolManager(cert_reqs='CERT_NONE'))

            print("MINIO_CLIENT", client)
            
            if model_name == "logisticregression":
                run_model_path = f"{self._raypath}/model.pkl"
                pickle.dump(model, open(run_model_path, "wb"))
                object = client.fput_object(bucket_name=self._s3_bucket_name, object_name=self._s3_bucket_path.format(model_name), file_path=run_model_path)
                print(f"The Fraud indentification modle {run_model_path} classifier finalized model Upload Completed!")
            elif model_name == "xgboost":
                run_model_path = f"{self._raypath}/model.pkl"
                pickle.dump(model, open(run_model_path, "wb"))
                object = client.fput_object(bucket_name=self._s3_bucket_name, object_name=self._s3_bucket_path.format(model_name), file_path=run_model_path)
                print(f"The Fraud indentification modle {run_model_path} classifier finalized model Upload Completed!")
                
            tune.report(mean_accuracy=accuracy, mean_precision=precision, mean_recall=recall,
                        mean_f1=f1, mean_auc_roc=auc_roc)
    
    
    def run_experiment(self,index):
        try:
            print("=============:avi: system reached phase-2")
            self.load_data()
            self.preprocess_data()
            if index == 0:
                model = tune.choice(["logisticregression"])
                config = { "model": model, "max_depth": tune.choice([8]), "C": tune.loguniform(0.01, 10), "solver": tune.choice(["lbfgs"]), "max_iter":tune.choice([999]), }
            else:
                model = tune.choice(["xgboost"])
                config = { "model": model, "estimators": tune.choice([100]), "max_depth": tune.choice([8]), "C": tune.loguniform(0.01, 10), "solver": tune.choice(["lbfgs", "liblinear"]),
                "kernel": tune.choice(["linear", "rbf"]), "random_state":tune.choice([42]), "verbose":tune.choice([1]), "class_weight":tune.choice(["balaced"]), "max_iter":tune.choice([999]), "silent":tune.choice([None]), "seed":tune.choice([42]), "colsample_bynode":tune.choice([1]), "learning_rate":tune.choice([0.05]), "n_estimators":tune.choice([400]), "objective":tune.choice(["binary:hinge"]),
                "booster":tune.choice(["gbtree"]), "missing":tune.choice([1]), "n_jobs":tune.choice([-1]), "nthread":tune.choice([None]), "gamma":tune.choice([0]), "min_child_weight":tune.choice([1]), "max_delta_step":tune.choice([0]), "subsample":tune.choice([1]), "colsample_bytree":tune.choice([1]), 
                "colsample_bylevel":tune.choice([1]), "reg_alpha":tune.choice([0]), "reg_lambda":tune.choice([1]), "base_score":tune.choice([0.5]), "verbosity":tune.choice([999])}

            analysis = tune.run(
                self.train_models,
                config=config,
                resources_per_trial={"cpu": 3},
                metric="mean_accuracy",
                mode="max",
                num_samples=1,
                reuse_actors=True,
                stop={
                    "mean_accuracy": 0.50, 
                    "training_iteration": 1},
                scheduler=ASHAScheduler(max_t=10)
            )


            best_config = analysis.get_best_config(metric="mean_accuracy", mode="max")
            print("Best Configuration:", best_config)


        except Exception as e:
            # Exception handling
            print("An error occurred:", str(e))

In [3]:
def main_experiment():
    # Start timer
    start_time = time.time()
    ray.init(address="ray://kuberay-head-svc.kuberay:10001", 
             runtime_env={
                #"env_vars":{"http_proxy": "http://10.78.90.46:80", "https_proxy": "http://10.78.90.46:80"} #needed for LR1 network
             }
    )

    print("=============:", ray.cluster_resources())

    # Create the remote RayFraudDetectionExperiment actors
    model_loop = ["logisticregression", "xgboost"]
    
    # Run the experiments in parallel
    for _ in range(len(model_loop)):
        fraud_detection_experiments = RayFraudDetectionExperiment.remote()
        ray.get([fraud_detection_experiments.run_experiment.remote(_)])

    # Stop timer
    end_time = time.time()
    execution_time = end_time - start_time
    print("=============: Execution Time:", execution_time, "seconds")
    ray.shutdown()

In [4]:
if __name__ == "__main__":
    main_experiment()

[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m MINIO_CLIENT <minio.api.Minio object at 0x7f2f1ad2c280>




[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 0          0  C1093826151   4  ...  es_transportation    4.55     0
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 1          0   C352968107   2  ...  es_transportation   39.68     0
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 2          0  C2054744914   4  ...  es_transportation   26.89     0
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 3          0  C1760612790   3  ...  es_transportation   17.25     0
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 4          0   C757503768   5  ...  es_transportation   35.72     0
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m ...      ...          ...  ..  ...                ...     ...   ...
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 594638   179  C1753498738   3  ...  es_transportation   20.53     0
[2m[36m(RayFraudDetectionExperiment pid=7760, 

[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 2023-07-07 07:05:39,947	INFO registry.py:96 -- Detected unknown callable for trainable. Converting to class.
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m from ray.air import session
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m def train(config):
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m     # ...
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m     session.report({"metric": metric}, checkpoint=checkpoint)
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m 
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m For more information please see https://docs.ray.io/en/master/tune/api_docs/trainable.html
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0

[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m == Status ==
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m Current time: 2023-07-07 07:05:49 (running for 00:00:07.51)
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m Memory usage on this node: 14.7/123.8 GiB 
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m Using AsyncHyperBand: num_stopped=0
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m Bracket: Iter 4.000: None | Iter 1.000: None
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/14.9 GiB heap, 0.0/4.38 GiB objects
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m Result logdir: /home/ray/ray_results/train_models_2023-07-07_07-05-41
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m Number of trials: 1/1 (1 PENDING)
[2m[36m(RayFraudDetectionExperiment pid=7760, ip=10.244.3.46)[0m +-----



[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m    step  customer  age  gender  merchant  category  amount
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 0     0       210    4       2        30        12    4.55
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 1     0      2753    2       2        30        12   39.68
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 2     0      2285    4       1        18        12   26.89
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 3     0      1650    3       2        30        12   17.25
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 4     0      3585    5       2        30        12   35.72 
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 0    0
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 1    0
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 2    0
[2m[36m(train_models pid=4882, ip=10.244.3.50)[0m 3    0
[2m[36m(train_models pid=4882, 



[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 0          0  C1093826151   4  ...  es_transportation    4.55     0
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 1          0   C352968107   2  ...  es_transportation   39.68     0
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 2          0  C2054744914   4  ...  es_transportation   26.89     0
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 3          0  C1760612790   3  ...  es_transportation   17.25     0
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 4          0   C757503768   5  ...  es_transportation   35.72     0
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m ...      ...          ...  ..  ...                ...     ...   ...
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 594638   179  C1753498738   3  ...  es_transportation   20.53     0
[2m[36m(RayFraudDetectionExperiment pid=4757, 

[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 2023-07-07 07:06:37,293	INFO registry.py:96 -- Detected unknown callable for trainable. Converting to class.
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m from ray.air import session
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m def train(config):
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m     # ...
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m     session.report({"metric": metric}, checkpoint=checkpoint)
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m 
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m For more information please see https://docs.ray.io/en/master/tune/api_docs/trainable.html
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0

[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m == Status ==
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Current time: 2023-07-07 07:06:48 (running for 00:00:09.49)
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Memory usage on this node: 15.8/123.8 GiB 
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Using AsyncHyperBand: num_stopped=0
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Bracket: Iter 4.000: None | Iter 1.000: None
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Resources requested: 3.0/9 CPUs, 0/0 GPUs, 0.0/29.8 GiB heap, 0.0/8.8 GiB objects
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Result logdir: /home/ray/ray_results/train_models_2023-07-07_07-06-39
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m +----



[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m == Status ==
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Current time: 2023-07-07 07:13:42 (running for 00:07:03.61)
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Memory usage on this node: 16.2/123.8 GiB 
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Using AsyncHyperBand: num_stopped=0
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Bracket: Iter 4.000: None | Iter 1.000: 0.9939256552405055
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Resources requested: 0/9 CPUs, 0/0 GPUs, 0.0/29.8 GiB heap, 0.0/8.8 GiB objects
[2m[36m(RayFraudDetectionExperiment pid=4757, ip=10.244.3.52)[0m Current best trial: 7dfa5_00000 with mean_accuracy=0.9939256552405055 and parameters={'model': 'xgboost', 'estimators': 100, 'max_depth': 8, 'C': 0.022323878349504816, 'solver': 'lbfgs', 'kernel': 'rbf', 'random_state': 42, 'verbose'

## **Configuration**

## Note ::
1. experiment configuration detail.

In [5]:
import json
configuration_dtl = {
    "NAMESPACE":"hpedemo-user01",
    "MINIO_HOST_URL":"home.hpe-staging-ezaf.com:31900",
    "MINIO_ACCESS_KEY":"minioadmin",
    "MINIO_SECRET_KEY":"minioadmin",
    "SOURCE_PATH":"/source/feed.csv",
    "GENERATED_PATH":"/source/generated-data.csv",
    "KUBEFLOW_HOSTURL":"kubeflow.hpe-staging-ezaf.com",
    "KSERVE_DNS_MODEL_SERVING_NAME":"ray-fraud-detection-lr-8062023-3-predictor-default",
    "KSERVE_MODEL_NAME":"ray-fraud-detection-lr-8062023-3",
    "service_account_name": "fraud-detection-kserver-service",  
    "storage_uri": "s3://experiments/ray/pickels/logisticregression/model", 
    "protocol_version": "v2",
    "bucket_name": "experiments",
    "EZAF_ENV":"hpe-staging-ezaf",
    "token_url":"https://keycloak.{0}.com/realms/UA/protocol/openid-connect/token"
}

print("configuration_dtl :: ===== ::", json.loads(json.dumps(configuration_dtl)))

configuration_dtl :: ===== :: {'NAMESPACE': 'hpedemo-user01', 'MINIO_HOST_URL': 'home.hpe-staging-ezaf.com:31900', 'MINIO_ACCESS_KEY': 'minioadmin', 'MINIO_SECRET_KEY': 'minioadmin', 'SOURCE_PATH': '/source/feed.csv', 'GENERATED_PATH': '/source/generated-data.csv', 'KUBEFLOW_HOSTURL': 'kubeflow.hpe-staging-ezaf.com', 'KSERVE_DNS_MODEL_SERVING_NAME': 'ray-fraud-detection-lr-8062023-3-predictor-default', 'KSERVE_MODEL_NAME': 'ray-fraud-detection-lr-8062023-3', 'service_account_name': 'fraud-detection-kserver-service', 'storage_uri': 's3://experiments/ray/pickels/logisticregression/model', 'protocol_version': 'v2', 'bucket_name': 'experiments', 'EZAF_ENV': 'hpe-staging-ezaf', 'token_url': 'https://keycloak.{0}.com/realms/UA/protocol/openid-connect/token'}


## **Model Deployment**

## Note ::
1. Here we are creating the secrets and configuration server accounts traditionally.

In [7]:
#kserve
from kubernetes import client
from kserve import KServeClient
from kserve import constants
from kserve import utils
from kserve import V1beta1InferenceService
from kserve import V1beta1InferenceServiceSpec
from kserve import V1beta1PredictorSpec
from kserve import V1beta1SKLearnSpec

default_model_spec = V1beta1InferenceServiceSpec(predictor=V1beta1PredictorSpec(
    service_account_name=configuration_dtl.get('service_account_name'),
    sklearn=V1beta1SKLearnSpec(
        storage_uri=configuration_dtl.get('storage_uri'),
        protocol_version=configuration_dtl.get('protocol_version')
    )))

isvc = V1beta1InferenceService(api_version=constants.KSERVE_V1BETA1,
                          kind=constants.KSERVE_KIND,
                          metadata=client.V1ObjectMeta(name=configuration_dtl.get('KSERVE_MODEL_NAME'), namespace=configuration_dtl.get('NAMESPACE')),
                          spec=default_model_spec)

# print(isvc)
kserve = KServeClient()
kserve.create(isvc)

{'apiVersion': 'serving.kserve.io/v1beta1',
 'kind': 'InferenceService',
 'metadata': {'creationTimestamp': '2023-07-07T14:15:45Z',
  'generation': 1,
  'labels': {'modelClass': 'mlserver_sklearn.SKLearnModel'},
  'managedFields': [{'apiVersion': 'serving.kserve.io/v1beta1',
    'fieldsType': 'FieldsV1',
    'fieldsV1': {'f:spec': {'.': {},
      'f:predictor': {'.': {},
       'f:serviceAccountName': {},
       'f:sklearn': {'.': {},
        'f:name': {},
        'f:protocolVersion': {},
        'f:storageUri': {}}}}},
    'manager': 'OpenAPI-Generator',
    'operation': 'Update',
    'time': '2023-07-07T14:15:42Z'}],
  'name': 'ray-fraud-detection-lr-8062023-3',
  'namespace': 'hpedemo-user01',
  'resourceVersion': '138785832',
  'uid': '88fceaa8-22ce-4740-bfef-610f2891d9a8'},
 'spec': {'predictor': {'model': {'env': [{'name': 'MLSERVER_MODEL_NAME',
      'value': 'ray-fraud-detection-lr-8062023-3'},
     {'name': 'MLSERVER_MODEL_URI', 'value': '/mnt/models'}],
    'modelFormat': {'n

In [12]:
import time
import requests
import json

time.sleep(5)
#---------------- ::: get-generated-data-test ::: ----------------
MINIO_CLIENT_INFR = Minio(
    endpoint= configuration_dtl.get('MINIO_HOST_URL'), 
    access_key=configuration_dtl.get('MINIO_ACCESS_KEY'), 
    secret_key=configuration_dtl.get('MINIO_SECRET_KEY'),
    secure=True,
    http_client = urllib3.PoolManager(cert_reqs='CERT_NONE'))

print("MINIO_CLIENT", MINIO_CLIENT_INFR)
csv_file = MINIO_CLIENT_INFR.get_object(configuration_dtl.get('bucket_name'), configuration_dtl.get('GENERATED_PATH'))
data = pd.read_csv(csv_file)
data.head(5)
data_reduced = data.drop(['zipcodeOri','zipMerchant'],axis=1)
data_reduced.loc[:,['customer','merchant','category']].astype('category')

# turning object columns type to categorical for easing the transformation process
col_categorical = data_reduced.select_dtypes(include= ['object']).columns
for col in col_categorical:
    data_reduced[col] = data_reduced[col].astype('category')
data_reduced[col_categorical] = data_reduced[col_categorical].apply(lambda x: x.cat.codes)
data_reduced.head(5)

#---------------- ::: inference input ::: ----------------
# In contrast, model inference is the process of using a trained model to infer a result from live data.
X = data_reduced.drop(['fraud'], axis=1)
y = data_reduced['fraud']
print("shape==============", [len(X.values), len(X.values[0])])
print("X.values[0]==============", X.values[0], "=======",  list(X.values[0]))

inference_request = {
    "inputs" : [{
        "name" : "ray-fraud-detection-infer-001",
        "datatype": "FP32",
        # !!! Multiple record infer !!!
        # "data": [list(item) for item in X.values],
        # "shape": [len(X.values), len(X.values[0])],
 
        # !!! One record infer !!!
        "shape": [1, 7],
        # "data": [list(item) for item in X.values][14], #Non-Fraud Transaction Dtls
        "data": [list(item) for item in X.values][17], #Fraud Transaction Dtls
    }]
}
print("data::", inference_request)

config_data = {
    "username" : "hpedemo-user01",
    "password" : "Hpepoc@123",
    "grant_type" : "password",
    "client_id" : "ua-grant",
}

token_responce = requests.post(
    configuration_dtl.get('token_url').format(configuration_dtl.get('EZAF_ENV')), 
    data=config_data, 
    allow_redirects=True, 
    verify=False)

token = token_responce.json()["access_token"]
headers = {"Authorization": f"Bearer {token}"}
#print("token", token)

#---------------- ::: Trigger Kserving ::: ----------------
KServe = KServeClient()
server_isvc_resp = KServe.get(configuration_dtl.get('KSERVE_MODEL_NAME'), namespace=configuration_dtl.get('NAMESPACE')).get("status").get("components").get('predictor').get('url').replace("http","https")
print("server_isvc_resp", server_isvc_resp)
print("inference::", f"{server_isvc_resp}/v2/models/{configuration_dtl.get('KSERVE_MODEL_NAME')}/infer")

session = requests.Session()
message = {"message":"", "value":""}
response = session.post(
    f"{server_isvc_resp}/v2/models/ray-fraud-detection-lr-8062023-3/infer",
    json = inference_request,
    headers=headers,
    verify=False)
if response.status_code == 200:
    if json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'][0] != None and json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'][0] == 1:
        message['message'] = "Fraud Banking Transaction !"
        message['value'] = json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'][0]        
        print('\033[91m' "Prediction Result:", json.dumps(message))
    elif len(json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'])>1:
        print("Model-Infer-dtl:[data]:\n", json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'])
    else:
        message['message'] = "Non-fraud Banking Transaction !"
        message['value'] = json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'][0]   
        print('\033[92m'  "Prediction Result:", json.dumps(message))
else:
    print("service issue::", response.status_code)
    print("service issue::", response.content)

MINIO_CLIENT <minio.api.Minio object at 0x7f54b33580a0>
data:: {'inputs': [{'name': 'ray-fraud-detection-infer-001', 'datatype': 'FP32', 'shape': [1, 7], 'data': [1.0, 4.0, 4.0, 0.0, 4.0, 1.0, 255.14]}]}
server_isvc_resp https://ray-fraud-detection-lr-8062023-3-predictor-default.hpedemo-user01.hpe-staging-ezaf.com
inference:: https://ray-fraud-detection-lr-8062023-3-predictor-default.hpedemo-user01.hpe-staging-ezaf.com/v2/models/ray-fraud-detection-lr-8062023-3/infer
[91mPrediction Result: {"message": "Fraud Banking Transaction !", "value": 1}


## **Conclusion::**
In this kernel we have tried to do fraud detection on a bank payment data and we have achieved remarkable results with our classifiers. I haven't put the classification results without SMOTE here but i added them in my github repo before so if you are interested to compare both results you can also check my github repo.

Thanks for taking the time to read or just view the results from my first kernel i hope you enjoyed it. I would be grateful for any kind of critique, suggestion or comment and i wish you to have a great day with lots of beautiful data!

# **Reference::**

1. Lavion, Didier; et al, "PwC's Global Economic Crime and Fraud Survey 2022", 
2. https://www.pwc.com/gx/en/services/forensics/economic-crime-survey.html |
3. https://www.pwc.com/gx/en/services/forensics/gecs/outcomes-of-platform-fraud.svg |
4. https://www.pwc.com/gx/en/forensics/gecsm-2022/pdf/PwC%E2%80%99s-Global-Economic-Crime-and-Fraud-Survey-2022.pdf **(pdf) | PwC.com. Retrieved PwC’s Global Economic Crime and Fraud Survey 2022 **

5. SMOTE: Investigated by Aravind, Synthetic Minority Over-sampling Technique, https://jair.org/index.php/jair/article/view/10302

6. Banksim Data Set,paper http://www.msc-les.org/proceedings/emss/2014/EMSS2014_144.pdf **(pdf)**