## Credit Card Streaming Data Anomaly Detection
In this experiment, I'm trying to do anomaly detection of credit card streaming data. The models I'm using are semi-online, meaning the training data is used first for training, then the testing data is streamed to the model. The model updates itself during the streaming process. The models considered are:
* IForestASD

##### Make sure requirements are installed first

In [None]:
pip install -r requirements.txt

##### Add imports

In [4]:
import numpy as np
from sklearn.utils import shuffle
from pysad.models import IForestASD
from pysad.transform.preprocessing import InstanceUnitNormScaler
from pysad.transform.postprocessing import RunningAveragePostprocessor
from pysad.utils import Data
from pysad.evaluation import AUROCMetric, PrecisionMetric, RecallMetric
from pysad.utils.array_streamer import ArrayStreamer
from tqdm import tqdm
import pandas as pd
from datetime import date, datetime

##### These are just utility functions used for extracting features from the DF.

In [1]:
# This function returns the label and features of the given dataframe
def get_label_and_features(df):
    # This function converts given date to age 
    def age(born): 
        born = datetime.strptime(born, "%Y-%m-%d").date() 
        today = date.today() 
        return today.year - born.year - ((today.month,  
                                        today.day) < (born.month,  
                                                        born.day))
    
    # Assuming the 'is_fraud' column contains the labels (1 for fraud, 0 for normal)
    labels = df['is_fraud']

    # Drop non-numeric columns and the label column
    features = df.drop(['merchant', 'first', 'last', 
                        'street', 'city', 'state', 'job', 'dob', 'trans_num', 'is_fraud'], axis=1)

    # Convert time string to timestamp
    features['trans_date_trans_time'] = pd.to_datetime(features['trans_date_trans_time'], format='%Y-%m-%d %H:%M:%S').astype(np.int64)

    # Changing categorical data to numerical data
    features['category'] = pd.Categorical(features['category'])
    features['category'] = features['category'].cat.codes

    features['gender'] = pd.Categorical(features['gender'])
    features['gender'] = features['gender'].cat.codes

    # Calculate current age
    features['age'] = df['dob'].apply(age)

    return labels, features

##### Read The Training Dataset

In [3]:
# Load data from CSV using pandas
# index_col='Unnamed: 0' is used to ignore the first column
training_df = pd.read_csv('data/fraudTrain.csv', index_col='Unnamed: 0')

_, training_features = get_label_and_features(training_df)

##### Do the training on the model first

In [None]:
# Train IForestASD model
model = IForestASD()
model.fit(training_features.values)

##### Read The Streaming Data

In [None]:
# Load data from CSV using pandas
# index_col='Unnamed: 0' is used to ignore the first column
df = pd.read_csv('data/fraudTest.csv', index_col='Unnamed: 0')
labels, features = get_label_and_features(df)

##### Prepare for streaming

In [None]:
# Shuffle the data
features, labels = shuffle(features, labels)

iterator = ArrayStreamer(shuffle=False)  # Init streamer to simulate streaming data.

preprocessor = InstanceUnitNormScaler()  # Init normalizer.
postprocessor = RunningAveragePostprocessor(window_size=5)  # Init running average postprocessor.

##### Instantiate metrics

In [5]:
auroc = AUROCMetric()  # Init area under the receiver-operating characteristics curve metric.
precision = PrecisionMetric() # Init precision metric.
recall = RecallMetric() # Init recall metric.

##### Stream the data

In [None]:
start_idx = 100
end_idx = 1000
for X, y in tqdm(iterator.iter(features.values[start_idx: end_idx], labels.values[start_idx: end_idx])):  # Stream data.
    # Apply preprocessing and postprocessing.
    X_numeric = preprocessor.fit_transform_partial(X)
    X = X_numeric  # Replace the transformed elements back into the original array
    
    score = model.fit_score_partial(X)  # Fit model to and score the instance.
    score = postprocessor.fit_transform_partial(score)  # Apply running averaging to the score.

    auroc.update(y, score)  # Update AUROC metric.
    precision.update(y, score)  # Update precision metric.
    recall.update(y, score)  # Update recall metric.

##### Print performance metrics

In [None]:
# Output resulting AUROCS metric.
print("AUROC: ", auroc.get())
print("Precision: ", precision.get())
print("Recall: ", recall.get())