<a href="https://www.kaggle.com/code/mikedelong/acc-0-9505-with-a-simple-ensemble?scriptVersionId=166073548" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd

METAVERSE = '/kaggle/input/metaverse-financial-transactions-dataset/metaverse_transactions_dataset.csv'
# we're going to drop the columns we aren't going to use
df = pd.read_csv(filepath_or_buffer=METAVERSE, parse_dates=['timestamp']).drop(columns=['timestamp', 'sending_address', 'receiving_address', 'ip_prefix'])
df.head()

Unnamed: 0,hour_of_day,amount,transaction_type,location_region,login_frequency,session_duration,purchase_pattern,age_group,risk_score,anomaly
0,12,796.949206,transfer,Europe,3,48,focused,established,18.75,low_risk
1,19,0.01,purchase,South America,5,61,focused,established,25.0,low_risk
2,16,778.19739,purchase,Asia,3,74,focused,established,31.25,low_risk
3,9,300.838358,transfer,South America,8,111,high_value,veteran,36.75,low_risk
4,14,775.569344,sale,Africa,6,100,high_value,veteran,62.5,moderate_risk


In [2]:
import warnings
from plotly import express
warnings.filterwarnings(action='ignore', category=FutureWarning)

express.histogram(data_frame=df, x='risk_score', color='anomaly')

Our risk assessment is to build a risk score and then segment the risk score into low, medium, and high buckets. The surprise here is that the risk score is a discrete variable with a non-uniform distribution. Rather than try to predict it we'll focus on predicting the anomaly.

We know from a prior analysis we want to use two different models: one focusing on real values and the other on categorical values. We want to use a common test/train split.

In [3]:
from sklearn.model_selection import train_test_split

RANDOM_STATE = 2024
TARGET = 'anomaly'
TEST_SIZE = 0.2

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['risk_score', 'anomaly']), df[TARGET], test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=True, stratify=df[TARGET])
X_train.head(n=5)

Unnamed: 0,hour_of_day,amount,transaction_type,location_region,login_frequency,session_duration,purchase_pattern,age_group
49619,20,464.264187,purchase,North America,5,64,focused,established
6024,9,686.530626,transfer,South America,7,117,high_value,veteran
39833,12,452.645601,purchase,North America,2,31,random,new
67412,23,919.036919,sale,South America,8,82,high_value,veteran
50274,2,462.039875,purchase,South America,4,43,focused,established


Let's do our categorical value model first.

In [4]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder
from sklearn.inspection import permutation_importance

categories = ['transaction_type', 'purchase_pattern', 'age_group', 'login_frequency', 'session_duration']

# we could do this a column at a time with LabelEncoder
# but since all of our data is categorical we can use OrdinalEncoder
# and do the whole thing at once
encoder = OrdinalEncoder(categories='auto').set_output(transform='pandas').fit(X=X_train[categories])

categorical = CategoricalNB(alpha=1.0, force_alpha='warn', fit_prior=True, class_prior=None, min_categories=None, )
categorical.fit(X=encoder.transform(X=X_train[categories]), y=y_train)
print('accuracy: {:5.4f} '.format(categorical.score(X=encoder.transform(X=X_test[categories]), y=y_test)))

express.histogram(y=permutation_importance(estimator=categorical, X=encoder.transform(X=X_test[categories]), y=y_test)['importances_mean'].tolist(),
                  x=categories, title='Categorical mean importance').show(validate=True)

accuracy: 0.8659 


In [5]:
from sklearn.metrics import classification_report
print(classification_report(y_true=y_test, y_pred=categorical.predict(X=encoder.transform(X=X_test[categories])), zero_division=0))

               precision    recall  f1-score   support

    high_risk       1.00      1.00      1.00      1299
     low_risk       0.99      0.84      0.91     12699
moderate_risk       0.45      0.95      0.61      1722

     accuracy                           0.87     15720
    macro avg       0.81      0.93      0.84     15720
 weighted avg       0.93      0.87      0.88     15720



Our categorical model always gets the high risk cases right, but has trouble with moderate risk cases. Let's look at what we can do with a simple regression model for the numerical values.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

reals = ['hour_of_day', 'amount', 'login_frequency', 'session_duration', ]

regression = LogisticRegression(max_iter=1000, tol=1e-6)
regression.fit(X=X_train[reals], y=y_train)
print('fit complete after {} iterations.'.format(regression.n_iter_[0]))
print('accuracy: {:5.4f} '.format(regression.score(X=X_test[reals], y=y_test)))
express.histogram(y=regression.coef_.tolist()[0], x=reals, title='Regression coefficients').show(validate=True)

fit complete after 744 iterations.
accuracy: 0.8302 


In [7]:
from sklearn.metrics import classification_report
print(classification_report(y_true=y_test, y_pred=regression.predict(X=X_test[reals]), zero_division=0))

               precision    recall  f1-score   support

    high_risk       0.00      0.00      0.00      1299
     low_risk       0.84      0.97      0.90     12699
moderate_risk       0.66      0.43      0.52      1722

     accuracy                           0.83     15720
    macro avg       0.50      0.47      0.47     15720
 weighted avg       0.75      0.83      0.79     15720



Our regression model can't identify any of the high risk cases, but it has better precision for moderate-risk cases. Let's have a little model cook-off to see if we can do better in an accuracy sense.

In [8]:
from arrow import now
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

MODEL = {
    'Naive Bayes': GaussianNB(),
    'QDA': QuadraticDiscriminantAnalysis(),
    '3 Nearest Neighbors': KNeighborsClassifier(n_neighbors=3),
    '5 Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    '7 Nearest Neighbors': KNeighborsClassifier(n_neighbors=7),
    '8 Nearest Neighbors': KNeighborsClassifier(n_neighbors=8),
    '9 Nearest Neighbors': KNeighborsClassifier(n_neighbors=9),
    '10 Nearest Neighbors': KNeighborsClassifier(n_neighbors=10),
    '1 deep Decision Tree': DecisionTreeClassifier(max_depth=1, random_state=2024),
    '2 deep Decision Tree': DecisionTreeClassifier(max_depth=2, random_state=2024),
    '3 deep Decision Tree': DecisionTreeClassifier(max_depth=3, random_state=2024),
    '4 deep Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=2024),
    '5 deep Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=2024),
    '6 deep Decision Tree': DecisionTreeClassifier(max_depth=6, random_state=2024),
    '7 deep Decision Tree': DecisionTreeClassifier(max_depth=7, random_state=2024),
    '8 deep Decision Tree': DecisionTreeClassifier(max_depth=8, random_state=2024),
    '9 deep Decision Tree': DecisionTreeClassifier(max_depth=9, random_state=2024),
    '10 deep Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=2024),
    '10 estimator Random Forest': RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1, random_state=2024),
    '20 estimator Random Forest': RandomForestClassifier(max_depth=5, n_estimators=20, max_features=1, random_state=2024),
    '30 estimator Random Forest': RandomForestClassifier(max_depth=5, n_estimators=30, max_features=1, random_state=2024),
    'Neural Net': MLPClassifier(alpha=1, max_iter=1000, random_state=2024),
    'AdaBoost': AdaBoostClassifier(algorithm='SAMME', random_state=2024),
}

result = []
for name, clf in MODEL.items():
    time_start = now()
    clf.fit(X_train[reals], y_train)
    score = clf.score(X_test[reals], y_test)
    result.append((score, name))
    print('{:5.4f} {} {}'.format(score, now() - time_start, name))
result = sorted(result, key=lambda x: x[0], reverse=True)
print('best: {:5.4f} {}'.format(result[0][0], result[0][1]))

0.6433 0:00:00.115166 Naive Bayes
0.7163 0:00:00.160496 QDA
0.8219 0:00:00.981955 3 Nearest Neighbors
0.8356 0:00:00.946877 5 Nearest Neighbors
0.8415 0:00:00.953017 7 Nearest Neighbors
0.8304 0:00:00.964228 8 Nearest Neighbors
0.8441 0:00:00.980653 9 Nearest Neighbors
0.8347 0:00:00.976050 10 Nearest Neighbors
0.8078 0:00:00.128870 1 deep Decision Tree
0.8475 0:00:00.148036 2 deep Decision Tree
0.8679 0:00:00.160273 3 deep Decision Tree
0.8676 0:00:00.170412 4 deep Decision Tree
0.8676 0:00:00.180255 5 deep Decision Tree
0.8677 0:00:00.189121 6 deep Decision Tree
0.8669 0:00:00.195242 7 deep Decision Tree
0.8669 0:00:00.204685 8 deep Decision Tree
0.8656 0:00:00.208655 9 deep Decision Tree
0.8656 0:00:00.225173 10 deep Decision Tree
0.8559 0:00:00.415486 10 estimator Random Forest
0.8615 0:00:00.713955 20 estimator Random Forest
0.8608 0:00:00.945812 30 estimator Random Forest
0.8472 0:00:11.982420 Neural Net
0.8163 0:00:05.815017 AdaBoost
best: 0.8679 3 deep Decision Tree


Our winner is one of our decision tree models. How does its classification report look?

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

tree = DecisionTreeClassifier(max_depth=3, random_state=2024).fit(X=X_train[reals], y=y_train)
print('accuracy: {:5.4f} '.format(tree.score(X=X_test[reals], y=y_test)))
express.histogram(y=tree.feature_importances_.tolist(), x=reals, title='Feature importances').show(validate=True)
print(classification_report(y_true=y_test, y_pred=tree.predict(X=X_test[reals]), zero_division=0))

accuracy: 0.8679 


               precision    recall  f1-score   support

    high_risk       0.00      0.00      0.00      1299
     low_risk       0.90      0.94      0.92     12699
moderate_risk       0.70      0.95      0.81      1722

     accuracy                           0.87     15720
    macro avg       0.53      0.63      0.58     15720
 weighted avg       0.80      0.87      0.83     15720



Our tree model does better than our naive Bayes model for both low and moderate risk cases, but it completely misses high risk cases. When the two models disagree what does that look like?

In [10]:
import numpy as np
from collections import Counter

y_c = categorical.predict(X=encoder.transform(X_test[categories]))
y_r = tree.predict(X=X_test[reals])

print(Counter(['{}/{}'.format(y_c[index], y_r[index]) for index in range(len(y_c))]))

Counter({'low_risk/low_risk': 10050, 'moderate_risk/low_risk': 2029, 'moderate_risk/moderate_risk': 1643, 'high_risk/low_risk': 1299, 'low_risk/moderate_risk': 699})


Let's do the dumbest thing possible and combine them by taking the high risk assessment from the categorical model and otherwise use the assessment from the tree model. 

In [11]:
from sklearn.metrics import accuracy_score

def combine(l: str, r: str) -> str:
    if l == 'high_risk':
        return l;
    return r

y_combined = np.array([combine(y_c[index], y_r[index]) for index in range(len(y_c))])

print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=y_combined)))
print(classification_report(y_true=y_test, y_pred=y_combined, zero_division=0))

accuracy: 0.9505
               precision    recall  f1-score   support

    high_risk       1.00      1.00      1.00      1299
     low_risk       0.99      0.94      0.97     12699
moderate_risk       0.70      0.95      0.81      1722

     accuracy                           0.95     15720
    macro avg       0.90      0.97      0.93     15720
 weighted avg       0.96      0.95      0.95     15720



We still have trouble, relatively speaking, with the moderate-risk cases, but this is markedly better than either model by itself.