# Credit Card Fraud Detection: Harnessing the Power of Machine Learning in Snowflake ML

The prerequisite for this notebook is the completion of setup in the 1_cc_fins_setup notebook.

To get started, click the **Start** button! Once it says **Active**, you're ready to run the rest of the Notebook. All the packages have been pre-uploaded. 
We will be consuming the features from the Feature Store.

### Snowflake ML Feature Store
A Python SDK for defining, registering, retrieving, and managing features.

Entity: Entities are the underlying objects that features and feature views are associated with. They encapsulate the join keys used for feature lookups. 

FeatureView: A feature view is a group of logically-related features that are refreshed on the same schedule.


In [None]:
# Standard library imports
import os
import time
import math

# Third-party library imports
import pandas as pd
import numpy as np



# Snowflake library imports
import streamlit as st

import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns

from snowflake.ml.feature_store import (
FeatureStore,
FeatureView,
CreationMode)

from snowflake.ml import dataset
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
from snowflake.snowpark.context import get_active_session
session = get_active_session()
session.query_tag = {"origin":"sf_sit-is", 
                     "name":"credit_card_fraud", 
                     "version":{"major":1, "minor":0},
                     "attributes":{"is_quickstart":0, "source":"notebook"}}

# Set the style for the plots
sns.set(style="whitegrid")

# Custom color palettes
colors = {
    'Non-Fraud Bars': '#4C72B0',
    'Fraud Bars': '#55A868',
    'Non-Fraud Line': '#1f77b4',
    'Fraud Line': '#ff7f0e'
}

Set the context for the database and warehouse

In [None]:

session.sql("USE ROLE SYSADMIN").collect()
session.sql("USE DATABASE CC_FINS_DB").collect()
session.sql("USE SCHEMA ANALYTICS").collect()

Generating Datasets for Training
We are now ready to generate our training set. We'll define a spine DataFrame to form the backbone of our generated dataset and pass it into FeatureStore.generate_dataset() along with our Feature Views.

NOTE: The spine serves as a request template and specifies the entities, labels and timestamps (when applicable). The feature store then attaches feature values along the spine using an AS-OF join to efficiently combine and serve the relevant, point-in-time correct feature data.

In [None]:
session.sql("create or replace TABLE TRANSACTIONS_DATA (USER_ID VARCHAR,TRANSACTION_ID VARCHAR(16777216),IS_FRAUD VARCHAR)").collect()

Save the spine dataframe to a table

In [None]:
session.sql("insert into TRANSACTIONS_DATA(User_ID, Transaction_ID, IS_FRAUD) SELECT distinct User_ID, Transaction_ID, IS_FRAUD FROM CREDITCARD_TRANSACTIONS").collect()
TRANSACTIONS_DATA_df = session.table("TRANSACTIONS_DATA")
TRANSACTIONS_DATA_df.show()

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution.

In [None]:
full_df = session.sql("SELECT * FROM CREDITCARD_TRANSACTIONS")
full_df.describe()

In [None]:
full_df.columns

Visualization of the fraud and normal data using a bar chart displayed in Streamlit. Shows the total number of distinct transactions for each fraud category.

In [None]:
# Load the dataset
dataset=full_df.toPandas()
# Group by 'IS_FRAUD' and count distinct TRANSACTION_ID
df= TRANSACTIONS_DATA_df.select( F.col("TRANSACTION_ID"),F.col("IS_FRAUD")).groupBy(F.col("IS_FRAUD")) \
          .agg(F.count_distinct(F.col("TRANSACTION_ID")).alias("TOTAL_FRAUD")) 


st.bar_chart(df,x="IS_FRAUD",y="TOTAL_FRAUD")

Create a histogram that shows the distribution of transaction amounts, distinguishing between fraudulent and non-fraudulent transactions. 

In [None]:

dataset['IS_FRAUD'] = dataset['IS_FRAUD'].astype(int)
# Set the style for the plots
sns.set(style="whitegrid")
# Background color
background_color = "#f0f0f0"  # Light gray
# 1. Distribution of Transaction Amounts
plt.figure(figsize=(4,4))
sns.histplot(data=dataset, x='TRANSACTION_AMOUNT', hue='IS_FRAUD', kde=True, bins=50)
plt.title('Distribution of Transaction Amounts')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.legend(title='Transaction', loc='upper right', labels=['Normal', 'Fraud'])
plt.show()


Create a histogram that shows the distribution of clicks, distinguishing between fraudulent and non-fraudulent transactions. 

In [None]:
#CLICKS, LOGIN_PER_HOUR, and PAGES_VISITED Distributions
sns.set(style="whitegrid")

# Custom color palettes
colors = {
    'Normal Bars': '#4C72B0',
    'Fraud Bars': '#55A868',
    'Normal Line': '#1f77b4',
    'Fraud Line': '#ff7f0e'
}
# 4. CLICKS Distribution
plt.figure(figsize=(4, 4))
sns.histplot(data=dataset, x='CLICKS', hue='IS_FRAUD', multiple='dodge', kde=True, bins=30)
plt.title('Clicks Distribution')
plt.xlabel('Clicks')
plt.ylabel('Frequency')
plt.legend(title='Transaction', loc='upper right', labels=['Normal', 'Fraud'])
plt.show()

Create a histogram that shows the distribution of logins, distinguishing between fraudulent and non-fraudulent transactions. 

In [None]:
plt.figure(figsize=(4, 4))
sns.histplot(data=dataset, x='LOGIN_PER_HOUR', hue='IS_FRAUD', multiple='dodge', kde=True, bins=30)
plt.title('Login Per Hour Distribution')
plt.xlabel('Login Per Hour')
plt.ylabel('Frequency')
plt.legend(title='Is Fraud', loc='upper right', labels=['Non-Fraud', 'Fraud'])
plt.show()

Create a histogram that shows the distribution of time elapsed online, distinguishing between fraudulent and non-fraudulent transactions. 

In [None]:
plt.figure(figsize=(4,4))
sns.histplot(data=dataset, x='TIME_ELAPSED', hue='IS_FRAUD', kde=True, bins=50)
plt.title('Time Elapsed Distribution')
plt.xlabel('Time Elapsed (seconds)')
plt.ylabel('Frequency')
plt.legend(title='Is Fraud', loc='upper right', labels=['Non-Fraud', 'Fraud'])
plt.show()


Create a histogram that shows the distribution of location, distinguishing between fraudulent and non-fraudulent transactions. 

In [None]:

# Define location coordinates
location_coords = {
    'New York': (40.7128, -74.0060),
    'Los Angeles': (34.0522, -118.2437),
    'Chicago': (41.8781, -87.6298),
    'Houston': (29.7604, -95.3698),
    'Phoenix': (33.4484, -112.0740),
    'Philadelphia': (39.9526, -75.1652),
    'San Antonio': (29.4241, -98.4936),
    'San Diego': (32.7157, -117.1611),
    'Dallas': (32.7767, -96.7970),
    'San Jose': (37.3382, -121.8863),
    'Moscow': (55.7558, 37.6176)  # Add Moscow coordinates
}

# Add latitude and longitude based on location
dataset['LATITUDE'] = dataset['LOCATION'].map(lambda loc: location_coords.get(loc, (None, None))[0])
dataset['LONGITUDE'] = dataset['LOCATION'].map(lambda loc: location_coords.get(loc, (None, None))[1])

# Filter for plotting
plt.figure(figsize=(6, 6))

# Plot all locations
scatter = plt.scatter(dataset['LONGITUDE'], dataset['LATITUDE'], 
                      c=dataset['IS_FRAUD'].map({0: 'purple', 1: 'red'}),
                      alpha=0.5)

# Create custom legend
purple_patch = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='purple', markersize=10, label='Normal')
red_patch = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Fraud')

# Plot details
plt.title('Geographical Distribution of Transactions')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

# Set legend with custom handles
plt.legend(handles=[purple_patch, red_patch], title='Transaction Type', loc='upper left', bbox_to_anchor=(1, 1), frameon=True, fontsize='small')

plt.grid(True)

# Set background color for the plot
plt.gcf().set_facecolor("#f0f0f0")  # Light gray
plt.show()

## Feature Store
The feature store contains feature views for customers and transactions. Model features will be accessed from the feature store.

**Snowflake Feature:** Feature Store (PrPr) - Easily find features that work with your data

In [None]:
# Access feature views

fs = FeatureStore(
    session=session,
    database="CC_FINS_DB",
    name="ANALYTICS",
    default_warehouse="CC_FINS_WH",
    creation_mode=CreationMode.FAIL_IF_NOT_EXIST
)

customer_fv : FeatureView = fs.get_feature_view(
    name='Customer_Features',
    version='V1'
)
print(customer_fv)

trans_fv : FeatureView = fs.get_feature_view(
    name='Trans_Features',
    version='V1'
)
print(trans_fv)


Generate a training data set with the feature store’s generate_training_set method, which enriches a Snowpark DataFrame that contains the source data with the derived feature values. 


In [None]:
# Get transactions dataset and get features from the feature store
def create_dataset(spine_df, name):
    train_dataset = fs.generate_dataset(
    name=name,
    spine_df=spine_df,
    features=[customer_fv, trans_fv]
    )
    df = train_dataset.read.to_snowpark_dataframe()
    return df
# Split into train/validation/test


datasets = TRANSACTIONS_DATA_df.random_split([.8,.2])

# Build training tables
train_df = create_dataset(datasets[0], "train")
val_df = create_dataset(datasets[1], "validation")



View the training dataset.

This contains the columns except for Ids. The Label is included here as this will be specified in the LABEL field during model training.


In [None]:
train_df.show()

Creating separate views for training and validation to be used with a Binary Classifier. Columns in the inference data that were not present in the training dataset are ignored.


In [None]:

train_df.write.mode("overwrite").save_as_table("training_fd_table")

session.sql("CREATE OR REPLACE VIEW fraud_classification_training_view AS SELECT IS_FRAUD,LATITUDE,LONGITUDE,LOCATION,TOTAL_TRANSACTIONS,STDDEV_TRANSACTION_AMOUNT,NUM_UNIQUE_MERCHANTS, MEAN_WEEKLY_SPENT,MEAN_MONTHLY_SPENT,MEAN_YEARLY_SPENT,TIME_ELAPSED,CLICKS,CUMULATIVE_CLICKS,CUMULATIVE_LOGINS_PER_HOUR FROM training_fd_table").collect()

val_df.drop("IS_FRAUD").collect()
val_df.write.mode("overwrite").save_as_table("val_fd_table")

session.sql("CREATE OR REPLACE VIEW fraud_classification_val_view AS SELECT * EXCLUDE IS_FRAUD FROM val_fd_table").collect()

In [None]:
SELECT * FROM fraud_classification_val_view LIMIT 2;

## Build the model
We can create the classification model by running the following statement

In [None]:
CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION fraud_classification_model(
    INPUT_DATA => SYSTEM$REFERENCE('view', 'fraud_classification_training_view'),
    TARGET_COLNAME => 'IS_FRAUD'
);

View all classification models, use the SHOW command.

In [None]:
SHOW SNOWFLAKE.ML.CLASSIFICATION;

Add a table to use for the Streamlit App that will be used for ongoing Predictions

In [None]:
CREATE or replace table CC_APP_TBL AS SELECT * FROM CREDITCARD_TRANSACTIONS WHERE TRANSACTION_ID NOT IN (SELECT DISTINCT TRANSACTION_ID FROM training_fd_table);
alter table CC_APP_TBL drop column IS_FRAUD;

Run inference (prediction) on a dataset, use the model’s PREDICT method.

In [None]:
CREATE OR REPLACE TABLE fraud_predictions AS
SELECT *,fraud_classification_model!PREDICT(INPUT_DATA => object_construct(*)) as predictions
from fraud_classification_val_view;


View the predictions.The model returns output in the following format. The prediction object includes predicted probabilities for each class and the predicted class based on the maximum predicted probability. The predictions are returned in the same order as the original features were provided.

In [None]:
SELECT * FROM fraud_predictions;

In the result set, we see that the model produces both a predicted class denoted by class as well giving us the probability of the respective class membership. Oftentimes, we may want to parse out the probabilities or the prediction directly, and have it in its own column

In [None]:
select * EXCLUDE PREDICTIONS,
        predictions:class::STRING AS class,
      round(predictions['probability'][class], 3) as probability
from fraud_predictions;

Now that we have built our classifier, we can begin to evaluate it to better understand both its performance as well as the primary factors within the dataset that were driving the predictions. Follow along below to see the various commands you may run to evalute your own classifier:

# Confusion Matrix & Model Accuracy
One of the most common ways of evaluating a classifier is by creating a Confusion Matrix, which allows us to visualize the types of errors that the model is making. Typically, they are used to calculate a classifier's Precision & Recall; which describe both the accuracy of a model when it predicts a certain class of interest (Precision), as well as how many of that specific class of interest were classified (recall)

In [None]:
CALL fraud_classification_model!SHOW_CONFUSION_MATRIX();

The show_evaluation_metrics calculates the following False Positive, False Negative, True Positive and True Negative

In [None]:
CALL fraud_classification_model!SHOW_EVALUATION_METRICS();

The show_threshold_metrics provides raw counts and metrics for a specific threshold for each class. This can be used to plot ROC and PR curves or do threshold tuning if desired. The threshold varies from 0 to 1 for each specific class; 

In [None]:
CALL fraud_classification_model!SHOW_THRESHOLD_METRICS()

# Feature Importances
The last thing we want to understand when evaluating the classifier is to get a sense of the importance of each of the individual input columns or features we made use of. 

Better understand what's driving a model's prediction to give us more insight into the business process we are trying to model out
Engineer new features or remove ones that are not too impactful to increase the model's performance.
The ML Classification function provides a method to do just this, and provides us a ranked list of the relative importance of all the input features, such that their values are between 0 and 1, and the importances across all the features sum to be 1.

In [None]:
CALL fraud_classification_model!SHOW_FEATURE_IMPORTANCE();

This completes an end to end model building using Snowflake ML and detection of the fraud using a validation dataset.