# Credit Card Fraud Prediction - Loading Dataset using Snowpark Python

This example is based on the Machine Learning for Credit Card Fraud detection - Practical handbook, https://fraud-detection-handbook.github.io/fraud-detection-handbook/

## Loading Credit Card Transactions into Snowflake

### Import the dependencies and connect to Snowflake

In [1]:
# Snowpark
from snowflake.snowpark import Session
import snowflake.snowpark.types as T
import snowflake.snowpark.functions as F

# Print the version of Snowpark we are using
from importlib.metadata import version
version('snowflake_snowpark_python')

'0.11.0'

In [2]:
# Other
import json

**Before connecting make sure you have updated creds.json with information for your Snowflake account**

In [3]:
with open('creds.json') as f:
    connection_parameters = json.load(f)

In [4]:
session = Session.builder.configs(connection_parameters).create()

The **get_** functions can be use to get information about the current database, schema, role etc

In [5]:
print(f"Current schema: {session.get_fully_qualified_current_schema()}, current role: {session.get_current_role()}, current warehouse:  {session.get_current_warehouse()}")

Current schema: "FRAUD_DATA"."PUBLIC", current role: "ACCOUNTADMIN", current warehouse:  "COMPUTE_WH"


### Define Staging Area and the Schema for the transaction table

Using SQL we can create a internal stage and then use the **put** function to uplad the **fraud_transactions.csv.gz** file to it.

In [6]:
stage_name = "FRAUD_DATA"
# Create a internal staging area for uploading the source file
session.sql(f"CREATE or replace STAGE {stage_name}").collect()

# Upload the source file to the stage
putResult = session.file.put("data/fraud_transactions.csv.gz", f"@{stage_name}", auto_compress=False, overwrite=True)

putResult

[PutResult(source='fraud_transactions.csv.gz', target='fraud_transactions.csv.gz', source_size=21382572, target_size=21382576, source_compression='GZIP', target_compression='GZIP', status='UPLOADED', message='')]

Define the schma for our **CUSTOMER_TRANSACTIONS_FRAUD** table

In [7]:
# Define the schema for the Frauds table
dfCustTrxFraudSchema = T.StructType(
    [
        T.StructField("TRANSACTION_ID", T.IntegerType()),
        T.StructField("TX_DATETIME", T.TimestampType()),
        T.StructField("CUSTOMER_ID", T.IntegerType()),
        T.StructField("TERMINAL_ID", T.IntegerType()),
        T.StructField("TX_AMOUNT", T.FloatType()),
        T.StructField("TX_TIME_SECONDS", T.IntegerType()),
        T.StructField("TX_TIME_DAYS", T.IntegerType()),
        T.StructField("TX_FRAUD", T.IntegerType()),
        T.StructField("TX_FRAUD_SCENARIO", T.IntegerType())
    ]
)

Load the **fraud_transactions.csv.gz** to a DataFrame reader and save into a table

In [8]:
# Crete a reader
dfReader = session.read.schema(dfCustTrxFraudSchema)

# Get the data into the data frame
dfCustTrxFraudRd = dfReader.csv(f"@{stage_name}/fraud_transactions.csv.gz")

In [9]:
# Write the dataframe in a table
ret = dfCustTrxFraudRd.write.mode("overwrite").saveAsTable("CUSTOMER_TRANSACTIONS_FRAUD")

### Read the data from the staging area and create CUSTOMER_TRANSACTIONS_FRAUD, CUSTOMERS and TERMINALS tables

In [10]:
# Now create Customers and Terminal tables

dfCustTrxFraudTb =session.table("CUSTOMER_TRANSACTIONS_FRAUD")

dfCustomers = dfCustTrxFraudTb.select(F.col("CUSTOMER_ID")).distinct().sort(F.col("CUSTOMER_ID"))

dfTerminals = dfCustTrxFraudTb.select(F.col("TERMINAL_ID")).distinct().sort(F.col("TERMINAL_ID"))
                                
ret2 = dfCustomers.write.mode("overwrite").saveAsTable("CUSTOMERS")

ret3 = dfTerminals.write.mode("overwrite").saveAsTable("TERMINALS")

In [11]:
dfCustTrxFraudTb.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------
|"TRANSACTION_ID"  |"TX_DATETIME"        |"CUSTOMER_ID"  |"TERMINAL_ID"  |"TX_AMOUNT"  |"TX_TIME_SECONDS"  |"TX_TIME_DAYS"  |"TX_FRAUD"  |"TX_FRAUD_SCENARIO"  |
----------------------------------------------------------------------------------------------------------------------------------------------------------------
|0                 |2019-04-01 00:00:31  |596            |3156           |57.16        |31                 |0               |0           |0                    |
|1                 |2019-04-01 00:02:10  |4961           |3412           |81.51        |130                |0               |0           |0                    |
|2                 |2019-04-01 00:07:56  |2              |1365           |146.0        |476                |0               |0           |0                    |
|3                 |2019-04-01 00: