# About the Dataset

This is a simulated credit card transaction dataset containing legitimate and fraudulent transactions from the period of January 1, 2019, to December 31, 2020. The dataset was collected and made available on Kaggle.

### Dataset Columns

1. **trans_date_trans_time**: Transaction date and time.
2. **cc_num**: Credit card number.
3. **merchant**: Name of the merchant where the transaction was made.
4. **category**: Merchant category.
5. **amt**: Transaction amount.
6. **first**: First name of the cardholder.
7. **last**: Last name of the cardholder.
8. **gender**: Gender of the cardholder.
9. **street**: Street address of the cardholder.
10. **city**: City of the cardholder's address.
11. **state**: State of the cardholder's address.
12. **zip**: Zip code of the cardholder's address.
13. **lat**: Latitude of the cardholder's address.
14. **long**: Longitude of the cardholder's address.
15. **city_pop**: Population of the cardholder's city.
16. **job**: Job of the cardholder.
17. **dob**: Date of birth of the cardholder.
18. **trans_num**: Transaction number.
19. **unix_time**: Transaction timestamp in Unix format.
20. **merch_lat**: Merchant's latitude.
21. **merch_long**: Merchant's longitude.
22. **is_fraud**: Fraud indicator (0 for legitimate transactions, 1 for fraudulent transactions).

### Usage Examples

This dataset can be used for various analyses and predictive models, such as:

- Exploratory Data Analysis (EDA) to understand patterns of fraudulent transactions.
- Training machine learning models for fraud detection.
- Studies on consumer and merchant behavior.

### Source

The dataset is available on Kaggle and can be accessed [here](https://www.kaggle.com/datasets/kartik2112/fraud-detection/data).

# Importing Libraries

In [1]:
#Data Manipulation
import pandas as pd
import numpy as np

#System
import os
import sys

#Visualization
import plotly.graph_objects as go

# Adding the project's root directory to sys.path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir, os.pardir)))

from app.lib.crisp_dm.data_understanding import DataUnderstanding
from app.lib.crisp_dm.data_vizualization import DataVisualization

## Settings

In [2]:
pd.set_option('display.max_columns', 500)

# Importing the Dataset

In [3]:
df_path = f"dataset/fraudTest.csv"
df = pd.read_csv(df_path, index_col=0)
df.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,Columbia,SC,29209,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,Altonah,UT,84002,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,Bellmore,NY,11710,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0
3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,Titusville,FL,32780,28.5697,-80.8191,54767,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,Falmouth,MI,49632,44.2529,-85.017,1126,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0


# Data Understanding

In [4]:
du = DataUnderstanding(df=df)
du.generate_metadata()

<H3 style='text-align:left;float:left;'>Dataset Information</H3>

<H5>The dataset has 555719 rows and 22 columns. Of these, we have:</H5>

Unnamed: 0,Not Null,Null,Perce Null,Dtype,Cardinality
trans_date_trans_time,555719,0,0.00%,object,544760
cc_num,555719,0,0.00%,int64,924
merchant,555719,0,0.00%,object,693
category,555719,0,0.00%,object,14
amt,555719,0,0.00%,float64,37256
first,555719,0,0.00%,object,341
last,555719,0,0.00%,object,471
gender,555719,0,0.00%,object,2
street,555719,0,0.00%,object,924
city,555719,0,0.00%,object,849


<H3 style='text-align:left;float:left;'>Data Types Information:</H3>

Unnamed: 0,Dtype,Count,Percentage
0,object,12,55.00%
1,int64,5,23.00%
2,float64,5,23.00%


We don't have any null values in the dataset. That's good!

Some important columns are:
- **trans_date_trans_time:** understand if there is any pattern in the time of the transaction, that can be a good feature to predict fraud.
- **cc_num:** analyze the credit card number, check if there is a concentration of fraud in some credit card numbers.
- **merchant and category:** check if there is a concentration of fraud in some merchants and/or categories, could be some merchants or categories that are more prone to fraud.
- **locations (lat, long, city, state, zip, etc):** check if there is a concentration of fraud in some locations.
- **amt:** Look if some range of amounts has more fraud, could be a good feature to predict fraud.
- **is_fraud:** Do a multivariable analysis to understand if there is a correlation between the features and fraud.

Let's format the data to understand the correlation between the features and the fraud.

In [5]:
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
df['dob'] = pd.to_datetime(df['dob'])

df['trans_date'] = df['trans_date_trans_time'].dt.date
df['trans_time'] = df['trans_date_trans_time'].dt.time
df['trans_date'] = pd.to_datetime(df['trans_date'])

df['gender'] = df['gender'].map({'F': 0, 'M': 1})

In [6]:
df.head(2)

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,trans_date,trans_time
0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,1,351 Darlene Green,Columbia,SC,29209,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0,2020-06-21,12:14:25
1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,0,3638 Marsh Union,Altonah,UT,84002,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0,2020-06-21,12:14:33


## Correlation Analysis

First of all, we need to check if there is any correlation between the features and the fraud.

In [7]:
dv = DataVisualization(df=df)
dv.plot_correlation_heatmap()

### Insights from the Heatmap

1. **Correlation between `lat` and `long`**:
   - Latitude (`lat`) and longitude (`long`) have a very high correlation, which is expected since both represent the geographical location of the cardholder's address.

2. **Correlation between `merch_lat` and `merch_long`**:
   - Similarly, the latitude (`merch_lat`) and longitude (`merch_long`) of the merchant also have a high correlation, indicating the geographical location of the merchant.

3. **Correlation between `amt` and `is_fraud`**:
   - The correlation between the transaction amount (`amt`) and fraud (`is_fraud`) can be a point of interest. If there is a significant correlation, it may indicate that transactions of certain amounts are more likely to be fraudulent.

In [None]:
df.loc[df['cc_num'] == 2291163933867244]