# IEEE Card Fraud Detection

## Contents
1. Load Data
2. Exploratory Data Analysis
3. Building the base model
4. Building the tuned model

## 1. Load Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_id = pd.read_csv("data/train_identity.csv")
train_txn = pd.read_csv("data/train_transaction.csv")

print(f"Shape of train identity: {train_id.shape}")
print(f"Shape of train transaction: {train_txn.shape}")

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.


KeyboardInterrupt



In [None]:
test_id = pd.read_csv("data/test_identity.csv")
test_txn = pd.read_csv("data/test_transaction.csv")

In [None]:
# primary key is TransactionID
train_df = pd.merge(train_txn,train_id, how = 'left', on = 'TransactionID',validate = "many_to_one")
test_df = pd.merge(test_txn,test_id, how = 'left', on = 'TransactionID',validate = "many_to_one")

In [None]:
train_df.head()

## 2. Exploratory Data Analysis

In [None]:
import seaborn as sns
import math

In [None]:
fraud_rows = train_df[train_df['isFraud']==1].shape[0]
total_rows = train_df.shape[0]
print(f"{fraud_rows} out of {total_rows} observations were fradulent ({round(fraud_rows/total_rows, 2)}%)")

#### Detect outliers

In [None]:
amts = train_df['TransactionAmt']

In [None]:
def zscore(lst):
    std_val = lst.std()
    avg_val = sum(lst) / len(lst)
    
    z_lst = [((x-avg_val) / std_val) for x in lst]
    
    return z_lst

In [None]:
train_df['TransactionAmtStd'] = zscore(train_df['TransactionAmt'])

In [None]:
def remove_outliers(df):
    df = df[df['TransactionAmtStd'] < 3]
    df = df[df['TransactionAmtStd'] > -3]
    
    return df

In [None]:
train_df.shape

In [None]:
train_df = remove_outliers(train_df)

In [None]:
train_df.shape

### Transaction Amount

In [None]:
sns.distplot(train_df['TransactionAmt']).set_title('Distribution of TransactionAmt')

The data is skewed left, probably due to the majority of transactions being small (<100 in this case).

In [None]:
sns.distplot(np.log(train_df['TransactionAmt'])).set_title('Log Distribution of TransactionAmt')

### Transaction Date
Times are reported in seconds (relative)

In [None]:
# Difference between smallest and largest
earliest_day = train_df['TransactionDT'].min() / 86400
latest_day = train_df['TransactionDT'].max() / 86400
print(f"First day in Dataset: {earliest_day}")
print(f"Last day in Dataset: {latest_day}")

### Product Attributes
Distirbution of the types of products being bought

In [None]:
product_codes = list(train_df['ProductCD'].unique())
print(f"There are {len(product_codes)} product codes: {product_codes}")

In [None]:
sns.catplot(x="TransactionAmt", y="ProductCD", data=train_df)

In [None]:
sns.catplot(x="ProductCD", y="TransactionAmt", kind="boxen", data=train_df)\
            .set(xlabel='Product Category', ylabel='Transaction Amount')

In [None]:
product_type_stdev = train_df.groupby('ProductCD').std()['TransactionAmt']
product_type_avg = train_df.groupby('ProductCD').mean()['TransactionAmt']
product_type = pd.DataFrame({"ProductCD": product_type_avg.index,
                             "Mean": list(product_type_avg), 
                             "Stdev": list(product_type_stdev)})

In [None]:
product_type

`W` Seems to have a high average due to it being more skewed, confirmed by the high standard deviation. 

In [None]:
sns.barplot(x="ProductCD", y="Mean", data=product_type, color='#baf5ff')

In terms of average transactions, there are large differences between product categories. Different amounts could influence the likelihood of fradulent transactions occuring due to the nature of certain products.

### Card attributes

In [None]:
card_df = train_df[['card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'isFraud']]

#### `card4`: Are certain credit card providers more likely to have fraudulent transactions?

In [None]:
card_df.head()

In [None]:
provider_count = card_df.groupby("card4").count()['isFraud']
provider_fraud_count = card_df.groupby("card4").sum()['isFraud']

In [None]:
provider_df = pd.merge(provider_count, provider_fraud_count, on='card4')
provider_df['pctFraud'] = provider_df['isFraud_y'] / provider_df['isFraud_x'] * 100
provider_df

#### `card6`: is credit or debit more likely to be fraudulent?

In [None]:
cred_deb = train_df[['card6', 'TransactionAmt', 'isFraud']]

In [None]:
ax = sns.violinplot(x="card6", y="TransactionAmt", hue="isFraud", data=cred_deb, palette="muted", split=True)

Seems like on a proportion basis, credit cards are more likely to have fraudulent transactions.

In [None]:
debit_fraud = cred_deb[cred_deb['isFraud']==1]['card6'].value_counts()['debit']
credit_fraud = cred_deb[cred_deb['isFraud']==1]['card6'].value_counts()['credit']

debit_total = cred_deb['card6'].value_counts()['debit']
credit_total = cred_deb['card6'].value_counts()['credit']

print(f"There are {debit_fraud} debit fraudulent transactions ({round(debit_fraud/debit_total, 3)})%")
print(f"There are {credit_fraud} credit fraudulent transactions ({round(credit_fraud/credit_total, 3)})%")

Looks like credit cards are 3 times more likely to have a fraudulent transaction compared to debit cards

### Device Type

#### `DeviceType` is composed of mobile, desktop or NAN

In [None]:
train_df['DeviceType'].unique()