In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style("whitegrid")

# Credit Card Fraud Detection 

https://www.kaggle.com/mlg-ulb/creditcardfraud

- - - 

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

## Dataset description

The dataset contains only numerical input variables which are the result of a PCA transformation. Due to confidentiality issues, we cannot provide the original features or more background information about the data. 

* Features V1, V2, … V28 are the principal components obtained with PCA. 
* Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. 
* Feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. 
* Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

- - - 

## Exploratory Data Analysis (EDA)

In [None]:
data = pd.read_csv('../dataset/creditcard.csv')
data.head()

In [None]:
data.describe()

### Data balance

In [None]:
ax = data['Class'].value_counts().plot(kind='barh')
ax.set_xscale('log')
ax.grid('on')
ax.set_title('class balance');

### Spend distribution

What is the overlap between normal (Class=0) transactions and fraudulent transactions (Class=1).

In [None]:
g = sns.catplot(x="Class", y="Amount", kind='boxen', data=data);

In [None]:
minimum_spend = 0.5
filtered_data = data[data['Amount'] > minimum_spend]
g = sns.catplot(x="Class", y="Amount", kind='boxen', data=filtered_data);
g.set(yscale="log");

### Summary statistics

In [None]:
data.groupby('Class')['Amount'].describe().T

### Correlation with target ('Class')

In [None]:
cmap = sns.diverging_palette(240, 10, n=10)

correlation = data.drop(columns=['Time','Class']).corrwith(data['Class'])
f, ax = plt.subplots(figsize=(15, 5))
xx = pd.DataFrame(correlation).reset_index()
xx.columns = ['Variable', 'Correlation']
sns.barplot(x='Variable', y='Correlation', data=xx, palette=cmap, ax=ax)
sns.despine()

The negative correlations mean that as the target variable decreases in value, the feature variable increases in value. (Linearly)

In [None]:
x = data.groupby('Class').corr()

mask = np.triu(np.ones_like(x.loc[0], dtype=np.bool))
cmap = sns.diverging_palette(240, 10, as_cmap=True, n=3)

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 12))
sns.heatmap(x.loc[0], mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .25}, ax=ax1)
ax1.set_title('Class (0)')

sns.heatmap(x.loc[1], mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .25}, ax=ax2)
ax2.set_title('Class (1)');

Note that the variables in the majority class have very close to zero correlation; a function of the PCA decomposition that has been used to form the features. 