This dataset contains simulated money transfers with fraudulent activies. In this notebook we want to answer the above question. :)

First, we will have a look at basic statistics of the data including the quantiles of the numeric features and correlations.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import warnings
warnings.filterwarnings('ignore')

# subsample
df = pd.read_csv("../input/PS_20174392719_1491204439457_log.csv")#, nrows=int(1e6))
df.head()

In [None]:
df.info()

Describe the numeric features in terms of their quantiles.

What we can see here, is that there are a few transactions that have very large amounts. Also, the mean of isFraud is 0.00129, meaning we there are ~1.2 frauds per 1000 transactions.

In [None]:
df.describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])

In [None]:
df_corr = df[['amount', 'oldbalanceOrg', 'oldbalanceDest', 'isFraud']]

data = [
    go.Heatmap(
        z=df_corr.corr().values,
        x=df_corr.columns.values,
        y=df_corr.columns.values,
        colorscale='Viridis',
        text = True ,
        opacity = 1.0
        
    )
]


layout = go.Layout(
    title='Pearson Correlation of all numeric features',
    #xaxis = dict(ticks='', nticks=36),
    #yaxis = dict(ticks='' ),
    #width = 900, height = 700,
    
)


fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')

# Categorical data

By looking at the type of transaction, let's calculate the number of frauds for each of these categories. We can do that simply by grouping on the type and fraud indicator and counting the rows.

Interesting to see here is, that there are only fraudulent activities for **CASH_OUT** and **TRANSFER** type.

In [None]:
df.groupby(['type', 'isFraud']).count()['step']

From here on, we will work with subsamples of the data, since the dataframe is quite large and we want the kernel to be fast.

In [None]:
df = df.sample(int(5e5))

By looking at boxplots for amount, fraudulent activities tend to have larger amounts.

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(x = 'isFraud', y = 'amount', data = df[df.amount < 1e5])

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(hue = 'isFraud', x = 'type', y = 'amount', data = df[df.amount < 1e5])

In [None]:
plt.figure(figsize=(12,8))
sns.pairplot(df[['amount', 'oldbalanceOrg', 'oldbalanceDest', 'isFraud']], hue='isFraud')

In [None]:
from scipy.stats import probplot
fig = plt.figure()
ax = fig.add_subplot(111)

probplot(df['amount'], plot=ax)