# Fraud Detection

Description:
1. step: represents a unit of time where 1 step equals 1 hour
2. type: type of online transaction
3. amount: the amount of the transaction
4. nameOrig: customer starting the transaction
5. oldbalanceOrg: balance before the transaction
6. newbalanceOrig: balance after the transaction
7. nameDest: recipient of the transaction
8. oldbalanceDest: initial balance of recipient before the transaction
9. newbalanceDest: the new balance of recipient after the transaction
10. isFraud: fraud transaction

# Data Preparation

In [None]:
import pandas as pd

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('PS_20174392719_1491204439457_log.csv')

In [None]:
df

# Exploratory Data Analysis

In [None]:
def preprocessing(df):
    """"
    Objective :
    Function for preprocessing data, known missing, duplicated values and basic stastics every column in dataframe and duplicated row
    df is dataframe
    """
    try:
        import pandas as pd
        variables = pd.DataFrame(columns=['Variable','Number of unique values','Percent of Null(%)','Number of Duplicated','Type','Values'])
        for i, var in enumerate(df.columns):
            variables.loc[i] = [var, df[var].nunique(),df[var].isnull().sum()/df.shape[0]*100,df[var].duplicated(keep=False).sum(),df[var].dtypes,df[var].unique()]
        return (variables.set_index('Variable'))
    except:
        print('ada kesalahan penginputan')

In [None]:
preprocessing(df)

**Data Insight**

    1. Tidak ada nilai null value pada dataset
    2. satu user bisa transaksi berkali-kali (nameOrigin yang duplicate)

In [None]:
round(df.describe())

In [None]:
df.duplicated().sum()

## Univariate Analysis

### Waktu Transaksi

In [None]:
num_days = 7
num_hours = 24
df['days'] = df['step']%num_days
df['hours'] = df['step']%num_hours

In [None]:
df['day_trans'] = df['step']/24

In [None]:
df['day_trans'] = round(df['day_trans'])

In [None]:
time = ['step','days','hours','day_trans']

In [None]:
for j in range (0,len(time)):
    num = time[j]
    sns.histplot(df[num])
    plt.figure(figsize=(3,3))
    plt.show()

**Data Insight**

    1. Jumlah transaksi setiap harinya memiliki frekuensi yang sama
    2. Transaksi dominan terjadi pada diatas jam 10 pagi
    3. Transaksi lebih banyak terjadi pada hari pertama hingga hari kedelapan belas

### Nominal Transaksi

In [None]:
round(df['amount'].describe())

**Data Transaksi**

    1. Jumlah transaksi yang terjadi yaitu 6.362.620 selama 32 hari
    2. Rata-rata transaksi adalah Rp. 179.862
    3. Maksimum transaksi adalah Rp 92.445.517

In [None]:
sns.kdeplot(df['amount'])

**Data Insight**

    1. Distribusi data amount terlalu skew negatif karena ada transaksi yang sangat besar jumlah tapi berjumlah sedikit
    2. Untuk analisis lebih lanjut bisa melakukan binning menjadi 3 bagian : Low, Medium, High, Very High

In [None]:
bins = [0,np.percentile(df['amount'],25),np.percentile(df['amount'],50),np.percentile(df['amount'],75),np.percentile(df['amount'],100)]

In [None]:
kategori = ['Low','Mid','High','Very High']

In [None]:
df['amount_bins'] = pd.cut(df['amount'], bins, labels=kategori, include_lowest=True)

In [None]:
amount_cat = df.groupby('amount_bins').agg({'amount':('min','max')}).reset_index()
amount_cat.columns = ['kategori','minimum amount','maximum amount']
amount_cat

In [None]:
sns.countplot(df['amount_bins'])

In [None]:
amount_sum = df.groupby('amount_bins')['amount'].sum().reset_index()
amount_sum.columns = ['kategori','jumlah transaksi']

In [None]:
sns.barplot(x='kategori',y='jumlah transaksi',data=amount_sum)

In [None]:
sns.catplot(x='amount_bins',y='amount',data=df,kind='box')

**Data Insight**

    1. Transaksi very high memiliki jumlah yang sangat besar

In [None]:
df_cat = df[~(df['amount_bins']=='Very High')]

In [None]:
sns.catplot(x='amount_bins',y='amount',data=df_cat,kind='box')

In [None]:
sns.histplot(df_cat['amount'])

**Data Insight**

    1. Transaksi lebih banyak terjadi dibawah 50.000

### Sender

In [None]:
sender = df['nameOrig'].value_counts().reset_index()

In [None]:
sender.sort_values('nameOrig',ascending=False).head(5)

**Data Insight**

    1. Satu customer paling banyak melakukan transaksi 3 kali dalam 32 hari

In [None]:
sender_amount = df.groupby('nameOrig')['amount'].sum().reset_index()

In [None]:
sender_amount.sort_values('amount',ascending=False)b

In [None]:
sender_amount[sender_amount['amount']<=0].count()

**Data Insight**

    1. Ada 16 customer yang melakukan transaksi dibawah Rp 0

### Recepient

In [None]:
receiver = df['nameDest'].value_counts().reset_index()

In [None]:
receiver.sort_values('nameDest',ascending=False).head(10)

**Data Insight**

    1. ada 7 recepient yang melakukan transaksi lebih dari 100 kali

### Tipe Transaksi

In [None]:
type_count = df.groupby('type')['amount'].count().reset_index()

In [None]:
type_count.columns = ['type','banyaknya transaksi']

In [None]:
sns.barplot(x='type',y='banyaknya transaksi',data=type_count)

**Data Insight**

    1. Jenis transaksi debit dan transfer jarang terjadi

In [None]:
type_sum = df.groupby('type')['amount'].sum().reset_index()

In [None]:
type_sum.columns = ['type','jumlah transaksi']

In [None]:
sns.barplot(x='type',y='jumlah transaksi',data=type_sum)

**Data Insight**

    1. Jumlah transaksi pada transfer mencapai lebih dari Rp 50.000.000.000

## Bivariate Analysis dengan Fraud

In [None]:
df_fraud = df[df['isFraud']==1]

In [None]:
df_fraud.describe()

In [None]:
print((df['isFraud'].value_counts()).plot(kind='pie',autopct='%1.2f%%'))

**Data Insight**

    1. Transaksi fraud terjadi sebanyak 8.213
    2. Kemungkinan transaksi adalah 0,13%

In [None]:
df_fraud['amount'].sum()

In [None]:
df_fraud['amount'].sum()/df['amount'].sum()*100

**Data Insight**

    1. Jumlah Transaksi fraud mencapai Rp 1.205.6415.427 dengan persentase 1% dari keseluruhan jumlah transaksi

### Waktu Transaksi

In [None]:
time = ['days','hours','day_trans']

In [None]:
for j in range (0,len(time)):
    num = time[j]
    sns.histplot(df_fraud[num])
    plt.figure(figsize=(3,3))
    plt.show()

**Data Insight**

    1. Transaksi pada jam 0:00 dan 23:00 sering terjadi fraud

In [None]:
df['IsFraud'] = df['isFraud']

In [None]:
for j in range (0,len(time)):
    num = time[j]
    df['IsFraud_new'] = df['IsFraud']
    df_num = df.groupby([num])['IsFraud'].count()
    df_num = df_num.reset_index()
    df_num_corr = df[df['IsFraud']==1].groupby([num])['IsFraud_new'].count()
    df_num_corr = df_num_corr.reset_index()
    df_num_corr = df_num.merge(df_num_corr,how='left',on=num)
    df_num_corr['%IsFraud'] = (df_num_corr['IsFraud_new']/df_num_corr['IsFraud'])*100
    sns.lineplot(df_num_corr[num],df_num_corr['%IsFraud'])
    plt.xlabel(num)
    plt.ylabel('%IsFraud')
    plt.show()

**Data Insight**

    1. Transaksi pada jam 1-10 memiliki probabilitas fraud yang lebih tinggi
    2. Transaksi pada hari kedua sampe keempat memiliki probabilitas fraud yang sangat tinggi
    3. Transaksi pada hari ke 31 dan 32 memiliki probabilitas fraud yang tinggi

## Nominal Transaksi

In [None]:
num = 'amount_bins'

In [None]:
df['IsFraud_new'] = df['IsFraud']
df_num = df.groupby([num])['IsFraud'].count()
df_num = df_num.reset_index()
df_num_corr = df[df['IsFraud']==1].groupby([num])['IsFraud_new'].count()
df_num_corr = df_num_corr.reset_index()
df_num_corr = df_num.merge(df_num_corr,how='left',on=num)
df_num_corr['%IsFraud'] = (df_num_corr['IsFraud_new']/df_num_corr['IsFraud'])*100

In [None]:
df_num_corr = df_num_corr.dropna()

In [None]:
sns.barplot(df_num_corr['amount_bins'],df_num_corr['%IsFraud'])

**Data Insight**

    1. Transaksi Very High memiliki kemungkinan fraud yang tinggi 35%

In [None]:
df_very_high = df_fraud[df_fraud['amount_bins']=='Very High']

In [None]:
round(df_very_high['amount'].describe())

In [None]:
sns.histplot(df_very_high['amount'])

**Data Insight**

    1. Transaksi fraud dengan Rp 10.000.000 memiliki terjadi sebanyak 400 kali

### Sender

In [None]:
num = 'nameOrig'

In [None]:
df['IsFraud_new'] = df['IsFraud']
df_num = df.groupby([num])['IsFraud'].count()
df_num = df_num.reset_index()
df_num_corr = df[df['IsFraud']==1].groupby([num])['IsFraud_new'].count()
df_num_corr = df_num_corr.reset_index()
df_num_corr = df_num.merge(df_num_corr,how='left',on=num)
df_num_corr['%IsFraud'] = (df_num_corr['IsFraud_new']/df_num_corr['IsFraud'])*100

In [None]:
df_num_corr = df_num_corr.fillna(0)

In [None]:
df_num_corr.sort_values('%IsFraud',ascending=False)

**Data Transaksi**

Customer yang melakukan transaksi fraud adalah 100% dari kesuluruhan transaksi mereka