## Kampus Merdeka 6: IBM & Skilvul
## Capstone Banking & Finance Challenge

### Problem Definition Fraud Detection App

### Latar Belakang

**Penipuan dalam sistem keuangan, termasuk transaksi kartu kredit, perbankan online, dan e-commerce, menimbulkan ancaman signifikan bagi lembaga keuangan dan pelanggan mereka. Hal ini dapat mengakibatkan kerugian finansial yang besar dan merusak reputasi lembaga tersebut. Dengan meningkatnya transaksi online dan perbankan digital, mendeteksi aktivitas penipuan menjadi semakin kompleks dan menantang. Oleh karena itu, ada kebutuhan mendesak untuk sistem deteksi penipuan yang efektif yang dapat mengidentifikasi dan mengurangi aktivitas penipuan secara real-time.**

### Tujuan

**Digunakan untuk mendeteksi penipuan berdasarkan nilai binary [0,1] pada kolom isFraud.**

### Data yang akan dipakai

**Kami menggunakan data yang bersumber dari kaggle dengan link sebagai berikut (https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data/data).**

### Metode

**Metode yang digunakan yaitu metode Binary Classification**

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier

### Get Data

In [None]:
data = pd.read_csv('../Fraud.csv')

### Clean Data | Membersihkan Data

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.shape

In [None]:
data.isnull().sum()

In [None]:
data.duplicated().sum()

In [None]:
data.describe().T

In [None]:
fraud_counts = data['isFraud'].value_counts()

print(fraud_counts)

In [None]:
data['amount'].describe()

In [None]:
data.isFraud.nunique()

### Explore Data (EDA) | Eksplorasi Data

In [None]:
# Hapus atau konversi kolom non-numerik
data_numeric = data.select_dtypes(include=[np.number])

fig, ax = plt.subplots(figsize=(21,10))
sns.set_context('poster')
corr = data_numeric.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns,cmap='gist_rainbow', annot = True)
ax.set_title('Collinearity of Feature Attributes')
plt.savefig('cormap.png')


In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()

In [None]:
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = df.dataframeName
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()

In [None]:
# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()

In [None]:
# print percentage of questions where target == 1
percent = (len(data.loc[data.isFraud==1])) / (len(data.loc[data.isFraud == 0])) * 100
print(f"Percentage of Fraudulent Transanctions in the Dataset: {percent}%")

In [None]:
data["type"].nunique()

In [None]:
data["type"].unique()

### Feature Engineering

In [None]:
data.drop(['nameOrig', 'nameDest'], axis=1, inplace=True)

In [None]:
df = data.copy(deep = True)

In [None]:
# get all categorical columns in the dataframe
catCols = [col for col in data.columns if data[col].dtype=="O"]

from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()

for item in catCols:
    data[item] = lb_make.fit_transform(data[item])

In [None]:
data.head()

### Model Training | Pelatihan Model

In [None]:
# The function below will be used to evaluate different metrics of the algorithms used here.
def evaluate_model(y_test, y_pred):
    print("Accuracy Score: ", accuracy_score(y_test, y_pred))
    print("Precision Score: ", precision_score(y_test, y_pred))
    print("Recall Score: ", recall_score(y_test, y_pred))
    print("F1 Score: ", f1_score(y_test, y_pred))
    print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
    

    df = {'y_Actual': y_test, 'y_Predicted': y_pred}

    df1 = pd.DataFrame(df, columns = ['y_Actual','y_Predicted'])

    clf_confusion_matrix = pd.crosstab(df['y_Predicted'], df['y_Actual'], rownames = ['Predicted'], colnames=['Actual'])

    sns.heatmap(clf_confusion_matrix, annot=True)

### Model Selection | Pemilihan Model

**Dummy Classifier**

In [None]:
X = data.drop('isFraud', axis=1)
y = data.isFraud


# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

In [None]:
# DummyClassifier to predict only target 0
dummy = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
dummy_pred = dummy.predict(X_test)

# checking unique labels
print('Unique predicted labels: ', (np.unique(dummy_pred)))

# checking accuracy
evaluate_model(y_test, dummy_pred)

Seperti yang dapat kita lihat, Dummy Classifier secara akurat memprediksi transaksi non-penipuan dengan akurasi 99,8%, tetapi itu bukan fokus kami. Kita harus dapat memprediksi transaksi penipuan secara akurat.

### Model Selection | Pemilihan Model

**Logistic Regression**

In [None]:
# Modeling the data as is
# Train model
lr = LogisticRegression()
model1 = lr.fit(X_train, y_train)
 
# Predict on training set
lr_pred = model1.predict(X_test)

In [None]:
evaluate_model(y_test, lr_pred)

In [None]:
# Checking unique values
predictions = pd.DataFrame(lr_pred)
predictions[0].value_counts()

In [None]:
pd.DataFrame(confusion_matrix(y_test, lr_pred))

Model Logistic Regression berkinerja cukup baik tetapi skor recall masih sangat rendah. Lebih banyak pekerjaan yang perlu dilakukan dengan himpunan data.

### Model Selection | Pemilihan Model

**Random Forest Classifier**

In [None]:
rfc = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)

# predict on test set
rfc_pred = rfc.predict(X_test)

evaluate_model(y_test, rfc_pred)

In [None]:
# Checking unique values
predictions = pd.DataFrame(rfc_pred)
predictions[0].value_counts()

In [None]:
pd.DataFrame(confusion_matrix(y_test, rfc_pred))

Kita dapat melihat bahwa Random Forest memiliki skor terbaik sejauh ini di berbagai metrik dengan skor recall 77% dan skor F1 86%.

In [None]:
import pickle

In [None]:
filename = 'FraudDetect_model.sav'
pickle.dump(rfc, open(filename, 'wb'))

### Model Selection | Pemilihan Model

Model terakhir yang digunakan yang memiliki skor terbaik di semua metrik adalah Random Forest Classifier.

### Kesimpulan

**Dari ketiga report diatas yang cukup sekiranya untuk dikatakan bagus yaitu Random Forest Classifier karena mendapat skor recall 77%, dan mendapat skor f1 sebesar 90%. Dan secara keseluruhan, classification model memiliki akurasi 99%.**