<a href="https://colab.research.google.com/github/nshamid/fraud_detection_in_transactions/blob/main/logistic_regression_fraud_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fraud Detection with Logistic Regression and SMOTE

In [None]:
# Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE

## Download Dataset from GitHub

In [None]:
import requests, zipfile

dataset_url = 'https://github.com/Deploy-Camp-Team-6/dataset/raw/refs/heads/main/fraud_detection.zip'
response = requests.get(dataset_url)
with open('dataset.zip', 'wb') as f:
    f.write(response.content)
with zipfile.ZipFile('dataset.zip', 'r') as zip_ref:
    zip_ref.extractall()
print("Dataset berhasil diunduh dan diekstrak.")

Dataset berhasil diunduh dan diekstrak.


## Load Dataset

In [None]:
df = pd.read_csv("fraud_detection.csv")
df.head()

Unnamed: 0,transaction_id,amount,merchant_type,device_type,label
0,1,46.93,travel,tablet,0
1,2,301.01,groceries,desktop,0
2,3,131.67,others,tablet,0
3,4,91.29,electronics,desktop,0
4,5,16.96,others,mobile,0


## Dataset Overview

In [None]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   transaction_id  1000 non-null   int64  
 1   amount          1000 non-null   float64
 2   merchant_type   1000 non-null   object 
 3   device_type     1000 non-null   object 
 4   label           1000 non-null   int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB
None


In [None]:
print(df.isnull().sum())

transaction_id    0
amount            0
merchant_type     0
device_type       0
label             0
dtype: int64


In [None]:
print(df["label"].value_counts(normalize=True))

label
0    0.95
1    0.05
Name: proportion, dtype: float64


## Drop Kolom ID

In [None]:
df = df.drop(columns=["transaction_id"])

## Pisahkan Fitur dan Label

In [None]:
X = df.drop(columns=["label"])
y = df["label"]

## Definisikan Kolom Numerik dan Kategorikal

In [None]:
numeric_features = ["amount"]
categorical_features = ["merchant_type", "device_type"]

## Preprocessing Pipeline

In [None]:
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

## Pipeline: SMOTE + Logistic Regression

In [None]:
pipeline = ImbPipeline(steps=[
    ("preprocessor", preprocessor),
    ("smote", SMOTE(random_state=42)),
    ("classifier", LogisticRegression(class_weight="balanced", max_iter=1000, random_state=42))
])

## Splitting Dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

## Training Model

In [None]:
pipeline.fit(X_train, y_train)

## Evaluasi Model

In [None]:
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.58      0.73       190
           1       0.06      0.50      0.11        10

    accuracy                           0.58       200
   macro avg       0.51      0.54      0.42       200
weighted avg       0.91      0.58      0.69       200



# Deteksi Fraud Berdasarkan Input User

In [None]:
def get_user_input():
    print("Masukkan data transaksi:")
    amount = float(input("Amount transaksi: "))
    merchant_type = input("Merchant type (contoh: electronics, groceries, travel, dll): ").strip().lower()
    device_type = input("Device type (contoh: desktop, mobile, tablet): ").strip().lower()

    return pd.DataFrame([{
        "amount": amount,
        "merchant_type": merchant_type,
        "device_type": device_type
    }])

user_input_df = get_user_input()

pred = pipeline.predict(user_input_df)
proba = pipeline.predict_proba(user_input_df)

label = "FRAUD" if pred[0] == 1 else "NO FRAUD"
confidence = round(proba[0][1] * 100, 2)

print("\n===== HASIL DETEKSI TRANSAKSI =====")
print(f"Prediksi: {label}")
print(f"Probabilitas Fraud: {confidence}%")

Masukkan data transaksi:
Amount transaksi: 300
Merchant type (contoh: electronics, groceries, travel, dll): others
Device type (contoh: desktop, mobile, tablet): tablet

===== HASIL DETEKSI TRANSAKSI =====
Prediksi: NO FRAUD
Probabilitas Fraud: 49.36%


# Simpan pipeline model ke file .pkl

In [None]:
import joblib

joblib.dump(pipeline, "logistic_model.pkl")
print("Model pipeline berhasil disimpan.")

Model pipeline berhasil disimpan.


# Ringkasan Model: Logistic Regression

* **Metodologi:**
    * Model ini menggunakan pipeline `scikit-learn` yang menggabungkan preprocessing dan penanganan data tidak seimbang.
    * **Preprocessing:** Fitur numerik (`amount`) menggunakan `StandardScaler`, sementara fitur kategorikal (`merchant_type`, `device_type`) menggunakan `OneHotEncoder`.
    * **Imbalance Handling:** Menggunakan `SMOTE` (Synthetic Minority Over-sampling Technique) untuk menyeimbangkan distribusi kelas pada data training.

* **Performa pada Data Tes (Fokus pada kelas `1: Fraud`):**

| Metrik    | Skor   |
| :-------- | :----- |
| Precision | 0.06   |
| Recall    | 0.50   |
| F1-Score  | 0.11   |

* **Kesimpulan:**
    Model ini berhasil mengidentifikasi **50% dari seluruh transaksi fraud** yang sebenarnya pada data tes (Recall = 0.50). Namun, trade-off-nya adalah presisi yang sangat rendah, yang berarti sebagian besar transaksi yang ditandai sebagai fraud sebenarnya bukan fraud (banyak terjadi *false positive*).

