## UTS Machine Learning - Fraud Detection (Classification)

**Name:** Agatha Kinanthi Pramdriswara Truly Amorta

**Class:** TK-46-04

**NIM:** 1103223212


*- This notebook is part of the midterm assignment for the Machine Learning course.*  

*- The objective is to build an end-to-end pipeline to detect fraudulent online transactions.*


**1. Imports**

In [None]:
!pip install imbalanced-learn

In [None]:
# Core libraries
import numpy as np
import pandas as pd
# Preprocessing & splitting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Handle imbalance (install if needed)
from imblearn.over_sampling import SMOTE
# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Evaluation metrics
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix

**2. Mount Google Drive and Load Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Set path to Drive
datasets = '/content/drive/MyDrive/Machine-Learning-Midterm-Datasets/'

import pandas as pd
df = pd.read_csv(datasets + 'train_transaction.csv', nrows=200000)
print("Shape:", df.shape)
df.head()


**3. Exploratory Data Analysis (EDA)**

In [None]:
df.info()
df.describe().T.head()

In [None]:
print("Target distribution:")
print(df['isFraud'].value_counts(normalize=True))

**4. Preprocessing**

In [None]:
if 'TransactionID' in df.columns:
    df = df.drop(columns=['TransactionID'])

# Define target column
target = 'isFraud'
num_cols = df.drop(columns=[target], errors='ignore') \
             .select_dtypes(include=['int64','float64']).columns.tolist()
X = df[num_cols].fillna(0)
y = df[target]

print("Number of numeric features:", len(num_cols))

**5. Train-Test Split and Handle Imbalance**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

print("Resampled training shape:", X_train_res.shape)

**6. Baseline Model**

In [None]:
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf.fit(X_train_res, y_train_res)

probs = clf.predict_proba(X_test)[:,1]
print("ROC AUC Score:", roc_auc_score(y_test, probs))
print(classification_report(y_test, (probs>0.5).astype(int)))

**7. Submission File**

In [None]:
# Load test dataset
test_df = pd.read_csv(datasets + 'test_transaction.csv', nrows=200000)

num_cols = [c for c in num_cols if c in test_df.columns]
test_features = test_df[num_cols].fillna(0)
predict = clf.predict_proba(test_features)[:, 1]

# Build submission DataFrame
submission = pd.DataFrame({
    'TransactionID': test_df['TransactionID'],
    'isFraud': predict
})

submission.head()

# Save submission file to Google Drive
submission.to_csv(
    '/content/drive/MyDrive/Machine-Learning-Midterm-Datasets/submission_fraud.csv',
    index=False
)

### **Conclusion**

- The baseline Random Forest model achieved a ROC-AUC score of *0.8954709909894493*.
- Class imbalance was handled using SMOTE, which improved the balance of training data.
- Further improvements can be made by testing other models and tuning hyperparameters.