# IEEE-Computational Intelligence Society Fraud Detection

**Goal:** In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target ```isFraud```. 

[Kaggle Competition](https://www.kaggle.com/competitions/ieee-fraud-detection/overview)

## Evaluation Metric

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

## Submission File

For each ```TransactionID``` in the test set, you must predict a probability for the ```isFraud variable```. The file should contain a header and have the following format:

```
TransactionID,isFraud
3663549,0.5
3663550,0.5
3663551,0.5
etc.
```

## Dataset Description

The data is broken into two files ```identity``` and ```transaction```, which are joined by ```TransactionID```. Not all transactions have corresponding identity information.

### Categorical Features - Transaction
* ```ProductCD```
* ```card1``` - ```card6```
* ```addr1```, ```addr2```
* ```P_emaildomain```
* ```R_emaildomain```
* ```M1``` - ```M9```

### Categorical Features - Identity
* ```DeviceType```
* ```DeviceInfo```
* ```id_12```-```id_38```

## Files

* **train_{transaction, identity}.csv** - the training set
* **test_{transaction, identity}.csv** - the test set (you must predict the ```isFraud``` value for these observations)
* **sample_submission.csv** - a sample submission file in the correct format

## Data Loading and Preprocessing

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score,roc_auc_score, make_scorer
from sklearn.compose import ColumnTransformer
from collections import Counter
import xgboost as xgb

In [34]:
train_id = pd.read_csv('train_data/train_identity.csv')
train_tr = pd.read_csv('train_data/train_transaction.csv')
test_id = pd.read_csv('test_data/test_identity.csv')
test_tr = pd.read_csv('test_data/test_transaction.csv')

In [21]:
train_tr.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
train_id.head()

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987004,0.0,70787.0,,,,,,,,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,...,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,...,chrome 62.0,,,,F,F,T,T,desktop,Windows
3,2987011,-5.0,221832.0,,,0.0,-6.0,,,,...,chrome 62.0,,,,F,F,T,T,desktop,
4,2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,...,chrome 62.0,24.0,1280x800,match_status:2,T,F,T,T,desktop,MacOS


In [36]:
# Join the two datasets (transaction, identity) on TransactionID for train and test 
train = train_tr.merge(train_id, how='left', on = 'TransactionID')
test = test_tr.merge(test_id, how='left',on='TransactionID')

# Separating our target feature
y_train = train['isFraud'].copy()
X_train = train.drop('isFraud', axis=1)

# Fill in NaNs
X_test = test.copy()
# X_train = X_train.fillna(-999)
# X_test = X_test.fillna(-999)

In [97]:
print(y_train.value_counts())


isFraud
0    569877
1     20663
Name: count, dtype: int64


In [100]:
X_train.head()

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,86400,68.5,W,13926,-999.0,150.0,discover,142.0,credit,...,-999,-999.0,-999,-999,-999,-999,-999,-999,-999,-999
1,2987001,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,...,-999,-999.0,-999,-999,-999,-999,-999,-999,-999,-999
2,2987002,86469,59.0,W,4663,490.0,150.0,visa,166.0,debit,...,-999,-999.0,-999,-999,-999,-999,-999,-999,-999,-999
3,2987003,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,debit,...,-999,-999.0,-999,-999,-999,-999,-999,-999,-999,-999
4,2987004,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,credit,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M


In [26]:
print(X_train.columns)
print(X_test.columns)

Index(['TransactionID', 'TransactionDT', 'TransactionAmt', 'ProductCD',
       'card1', 'card2', 'card3', 'card4', 'card5', 'card6',
       ...
       'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38',
       'DeviceType', 'DeviceInfo'],
      dtype='object', length=433)
Index(['TransactionID', 'TransactionDT', 'TransactionAmt', 'ProductCD',
       'card1', 'card2', 'card3', 'card4', 'card5', 'card6',
       ...
       'id-31', 'id-32', 'id-33', 'id-34', 'id-35', 'id-36', 'id-37', 'id-38',
       'DeviceType', 'DeviceInfo'],
      dtype='object', length=433)


In [38]:
# Seems like there was an encoding error for some test features
# id_x in train, id-x in test. Let's make them compatible:
X_test.columns = X_test.columns.str.replace('-', '_')

In [40]:
# As per the dataset description, 

#'Not all transactions have corresponding identity information'.

# In order to preprocess the joined datasets effectively, we need to
# ensure that categorical features are consistently encoded across both training and testing datasets.
# This involves applying label encoding only to columns present in both datasets to maintain alignment
# and prevent discrepancies during model training and evaluation.
for f in X_train.columns:
    # Check if the column exists in both X_train and X_test
    if f in X_test.columns:
        if X_train[f].dtype == 'object' or X_test[f].dtype == 'object': 
            lbl = preprocessing.LabelEncoder()
            lbl.fit(list(X_train[f].values) + list(X_test[f].values))
            X_train[f] = lbl.transform(list(X_train[f].values))
            X_test[f] = lbl.transform(list(X_test[f].values))
    else:
        # Optionally handle columns that exist in X_train but not in X_test
        print(f"Column {f} exists in X_train but not in X_test.")


In [42]:
# split the training data into train and cross-validation datasets
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.20)

## Model Building and Cross-Validation

How rare are fraudulent transactions? We need to check if our labels are imbalanced before training.

In [38]:
print(y_train.value_counts())

isFraud
0    455944
1     16488
Name: count, dtype: int64


Very rare! This indicate a heavily imbalanced dataset, although XGBoost handles these quite well natively. It is, however, something to keep in mind when using other learning models. 

The ```XGBoostClassifier``` is a powerful gradient boosting model that combines many weak learners to optimize predictive accuracy. It’s particularly well-suited to complex, non-linear relationships and often outperforms simpler models in accuracy by minimizing both bias and variance. However, it requires more computation and parameter tuning to achieve optimal performance. 

In [46]:
xgb_model = xgb.XGBClassifier(
    max_depth=8,              # Max depth of trees
    learning_rate=0.005,       # Learning rate (step size shrinkage)
    n_estimators=8000,         # Number of trees to be built
    subsample=0.8,            # Fraction of samples used per tree
    colsample_bytree=1,     # Fraction of features used per tree
    colsample_bylevel=0.8,    # Fraction of features per tree level
    min_child_weight=1,       # Minimum sum of instance weight in a child
    gamma=0.005,                # Minimum loss reduction required for split
    scale_pos_weight=1,       # Balancing positive/negative classes
    reg_alpha=0.4,           # L1 regularization
    reg_lambda=0.15,           # L2 regularization
    tree_method='hist',       # Use histogram-based algorithm
    random_state=57,          # Seed for reproducibility
    objective='binary:logistic',  # For binary classification
    eval_metric='auc',        # Evaluation metric
    n_jobs=-1                 # Use all available cores
)

In [48]:
xgb_model.fit(X_train, y_train)

## Assessing Model Performance

In [53]:
# Obtain predicted probabilities for the positive class
y_pred_proba_train = xgb_model.predict_proba(X_train)[:, 1]

# Calculate ROC AUC score
roc_auc = roc_auc_score(y_train, y_pred_proba_train)

print(f"Train ROC AUC: {roc_auc}")

Train ROC AUC: 0.9940969007501437


In [62]:
# Obtain predicted probabilities for the positive class
y_pred_proba_valid = xgb_model.predict_proba(X_valid)[:, 1]

# Calculate ROC AUC score
roc_auc_valid = roc_auc_score(y_valid, y_pred_proba_valid)

print(f"Validation ROC AUC: {roc_auc_valid}")

Validation ROC AUC: 0.9671422121132122


The values of Train ROC AUC at 0.9941 and Validation ROC AUC at 0.9671 indicate that our XGBoost model is performing exceptionally well on both the training and validation datasets. There is relatively mild evidence of overfitting (the train score is greater than the validation score--which is to be expected--but not very large), so there is potentially room for improvement via additional hyperparameter tuning. 

However, given the size of this dataset, it is not feasible (on my computer) to do something like an exhaustive grid search approach to hyperparameter tuning. Given more resources, I would adjust the following parameters:

* **Tree Parameters:**
  * ```max_depth:``` Controls the maximum depth of a tree. Lower values prevent overfitting.
  * ```min_child_weight```: Minimum sum of instance weight needed in a child. Higher values prevent overfitting.
  * ```gamma```: Minimum loss reduction required to make a further partition. Higher values make the algorithm more conservative.

* **Regularization Parameters:**
  * ```reg_alpha``` (L1 regularization/Ridge) and ```reg_beta``` (L2 regularization/Lasso): Help in reducing overfitting.
* **Learning Rate and Estimators:**
  * ```learning_rate```: Lower values make the model more robust but require more trees.
  * ```n_estimators```: Number of trees. Can be adjusted in conjunction with ```learning_rate```.
* **Subsampling Parameters**:
  * ```subsample```: Fraction of samples to be used for fitting the individual trees.
  * ```colsample_bytree```: Fraction of features to be used for each tree.

## Saving Submission File

In [64]:
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission['isFraud'] = xgb_model.predict_proba(X_test)[:,1]
sample_submission.to_csv('submission_xgboost.csv',index=False)