# 🕵️‍♂️ IEEE-CIS Fraud Detection - Team Project

Welcome to our team project on **Fraud Detection using the IEEE-CIS Dataset**, a real-world challenge involving identifying fraudulent online transactions.

## 🧑‍💻 Team Members:
- **Mostapha Abdulaziz**
- **Ahmed Imad**
- **Noha Ashraf**
- **Rana Ahmed**
- **Sondos Wael**

---

## 📌 Project Objective

The objective of this project is to detect fraudulent transactions using the **IEEE-CIS Fraud Detection Dataset**, one of the most comprehensive and anonymized datasets used in real-world financial systems. Through machine learning models and data preprocessing techniques, we aim to build an accurate fraud detection system capable of identifying suspicious activities.

---

## 📦 Dataset Overview

The dataset is provided by **Vesta Corporation** and hosted on **Kaggle**. It contains **anonymized transactional data** and **user-related identity information**. The dataset is divided into four main files:

### 1. `train_transaction.csv` & `test_transaction.csv`
These contain transaction-level features such as:
- `TransactionID` – unique ID for each transaction
- `TransactionDT` – time in seconds from a reference date
- `TransactionAmt` – amount of the transaction
- `ProductCD`, `card1`–`card6` – payment instruments
- `addr1`, `addr2`, `dist1`, `dist2` – location and distance metrics
- `P_emaildomain`, `R_emaildomain` – purchaser & recipient emails
- `C1`–`C14` – count-based features (anonymized)
- `D1`–`D15` – time deltas from prior events
- `M1`–`M9` – matching flags
- `V1`–`V339` – PCA-like engineered features
- `isFraud` (only in train) – target label: 1 if fraudulent, 0 otherwise

### 2. `train_identity.csv` & `test_identity.csv`
Contain additional information about:
- Device details (e.g., `DeviceType`, `DeviceInfo`)
- Browser data
- Network address and anonymized identity signals (`id_01` to `id_38`)

---

## 🧹 Data Cleaning & Preprocessing

The dataset requires extensive preprocessing due to:
- High number of null values
- Anonymized and encoded variables
- Mixed data types (numeric, categorical, textual)

We'll explore:
- Handling missing data
- Feature selection and dimensionality reduction
- Encoding of categorical variables
- Time feature extraction
- Device and browser parsing
- Merging identity and transaction data

---

## ⚙️ Our Approach

1. **Exploratory Data Analysis (EDA)**: Understand distribution, correlations, and missing values.
2. **Feature Engineering**: Create new features from device/browser info, time, emails, etc.
3. **Modeling**: Train models like XGBoost, LightGBM, and compare results.
4. **Evaluation**: Use metrics such as AUC-ROC, precision, recall to assess fraud detection performance.
5. **Interpretation**: Analyze important features and understand model decisions.
- And more

---

## 📊 Why This Problem Matters

Fraud detection is critical for the financial industry. By working on this project, we simulate what it’s like to deal with:
- Imbalanced classification problems
- Real-world noise and data anonymization
- Behavioral pattern detection



# Mounting the drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


# importing the dataset from kaggle


In [None]:
from google.colab import files
files.upload()


Saving kaggle.json to kaggle (1).json


{'kaggle (1).json': b'{"username":"mostaphaabdulaziz","key":"e3436eca79536cf8ad50a0272b73f6db"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [None]:
!kaggle competitions download -c ieee-fraud-detection -p /content/drive/MyDrive/ieee-fraud


In [None]:
import zipfile
with zipfile.ZipFile('/content/drive/MyDrive/ieee-fraud/ieee-fraud-detection.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/drive/MyDrive/ieee-fraud')



---



---



---



# Loading the datasets

In [7]:
import zipfile

zip_path = '/content/drive/MyDrive/ieee-fraud/ieee-fraud-detection.zip'
extract_path = '/content/drive/MyDrive/ieee-fraud'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

In [8]:
import pandas as pd

train_transaction = pd.read_csv('/content/drive/MyDrive/ieee-fraud/train_transaction.csv')
train_identity = pd.read_csv('/content/drive/MyDrive/ieee-fraud/train_identity.csv')
test_transaction = pd.read_csv('/content/drive/MyDrive/ieee-fraud/test_transaction.csv')
test_identity = pd.read_csv('/content/drive/MyDrive/ieee-fraud/test_identity.csv')

train_transaction.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Merge transaction and identity datasets

In [9]:
# Merge identity with transaction data using TransactionID
train_df = train_transaction.merge(train_identity, on='TransactionID', how='left')
test_df = test_transaction.merge(test_identity, on='TransactionID', how='left')

# Understand and handle missing values

In [10]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Combine train_transaction and train_identity (if not already)
train = pd.merge(train_transaction, train_identity, on='TransactionID', how='left')
test = pd.merge(test_transaction, test_identity, on='TransactionID', how='left')

# Show null % for training data
def missing_summary(df, name=''):
    nulls = df.isnull().sum()
    null_pct = nulls / len(df) * 100
    missing_df = pd.DataFrame({'Missing Count': nulls, 'Missing %': null_pct})
    missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values(by='Missing %', ascending=False)
    print(f"\n🧼 Null Summary for {name} Dataset:")
    return missing_df

train_missing = missing_summary(train, 'Train')
test_missing = missing_summary(test, 'Test')

# Display top 30 null-heavy features (train)
train_missing.head(30)


🧼 Null Summary for Train Dataset:

🧼 Null Summary for Test Dataset:


Unnamed: 0,Missing Count,Missing %
id_24,585793,99.196159
id_25,585408,99.130965
id_07,585385,99.12707
id_08,585385,99.12707
id_21,585381,99.126393
id_26,585377,99.125715
id_23,585371,99.124699
id_22,585371,99.124699
id_27,585371,99.124699
dist2,552913,93.628374


Dropping very-high-missing columns (~99%)

Imputing remaining high-missing-value features smartly

Handling both categorical and numerical

Adding missing flags for selected columns

Handling V-features in bulk

In [13]:
def preprocess_data(train, test):
    import numpy as np
    import pandas as pd

    # 1. Drop columns with >99% missing
    drop_cols = [
        'id_24', 'id_25', 'id_07', 'id_08', 'id_21',
        'id_26', 'id_23', 'id_22', 'id_27'
    ]
    train.drop(columns=drop_cols, inplace=True, errors='ignore')
    test.drop(columns=drop_cols, inplace=True, errors='ignore')

    # 2. Add null flags to selected columns
    null_flag_cols = ['id_03', 'id_04', 'id_30', 'id_32', 'id_33', 'id_34']
    for col in null_flag_cols:
        train[f'{col}_missing'] = train[col].isnull().astype(int)
        test[f'{col}_missing'] = test[col].isnull().astype(int)

    # 3. Impute categorical columns with 'missing'
    cat_cols = ['id_30', 'id_31', 'id_33', 'id_34']
    for col in cat_cols:
        train[col] = train[col].fillna('missing')
        test[col] = test[col].fillna('missing')

    # 4. Impute numerical columns with median (especially D & id time features)
    num_cols = ['D6', 'D7', 'D8', 'D9', 'D12', 'D13', 'D14', 'id_03', 'id_04', 'id_32']
    for col in num_cols:
        median = train[col].median()
        train[col] = train[col].fillna(median)
        test[col] = test[col].fillna(median)

    # 5. Handle V-features
    v_cols = [col for col in train.columns if col.startswith('V')]
    for col in v_cols:
        median = train[col].median()
        train[col] = train[col].fillna(median)
        test[col] = test[col].fillna(median)

    # Done
    print(f"Preprocessing done. Final shape: Train={train.shape}, Test={test.shape}")
    return train, test

In [16]:
# Assuming you already merged train_transaction and train_identity
import pandas as pd

df = pd.merge(train_transaction, train_identity, on='TransactionID', how='left')

# Check total remaining nulls
nulls_remaining = df.isnull().sum()
nulls_remaining = nulls_remaining[nulls_remaining > 0].sort_values(ascending=False)

if not nulls_remaining.empty:
    print("\nRemaining Null Columns:\n", nulls_remaining)
else:
    print("\nNo remaining nulls.")



Remaining Null Columns:
 id_24    585793
id_25    585408
id_07    585385
id_08    585385
id_21    585381
          ...  
V309         12
V312         12
V311         12
V310         12
V316         12
Length: 414, dtype: int64


In [17]:
import pandas as pd

# Assuming 'df' is your DataFrame, merge the train_transaction and train_identity if not done already
# df = pd.merge(train_transaction, train_identity, on='TransactionID', how='left')

# Step 1: Check for missing values
print("Initial missing values per column:\n")
print(df.isnull().sum().sort_values(ascending=False).head())

# Step 2: Drop columns with more than 90% missing data
df = df.dropna(thresh=len(df) * 0.1, axis=1)

# Step 3: Recheck the number of missing values after dropping columns
print("\nMissing values after dropping columns with > 90% missing data:\n")
print(df.isnull().sum().sort_values(ascending=False).head())

# Step 4: Handle missing values in numeric columns (impute with median)
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_cols:
    df[col].fillna(df[col].median(), inplace=True)

# Step 5: Handle missing values in categorical columns (impute with mode)
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Step 6: Recheck if any null values remain
nulls_remaining = df.isnull().sum()
nulls_remaining = nulls_remaining[nulls_remaining > 0].sort_values(ascending=False)

if not nulls_remaining.empty:
    print("\nRemaining Null Columns:\n", nulls_remaining)
else:
    print("\n All null values handled successfully.")

Initial missing values per column:

id_24    585793
id_25    585408
id_07    585385
id_08    585385
id_21    585381
dtype: int64

Missing values after dropping columns with > 90% missing data:

D13      528588
D14      528353
D12      525823
id_03    524216
id_04    524216
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)



✅ All null values handled successfully.


In [21]:
import pandas as pd
import numpy as np

# Assuming 'df' is your DataFrame

# Step 1: Drop columns with more than 90% missing values (already handled, but just in case)
df = df.dropna(thresh=len(df) * 0.1, axis=1)

# Step 2: Drop constant features (columns where all values are the same)
df = df.loc[:, df.nunique() > 1]

# Step 3: Drop irrelevant columns (such as 'TransactionID' or other identifiers)
# TransactionID is already merged and may not be relevant for prediction, so drop it
df.drop(columns=['TransactionID', 'id_24', 'id_25', 'id_07', 'id_08', 'id_21'], inplace=True, errors='ignore')

# Step 4: Calculate correlation matrix only for numeric columns
numeric_df = df.select_dtypes(include=[np.number])  # Select only numeric columns
corr_matrix = numeric_df.corr().abs()  # Calculate absolute correlation matrix

upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.9)]

df.drop(columns=to_drop, inplace=True)

print(f"Final DataFrame Shape: {df.shape}")


✅ Final DataFrame Shape: (590540, 248)
