# 🕵️‍♂️ IEEE-CIS Fraud Detection - Team Project

Welcome to our team project on **Fraud Detection using the IEEE-CIS Dataset**, a real-world challenge involving identifying fraudulent online transactions.

## 🧑‍💻 Team Members:
- **Mostapha Abdulaziz**
- **Ahmed Imad**
- **Noha Ashraf**
- **Rana Ahmed**
- **Sondos Wael**

---

## 📌 Project Objective

The objective of this project is to detect fraudulent transactions using the **IEEE-CIS Fraud Detection Dataset**, one of the most comprehensive and anonymized datasets used in real-world financial systems. Through machine learning models and data preprocessing techniques, we aim to build an accurate fraud detection system capable of identifying suspicious activities.

---

## 📦 Dataset Overview

The dataset is provided by **Vesta Corporation** and hosted on **Kaggle**. It contains **anonymized transactional data** and **user-related identity information**. The dataset is divided into four main files:

### 1. `train_transaction.csv` & `test_transaction.csv`
These contain transaction-level features such as:
- `TransactionID` – unique ID for each transaction
- `TransactionDT` – time in seconds from a reference date
- `TransactionAmt` – amount of the transaction
- `ProductCD`, `card1`–`card6` – payment instruments
- `addr1`, `addr2`, `dist1`, `dist2` – location and distance metrics
- `P_emaildomain`, `R_emaildomain` – purchaser & recipient emails
- `C1`–`C14` – count-based features (anonymized)
- `D1`–`D15` – time deltas from prior events
- `M1`–`M9` – matching flags
- `V1`–`V339` – PCA-like engineered features
- `isFraud` (only in train) – target label: 1 if fraudulent, 0 otherwise

### 2. `train_identity.csv` & `test_identity.csv`
Contain additional information about:
- Device details (e.g., `DeviceType`, `DeviceInfo`)
- Browser data
- Network address and anonymized identity signals (`id_01` to `id_38`)

---

## 🧹 Data Cleaning & Preprocessing

The dataset requires extensive preprocessing due to:
- High number of null values
- Anonymized and encoded variables
- Mixed data types (numeric, categorical, textual)

We'll explore:
- Handling missing data
- Feature selection and dimensionality reduction
- Encoding of categorical variables
- Time feature extraction
- Device and browser parsing
- Merging identity and transaction data

---

## ⚙️ Our Approach

1. **Exploratory Data Analysis (EDA)**: Understand distribution, correlations, and missing values.
2. **Feature Engineering**: Create new features from device/browser info, time, emails, etc.
3. **Modeling**: Train models like XGBoost, LightGBM, and compare results.
4. **Evaluation**: Use metrics such as AUC-ROC, precision, recall to assess fraud detection performance.
5. **Interpretation**: Analyze important features and understand model decisions.
- And more

---

## 📊 Why This Problem Matters

Fraud detection is critical for the financial industry. By working on this project, we simulate what it’s like to deal with:
- Imbalanced classification problems
- Real-world noise and data anonymization
- Behavioral pattern detection



# Mounting the drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# importing the dataset from kaggle


In [None]:
# Installing Miniconda for RAPIDS
!wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local

sys.path.append('/usr/local/lib/python3.11/site-packages/')

# Installing RAPIDS for CUDA 12.x
!conda create -n rapids -c rapidsai -c nvidia -c conda-forge rapids=24.02 python=3.11 cudatoolkit=12.2 --yes
!pip install cudf-cu12 cuml-cu12 --extra-index-url=https://pypi.nvidia.com

In [2]:
import sys
import cudf
import cuml
import pandas as pd
import numpy as np
import cupy as cp
from tqdm import tqdm
from sklearn.decomposition import PCA
print(f"GPU available: {cp.cuda.is_available()}")
print(f"CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
print(f"cuDF version: {cudf.__version__}")
print(f"cuML version: {cuml.__version__}")

GPU available: True
CUDA version: 12060
cuDF version: 25.02.01
cuML version: 25.02.01


In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [4]:
!kaggle competitions download -c ieee-fraud-detection -p /content/drive/MyDrive/ieee-fraud


Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 4, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.11/dist-packages/kaggle/__init__.py", line 6, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.11/dist-packages/kaggle/api/kaggle_api_extended.py", line 433, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method. See setup instructions at https://github.com/Kaggle/kaggle-api/


In [5]:
import zipfile
with zipfile.ZipFile('/content/drive/MyDrive/ieee-fraud/ieee-fraud-detection.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/drive/MyDrive/ieee-fraud')

In [None]:
from google.colab import drive
drive.mount('/content/drive')



---



---



---



# Loading the datasets

In [6]:
import zipfile

zip_path = '/content/drive/MyDrive/ieee-fraud/ieee-fraud-detection.zip'
extract_path = '/content/drive/MyDrive/ieee-fraud'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

In [7]:
print("Loading datasets...")
train_transaction = pd.read_csv('/content/drive/MyDrive/ieee-fraud/train_transaction.csv')
train_identity = pd.read_csv('/content/drive/MyDrive/ieee-fraud/train_identity.csv')
test_transaction = pd.read_csv('/content/drive/MyDrive/ieee-fraud/test_transaction.csv')
test_identity = pd.read_csv('/content/drive/MyDrive/ieee-fraud/test_identity.csv')


print(f"train_transaction shape: {train_transaction.shape}")
print(f"train_identity shape: {train_identity.shape}")
print(f"test_transaction shape: {test_transaction.shape}")
print(f"test_identity shape: {test_identity.shape}")
print(f"card1 in train_transaction: {'card1' in train_transaction.columns}")
print(f"V-features in train_transaction: {len([col for col in train_transaction.columns if col.startswith('V')])}")

Loading datasets...
train_transaction shape: (590540, 394)
train_identity shape: (144233, 41)
test_transaction shape: (506691, 393)
test_identity shape: (141907, 41)
card1 in train_transaction: True
V-features in train_transaction: 339


# Merge transaction and identity datasets

In [8]:
import pandas as pd

# normalizing test_identity column names
test_identity.columns = [col.replace('-', '_') for col in test_identity.columns]

# merging transaction and identity data
train = train_transaction.merge(train_identity, on='TransactionID', how='left')
test = test_transaction.merge(test_identity, on='TransactionID', how='left')

# **Missing Value Summary**

In [10]:
def missing_summary(df, name=''):
    # calculating null counts and percentages
    nulls = df.isnull().sum()
    null_pct = nulls / len(df) * 100
    missing_df = pd.DataFrame({'Missing Count': nulls, 'Missing %': null_pct})
    missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values(by='Missing %', ascending=False)
    print(f"\nnull Summary for {name} Dataset:")
    print(missing_df.head(30))
    return missing_df


train_missing = missing_summary(train, 'Train')
test_missing = missing_summary(test, 'Test')


null Summary for Train Dataset:
       Missing Count  Missing %
id_24         585793  99.196159
id_25         585408  99.130965
id_08         585385  99.127070
id_07         585385  99.127070
id_21         585381  99.126393
id_26         585377  99.125715
id_23         585371  99.124699
id_22         585371  99.124699
id_27         585371  99.124699
dist2         552913  93.628374
D7            551623  93.409930
id_18         545427  92.360721
D13           528588  89.509263
D14           528353  89.469469
D12           525823  89.041047
id_04         524216  88.768923
id_03         524216  88.768923
D6            517353  87.606767
id_33         517251  87.589494
id_09         515614  87.312290
D8            515614  87.312290
D9            515614  87.312290
id_10         515614  87.312290
id_30         512975  86.865411
id_32         512954  86.861855
id_34         512735  86.824771
id_14         510496  86.445626
V153          508595  86.123717
V141          508595  86.123717
V142   

here we define a function missing_summary to compute and display the count and percentage of missing values for each column in a dataset. It applies this function to both the training and test datasets, printing the top 30 columns with the highest missingness. This helps identify columns with significant nulls (e.g., id_24, id_25) for preprocessing decisions.

# **Preprocessing Data**

In [11]:
def preprocess_data(train, test):
    # dropping columns with >80% missing values unless critical
    key_cols = ['DeviceInfo', 'P_emaildomain', 'R_emaildomain', 'D1', 'D2', 'id_01', 'id_02', 'card1']
    high_missing_cols = [col for col in tqdm(train.columns, desc="Checking missingness") if train[col].isnull().mean() > 0.8 and col not in key_cols]
    train.drop(columns=high_missing_cols, inplace=True)
    test.drop(columns=high_missing_cols, inplace=True, errors='ignore')
    print(f"Dropped columns: {len(high_missing_cols)} columns")
    print(f"card1 present after missingness: {'card1' in train.columns}")

    # aligning train and test columns
    common_cols = train.columns.intersection(test.columns)
    train = train[common_cols.union(['isFraud'])]
    test = test[common_cols]
    print(f"card1 present after alignment: {'card1' in train.columns}")

    # adding null indicator flags for key columns
    null_flags_train = {}
    null_flags_test = {}
    for col in tqdm(key_cols, desc="Adding null flags"):
        if col in train.columns and col != 'isFraud' and train[col].isnull().mean() > 0.1:
            null_flags_train[f'{col}_missing'] = train[col].isnull().astype(int)
            null_flags_test[f'{col}_missing'] = test[col].isnull().astype(int)
    train = train.join(pd.DataFrame(null_flags_train, index=train.index))
    test = test.join(pd.DataFrame(null_flags_test, index=test.index))

    # imputing categorical columns with 'missing'
    cat_cols = train.select_dtypes(include=['object']).columns
    for col in tqdm(cat_cols, desc="Imputing categoricals"):
        train.loc[:, col] = train[col].fillna('missing')
        test.loc[:, col] = test[col].fillna('missing')

    # imputing numerical columns with median
    num_cols = train.select_dtypes(include=['float64', 'int64']).columns.drop(['isFraud'], errors='ignore')
    for col in tqdm(num_cols, desc="Imputing numericals"):
        median = train[col].median()
        train.loc[:, col] = train[col].fillna(median)
        test.loc[:, col] = test[col].fillna(median)

    # applying PCA to V-features
    v_cols = [col for col in train.columns if col.startswith('V') and train[col].nunique() > 1 and train[col].isnull().mean() < 0.9]
    if v_cols:
        print(f"applying PCA to {len(v_cols)} V-features...")
        v_data_train = train[v_cols].copy()
        v_data_test = test[v_cols].copy()
        for col in v_cols:
            median = v_data_train[col].median()
            v_data_train[col] = v_data_train[col].fillna(median)
            v_data_test[col] = v_data_test[col].fillna(median)
        if v_data_train.isnull().sum().sum() == 0 and v_data_train.nunique().min() > 1:
            pca = PCA(n_components=min(50, len(v_cols)), random_state=42)
            train_v_pca = pca.fit_transform(v_data_train)
            test_v_pca = pca.transform(v_data_test)
            train_v_pca = pd.DataFrame(train_v_pca).fillna(0)
            test_v_pca = pd.DataFrame(test_v_pca).fillna(0)
            train = train.drop(columns=v_cols).join(pd.DataFrame(train_v_pca, columns=[f'V_pca_{i}' for i in range(train_v_pca.shape[1])], index=train.index))
            test = test.drop(columns=v_cols).join(pd.DataFrame(test_v_pca, columns=[f'V_pca_{i}' for i in range(test_v_pca.shape[1])], index=test.index))
            print("PCA completed")
        else:
            print("skipping PCA: V-features have insufficient variance or nulls")
            train = train.drop(columns=v_cols)
            test = test.drop(columns=v_cols)
    else:
        print("no valid V-features for PCA")

    # engineering feature: missing count per row
    train = train.copy()
    test = test.copy()
    train.loc[:, 'missing_count'] = train.isnull().sum(axis=1)
    test.loc[:, 'missing_count'] = test.isnull().sum(axis=1)

    # dropping highly correlated numerical features with sampling
    numeric_df = train.select_dtypes(include=[np.number]).drop(['isFraud'], axis=1, errors='ignore')
    if len(numeric_df.columns) > 100:
        print("campling data for correlation matrix...")
        sample_df = numeric_df.sample(n=10000, random_state=42)
    else:
        sample_df = numeric_df
    print("computing correlation matrix...")
    corr_matrix = sample_df.corr().abs()
    print("correlation matrix completed")
    upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    protected_cols = ['card1'] if 'card1' in train.columns else []
    to_drop = [col for col in tqdm(upper_triangle.columns, desc="checking correlations") if col not in protected_cols and any(upper_triangle[col] > 0.9)]
    train.drop(columns=to_drop, inplace=True)
    test.drop(columns=to_drop, inplace=True)
    print(f"dropped correlated columns: {len(to_drop)} columns")
    print(f"card1 present after correlations: {'card1' in train.columns}")

    print("remaining nulls in train:", train.isnull().sum().sum())
    print("remaining nulls in test:", test.isnull().sum().sum())
    print(f"final shape: Train={train.shape}, Test={test.shape}")

    return train, test

# preprocessing data
train, test = preprocess_data(train, test)

Checking missingness: 100%|██████████| 434/434 [00:00<00:00, 458.17it/s]


Dropped columns: 74 columns
card1 present after missingness: True
card1 present after alignment: True


Adding null flags: 100%|██████████| 8/8 [00:00<00:00, 45.15it/s]
Imputing categoricals: 100%|██████████| 26/26 [00:02<00:00,  8.97it/s]
Imputing numericals: 100%|██████████| 339/339 [00:03<00:00, 93.83it/s]


applying PCA to 292 V-features...
PCA completed
computing correlation matrix...
correlation matrix completed


checking correlations: 100%|██████████| 98/98 [00:00<00:00, 11971.86it/s]


dropped correlated columns: 14 columns
card1 present after correlations: True
remaining nulls in train: 29527000
remaining nulls in test: 25334550
final shape: Train=(590540, 111), Test=(506691, 110)


# **Additional Preprocessing and Validation**

In [12]:
# dropping constant features
train = train.loc[:, train.nunique() > 1]
test = test.loc[:, test.nunique() > 1]

# dropping irrelevant columns
train.drop(columns=['TransactionID'], inplace=True, errors='ignore')
test.drop(columns=['TransactionID'], inplace=True, errors='ignore')

# engineering group-based feature
if 'card1' in train.columns and 'card1' in test.columns:
    print("creating card1_fraud_rate feature...")
    card1_fraud_rate = train.groupby('card1')['isFraud'].mean()
    train['card1_fraud_rate'] = train['card1'].map(card1_fraud_rate)
    test['card1_fraud_rate'] = test['card1'].map(card1_fraud_rate.fillna(card1_fraud_rate.mean()))
else:
    print("skipping card1_fraud_rate: card1 column missing")
    train['card1_fraud_rate'] = 0
    test['card1_fraud_rate'] = 0

print(f"final DataFrame Shape: Train={train.shape}, Test={test.shape}")
print("remaining nulls in train:", train.isnull().sum().sum())
print("remaining nulls in test:", test.isnull().sum().sum())

creating card1_fraud_rate feature...
final DataFrame Shape: Train=(590540, 61), Test=(506691, 60)
remaining nulls in train: 0
remaining nulls in test: 9360


# Why Cell 4 Left Tons of Missing Values but Cell 5 Nearly Fixed It

Wondering why **Cell 4** left a mess of missing values (29.5 million in `train` 25.3 million in `test`) but **Cell 5** dropped those to 0 in `train` and just 9360 in `test`?

## Cell 4: The Big Cleanup
**What it does**: Cell 4 is like a giant vacuum cleaner for the dataset (434 columns to start) It tries to tidy up by tossing bad columns filling gaps simplifying data and adding new bits Here's what happens

- Drops 74 columns with over 80% missing data (like `id_26` 99% empty) Keeps `card1`
- Matches train and test columns (except `isFraud` only in `train`)
- Adds flags (like `card1_missing`) for missing key data
- Fills gaps: text columns (like `ProductCD`) get “missing” number columns (like `TransactionAmt`) get the median
- Squashes 292 V-columns (like `V1` `V2`) into 50 using PCA (math trick)
- Adds `missing_count` to track gaps per row
- Drops 14 too-similar number columns keeping `card1`

**What went wrong**: It should fill all gaps but left 29.5 million in `train` 25.3 million in `test` Why? It only filled `float64` and `int64` number columns PCA or flag columns might be `float32` or `int32` and got skipped Tons of gaps stayed

**Result**: we got `train` (590540 rows 111 columns) and `test` (506691 rows 110 columns) but with tons of missing values

## Cell 5: Quick Polish
**What it does**: Cell 5 is a fast touch-up It simplifies data ditches a useless column and adds a smart feature Here's how

- tosses columns with one value or all blanks (useless for fraud detection)
- drops `TransactionID` (just an ID)
- adds `card1_fraud_rate` showing how often each `card1` is linked to fraud in `train` Tries to apply it to `test`

**Why nulls dropped**:
- **Train: 29.5 million to 0**:
  - The “toss boring columns” step (`train.nunique() > 1`) cut ~50 columns Many had tons of gaps (like `float32` PCA or flags) If a column was mostly empty or had one value it got cut taking all gaps
  - `card1_fraud_rate` has no blanks in `train` (uses `train`’s clean `card1` and `isFraud`)
  - Train ends clean with 61 columns 0 nulls
- **Test: 25.3 million to 9360**:
  - Same thing: Cut similar columns in `test` wiping most of the 25.3 million gaps
  - But `card1_fraud_rate` messed up If `test` has `card1` values not in `train` those spots stay empty That’s the 9360 nulls
  - Cell 6 will fix this

# ***Final Null Imputation***

In [13]:
print("Columns with nulls in train:")
print(train.isnull().sum()[train.isnull().sum() > 0])
print("Columns with nulls in test:")
print(test.isnull().sum()[test.isnull().sum() > 0])

# imputing card1_fraud_rate in test
if 'card1_fraud_rate' in test.columns:
    print("imputing card1_fraud_rate in test...")
    test['card1_fraud_rate'] = test['card1_fraud_rate'].fillna(train['card1_fraud_rate'].mean())

# imputing numerical columns
num_cols = test.select_dtypes(include=['float64', 'float32', 'int64', 'int32']).columns
for col in num_cols:
    test.loc[:, col] = test[col].fillna(-999)
    train.loc[:, col] = train[col].fillna(-999)

# imputing categorical columns
cat_cols = test.select_dtypes(include=['object']).columns
for col in cat_cols:
    test.loc[:, col] = test[col].fillna('missing')
    train.loc[:, col] = train[col].fillna('missing')

print("remaining nulls in train:", train.isnull().sum().sum())
print("remaining nulls in test:", test.isnull().sum().sum())
print(f"final shape: Train={train.shape}, Test={test.shape}")
print(f"card1 present in train: {'card1' in train.columns}")
print(f"isFraud present in train: {'isFraud' in train.columns}")
print(f"card1_fraud_rate present in test: {'card1_fraud_rate' in test.columns}")

Columns with nulls in train:
Series([], dtype: int64)
Columns with nulls in test:
card1_fraud_rate    9360
dtype: int64
imputing card1_fraud_rate in test...
remaining nulls in train: 0
remaining nulls in test: 0
final shape: Train=(590540, 61), Test=(506691, 60)
card1 present in train: True
isFraud present in train: True
card1_fraud_rate present in test: True


# Preprocessing and nulls handling overview

We cleaned up a huge fraud detection dataset (590540 train rows 506691 test rows ~434 columns) to make it ready for modeling.

**Cell 1: Load Data**  
Loaded train and test data from CSV files (transactions and identity) Checked we got `card1` and V-features Shapes: train (590540 rows 394 columns) test (506691 rows 393 columns)

**Cell 2: Merge Data**  
Combined transaction and identity data using `TransactionID` Fixed test column names Got train (434 columns) and test (433 columns)

**Cell 3: Check Missing Values**  
Looked at missing data Found tons (like `id_26` 99% empty) Helped us plan cleanup

**Cell 4: Big Cleanup**  
Dropped 74 columns with >80% missing Kept `card1` Aligned train/test columns Added flags for missing key data Filled text columns with “missing” and numbers with medians Squashed 292 V-features into 50 using PCA Added `missing_count` Dropped 14 similar columns Left 29.5M nulls in train 25.3M in test (missed some column types) Got train (111 columns) test (110 columns)

**Cell 5: Quick Polish**  
Dropped columns with one value or all blanks Ditched `TransactionID` Added `card1_fraud_rate` (fraud likelihood per `card1`) Train nulls dropped to 0 (cut null-heavy columns) Test nulls to 9360 (`card1_fraud_rate` glitch) Got train (61 columns) test (60 columns)

**Cell 6: Final Fix**  
Fixed 9360 nulls in test’s `card1_fraud_rate` with train’s average Filled any stray number/text columns No nulls left Confirmed `card1` `isFraud` `card1_fraud_rate` are there Final shapes: train (590540 rows 61 columns) test (506691 rows 60 columns)

**Preprocessing Done**  
We went from messy data to clean null-free datasets ready.

# **Feature Engineering**



---



---



---



In [None]:
# Feature Engineering Cell
def feature_engineering(df, is_train=True):
    """
    Applies feature engineering to the dataset
    Args:
        df: DataFrame (train or test)
        is_train: Whether this is training data (needed for some target-based features)
    Returns:
        DataFrame with new features
    """
    # Make a copy to avoid SettingWithCopyWarning
    df = df.copy()
    
    print("Starting feature engineering...")
    
    # ======================
    # 1. Time-Based Features
    # ======================
    if 'TransactionDT' in df.columns:
        print("Creating time-based features...")
        # Convert seconds to days
        df['Transaction_day'] = df['TransactionDT'] // (24*60*60)
        
        # Time of day features (morning, afternoon, evening, night)
        df['Transaction_hour'] = (df['TransactionDT'] % (24*60*60)) / (60*60)
        df['Is_night'] = ((df['Transaction_hour'] >= 22) | (df['Transaction_hour'] <= 6)).astype(int)
        df['Is_morning'] = ((df['Transaction_hour'] > 6) & (df['Transaction_hour'] <= 12)).astype(int)
        df['Is_afternoon'] = ((df['Transaction_hour'] > 12) & (df['Transaction_hour'] <= 18)).astype(int)
        df['Is_evening'] = ((df['Transaction_hour'] > 18) & (df['Transaction_hour'] < 22)).astype(int)
        
        # Weekend flag
        # Assuming day 0 was a Monday (you may need to adjust based on actual data)
        df['Is_weekend'] = ((df['Transaction_day'] % 7) >= 5).astype(int)
    
    # ===========================
    # 2. Transaction Amount Features
    # ===========================
    if 'TransactionAmt' in df.columns:
        print("Creating transaction amount features...")
        # Log transform of amount
        df['TransactionAmt_log'] = np.log1p(df['TransactionAmt'])
        
        # Binned amounts
        df['TransactionAmt_bin'] = pd.cut(df['TransactionAmt'], 
                                         bins=[0, 10, 50, 100, 500, 1000, float('inf')],
                                         labels=['0-10', '10-50', '50-100', '100-500', '500-1000', '1000+'])
    
    # ===========================
    # 3. Email Domain Features
    # ===========================
    for col in ['P_emaildomain', 'R_emaildomain']:
        if col in df.columns:
            print(f"Creating features from {col}...")
            # Extract domain provider
            df[f'{col}_provider'] = df[col].apply(lambda x: str(x).split('.')[0] if pd.notnull(x) else 'missing')
            
            # Free email flag (common for fraud)
            free_emails = ['gmail', 'yahoo', 'hotmail', 'outlook', 'aol', 'protonmail']
            df[f'{col}_is_free'] = df[f'{col}_provider'].isin(free_emails).astype(int)
    
    # ===========================
    # 4. Card Features
    # ===========================
    if 'card1' in df.columns:
        print("Creating card features...")
        # Number of transactions per card (only meaningful if we can group across full dataset)
        if is_train:
            card_counts = df['card1'].value_counts().to_dict()
            df['card1_count'] = df['card1'].map(card_counts)
        else:
            # For test data, we'd need to use counts from training data
            # This would need to be handled separately if doing proper train/test split
            df['card1_count'] = -1  # Placeholder
    
    # ===========================
    # 5. Device Features
    # ===========================
    if 'DeviceInfo' in df.columns:
        print("Creating device features...")
        # Extract device type (simplified)
        df['Device_type'] = df['DeviceInfo'].str.split('/', n=1).str[0]
        df['Device_type'] = df['Device_type'].str.split(' ', n=1).str[0]
        
        # Common device flag
        top_devices = df['Device_type'].value_counts().head(5).index
        df['Is_common_device'] = df['Device_type'].isin(top_devices).astype(int)
    
    if 'DeviceType' in df.columns:
        # Simple binary feature
        df['Is_mobile'] = (df['DeviceType'] == 'mobile').astype(int)
    
    # ===========================
    # 6. Interaction Features
    # ===========================
    if all(col in df.columns for col in ['TransactionAmt', 'card1_count']):
        print("Creating interaction features...")
        # Amount relative to card's typical transaction
        df['Amt_per_card_count'] = df['TransactionAmt'] / (df['card1_count'] + 1)
        
    if all(col in df.columns for col in ['TransactionAmt', 'Is_night']):
        # Nighttime high-value transactions might be suspicious
        df['Night_high_value'] = (df['Is_night'] & (df['TransactionAmt'] > 500)).astype(int)
    
    # ===========================
    # 7. Frequency Encoding
    # ===========================
    print("Adding frequency encoding for categoricals...")
    cat_cols = [col for col in df.columns if df[col].dtype == 'object']
    for col in cat_cols:
        if is_train:
            freq = df[col].value_counts(normalize=True).to_dict()
            df[f'{col}_freq'] = df[col].map(freq)
        else:
            # For test data, we'd need to use frequencies from training data
            df[f'{col}_freq'] = -1  # Placeholder
    
    print("Feature engineering complete!")
    return df

# Apply feature engineering
print("\nEngineering features for training data...")
train_fe = feature_engineering(train, is_train=True)

print("\nEngineering features for test data...")
test_fe = feature_engineering(test, is_train=False)

# Verify the results
print("\nFeature engineering results:")
print(f"Train shape after FE: {train_fe.shape}")
print(f"Test shape after FE: {test_fe.shape}")
print(f"New features added: {set(train_fe.columns) - set(train.columns)}")

# **Saving the preprocessed dataset.**

In [14]:
train_path = '/content/drive/MyDrive/ieee-fraud/train_preprocessed.parquet'
test_path = '/content/drive/MyDrive/ieee-fraud/test_preprocessed.parquet'
print("Saving preprocessed datasets...")
train.to_parquet(train_path, index=False)
test.to_parquet(test_path, index=False)

print("verifying saved datasets...")
train_saved = pd.read_parquet(train_path)
test_saved = pd.read_parquet(test_path)

print(f"train saved shape: {train_saved.shape}")
print(f"test saved shape: {test_saved.shape}")
print(f"train saved nulls: {train_saved.isnull().sum().sum()}")
print(f"test saved nulls: {test_saved.isnull().sum().sum()}")
print(f"card1 in train: {'card1' in train_saved.columns}")
print(f"isFraud in train: {'isFraud' in train_saved.columns}")
print(f"card1_fraud_rate in test: {'card1_fraud_rate' in test_saved.columns}")

Saving preprocessed datasets...
verifying saved datasets...
train saved shape: (590540, 61)
test saved shape: (506691, 60)
train saved nulls: 0
test saved nulls: 0
card1 in train: True
isFraud in train: True
card1_fraud_rate in test: True
