# 🛡️ Fraud Detection using Transaction Data

This project detects fraudulent transactions using machine learning (Decision Tree Classifier).  
We process and analyze transaction data from `.pkl` and `.csv` files.

---

## 📁 Files Used:
- `dataset.zip` → Contains multiple `.pkl` files with transaction data
- `combined_transaction.csv` → All `.pkl` files combined into one CSV
- `new_transaction.csv` → A sample transaction file for prediction
- `fraud_detection_model.pkl` → Saved model used for final prediction

---

## 🧪 Steps Covered:
1. **Extract Dataset** – Unzip and read `.pkl` files
2. **Load & Combine** – Convert `.pkl` files to a single CSV
3. **Preprocess** – Prepare data for modeling
4. **Train Model** – Fit Decision Tree Classifier
5. **Predict** – Predict fraud status using the trained model
6. **Save Artifacts** – Store model and processed data for reuse

---

> Created using Python, pandas, scikit-learn, and joblib.

In [None]:
# extract_dataset.py
import zipfile
import os

# Extract the ZIP
with zipfile.ZipFile("dataset.zip", "r") as zip_ref:
    zip_ref.extractall("unzipped")  # extract to 'unzipped' folder

# Print all files inside it
for root, dirs, files in os.walk("unzipped"):
    print("\nInside folder:", root)
    for file in files:
        print("  └──", file)



Inside folder: unzipped

Inside folder: unzipped\data
  └── 2018-04-01.pkl
  └── 2018-04-02.pkl
  └── 2018-04-03.pkl
  └── 2018-04-04.pkl
  └── 2018-04-05.pkl
  └── 2018-04-06.pkl
  └── 2018-04-07.pkl
  └── 2018-04-08.pkl
  └── 2018-04-09.pkl
  └── 2018-04-10.pkl
  └── 2018-04-11.pkl
  └── 2018-04-12.pkl
  └── 2018-04-13.pkl
  └── 2018-04-14.pkl
  └── 2018-04-15.pkl
  └── 2018-04-16.pkl
  └── 2018-04-17.pkl
  └── 2018-04-18.pkl
  └── 2018-04-19.pkl
  └── 2018-04-20.pkl
  └── 2018-04-21.pkl
  └── 2018-04-22.pkl
  └── 2018-04-23.pkl
  └── 2018-04-24.pkl
  └── 2018-04-25.pkl
  └── 2018-04-26.pkl
  └── 2018-04-27.pkl
  └── 2018-04-28.pkl
  └── 2018-04-29.pkl
  └── 2018-04-30.pkl
  └── 2018-05-01.pkl
  └── 2018-05-02.pkl
  └── 2018-05-03.pkl
  └── 2018-05-04.pkl
  └── 2018-05-05.pkl
  └── 2018-05-06.pkl
  └── 2018-05-07.pkl
  └── 2018-05-08.pkl
  └── 2018-05-09.pkl
  └── 2018-05-10.pkl
  └── 2018-05-11.pkl
  └── 2018-05-12.pkl
  └── 2018-05-13.pkl
  └── 2018-05-14.pkl
  └── 2018-05-15.pkl


In [None]:
# Load_data.py
import os
import pandas as pd

# Path to the folder containing the .pkl files
folder_path = 'unzipped/data'

# Check if the folder exists
if not os.path.exists(folder_path):
    print(f"Error: Folder path {folder_path} does not exist.")
    exit()

# List to store individual DataFrames
dataframes = []

# Load all .pkl files from the folder
for filename in sorted(os.listdir(folder_path)):
    if filename.endswith('.pkl'):
        file_path = os.path.join(folder_path, filename)
        try:
            # Read the .pkl file and append to the dataframes list
            df = pd.read_pickle(file_path)
            dataframes.append(df)
        except Exception as e:
            print(f"Error loading {filename}: {e}")

# Combine all the DataFrames into one
if dataframes:
    combined_df = pd.concat(dataframes, ignore_index=True)

    # Display the combined DataFrame's shape and first few rows
    print("Shape of combined DataFrame:", combined_df.shape)
    print(combined_df.head())

    # Save the combined data into a CSV file
    combined_df.to_csv('combined_transactions.csv', index=False)
    print("Combined data saved to 'combined_transactions.csv'.")
else:
    print("No valid .pkl files found to load.")

Shape of combined DataFrame: (1754155, 9)
   TRANSACTION_ID         TX_DATETIME CUSTOMER_ID TERMINAL_ID  TX_AMOUNT  \
0               0 2018-04-01 00:00:31         596        3156      57.16   
1               1 2018-04-01 00:02:10        4961        3412      81.51   
2               2 2018-04-01 00:07:56           2        1365     146.00   
3               3 2018-04-01 00:09:29        4128        8737      64.49   
4               4 2018-04-01 00:10:34         927        9906      50.99   

  TX_TIME_SECONDS TX_TIME_DAYS  TX_FRAUD  TX_FRAUD_SCENARIO  
0              31            0         0                  0  
1             130            0         0                  0  
2             476            0         0                  0  
3             569            0         0                  0  
4             634            0         0                  0  
Combined data saved to 'combined_transactions.csv'.


In [3]:
# preprocess_data.py

import pandas as pd
from sklearn.model_selection import train_test_split

# Step 1: Load the saved data
data = pd.read_csv('combined_transactions.csv')

# Step 2: Select features (input columns) and target (what you want to predict)
features = ['TX_AMOUNT', 'TX_TIME_SECONDS', 'TX_TIME_DAYS']  # Example features
X = data[features]
y = data['TX_FRAUD']

# Step 3: Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Print shapes to check
print(f"Training Features Shape: {X_train.shape}")
print(f"Testing Features Shape: {X_test.shape}")
print(f"Training Labels Shape: {y_train.shape}")
print(f"Testing Labels Shape: {y_test.shape}")

Training Features Shape: (1403324, 3)
Testing Features Shape: (350831, 3)
Training Labels Shape: (1403324,)
Testing Labels Shape: (350831,)


In [4]:
# train_model.py

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Load the saved CSV
data = pd.read_csv('combined_transactions.csv')

# Step 2: Define features (X) and label (y)
features = ['TX_AMOUNT', 'TX_TIME_SECONDS', 'TX_TIME_DAYS']  # Same as before
X = data[features]
y = data['TX_FRAUD']

# Step 3: Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Create and train the model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 5: Predict on the test set
y_pred = model.predict(X_test)

# Step 6: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Step 7: Save the model (optional, for later use)
import joblib
joblib.dump(model, 'fraud_detection_model.pkl')
print("\nModel saved as 'fraud_detection_model.pkl'")

Accuracy: 0.99

Confusion Matrix:
[[345578   2392]
 [  2198    663]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99    347970
           1       0.22      0.23      0.22      2861

    accuracy                           0.99    350831
   macro avg       0.61      0.61      0.61    350831
weighted avg       0.99      0.99      0.99    350831


Model saved as 'fraud_detection_model.pkl'


In [5]:
# predict.py

import pandas as pd
import joblib

# Step 1: Load the saved model
model = joblib.load('fraud_detection_model.pkl')
print("Model loaded successfully!")

# Step 2: Create some new sample transactions
# (You can change these values to test different transactions!)
new_transactions = pd.DataFrame({
    'TX_AMOUNT': [100.50, 5.75, 3000.00],
    'TX_TIME_SECONDS': [3600, 7200, 10800],
    'TX_TIME_DAYS': [0, 1, 2]
})

print("\nNew transactions:")
print(new_transactions)

# Step 3: Make predictions
predictions = model.predict(new_transactions)

# Step 4: Show the predictions
for idx, prediction in enumerate(predictions):
    result = "Fraud" if prediction == 1 else "Not Fraud"
    print(f"Transaction {idx+1}: {result}")



Model loaded successfully!

New transactions:
   TX_AMOUNT  TX_TIME_SECONDS  TX_TIME_DAYS
0     100.50             3600             0
1       5.75             7200             1
2    3000.00            10800             2
Transaction 1: Not Fraud
Transaction 2: Not Fraud
Transaction 3: Fraud
