## Data Loading & Initial Exploration

In [1]:
import pandas as pd

file_path = r"C:\Users\Asus\Downloads\creditcard.csv\creditcard.csv"

print("--- Step 1: Data Loading ---")
try:
    df = pd.read_csv(file_path)
    
    print(f" Data file successfully loaded. Total transactions: {len(df)}")
except FileNotFoundError:
    print(f" Error: File not found at '{file_path}'. Please check the path.")
    exit()

# 2. Initial Inspection
print("\n--- Initial Inspection (Head & Info) ---")
print(df.head())
print(df.info()) 

# 3. Target Variable Distribution (Class)
print("\n--- Target Variable Distribution (Class) ---")
# 'Class' (0: Legitimate, 1: Fraud) is the target variable
class_counts = df['Class'].value_counts()
print(class_counts)

# Calculate the imbalance
fraud_count = class_counts.get(1, 0) # Use .get(1, 0) in case no fraud exists (unlikely)
total_count = len(df)
fraud_percentage = (fraud_count / total_count) * 100

print(f"\nFraudulent transactions (Class 1): {fraud_count}")
print(f"Percentage of Fraud: {fraud_percentage:.6f}%")

--- Step 1: Data Loading ---
 Data file successfully loaded. Total transactions: 284807

--- Initial Inspection (Head & Info) ---
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175

## Data Preprocessing (Scaling and Feature Selection)

#### 1. Code for Scaling

In [2]:
from sklearn.preprocessing import StandardScaler, RobustScaler

# Anomaly Detection mein hum RobustScaler ko prefer karte hain, 
# kyunki yeh outliers (jo ki Fraud transactions hi hain) se kam affect hota hai.
robust_scaler = RobustScaler()

# Features to be scaled: Time and Amount
df['scaled_amount'] = robust_scaler.fit_transform(df['Amount'].values.reshape(-1, 1))
df['scaled_time'] = robust_scaler.fit_transform(df['Time'].values.reshape(-1, 1))

# Drop original Time and Amount columns
df = df.drop(['Time', 'Amount'], axis=1)

# Rearrange columns so the target ('Class') is at the end
# X = features, y = target
X = df.drop('Class', axis=1)
y = df['Class']

print(" Time and Amount features successfully scaled using RobustScaler.")
print("\n--- Processed Data Head (Showing scaled features) ---")
print(X.head())

 Time and Amount features successfully scaled using RobustScaler.

--- Processed Data Head (Showing scaled features) ---
         V1        V2        V3        V4        V5        V6        V7  \
0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9       V10  ...       V21       V22       V23       V24  \
0  0.098698  0.363787  0.090794  ... -0.018307  0.277838 -0.110474  0.066928   
1  0.085102 -0.255425 -0.166974  ... -0.225775 -0.638672  0.101288 -0.339846   
2  0.247676 -1.514654  0.207643  ...  0.247998  0.771679  0.909412 -0.689281   
3  0.377436 -1.387024 -0.054952  ... -0.108300  0.005274 -0.190321 -1.175575   
4 -0.270533  0.817739  0.753

#### 2. Isolation Forest Setup (One-Class Model)

In [3]:
# Separate Legitimate and Fraudulent transactions
df_legitimate = df[df['Class'] == 0]
df_fraud = df[df['Class'] == 1]

# We will only train Isolation Forest on the legitimate data (Unsupervised Anomaly Detection)
X_train_isolation = df_legitimate.drop('Class', axis=1)

print(f"\nIsolation Forest Training Data size (Legitimate only): {len(X_train_isolation)}")


Isolation Forest Training Data size (Legitimate only): 284315


## Model Training (Isolation Forest)

In [5]:
from sklearn.ensemble import IsolationForest

# Contamination: Yeh parameter batata hai ki hum kitne percentage data ko anomaly (fraud) expect karte hain.
# Hum iska value actual fraud percentage (0.172749%) ke aas-paas set karenge.
contamination_rate = 0.0017275  # ~0.17275%

# Initialize the Isolation Forest model
# random_state=42 for reproducibility
iso_forest = IsolationForest(
    n_estimators=100, 
    max_samples='auto', 
    contamination=contamination_rate, 
    random_state=42, 
    verbose=0
)

# Train the model ONLY on the legitimate transactions (X_train_isolation)
print("\nTraining Isolation Forest Model (Unsupervised)...")
iso_forest.fit(X_train_isolation)

print(" Isolation Forest Model successfully trained.")


Training Isolation Forest Model (Unsupervised)...
 Isolation Forest Model successfully trained.


## Prediction & Evaluation (Isolation Forest)

#### 1. Prediction

In [7]:
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix

# Predict anomalies (frauds) on the entire dataset (X)
# X contains all transactions (Legitimate and Fraud)
print("Making predictions on the entire dataset...")
y_pred_iso = iso_forest.predict(X)

# Map the Isolation Forest output (1: Inlier, -1: Outlier) to our target labels (0: Legitimate, 1: Fraud)
# Fraud (1) is represented by -1 (Outlier) in Isolation Forest.
y_pred_mapped = np.where(y_pred_iso == -1, 1, 0) 

print(" Predictions completed and mapped to 0/1.")

Making predictions on the entire dataset...
 Predictions completed and mapped to 0/1.


#### 2. Evaluation Metrics

In [9]:
# Calculate Confusion Matrix
cm_iso = confusion_matrix(y, y_pred_mapped)
print("\n--- Confusion Matrix ---")
print(cm_iso)

# Generate the detailed Classification Report
print("\n--- Classification Report (Focus on Class 1: Fraud) ---")
print(classification_report(y, y_pred_mapped))


--- Confusion Matrix ---
[[283823    492]
 [   357    135]]

--- Classification Report (Focus on Class 1: Fraud) ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    284315
           1       0.22      0.27      0.24       492

    accuracy                           1.00    284807
   macro avg       0.61      0.64      0.62    284807
weighted avg       1.00      1.00      1.00    284807



## Anomaly Detection

In [10]:
# Hum is data ko supervised model jaise Logistic Regression ke saath test karenge,
# lekin hum imbalance ko handle karne ke liye SMOTE ya Class Weighting ka use karenge.

from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE # Requires installation: pip install imblearn
from sklearn.model_selection import train_test_split

# 1. Data Splitting (Using all features X)
# X (V1-V28, scaled_time, scaled_amount), y (Class)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y # Imbalance ko train/test sets mein maintain karein
)

# 2. Apply SMOTE to the training data ONLY to balance the classes
print("\nApplying SMOTE to training data...")
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

print(f"Original Training Data size: {len(X_train)}")
print(f"Resampled Training Data size: {len(X_train_res)}")
print(f"Resampled Class Distribution:\n{y_train_res.value_counts()}")


Applying SMOTE to training data...


MemoryError: Unable to allocate 52.1 MiB for an array with shape (30, 227845) and data type float64

## Downsampling and Supervised Training (Logistic Regression)

In [11]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Separate Legitimate and Fraudulent transactions (using the variables created earlier)
# df_legitimate (284315 samples), df_fraud (492 samples)

# 1. Downsample Legitimate Transactions (Class 0)
# Target ratio: Let's aim for 5:1 ratio (Fraud: 492, Legitimate: 492 * 5 = 2460)
legit_sample_size = len(df_fraud) * 5 # 492 * 5 = 2460 samples

# Randomly select a smaller subset of legitimate transactions
df_legitimate_sampled = df_legitimate.sample(n=legit_sample_size, random_state=42)

# 2. Combine the sampled legitimate data and all fraud data
df_balanced = pd.concat([df_legitimate_sampled, df_fraud], ignore_index=True)

print(f" Legitimate data successfully downsampled to {legit_sample_size} samples.")
print(f"Total Combined Balanced Data Size: {len(df_balanced)}")
print(f"Balanced Data Class Distribution:\n{df_balanced['Class'].value_counts()}")

# 3. Final Split (Downsampled)
X_balanced = df_balanced.drop('Class', axis=1)
y_balanced = df_balanced['Class']

# Split the balanced data into Training and Testing sets
X_train_bal, X_test_bal, y_train_bal, y_test_bal = train_test_split(
    X_balanced, y_balanced, 
    test_size=0.3, # Using a standard 70/30 split on the balanced data
    random_state=42, 
    stratify=y_balanced
)

 Legitimate data successfully downsampled to 2460 samples.
Total Combined Balanced Data Size: 2952
Balanced Data Class Distribution:
Class
0    2460
1     492
Name: count, dtype: int64


In [12]:
#Supervised Model Training (Logistic Regression)
# Initialize and Train Logistic Regression on the downsampled data
lr_model = LogisticRegression(random_state=42)

print("\nTraining Logistic Regression Model on Downsampled Data...")
lr_model.fit(X_train_bal, y_train_bal)
print(" Logistic Regression Model successfully trained.")


Training Logistic Regression Model on Downsampled Data...
 Logistic Regression Model successfully trained.


## Evaluation on the Original, Unbalanced Test Set

In [13]:
# Hum Step 5 se yahi split use karenge. Original data se 20% test set
# Original X (poore features) aur y (Class) ko use karte hue.

# Original Data Splitting (20% for testing)
from sklearn.model_selection import train_test_split
# X aur y variable Step 2 mein define kiye gaye the.
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y # Imbalance ko maintain karein
)

print(f"Original Test Set Size: {len(X_test_orig)}")
print(f"Original Test Set Fraud Count: {y_test_orig.value_counts()[1]}")

Original Test Set Size: 56962
Original Test Set Fraud Count: 98


In [14]:
#1. Prediction'
# Use the trained Logistic Regression model (lr_model)
# to predict on the Original Test Set (X_test_orig)
print("\nMaking predictions on the ORIGINAL, UNBALANCED Test Set...")
y_pred_lr = lr_model.predict(X_test_orig)

print(" Predictions completed.")


Making predictions on the ORIGINAL, UNBALANCED Test Set...
 Predictions completed.


In [15]:
#2. Evaluation Metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Calculate Confusion Matrix
cm_lr = confusion_matrix(y_test_orig, y_pred_lr)
print("\n--- Confusion Matrix (Logistic Regression) ---")
print(cm_lr)

# Generate the detailed Classification Report
print("\n--- Classification Report (Focus on Class 1: Fraud) ---")
print(classification_report(y_test_orig, y_pred_lr))

# Calculate Accuracy (just for reporting)
accuracy_lr = accuracy_score(y_test_orig, y_pred_lr)
print(f"Overall Accuracy: {accuracy_lr:.4f}")


--- Confusion Matrix (Logistic Regression) ---
[[56377   487]
 [   10    88]]

--- Classification Report (Focus on Class 1: Fraud) ---
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     56864
           1       0.15      0.90      0.26        98

    accuracy                           0.99     56962
   macro avg       0.58      0.94      0.63     56962
weighted avg       1.00      0.99      0.99     56962

Overall Accuracy: 0.9913


##  Project Conclusion: Downsampled Logistic Regression (Supervised)

The Credit Card Fraud Detection project explored both unsupervised (Isolation Forest) and supervised (Logistic Regression with Downsampling) methods on a highly imbalanced dataset (0.17% Fraud).

### Key Findings and Model Selection

1.  **Isolation Forest (Unsupervised)**: Performed poorly, achieving a low **Recall of $0.27$** (missing 73% of actual fraud cases), making it unsuitable for a production environment.

2.  **Logistic Regression (Downsampled Supervised)**:
    * By downsampling the majority class (Legitimate) to a 5:1 ratio, the model was trained effectively to identify patterns in the rare Fraud class.
    * It achieved a critical **Recall of $0.90$**, correctly identifying 88 out of 98 fraud transactions in the test set. This drastically reduces **False Negatives** (missed fraud) to just **10**.
    * **Trade-off:** This high Recall came at the cost of **Precision ($0.15$)**, resulting in 487 **False Positives** (legitimate transactions incorrectly flagged as fraud).

### Final Recommendation

The **Logistic Regression (Downsampled)** model is the preferred solution. In financial security, **maximizing Recall (catching fraud)** is typically prioritized over maximizing Precision, as the cost of a missed fraud case is generally higher than the inconvenience of flagging a legitimate transaction (which can be manually verified).