# Credit Card Fraud Detection

In this notebook, we construct and assess a model that can flag suspicious transactions as potentially fraudulent using credit card transaction data.

### Step 1: Environment and Data Initialization

We begin by bringing in essential libraries that support data operations, analysis, and visualization.

In [None]:
import numpy as numpy
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("C:/Users/DELL/Downloads/creditcard (1).csv")

# Step 2: Exploratory Data Analysis (EDA)

Understanding the dataset's dimensions and contents helps us prepare for model development by identifying potential issues and opportunities in the data.

In [None]:
df.shape

This line reveals the structure of the dataset, highlighting the total number of records and features.

In [None]:
df.isna().sum()

```df.isna().sum()```  
This checks for missing (NaN/null) values in each column.  
There are no missing values in any of the columns. This is excellent, as it means we don't need to perform any imputation or removal of rows/columns due to missing data.

In [None]:
df.describe()

This provides descriptive statistics for numerical columns, such as count, mean, standard deviation, min, max, and quartile values.  
Most columns (V1 through V28) are anonymized features, likely obtained through a technique called PCA (Principal Component Analysis) to protect user privacy.  
The Time column represents the seconds elapsed between each transaction and the first transaction in the dataset.  
The Amount column shows the transaction amount.  
The Class column is our target variable, where 0 indicates a legitimate transaction and 1 indicates a fraudulent one.  
The mean of the Class column is 0.001727, which is very close to zero.  
This immediately signals a severe class imbalance: fraudulent transactions are very rare.

#### Renaming 'Class' Values
The numerical 0 and 1 in the Class column are replaced with more descriptive labels 'Not Fraud' and 'Fraud' for better readability and understanding of results.

In [None]:
df['Class'] = df['Class'].replace({0:'Not Fraud',1:'Fraud'})

This counts the occurrences of each unique value in the 'Class' column.

In [None]:
df['Class'].value_counts()

This confirms the extreme imbalance in the dataset. Out of 284,807 transactions, 284,315 are legitimate, while only 492 are fraudulent. This means fraud accounts for approximately (492 / 284807) * 100 = 0.17% of the total transactions. This imbalance is a significant challenge for machine learning models, as they might be biased towards the majority class ('Not Fraud').

#### Fraud Pie Chart
A pie chart is generated to visually represent the distribution of 'Fraud' vs. 'Not Fraud' transactions.

In [None]:
fraud = df['Class'].value_counts()
label = ['Not Fraud', 'Fraud']
color = ['green','red']

plt.figure(figsize=(10,8))
plt.pie(fraud, labels=label, autopct='%1.1f%%' )
plt.title("Fraud Pie Chart")
plt.show()

 The pie chart dramatically shows the imbalance, with the 'Fraud' slice being almost zero, highlighting the rarity of fraudulent transactions.

# 3. Data Preparation
Before training a machine learning model, the data needs to be structured appropriately.

#### Separating Features (X) and Target (Y)  
The dataset is divided into two parts:  
x (Features): Contains all columns from the DataFrame except the 'Class' column. These are the input variables the model will use to make predictions.  
y (Target): Contains only the 'Class' column. This is the output variable that the model will try to predict.

In [None]:
x = df.drop('Class',axis=1)
y = df['Class']

#### Splitting Data into Training and Testing Sets  
The dataset is split into training and testing subsets to evaluate the model's performance on unseen data.  
train_test_split: A function from scikit-learn for splitting arrays or matrices into random train and test subsets.  
test_size=0.2: 20% of the data will be used for testing, and the remaining 80% for training.  
random_state=7: This ensures reproducibility. If you run the code multiple times, the data will be split in the exact same way.  


In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=7)

Output: len(x_train),len(x_test) shows (227845, 56962).  
This confirms that 227,845 samples are used for training and 56,962 samples for testing.

#### Feature Scaling (StandardScaler)
Feature scaling is applied to standardize the range of independent variables or features.

StandardScaler: Transforms data such that its mean is 0 and standard deviation is 1 (Z-score normalization). This is important for algorithms that are sensitive to feature scales (like Logistic Regression or K-Nearest Neighbors).  

sc.fit_transform(x_train): The StandardScaler is fitted (learns the mean and standard deviation) only on the training data and then transforms it.  

sc.transform(x_test): The same scaler (with means and standard deviations learned from x_train) is used to transform the test data. This prevents "data leakage" where information from the test set could influence the training process.  

In [None]:
len(x_train),len(x_test)

# 4. Model Training & Evaluation
This section involves selecting a machine learning model, training it, and then evaluating its performance.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc = StandardScaler()

x_train_sc = sc.fit_transform(x_train)
x_test_sc = sc.transform(x_test)

#### Logistic Regression Model
Logistic Regression is a common algorithm for binary classification problems.  

LogisticRegression(): An instance of the Logistic Regression model is created.  
model.fit(x_train_sc, y_train): The model is trained using the scaled training features (x_train_sc) and their corresponding target labels (y_train).  
y_pred = model.predict(x_test_sc): After training, the model predicts the 'Class' for the scaled test features (x_test_sc).  

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
lr_model = LogisticRegression()

lr_model.fit(x_train_sc,y_train)
y_pred  =lr_model.predict(x_test_sc)

lr_accu = accuracy_score(y_pred,y_test)
print("Logistic regression accuracy_score: ",lr_accu*100)

print("\nConfusion Matrix:")
print(confusion_matrix(y_test,y_pred))

import seaborn as sns
cm_lr=confusion_matrix(y_test,y_pred)

sns.heatmap(cm_lr,annot=True,cmap='Blues')
plt.title("Confusion Matrix")

In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Fraud', 'Fraud']))

#### Accuracy Score
Accuracy measures the proportion of total predictions that were correct.  

lr_accu = accuracy_score(y_pred,y_test): Calculates the accuracy by comparing the model's predictions (y_pred) with the actual test labels (y_test).  
Output: accuracy_score: 99.90344440153085  
Explanation: The model achieved an accuracy of approximately 99.90%. While this number seems very high, it is misleading in the context of our highly imbalanced dataset.  

#### A simple model that always predicts 'Not Fraud' would achieve an accuracy of 284315 / 284807 ≈ 99.90%. Therefore, accuracy alone is not a sufficient metric for evaluating fraud detection models.

#### Confusion Matrix

A confusion matrix provides a more detailed breakdown of correct and incorrect classifications for each class.  
cm = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']): This command generates the confusion matrix table.  
#### Interpretation of the Confusion Matrix:
True Negatives (TN - Not Fraud Actual, Not Fraud Predicted): 56,851 transactions were correctly identified as legitimate.  
False Positives (FP - Not Fraud Actual, Fraud Predicted): 11 legitimate transactions were incorrectly classified as fraudulent (false alarms).  
False Negatives (FN - Fraud Actual, Not Fraud Predicted): 44 fraudulent transactions were incorrectly classified as legitimate. This is the most critical error in fraud detection, as these are the frauds that go undetected.  
True Positives (TP - Fraud Actual, Fraud Predicted): 56 fraudulent transactions were correctly identified.  
Conclusion from Confusion Matrix: Out of a total of 56 + 44 = 100 actual fraudulent transactions in the test set, the model only successfully detected 56 of them. This means 44% of the actual fraud cases were missed by the model, which is a significant weakness for a fraud detection system, despite the high overall accuracy.

#### Confusion Matrix Heatmap
The confusion matrix is visualized as a heatmap for easier interpretation.
The heatmap clearly shows the vast number of correctly predicted 'Not Fraud' transactions (the large value in the bottom-right cell) and the smaller but significant number of missed frauds (False Negatives) in the top-right cell.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(x_train_sc, y_train)
y_pred_dt = dt_model.predict(x_test_sc)

dt_accu = accuracy_score(y_pred_dt, y_test)

print("Decision Tree Accuracy: ",dt_accu * 100)

cm_dt = confusion_matrix(y_test, y_pred_dt)

print("\nConfusion Matrix:")
print(cm_dt)

sns.heatmap(cm_dt,annot=True,cmap='Blues')
plt.title("Confusion Matrix")


## Decision Tree Model
Decision Trees classify data by learning simple decision rules inferred from features, forming a tree-like structure.

Explanation: A DecisionTreeClassifier is initialized and trained, then used to predict on the test set.

Performance Metrics
Accuracy Score:
Output: Decision Tree Accuracy: [e.g., 99.91]%
Interpretation: Similar to Logistic Regression, overall accuracy can be deceptive; deeper analysis is required.



In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt, target_names=['Not Fraud', 'Fraud']))

Interpretation: Highlights the precision, recall, and F1-score, with recall for 'Fraud' being a key indicator of fraud detection capability.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=42)

rf_model.fit(x_train_sc, y_train)

y_pred_rf = rf_model.predict(x_test_sc)

rf_accu = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy: ",rf_accu * 100)

cm_rf = confusion_matrix(y_test, y_pred_dt)

print("\nConfusion Matrix:")
print(cm_rf)

sns.heatmap(cm_rf,annot=True,cmap='Blues')
plt.title("Confusion Matrix")


## Random Forest Model
Random Forest is an ensemble method that builds multiple decision trees and merges their predictions to improve accuracy and control overfitting.

Explanation: A RandomForestClassifier is trained on the scaled data, and predictions are made.

Performance Metrics
Accuracy Score:
Output: Random Forest Accuracy: [e.g., 99.95]%
Interpretation: Often higher than single trees, but still requires scrutiny beyond raw accuracy for imbalanced data.

In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Not Fraud', 'Fraud']))

Interpretation: Generally shows robust performance across metrics due to its ensemble nature, making it suitable for fraud detection.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()

knn_model.fit(x_train_sc, y_train)

y_pred_knn = knn_model.predict(x_test_sc)

knn_accu = accuracy_score(y_test, y_pred_knn)
print("K-Nearest Neighbors Accuracy: ",knn_accu * 100)

cm_knn = confusion_matrix(y_test, y_pred_dt)

print("\nConfusion Matrix:")
print(cm_knn)

sns.heatmap(cm_knn,annot=True,cmap='Blues')
plt.title("Confusion Matrix")


## K-Nearest Neighbors (KNN) Model
KNN is a non-parametric, instance-based learning algorithm that classifies new data points based on the majority class of their 'k' nearest neighbors.

Explanation: A KNeighborsClassifier is trained, and predictions are generated for the test set.

Performance Metrics
Accuracy Score:
Output: K-Nearest Neighbors Accuracy: [e.g., 99.95]%
Interpretation: KNN's performance is sensitive to k and feature scaling; accuracy alone is insufficient.

In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred_knn, target_names=['Not Fraud', 'Fraud']))

Interpretation: Crucial for understanding the balance between precision and recall for the minority class.

In [None]:
from sklearn.svm import SVC

svm_model = SVC(random_state=42)

svm_model.fit(x_train_sc, y_train)
y_pred_svm = svm_model.predict(x_test_sc)

svc_accu = accuracy_score(y_test, y_pred_svm)
print("SVM Accuracy: ", svc_accu * 100)

cm_svc = confusion_matrix(y_test, y_pred_dt)

print("\nConfusion Matrix:")
print(cm_svc)

sns.heatmap(cm_svc,annot=True,cmap='Blues')
plt.title("Confusion Matrix")

## Support Vector Machine (SVM) Model
SVMs are powerful supervised learning models that find an optimal hyperplane to separate data points into classes, maximizing the margin between them.

Explanation: An SVC (Support Vector Classifier) is trained and used to make predictions.

Performance Metrics
Accuracy Score:
Output: SVM Accuracy: [e.g., 99.94]%
Interpretation: SVMs can achieve high accuracy but are computationally intensive for large datasets; raw accuracy needs careful interpretation.

In [None]:

print("\nClassification Report:")
print(classification_report(y_test, y_pred_svm, target_names=['Not Fraud', 'Fraud']))

Interpretation: Evaluating precision and recall for the 'Fraud' class is key to assessing its practical utility.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda_model = LinearDiscriminantAnalysis()
lda_model.fit(x_train_sc, y_train)
y_pred_lda = lda_model.predict(x_test_sc)

lda_accu = accuracy_score(y_test, y_pred_lda)
print("LDA Accuracy: ", lda_accu * 100)

print("LDA Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lda))

cm_lda = confusion_matrix(y_test, y_pred_dt)

print("\nConfusion Matrix:")
print(cm_lda)

sns.heatmap(cm_lda,annot=True,cmap='Blues')
plt.title("Confusion Matrix")

## Linear Discriminant Analysis (LDA) Model
LDA is a classification and dimensionality reduction technique that projects data onto a lower-dimensional space to maximize class separability.

Explanation: A LinearDiscriminantAnalysis model is trained and used for predictions.

Performance Metrics
Accuracy Score:
Output: LDA Accuracy: [e.g., 99.95]%
Interpretation: LDA assumes normal distribution and equal covariance; recall for fraud cases is critical.

In [None]:

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lda, target_names=['Not Fraud', 'Fraud']))

Interpretation: Helps assess LDA's ability to distinguish classes based on its linear decision boundary.

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb_model = GaussianNB()
gnb_model.fit(x_train_sc, y_train)
y_pred_gnb = gnb_model.predict(x_test_sc)

gnb_accu = accuracy_score(y_test, y_pred_gnb)
print("Gaussian Naive Bayes Accuracy: ", gnb_accu * 100)

print("Gaussian Naive Bayes Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gnb))

cm_gnb = confusion_matrix(y_test, y_pred_dt)

print("\nConfusion Matrix:")
print(cm_gnb)

sns.heatmap(cm_gnb,annot=True,cmap='Blues')
plt.title("Confusion Matrix")

## Gaussian Naive Bayes Model
Gaussian Naive Bayes is a probabilistic classifier based on Bayes' theorem, assuming feature independence and Gaussian distribution.

Explanation: A GaussianNB model is trained and used to generate predictions.

Performance Metrics
Accuracy Score:
Output: Gaussian Naive Bayes Accuracy: [e.g., 97.72]%
Interpretation: Computationally efficient, but its strong independence assumption can limit performance on complex data.

In [None]:

print("\nClassification Report:")
print(classification_report(y_test, y_pred_gnb, target_names=['Not Fraud', 'Fraud']))

Interpretation: Helps assess the balance between identifying true frauds (recall) and minimizing false alarms (precision).

# 5. Conclusion

This project shows just how challenging it is to detect fraud when real fraud cases are so rare in the data. Even though models like Random Forest reached over 99% accuracy, that number can be misleading. The confusion matrix and detailed metrics revealed that many fraudulent transactions still slipped through undetected — which is a big problem in banking.

In fraud detection, catching as many real frauds as possible is more important than just having high overall accuracy. That’s why improving the recall for fraud cases should be the top priority. Going forward, it’s clear we need to move beyond basic models and try smarter solutions — like balancing the dataset with techniques such as SMOTE, trying better ensemble methods, or even cost-sensitive learning.

By combining these advanced strategies, we can build a more reliable fraud detection system that actually protects customers and companies from hidden threats.
