<a href="https://colab.research.google.com/github/jhammans/fraud_busters/blob/Manahil2/RandomForestClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#import Dependencies
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [None]:
data=pd.read_csv('/content/fraudTest.csv')
data.head()

In [None]:

# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0', 'cc_num', 'merchant',
                   'first', 'last', 'street', 'trans_num', 'dob']
data_cleaned = data.drop(columns=columns_to_drop, axis=1)


In [None]:
# Convert 'trans_date_trans_time' to a numeric format (optional)
data_cleaned['trans_date_trans_time'] = pd.to_datetime(data['trans_date_trans_time']).astype('int64') // 10**9  # Convert to Unix timestamp


In [None]:

# Encode non-numeric columns, including city and state
non_numeric_columns = ['category', 'gender', 'job', 'city', 'state']
label_encoders = {}
for col in non_numeric_columns:
    le = LabelEncoder()
    data_cleaned[col] = le.fit_transform(data_cleaned[col].astype(str))  # Ensure all data is string before encoding
    label_encoders[col] = le


In [None]:

# Extract target variable and features
X = data_cleaned.drop('is_fraud', axis=1)
y = data_cleaned['is_fraud']


In [None]:

# Ensure all features are numeric
print("Data types after encoding:", X.dtypes)


In [None]:

# Scale numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [None]:

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [None]:

# Display the first few rows of the processed dataset
print(X_train[:5], y_train[:5])


## **Training a Random Forest Model**

In [None]:

# Re-train and evaluate the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


# Performance Analysis of Random Forest Model

The Random Forest model performed exceptionally well in terms of overall accuracy, achieving **99.83% accuracy**. However, let's dive deeper into the results:


## **Performance Analysis**

### **Class 0 (Non-Fraudulent Transactions)**:
- **Precision**: 1.00 (Perfect precision; no false positives)
- **Recall**: 1.00 (Perfect recall; no false negatives)
- **F1-Score**: 1.00 (Excellent balance between precision and recall)

### **Class 1 (Fraudulent Transactions)**:
- **Precision**: 0.95 (Few false positives)
- **Recall**: 0.63 (Moderate recall; missed some fraudulent transactions)
- **F1-Score**: 0.76 (Good overall performance for fraud detection, but room for improvement)

### **Class Imbalance**
- Only **124 fraudulent transactions** vs. **28,613 non-fraudulent transactions**.
- This significant imbalance impacts the recall for fraud detection.



## **Adjusting the model's class weights to penalize misclassification of fraudulent transactions.**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Initialize the Random Forest model with class weights
model = RandomForestClassifier(random_state=42, class_weight={0: 1, 1: 10})

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


# Performance Summary

### Class 0 (Non-Fraudulent Transactions):
- **Precision**: 1.00 (No false positives; perfect identification of non-fraudulent transactions).
- **Recall**: 1.00 (All non-fraudulent transactions correctly identified).

### Class 1 (Fraudulent Transactions):
- **Precision**: 0.95 (Slightly more false positives but still very high).
- **Recall**: 0.64 (Improved compared to the previous model, but some fraudulent transactions are still missed).
- **F1-Score**: 0.76 (Balanced performance for fraud detection).

### Overall Accuracy:
- **99.84%**: Excellent overall performance.

### Macro and Weighted Averages:
- **Macro Avg Recall**: 0.82 (Reflects the imbalanced dataset).
- **Weighted Avg Recall**: 1.00 (Dominated by the majority class).

## Observations:
- The weighted averages show near-perfect results due to the dominant majority class (non-fraudulent transactions).


## **Manual Oversampling the data to get maximum accuracy**

In [None]:
# Separate the majority and minority classes
minority_class = data_cleaned[data_cleaned['is_fraud'] == 1]
majority_class = data_cleaned[data_cleaned['is_fraud'] == 0]

In [None]:

# Oversample the minority class
oversampled_minority_class = minority_class.sample(n=len(majority_class), replace=True, random_state=42)


In [None]:

# Combine the majority class with the oversampled minority class
balanced_data = pd.concat([majority_class, oversampled_minority_class])


In [None]:

# Shuffle the balanced dataset
balanced_data = balanced_data.sample(frac=1, random_state=42).reset_index(drop=True)


In [None]:

# Split features and target variable
X_balanced = balanced_data.drop('is_fraud', axis=1)
y_balanced = balanced_data['is_fraud']


In [None]:

# Scale the features
scaler = StandardScaler()
X_balanced_scaled = scaler.fit_transform(X_balanced)


In [None]:

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_balanced_scaled, y_balanced, test_size=0.2, random_state=42)

# Verify the class distribution
print("Class distribution in y_train:\n", y_train.value_counts())


## **Random Forest Classifier on Balanced Dataset**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


# Performance Summary

## **Overall Accuracy**
- **99.97%**: Almost perfect accuracy on the balanced dataset.



## **Class 0 (Non-Fraudulent Transactions)**
- **Precision**: 1.00 (No false positives).
- **Recall**: 1.00 (All non-fraudulent transactions correctly identified).
- **F1-Score**: 1.00 (Perfect balance between precision and recall).



## **Class 1 (Fraudulent Transactions)**
- **Precision**: 1.00 (Almost no false positives).
- **Recall**: 1.00 (All fraudulent transactions correctly identified).
- **F1-Score**: 1.00 (Perfect fraud detection).


## **Macro and Weighted Averages**
- **Precision, Recall, F1-Score**: All metrics are perfect due to the balanced dataset and model sensitivity.



## **Observations**
1. **Balanced Data**:
   - Balancing the dataset allowed the model to perform equally well for both classes.

2. **No Overfitting**:
   - Random oversampling combined with the Random Forest model handled the dataset effectively without signs of overfitting.

