# Fraud Detection in Transaction

1. **Data cleaning including missing values, outliers, and multi-collinearity**:
   We performed exploratory data analysis and addressed missing values in the dataset. It was observed that the data was unbalanced. To gain insights into relationships between variables, we plotted a heatmap to check for multi-collinearity.

    To analyze outliers, we plotted different graphs, including "Amount" for cases with 'isFraud' equal to 1 and 0 separately. This allowed us to understand the distribution and potential presence of outliers in each class.

    To handle outliers, we implemented outlier detection models, such as Isolation Forest and Local Outlier Factor (LOF). These unsupervised anomaly detection algorithms helped identify and deal with outliers effectively.

2. **Describe your fraud detection model in elaboration**:
    Isolation Forest is an unsupervised anomaly detection algorithm that isolates rare and different instances from the majority of the data. It uses binary trees to partition the data and identify outliers, which require fewer partitions to be isolated compared to normal instances. The IsolationForest class from scikit-learn is used to create the Isolation Forest model.
    
    Local Outlier Factor is another unsupervised anomaly detection algorithm that measures the local deviation of density for each data point with respect to its neighbors. It assigns an outlier score based on the local density compared to its neighbors, with significantly lower density indicating outliers. The LocalOutlierFactor class from scikit-learn is used to create the LOF model.
   The fraud detection model used in this code is a RandomForestClassifier. RandomForest is an ensemble learning method that combines multiple decision trees to make predictions. It is well-suited for classification tasks like fraud detection, as it can handle imbalanced datasets and capture complex patterns in the data.

3. **How did you select variables to be included in the model?**:
    Variable inclusion/exclusion in the dataset depends on factors like relevance and multicollinearity. When considering variables, it's crucial to include those that have a meaningful impact on the target variable and exclude irrelevant ones like "Name"(if it was present in our dataset)"

     Additionally, multicollinearity, where variables are highly correlated, can lead to unstable model coefficients. To avoid redundancy, highly correlated variables can be removed. However, in this dataset, multicollinearity was not significant, so both types of variables were used for model training.

    During the analysis, the "isFlaggedFraud" column's poor performance as a predictor was observed. Hence, I removed from the dataset to prevent noise from affecting the model's accuracy and effectiveness in detecting fraud.
   

4. **Demonstrate the performance of the model by using the best set of tools**:
   The code demonstrates the performance of the RandomForestClassifier using various datasets - Sampled, Undersampled, and Oversampled It prints the classification report and confusion matrix for both the training and testing sets, along with accuracy scores.

5. **What are the key factors that predict fraudulent customers?**:
   Visualization Plots:
   Amount per transaction by class: It displays histograms of transaction amounts for both fraud and normal transactions. The x-axis represents the transaction amount, and the y-axis shows the number of transactions within each bin.

   Time of transaction vs Amount by class: It shows scatter plots of transaction time against transaction amount for fraud and normal transactions.

6. **Do these factors make sense? If yes, how? If not, how not?**:
    Yes, these factors make sense in the context of fraud detection. Fraudsters often exhibit abnormal behavior compared to legitimate customers. Unusual transaction amounts and frequencies can be indicative of attempts to exploit vulnerabilities. Anomalies and time-based patterns can help identify potential unauthorized access. Analyzing these information can aid in recognizing suspicious activities.


7. **What kind of prevention should be adopted while the company updates its infrastructure?**:
   Preventing fraud requires a multi-layered approach. Some prevention measures include:
   1. Implement real-time transaction monitoring to detect and flag unusual patterns or suspicious activities promptly.
   2. Strengthen authentication and authorization mechanisms to ensure secure access to critical systems and data.
   3. Conduct regular security audits and vulnerability assessments to identify and address potential weaknesses in the infrastructure.
   4. Educate customers and employees about fraud risks and best practices for prevention, creating a vigilant and informed workforce.
   5. Foster collaboration with industry peers and share threat intelligence to stay ahead of emerging fraud threats and enhance overall defense measures.

8. **Assuming these actions have been implemented, how would you determine if they work?**:
   The effectiveness of fraud prevention measures can be assessed using various metrics, such as:
   1. Reduction in the number of confirmed fraud cases.
   2. Decrease in financial losses due to fraud.
   3. Increase in fraud detection rate (sensitivity/recall).
   4. Improvement in precision to reduce false positives.
   5. Feedback from customers and employees regarding security perceptions.

In [None]:
!pip install seaborn 

In [None]:
#Importing Necessary Libraries 
import numpy as np
import pandas as pd
import sklearn
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report,accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
#Importing the dataset
df = pd.read_csv(r"/kaggle/input/financial-dataset-for-fraud-detection-in-a-comapny/Fraud.csv")

In [None]:
#Visualizing the Dataset
df.head()

# EDA 

In [None]:
#Checking the datatype of column
df.info()

In [None]:
#Calculating the number of Fraud 
df["isFraud"].sum()

In [None]:
#Overall Number of Rows
df["isFraud"].shape[0]

In [None]:
#Ploting isFraud (0 vs 1)
import seaborn as sns
sns.countplot(x ="isFraud", data = df)

Here, we can see that it's an imbalanced dataset. So, we can't apply a straightforward classification model.

In [None]:
#Checking the null values
df.isnull().sum()

In [None]:
#Number of Unique Values 
df["nameDest"].nunique()

In [None]:
#Total Number of Rows
df["nameDest"].shape[0]

In [None]:
print("Repeated Values : ")
print(6362620-2722362)

In [None]:
#Visualizing the Count of Types of Transaction Made
df["type"].value_counts()

In [None]:
sns.countplot(x ="type", data = df)

In [None]:
df["amount"].describe()

In [None]:
#Checking IsFlaggedFraud Column Accuracy
cm = confusion_matrix(df["isFraud"], df["isFlaggedFraud"])
print("Confusion Matrix:")
print(cm)

In [None]:
df[df["isFraud"] == 1]["amount"].describe()

In [None]:
fraud = df[df['isFraud'] == 1]
normal = df[df['isFraud'] == 0]

# Calculate the maximum 'amount' for fraud and normal cases
max_fraud_amount = fraud["amount"].max()
max_normal_amount = normal["amount"].max()

# Print the results
#print("Is the sum of 'isFlaggedFraud' equal to the total number of fraud cases?", is_flagged_fraud_equals_total)
print("Maximum 'amount' for fraud cases:", max_fraud_amount)
print("Maximum 'amount' for normal cases:", max_normal_amount)
print("Total number of fraud cases:", fraud.shape[0])

In [None]:
sns.countplot(x ="type", data = fraud)

In [None]:
dict={}
for i in fraud["amount"]:
    if i  not in dict:
        dict[i]=1
    else:
        dict[i]=dict[i]+1

In [None]:
%%capture
dict

In [None]:
max_key = None
max_value = None

for key, value in dict.items():
    if max_value is None or value > max_value:
        max_key = key
        max_value = value
print("Amount corresponding to Maximum number of Transaction :", max_key)
print("Count of Maximum number of Transction to a particular Amount", max_value)

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
f.suptitle('Amount per transaction by class')
bins = 25
ax1.hist(fraud.amount, bins = bins)
ax1.set_title('Fraud')
ax2.hist(normal.amount, bins = bins)
ax2.set_title('Normal')
plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.xlim((0, 10000000 ))
plt.yscale('log')
plt.show();

In [None]:
# We Will check Do fraudulent transactions occur more often during certain time frame ? Let us find out with a visual representation.

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
f.suptitle('Time of transaction vs Amount by class')
ax1.scatter(fraud.step, fraud.amount)
ax1.set_title('Fraud')
ax2.scatter(normal.step, normal.amount)
ax2.set_title('Normal')
plt.xlabel('Time (in Seconds)')
plt.ylabel('Amount')
plt.show()

In [None]:
#Ploting the heatmap for Checking the Correlation between variables
data1 = df.sample(frac = 0.1,random_state=1) #(for 10% of data)
## Correlation
import seaborn as sns
#get correlations of each features in dataset
corrmat = data1.corr(numeric_only=True)
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")

In [None]:
#For whole data
corrmat = df.corr(numeric_only=True)
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")

In [None]:
# # Handle multi-collinearity
# corr = df.corr()
# print(corr)
# high_corr_cols = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool_)).abs()
# to_drop = [column for column in high_corr_cols.columns if any(high_corr_cols[column] > 0.8)]
# df = df.drop(to_drop, axis=1)

In [None]:
Fraud = data1[data1['isFraud']==1]

Valid = data1[data1['isFraud']==0]

outlier_fraction = len(Fraud)/float(len(Valid))

In [None]:
outlier_fraction 

In [None]:
columns = data1.columns.tolist()
# Filter the columns to remove data we do not want 
columns = [c for c in columns if c not in ["Class"]]
# Store the variable we are predicting 
target = "isFraud"
# Define a random state 
state = np.random.RandomState(42)
X = data1[columns]
Y = data1[target]
X_outliers = state.uniform(low=0, high=1, size=(X.shape[0], X.shape[1]))
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)

In [None]:
Y.sum()

In [None]:
%%capture
!pip install tensorflow
import tensorflow
from tensorflow.keras.utils import to_categorical

In [None]:
X = pd.get_dummies(X, columns=["type"])

In [None]:
X.drop('nameDest', axis=1, inplace=True)
X.drop('isFlaggedFraud' , axis=1 , inplace = True)
X.drop('nameOrig' , axis=1 , inplace = True)

In [None]:
##Define the outlier detection methods

classifiers ={
    
    "Isolation Forest":IsolationForest(n_estimators=100, max_samples=len(X), 
                                       contamination=outlier_fraction,random_state=state, verbose=0),
    "Local Outlier Factor":LocalOutlierFactor(n_neighbors=20, algorithm='auto', 
                                              leaf_size=30, metric='minkowski',
                                              p=2, metric_params=None, contamination=outlier_fraction),
   
            }

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

n_outliers = len(Fraud)

for i, (clf_name, clf) in enumerate(classifiers.items()):
    # Fit the data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_prediction = clf.negative_outlier_factor_
    else:
        clf.fit(X)
        scores_prediction = clf.decision_function(X)
        y_pred = clf.predict(X)
    
    # Reshape the prediction values to 0 for Valid transactions, 1 for Fraud transactions
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1
    
    n_errors = (y_pred != Y).sum()
    # Run Classification Metrics
    print("{}: {}".format(clf_name, n_errors))
    print("Accuracy Score:")
    print(accuracy_score(Y, y_pred))
    print("Classification Report:")
    print(classification_report(Y, y_pred))
    
    # Calculate and print the confusion matrix
    cm = confusion_matrix(Y, y_pred)
    print("Confusion Matrix:")
    print(cm)

    print("--------------------")


In [None]:
# Handle multi-collinearity
corr = X.corr()
print(corr)
high_corr_cols = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool_)).abs()
to_drop = [column for column in high_corr_cols.columns if any(high_corr_cols[column] > 0.8)]
new_data = X.drop(to_drop, axis=1)

In [None]:
##Define the outlier detection methods

classifiers ={
    
    "Isolation Forest":IsolationForest(n_estimators=100, max_samples=len(X), 
                                       contamination=outlier_fraction,random_state=state, verbose=0),
    "Local Outlier Factor":LocalOutlierFactor(n_neighbors=20, algorithm='auto', 
                                              leaf_size=30, metric='minkowski',
                                              p=2, metric_params=None, contamination=outlier_fraction),
   
            }

for i, (clf_name, clf) in enumerate(classifiers.items()):
    # Fit the data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_prediction = clf.negative_outlier_factor_
    else:
        clf.fit(X)
        scores_prediction = clf.decision_function(X)
        y_pred = clf.predict(X)
    
    # Reshape the prediction values to 0 for Valid transactions, 1 for Fraud transactions
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1
    
    n_errors = (y_pred != Y).sum()
    # Run Classification Metrics
    print("{}: {}".format(clf_name, n_errors))
    print("Accuracy Score:")
    print(accuracy_score(Y, y_pred))
    print("Classification Report:")
    print(classification_report(Y, y_pred))
    
    # Calculate and print the confusion matrix
    cm = confusion_matrix(Y, y_pred)
    print("Confusion Matrix:")
    print(cm)

    print("--------------------")


In [None]:
data_new = df
data_new = pd.get_dummies(data_new, columns=["type"])
data_new.drop("nameOrig" ,axis=1 , inplace = True)
data_new.drop("nameDest" ,axis=1 , inplace = True)

In [None]:
from sklearn.utils import resample
#create two different dataframe of majority and minority class 
df_majority = data_new[(data_new['isFraud']==0)] 
df_minority = data_new[(data_new['isFraud']==1)]

# upsample minority class
df_minority_oversampled = resample(df_minority,
                                 replace=True,
                                 n_samples=6354407,
                                 random_state=42)
# Combine majority class with upsampled minority class
df_oversampled = pd.concat([df_minority_oversampled, df_majority])
df_oversampled.isFraud.value_counts()

In [None]:
df_oversampled.isFraud.value_counts()

In [None]:
X_oversampled = df_oversampled.drop('isFraud', axis=1)
y_oversampled = df_oversampled['isFraud']
X_oversampled.shape, y_oversampled.shape

In [None]:
y1= df["isFraud"]

In [None]:
X1 = data_new.drop("isFraud", axis=1)

In [None]:
X1 = X1.drop("isFlaggedFraud" , axis=1)
X1

In [None]:
pip install imbalanced-learn

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

rus = RandomUnderSampler(random_state=42)

X_undersampled, y_undersampled = rus.fit_resample(X1, y1)
print(f"The number of Classes before the fit {Counter(y1)}")
print(f"The number of Classes after the fit {Counter(y_undersampled)}")

In [None]:
data_new

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

def train_and_evaluate_rf(X1, y1):
    # Define the predictor variables (X) and target variable (y)
#     X1 = df[['step', 'amount', 'oldbalanceOrg', 'oldbalanceDest', 'isFlaggedFraud']]
#     y1 = df['isFraud']
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.25, random_state=42)
    
    # Initialize the RandomForestClassifier
    rf_clf = RandomForestClassifier(random_state=42)
    
    # Train the model
    rf_clf.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = rf_clf.predict(X_test)
    print("Accuracy Score:")
    print(accuracy_score(y_test, y_pred))
    # Print the classification report
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
    # Print the confusion matrix
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))


In [None]:
train_and_evaluate_rf(X_undersampled, y_undersampled)

## Thanks you for giving me such a good opporunity