<a href="https://colab.research.google.com/github/AnahitShekikyan/ADS-505-Final-Team-Project/blob/main/Final_Project_Updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Credit Card Fraud Detection Project**

## **Business Problem**
The objective of this project is to identify fraudulent credit card transactions. Credit card fraud detection is critical for financial institutions, as it helps prevent financial loss and maintain customer trust. We aim to build a model that can accurately detect fraudulent transactions, which account for only 0.17% of all transactions in our dataset.

## **Dataset Information**
The dataset contains 284,807 transactions, with 31 features describing each transaction. The target variable indicates whether the transaction is fraudulent (1) or legitimate (0). Given the class imbalance, specific techniques will be applied to address this challenge.
    

# **Import Data & Libraries**

In [None]:
%%capture
!pip install dmba

#library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.tree import plot_tree
from sklearn.decomposition import PCA
from dmba import classificationSummary
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score, roc_curve, precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_curve, auc
from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

# Suppress all warnings
warnings.filterwarnings("ignore")

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
# Path to the dataset in Google Drive
data = '/content/drive/MyDrive/Jason Documents/creditcard.csv'

# Load the CSV into a pandas DataFrame
df = pd.read_csv(data)

# Display the first few rows
df.head()

# **Basic Data Information**

In [None]:
df.info()

In [None]:
# Checking if there are there duplicates?
df.duplicated().sum()

In [None]:
# Plotting the class distribution to visualize the class imbalance
plt.figure(figsize=(6, 4))
sns.countplot(x='Class', data=df)
plt.title('Distribution of Legitimate vs Fraudulent Transactions')
plt.show()

# Cheking if there is there class imbalance?
df['Class'].value_counts()

The histogram further confirms the extreme imbalance in the dataset with very few fraudulent instances compared to the legitimate ones. We need to handle the class imbalance and do dditional exploratory data analysis focusing on the minority class (fraud) to extract more relevant insights or patterns specific to those cases.


In [None]:
# Plotting the transaction amounts for fraud vs non-fraud
plt.figure(figsize=(8, 6))
sns.boxplot(x='Class', y='Amount', data=df)

# Limiting y-axis to focus on smaller amounts for clearer visualization
plt.ylim(0, 500)
plt.title('Transaction Amounts by Class')
plt.show()

This box plot indicates that fraudulent transactions (Class 1) have higher median transaction amounts compared to legitimate transactions (Class 0). However, both classes show a wide range of transaction amounts, and there are more outliers in the legitimate transactions, so we need to use the "Amount" feature as a predictor since there appears to be a noticeable difference between classes. Investigate the outliers in the legitimate transactions to determine if they might be misclassified frauds or anomalies that need special handling, and normalize or standardize the "Amount" variable since it may vary significantly and could influence model performance if left unscaled.

# **Data Quality Report**


In [None]:
# Summary stats
df.describe()

In [None]:
def data_quality_report(df):

    # Initializing the report dictionary
    report = pd.DataFrame(index=df.columns)

    # Data types
    report['Data Type'] = df.dtypes

    # Counting missing values
    report['Missing Values'] = df.isnull().sum()

    # Counting percentage of missing values
    report['% Missing'] = (df.isnull().sum() / len(df)) * 100

    # Counting of unique values
    report['Unique Values'] = df.nunique()

    # Continuous features summary (only for float64 and int64)
    report['Min'] = df.min()
    report['Max'] = df.max()
    report['Mean'] = df.mean()
    report['Median'] = df.median()
    report['Standard Deviation'] = df.std()

    # Checking for duplicates
    report['Duplicates'] = df.duplicated().sum()

    # Determine cardinality for categorical variables (assumes non-continuous variables)
    report['Cardinality'] = [df[col].nunique() if df[col].dtype == 'object' else 'N/A' for col in df.columns]

    # Return the report
    return report

# Generating the Data Quality Report for the dataset
report = data_quality_report(df)

# Display the report
report

Here is a detailed summary of each feature in the dataset, covering various statistical and descriptive properties that are crucial for data exploration and preparation.


*   **Data Type:** All features, except the target variable (Class), are continuous and represented as floats.

*   **Missing Values:** There are no missing values, which indicates the dataset is complete and requires no imputation.


*   **Unique Values:** The number of distinct values in each feature. High cardinality (e.g., for Time and most other features) indicates a diverse set of values, typical for continuous variables.


*   **Min/Max:** It provides insight into the range of each feature. For example, the feature Amount ranges from 0 to 25691.16, which is consistent with transaction values.

*   **Mean/Median:**  This show that most features have means close to zero. This suggests that the features might have been transformed (e.g., PCA) to center their distributions.


*   **Standard Deviation:** It show the spread or dispersion of values within each feature. Features like "Amount" have a high standard deviation (250.12), reflecting a wide range of transaction values.

*   **Duplicates:** There are 1081 duplicate rows in the dataset, which may need to be removed or might prevent bias in model training.


*   **Cardinality:** Not applicable for continuous features.

Based on the data quality report we need to remove duplicates, handle outliers, scale the data, and  reduce multicollinearity.

# **Univariate Analysis**

**For Continuous Features**

In [None]:
# Generating summary statistics for continuous variables
continuous_features = df.select_dtypes(include=['float64', 'int64']).columns
print("Summary statistics for continuous features:")
print(df[continuous_features].describe())

# Visualizing continuous variables in sets of 5 per row
n_features = len(continuous_features)
n_cols = 5  # Number of plots per row

for i in range(0, n_features, n_cols):
    plt.figure(figsize=(15, 4))  # Adjust the width for 3 plots per row
    for j in range(n_cols):
        if i + j < n_features:
            feature = continuous_features[i + j]

            # Create subplot for histogram and boxplot
            plt.subplot(1, n_cols, j + 1)
            sns.histplot(df[feature], bins=30, kde=True)
            plt.title(f'Distribution of {feature}')

    plt.tight_layout()  # Adjust layout for better spacing
    plt.show()

    # Second row with boxplots for the same features
    plt.figure(figsize=(15, 4))  # Separate figure for boxplots
    for j in range(n_cols):
        if i + j < n_features:
            feature = continuous_features[i + j]

            # Create subplot for boxplot
            plt.subplot(1, n_cols, j + 1)
            sns.boxplot(x=df[feature])
            plt.title(f'Boxplot of {feature}')

    plt.tight_layout()  # Adjust layout for better spacing
    plt.show()

There is a histograms and boxplots for each continuous feature to explore their distributions.

Histograms show the distribution of data within each feature. It helps identify skewness, modality (e.g., unimodal, bimodal), and if the data follows a normal distribution. For an example features like V1, V2, and others often show normal-like distributions centered around zero. This indicates they may have been scaled or transformed.

Boxplots display the spread and presence of outliers for each feature, and gave a view of the median, quartiles, and extreme values. For an exzamle of boxplots of features like V5, V6, etc., reveal many outliers, which can be important when deciding on preprocessing techniques, like applying robust scaling or addressing extreme values to avoid undue influence on the models.

Both histograms and boxplots are displayed side by side for each set of features, ensuring that the visual exploration of the dataset is comprehensive.

**For Categorical Features**

In [None]:
# Visualizing the target variable 'Class'
plt.figure(figsize=(4, 3))
sns.countplot(x='Class', data=df)
plt.title('Distribution of Class (Fraud vs Non-Fraud)')
plt.show()

# Display percentage of fraud vs non-fraud transactions
class_counts = df['Class'].value_counts(normalize=True) * 100
print("Percentage distribution of the target variable 'Class':")
print(class_counts)

The bar chart displays the distribution of the target variable, Class, which represents legitimate transactions (Class 0) and fraudulent transactions (Class 1). It shows a significant imbalance between the two classes, with legitimate transactions making up 99.83% of the dataset, while fraudulent transactions only constitute 0.17%.

The extreme class imbalance indicated by the chart requires special consideration in the modeling process to ensure the model's effectiveness in identifying the minority class (fraudulent transactions). Proper resampling, algorithm adjustments, and appropriate evaluation metrics will be crucial to developing an effective fraud detection model.


# **Multivariate Analysis**

In [None]:
# Using a subset of features to avoid too many plots (e.g., V1 to V5)
subset_features = ['V1', 'V2', 'V3', 'V4', 'V5', 'Class']

# Creating a pair plot
sns.pairplot(df[subset_features], hue='Class', diag_kind='kde', plot_kws={'alpha': 0.3})
plt.show()

The pair plot shows the relationships and scatter distributions of selected features (e.g., V1, V2, etc.) across different classes. There are clusters and some separable patterns, particularly between legitimate and fraudulent classes. Here we need to identify and prioritize features that show clear separability between classes as important predictors for modeling. We might use the PCA if multicollinearity or highly correlated features are identified, and clustering algorithms to validate whether the clusters align with the fraud and non-fraud classes.

In [None]:
# Computing the correlation matrix
corr_matrix = df.corr()

# Ploting the heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, cmap='coolwarm', annot=False, fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

This heatmap illustrates the correlation between different features. Some features show strong correlations, which could indicate redundancy or multicollinearity issues. Here are the steps we might go for the next steps:


*   Removing or combine highly correlated features to avoid multicollinearity in model building (e.g., using PCA or dropping one of the correlated features).

*   Focusing on features with stronger correlations with the target variable (Class) as they might be more predictive.

In [None]:
# Violin plot for 'Amount' based on 'Class'
plt.figure(figsize=(8, 6))
sns.violinplot(x='Class', y='Amount', data=df)
plt.title('Distribution of Transaction Amounts by Class')
plt.show()

The violin plot gives a deeper look into the distribution and density of transaction amounts by class. It shows that fraudulent transactions have a more concentrated distribution compared to legitimate ones, which are more spread out with outliers. Here we can explore additional features that might help differentiate frauds based on amount distributions, also handling outliers in the legitimate transactions to refine model performance.

In [None]:
# Only keeping numeric features (skip target 'Class')
X = df.drop(columns=['Class'])

# Calculating VIF (Variance Inflation Factor) for each feature
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

The VIF output indicates the level of multicollinearity among the features in dataset.

Most features (e.g., "V1", "V3", "V6", "V8", and others) have VIF values close to 1, indicating very low multicollinearity. This suggests that these features are not significantly correlated with other features in the dataset, making them reliable for use in regression models.

Features like "V2", "V5", "V7", and "V20" have VIF values between 2 and 4. These values are still within an acceptable range but indicate some degree of correlation with other features. While these values do not warrant immediate removal, it's important to monitor these features in case they lead to multicollinearity issues in your model.

The "Amount" feature has a VIF value of approximately 11.5, which is quite high and indicates significant multicollinearity. This suggests that "Amount" may be highly correlated with one or more other features in the dataset. High VIF values like this could distort regression coefficients and affect the stability and interpretability of the model.

Since Amount has a high VIF value, you might need to explore its correlation with other features. If it's highly correlated with other variables, we can removing it, if it doesn't contribute new information or combine it with other features or transform it to reduce the correlation.

Although the VIF values between 2 and 4 are acceptable, it's a good idea to keep these features in mind when evaluating model performance, as they might introduce minor multicollinearity.

In [None]:
# Group by 'Class' and compute summary statistics for continuous features
grouped_stats = df.groupby('Class').mean()
print(grouped_stats)

In [None]:
# Box plot of Amount grouped by Class
plt.figure(figsize=(4, 3))
sns.boxplot(x='Class', y='Amount', data=df)
plt.title('Transaction Amount by Class')
plt.show()

This boxplot visualizes the distribution of transaction amounts for legitimate (Class 0) and fraudulent (Class 1) transactions. Fegitimate transactions (Class 0), the majority of transaction amounts are concentrated at lower values, with a few extreme outliers reaching up to over 25,000 and but  presence of numerous outliers suggests that legitimate transactions vary widely in amount, with most being small but some being very large.

For fraudlent transactions (Class 1) show low fraudulent transaction but appear to be more tightly distributed, with significantly fewer extreme outliers compared to legitimate transactions. Which suggests that fraudulent transactions generally tend to have lower amounts, and there is less variation in their values.

# **Data Preprocessing**
In this section, we will handle missing values, scale the data, and address class imbalance.

In [None]:
# Removing duplicates
df = df.drop_duplicates()

In [None]:
# Checking if there are there still any duplicates?
df.duplicated().sum()

Our dataset contains substantial outliers, which is common in fraud detection, so RobustScaler is primary choice, because this scaler centers the data using the median and scales using the interquartile range (IQR), which makes it robust to outliers. So, to ensure that outliers are minimized first we will use  RobustScaler and then further normalize the data for algorithms sensitive to scaling StandardScaler.




In [None]:
# Creating a correlation matrix with imbalanced and SubSample data
fraud = df[df['Class'] == 1]
non_fraud = df[df['Class'] == 0].sample(n=len(fraud), random_state=42)

# Concatenate fraud and non-fraud samples to create a balanced subsample
subsample = pd.concat([fraud, non_fraud])

# Compute correlation matrices
corr_matrix_full = df.corr()
corr_matrix_subsample = subsample.corr()

# Set up the matplotlib figure
fig, ax = plt.subplots(2, 1, figsize=(8, 6))

# Imbalanced correlation matrix
sns.heatmap(corr_matrix_full, cmap='coolwarm', ax=ax[0], cbar_kws={'shrink': 0.5}, vmin=-1, vmax=1)
ax[0].set_title('Imbalanced Correlation Matrix\n(don\'t use for reference)', fontsize=16)

# Subsample correlation matrix
sns.heatmap(corr_matrix_subsample, cmap='coolwarm', ax=ax[1], cbar_kws={'shrink': 0.5}, vmin=-1, vmax=1)
ax[1].set_title('SubSample Correlation Matrix\n(use for reference)', fontsize=6)

plt.tight_layout()
plt.show()

The two correlation matrices illustrate the relationship between different features in the dataset, showing how preprocessing, especially subsampling, impacts the representation and correlations between features.

Imbalanced correlaition matrix show the  lacks variability and detail, indicating that the class imbalance masks true relationships between features. This is because the overwhelming majority of non-fraudulent (Class 0) data points dominate the calculation, making the correlation values less reliable for model development. This matrix is not ok to  be used as a reference for feature selection or understanding relationships since it is skewed by the imbalance.

SubSample correlation matrix shows a more diverse pattern of correlations, with a mixture of positive and negative correlations across different feature pairs. This suggests that the subsampled dataset is better balanced, allowing the true relationships to surface. There is a
strong correlations between some features (e.g., between V3, V6, and V9) which could indicate multicollinearity. These features might need to be addressed using techniques like Principal Component Analysis (PCA) or feature elimination to reduce redundancy.



#Supervised Algorithm


## Model 1 _ Logistic Regression

In [None]:
#  Re assigning X after removing duplicates from df
X = df.drop(columns=['Class'])

# Applying RobustScaler first
# Will be using X which was defined earlier section
robust_scaler = RobustScaler()
X_robust_scaled = robust_scaler.fit_transform(X)

# Applying StandardScaler on the robust-scaled data
standard_scaler = StandardScaler()
X_final_scaled = standard_scaler.fit_transform(X_robust_scaled)

In [None]:
# Here we are setting 'y' as the target variable, df is the original dataset
y = df['Class']

# Spliting the dataset into training and testing sets using X_final_scaled that was created earlier
# Here we will be using stratiy during train test split due to the imbalanced nature of the dataset
X_train, X_test, y_train, y_test = train_test_split(X_final_scaled, y, test_size=0.2, stratify=y, random_state=42)

# Initializing the logistic regression model, adjusting max_iter if necessary for convergence
# Here we are applying class_weight as balanced to automatically assigns weights inversely proportional to the class frequencies in the data
log_reg = LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000)

# Training the model using the training data
log_reg.fit(X_train, y_train)

# Making predictions on the test set
y_pred_lr = log_reg.predict(X_test)
y_pred_prob_lr = log_reg.predict_proba(X_test)[:, 1]

# Evaluating the model
# We will be skipping the accuracy since in imbalanced dataset, it may not be meaningful
# Instead, we will focus more on Precision, Recall, and AUPRC
precision_lr, recall_lr, _ = precision_recall_curve(y_test, y_pred_prob_lr)
conf_matrix_lr = confusion_matrix(y_test, y_pred_lr)
report_lr = classification_report(y_test, y_pred_lr)

# Calculating AUPRC
auprc_lr = auc(recall_lr, precision_lr)

# Displaying the evaluation metrics
print("\nConfusion Matrix:\n", conf_matrix_lr)
print("\nClassification Report:\n", report_lr)
print(f"Area Under the Precision-Recall Curve (AUPRC): {auprc_lr:.2f}")

## Model 2 _ Random Forest Classifier

In [None]:
# Initialize RandomForestClassifier with class weights to handle imbalance
rf_model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)

# Fitting the model
rf_model.fit(X_train, y_train)

# Making predictions on the test set
y_pred_rf = rf_model.predict(X_test)
y_pred_prob_rf = rf_model.predict_proba(X_test)[:, 1]

# Evaluating the model
precision_rf, recall_rf, _ = precision_recall_curve(y_test, y_pred_prob_rf)
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
report_rf = classification_report(y_test, y_pred_rf)

# Calculating AUPRC
auprc_rf = auc(recall_rf, precision_rf)

# Displaying the evaluation metrics
print("\nConfusion Matrix:\n", conf_matrix_rf)
print("\nClassification Report:\n", report_rf)
print(f"Area Under the Precision-Recall Curve (AUPRC): {auprc_rf:.2f}")

In [None]:
# Access one of the trees in the forest (e.g., the first tree)
tree = rf_model.estimators_[0]  # Accessing the first tree in the forest

# Visualizing the tree
plt.figure(figsize=(10, 8))
plot_tree(tree,
          filled=True,
          feature_names=X_train.columns if isinstance(X_train, pd.DataFrame) else ['Feature_' + str(i) for i in range(X_train.shape[1])],
          class_names=['Not Fraud', 'Fraud'],
          rounded=True)
plt.title('Decision Tree Visualization (Tree 1 of Random Forest)')
plt.show()


The decision tree visualization provided represents one of the individual trees from the Random Forest model built to classify transactions as either fraudulent or non-fraudulent. The tree is composed of several levels where splits are made based on specific feature values, with each node representing a decision point. The tree uses these splits to divide the dataset into smaller subsets, attempting to group similar observations together. The decision rules at each node are determined based on features that maximize information gain, measured by a reduction in Gini impurity, which indicates how mixed the classes are within a node. A Gini value of 0 at a node means that all samples belong to a single class, making the node pure.

As the tree progresses down each path, it uses different features, such as Feature_1, Feature_14, and Feature_12, suggesting that these attributes are influential in distinguishing between fraudulent and non-fraudulent transactions. Early splits in the tree rely on these critical features, helping the model make initial decisions about which direction the transaction should proceed. The tree then continues to branch out further using other features, each split further refining the classification until the leaf nodes are reached. These leaf nodes represent the final classification for the observations, where the "class" value shows whether the majority of samples at that node are classified as fraud or not fraud. Itâ€™s notable that some branches become pure (Gini = 0) quite quickly, indicating a clear separation of classes based on certain feature values, while other branches require more depth to achieve purity.

This individual tree is just one part of the entire Random Forest, which is composed of many such trees, each capturing different aspects of the data. While this single tree provides insight into how certain features and their values influence the classification process, it is important to remember that the forest, as a whole, averages the predictions from all these trees, reducing the risk of overfitting. Overfitting is a concern when trees grow too deep and fit noise in the training data rather than capturing general patterns. However, the Random Forest model mitigates this by aggregating results across multiple trees, leading to a more robust and generalizable model. This visualization allows us to understand the decision-making process within the forest and highlights the importance of certain features in classifying fraudulent versus non-fraudulent transactions.

#Unsupervised Algorithm

##Model 3 _ Isolation Forest

In [None]:
# Initialize the Isolation Forest model
# Here we use a high contamination of 0.05 to try catch more fraud transactions
isolation_forest = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)

# Fit the model using the X_final_scaled from before
isolation_forest.fit(X_final_scaled)

# Predict anomalies using the Isolation Forest model
# The predict method returns -1 for anomalies and 1 for normal data
y_pred_if = isolation_forest.predict(X_final_scaled)

# Convert -1 (anomalies) to 1 (fraud) and 1 (normal) to 0 (non-fraud)
y_pred_converted_if = [1 if x == -1 else 0 for x in y_pred_if]

# Evaluating the model
conf_matrix_if = confusion_matrix(y, y_pred_converted_if)
report_if = classification_report(y, y_pred_converted_if)

# Step 1: Get anomaly scores from the Isolation Forest model
# The decision_function method provides the anomaly scores (higher score indicates a normal point, lower score indicates an outlier)
anomaly_scores = isolation_forest.decision_function(X_final_scaled)

# Step 2: Calculate precision, recall, and thresholds using precision_recall_curve
# Use negative scores to match with anomaly prediction
precision_if, recall_if, thresholds = precision_recall_curve(y, -anomaly_scores)

# Step 3: Calculate the AUPRC
auprc_if = auc(recall_if, precision_if)

# Displaying the evaluation metrics
print("\nConfusion Matrix:\n", conf_matrix_if)
print("\nClassification Report:\n", report_if)
print(f"Area Under the Precision-Recall Curve (AUPRC): {auprc_if:.2f}")

##Model 4 _ DBSCAN

In [None]:
# Initialize the DBSCAN model
dbscan = DBSCAN(eps=0.5, min_samples=5)  # You can tune these hyperparameters

# Here we would fit and predict directly
y_preds_dbscan = dbscan.fit_predict(X_final_scaled)

# Convert -1 (anomalies) to 1 (fraud) and others to 0 (non-fraud)
y_preds_convert_dbscan = [1 if x == -1 else 0 for x in y_preds_dbscan]

# Evaluating the model
conf_matrix_dbscan = confusion_matrix(y, y_preds_convert_dbscan)
report_dbscan = classification_report(y, y_preds_convert_dbscan)

# Calculate precision, recall, and thresholds using precision_recall_curve
precision_dbscan, recall_dbscan, thresholds_dbscan = precision_recall_curve(y, y_preds_convert_dbscan)

# Calculate the AUPRC
auprc_dbscan = auc(recall_dbscan, precision_dbscan)

print("\nConfusion Matrix:\n", conf_matrix_dbscan)
print("\nClassification Report:\n", report_dbscan)
print(f"Area Under the Precision-Recall Curve (AUPRC): {auprc_dbscan:.2f}")

#Area Under the Precision-Recall Curve for All The Models

In [None]:
plt.figure(figsize=(8, 6))

# Plotting Precision-Recall Curve for Logistic Regression
plt.plot(recall_lr, precision_lr, label=f'Logistic Regression (AUPRC = {auprc_lr:.2f})')

# Plotting Precision-Recall Curve for Random Forest
plt.plot(recall_rf, precision_rf, label=f'Random Forest (AUPRC = {auprc_rf:.2f})')

# Plotting Precision-Recall Curve for Isolation Forest
plt.plot(recall_if, precision_if, label=f'Isolation Forest (AUPRC = {auprc_if:.2f})')

# Plotting Precision-Recall Curve for DBSCAN
plt.plot(recall_dbscan, precision_dbscan, label=f'DBSCAN (AUPRC = {auprc_dbscan:.2f})')

# Adding labels, title, legend, and grid
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve for Multiple Models')
plt.legend()
plt.grid()

# Show plot
plt.show()

#Evaluating The Models

In [None]:
# Initializing an empty list to store results
results = []

models = {
    'Logistic Regression': y_pred_lr,
    'Random Forest': y_pred_rf,
    'Isolation Forest': y_pred_converted_if,
    'DBSCAN': y_preds_convert_dbscan
}

# Training and evaluating each model
for model_name, model in models.items():

    # Calculating metrics
    if (model_name == 'Logistic Regression') or (model_name == 'Random Forest'):
      precision = precision_score(y_test, model)
      recall = recall_score(y_test, model)
    else:
      precision = precision_score(y, model)
      recall = recall_score(y, model)

    # Append results to the list
    results.append({
        'Model': model_name,
        'Precision': precision,
        'Recall': recall
    })

# Converting results to DataFrame
df_results = pd.DataFrame(results)

# Displaying the results
display(df_results)

#Conclusion


**Logistic Regression (AUPRC = 0.70):** This model performs moderately, with an AUPRC (Area Under Precision-Recall Curve) of 70%. However, it has a very bad precision. This is understandable due to the nature of the imbalanced dataset. This is very optimal, however, we have to be aware that the dataset has been stratify during train test split and have a smaller overall dataset.

**Random Forest (AUPRC = 0.80):** The Random Forest model achieves a very good AUC of 80%, which indicates good separation between classes. The AUPRC curve follows the axes precisely, meaning it correctly identifies most of the true positives and negatives without any error. Surprisingly, it also has a very good precision and a decent recall. This is very optimal, however, we have to be aware that the dataset has been stratify during train test split and have a smaller overall dataset.

**Isolation Forest (AUPRC = 0.09):** Isolation Forest is obviously not performing well with a AUPRC of only 9%. This indicates that it is even worse than random guessing of 50/50. Although we are using the entire dataset without stratifying it, it is still performing very poorly

**DBSCAN (AUPRC = 0.50):** The DBSCAN (Density-based spatial clustering of applications with noise) model, with an AUPRC of 50%, indicates that it is only perfoming just as good as random guess. Overall in general, it seems like the unsupervised Algorithm is not performing well.