#**SUMMARY**:

The capstone project focuses on Fraud Detection Analysis in Data Science, covering various important aspects of the project lifecycle. Here is a brief overview based on the provided context:

1. Dataset: The project involves working with a specific dataset for fraud detection analysis.

2. Project Objectives and Scope: Key questions include defining success for the fraud detection model and anticipating challenges during the model's implementation.

3. Data Analysis: Exploring how to handle multicollinearity in the dataset and the significance of feature importance in the analysis process.

4. Data Preprocessing: Techniques for dealing with categorical data and the significance of splitting the dataset into training and testing sets for model development.

5. Model Training: Understanding the differences between Gaussian Naive Bayes and other variants, and strategies for optimizing model parameters.

6. Model Evaluation: Ensuring the reliability of evaluation metrics and understanding the impact of false positives and false negatives in fraud detection.

7. Results and Interpretation: Validating model results and discussing the implications of the findings for stakeholders.

8. Model Improvement: Incorporating domain knowledge into the model and considering advanced techniques like ensemble methods for enhancing performance.

9. Practical Implementation: Ensuring scalability of the fraud detection model and addressing data privacy and security concerns.

10. Technical Implementation: Managing dependencies and version control for implementation and utilizing pipelines in machine learning workflows for efficiency.

Each section plays a crucial role in developing a robust fraud detection model, from data analysis to model training, evaluation, and practical implementation. The project aims to leverage various techniques and strategies to achieve accurate and reliable fraud detection results while addressing key challenges and ensuring the model's scalability and security.  

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel


In [None]:
# Load dataset
df = pd.read_csv('/content/creditcard.csv')

#**Data** **Analysis**

In [None]:
# Handle multicollinearity
corr_matrix = df.corr()
print(corr_matrix)

            Time            V1            V2            V3            V4  \
Time    1.000000  1.173963e-01 -1.059333e-02 -4.196182e-01 -1.052602e-01   
V1      0.117396  1.000000e+00  3.777823e-12 -2.118614e-12 -1.733159e-13   
V2     -0.010593  3.777823e-12  1.000000e+00  2.325661e-12 -2.314981e-12   
V3     -0.419618 -2.118614e-12  2.325661e-12  1.000000e+00  2.046235e-13   
V4     -0.105260 -1.733159e-13 -2.314981e-12  2.046235e-13  1.000000e+00   
V5      0.173072 -3.473231e-12 -1.831952e-12 -4.032993e-12 -2.552389e-13   
V6     -0.063016 -1.306165e-13  9.438444e-13 -1.574471e-13  1.084041e-12   
V7      0.084714 -1.116494e-13  5.403436e-12  3.405586e-12  8.135064e-13   
V8     -0.036949  2.114527e-12  2.133785e-14 -1.272385e-12  7.334818e-13   
V9     -0.008660  3.016285e-14  3.238513e-13 -6.812351e-13 -7.143069e-13   
V10     0.030617 -2.615192e-12  1.463282e-12 -1.609126e-12 -1.938143e-12   
V11    -0.247689  1.866551e-12 -8.314960e-13  8.707055e-13  1.874473e-12   
V12     0.12

In [None]:
# Remove highly correlated features
threshold = 0.7
corr_matrix = corr_matrix.abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
df.drop(to_drop, axis=1, inplace=True)

In [None]:
# Remove highly correlated features
threshold = 0.7
corr_matrix = corr_matrix.abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Exclude the 'Class' column from the correlation check
to_drop = [column for column in upper.columns if any(upper[column] > threshold) and column != 'Class']
df.drop(to_drop, axis=1, inplace=True)

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('Class', axis=1), df['Class'], test_size=0.2, random_state=42)



#**Data** **Preprocessing**

In [None]:
# Data preprocessing
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#**Model** **Training**

In [None]:
# Model training
gnb = GaussianNB()
gnb.fit(X_train_scaled, y_train)

#**Model** **Evaluation**

In [None]:
# Model evaluation
y_pred = gnb.predict(X_test_scaled)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9778273234788104
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.98      0.99     56864
           1       0.06      0.82      0.11        98

    accuracy                           0.98     56962
   macro avg       0.53      0.90      0.55     56962
weighted avg       1.00      0.98      0.99     56962

Confusion Matrix:
[[55619  1245]
 [   18    80]]


##**Results and Interpretation**

####Based on the above provided classification report and confusion matrix, here is the interpretation:

- Precision:
  - For class 0 (the majority class), the precision is 1.00, which indicates that when the model predicts a transaction as non-fraudulent, it is correct 100% of the time.
  - For class 1 (the minority class), the precision is 0.06, showing that when the model predicts a transaction as fraudulent, it is correct only 6% of the time. This suggests a high rate of false positives for class 1.

- Recall:
  - The recall for class 0 is 0.98, signifying that the model correctly identifies 98% of the non-fraudulent transactions.
  - The recall for class 1 is 0.82, indicating that the model only captures 82% of the fraudulent transactions.

- F1-score:
  - The F1-score considers both precision and recall. For class 0, it is 0.99, indicating a high balance between precision and recall. However, for class 1, the F1-score is 0.11, suggesting a significant imbalance between precision and recall for the fraudulent transactions.

- Support:
  - The support refers to the number of samples of the true response that lie in that class.

The confusion matrix complements the classification report by providing a visual representation of the model's performance. In this case, it shows that the model accurately predicted a large proportion of the non-fraudulent transactions (55619 out of 56864), but struggled to correctly identify fraudulent transactions, with only 80 out of 98 being correctly classified.

Overall, this interpretation indicates that the model performs well in identifying non-fraudulent transactions but has difficulty detecting fraudulent ones, as evidenced by the low precision, recall, and F1-score for class 1. This suggests a potential imbalance in the dataset or a need for further model refinement to improve its performance in detecting fraudulent transactions.  

In [None]:
# Feature importance
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train_scaled, y_train)
feature_importances = rf.feature_importances_
print('Feature Importances:')
print(feature_importances)

Feature Importances:
[0.0134378  0.01526844 0.01236817 0.01787922 0.03010317 0.01323203
 0.0130918  0.0269126  0.01223409 0.05530981 0.09203659 0.10056935
 0.08591615 0.00985921 0.12084374 0.01096314 0.06189566 0.15280664
 0.02419824 0.01204678 0.01371886 0.01509335 0.00955658 0.00889693
 0.00950858 0.00949304 0.01945112 0.01161559 0.01052851 0.01116482]
