<a href="https://colab.research.google.com/github/narwhalhorned/creditcard/blob/main/pycreditfraud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Setup

In [None]:
pip install pycaret

In [23]:
import pandas as pd
from pycaret.classification import *
from google.colab import drive
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from pycaret.classification import save_model
import joblib

In [3]:
drive.mount('/content/drive')
file_path = '/content/drive/My Drive/data/creditcard.csv'
df = pd.read_csv(file_path)

Mounted at /content/drive


In [4]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


* This dataset contains only numerical input variables resulting from a PCA transformation (information value standardized numerically). Due to confidentiality issues, the original features or additional background information about the data were not provided.

* Features V1 to V28 represent the principal components obtained through PCA. The only features not transformed with PCA are 'Time' and 'Amount'.

* 'Time': Represents the seconds elapsed between each transaction and the first transaction in the dataset.

* 'Amount': Represents the transaction amount.

* 'Class': Indicates whether a transaction is fraudulent (1) or not (0).

In [6]:
setup(df, target='Class', session_id=122) #to use gpu add "use_gpu=True"

Unnamed: 0,Description,Value
0,Session id,122
1,Target,Class
2,Target type,Binary
3,Original data shape,"(284807, 31)"
4,Transformed data shape,"(284807, 31)"
5,Transformed train set shape,"(199364, 31)"
6,Transformed test set shape,"(85443, 31)"
7,Numeric features,30
8,Preprocess,True
9,Imputation type,simple


<pycaret.classification.oop.ClassificationExperiment at 0x7955413be320>

#### *Additional Info*

* Session id: A unique identifier for your PyCaret session, which is typically an internal reference.

* Target Class: The 'Class' label to be predicted or the dependent variable

* Target type: Defines the problem type (classification)

* Original data shape: Size of data (284,807 rows and 31 columns).

* Transformed data shape: The shape of the dataset after preprocessing.

* Transformed train set shape: The size of training dataset (199,364 rows and 31 columns).

* Transformed test set shape: The size of test dataset used for model evaluation (85,443 rows and 31 columns).

* Numeric features: The number of numeric features in the dataset.

* Preprocess: Indicates whether preprocessing steps have been applied to the data.

* Imputation type: Specifies the type of imputation method used for handling missing values.

* Numeric imputation: The specific imputation method used for numeric features. It's set to "mean," which means that missing values in numeric features have been imputed with the mean of the non-missing values.

* Categorical imputation: The specific imputation method used for categorical features. It's set to "mode," which means that missing values in categorical features have been imputed with the mode (most frequent category) of the non-missing values.

* Fold Generator: Generating folds during cross-validation. "StratifiedKFold,"  ensures each fold has a similar distribution of the target variable.

* Fold Number: The number of folds used for cross-validation.

* CPU Jobs: The number of CPU cores used for parallel processing.

* Use GPU: Indicates whether a GPU is used for computation.

* Log Experiment: Log the details of the experiment.

* Experiment Name: The name for the current experiment.

* USI: An identifier for the User Specific Information, which can be used to track or reference a specific user's experiment or session.

These values provide an overview of the session and its settings, including data preprocessing, model configuration, and experiment details within PyCaret.

### Model Training, Comparison & Evaluation Metrics (1)

In [9]:
best_model = compare_models()


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.9996,0.9524,0.8115,0.9593,0.8781,0.8779,0.8816,194.082
et,Extra Trees Classifier,0.9996,0.9508,0.8085,0.9622,0.8776,0.8774,0.8813,20.873
xgboost,Extreme Gradient Boosting,0.9996,0.9815,0.8027,0.9653,0.8757,0.8755,0.8796,3.946
lda,Linear Discriminant Analysis,0.9994,0.9114,0.7855,0.8588,0.818,0.8177,0.8198,1.599
dt,Decision Tree Classifier,0.9993,0.8867,0.7737,0.792,0.7818,0.7815,0.782,16.274
ada,Ada Boost Classifier,0.9993,0.9829,0.7184,0.8388,0.7726,0.7722,0.7752,56.348
lr,Logistic Regression,0.9992,0.9522,0.6317,0.8386,0.7177,0.7172,0.7259,11.929
gbc,Gradient Boosting Classifier,0.999,0.6653,0.546,0.8299,0.6267,0.6263,0.6542,301.773
ridge,Ridge Classifier,0.9988,0.0,0.4226,0.8156,0.5494,0.5489,0.5822,0.264
knn,K Neighbors Classifier,0.9984,0.5989,0.0467,0.8,0.088,0.0879,0.1915,52.798


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

### Evaluation metrics (2)

Plots below are used to observe prediction results (Confusion Matrix).

The other plots are used as a guideline for model fine tuning.

In [10]:
evaluate_model(best_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

Here's what these values mean in **Confusion Matrix**:

- True Negatives (TN): The number of correctly predicted non-fraudulent cases. In our case, it's **85286**. These are the transactions that the model **correctly identified** as **non-fraudulent**.

- False Positives (FP): The number of actual non-fraudulent cases that were **incorrectly predicted as fraudulent**. In our case, it's **9**. These are the transactions that were not fraudulent but were incorrectly flagged by the model.

- False Negatives (FN): The number of actual fraudulent cases that were **incorrectly predicted as non-fraudulent**. In our case, it's **44**. These are the transactions that were actually fraudulent but were missed by the model.

- True Positives (TP): The number of correctly predicted fraudulent cases. In our case, it's **104**. These are the transactions that the model **correctly identified** as **fraudulent**.


Analyzing these values helps us understand the model's ability to correctly identify fraudulent transactions (TP) and minimize false negatives (FN). Additionally, we want to keep false positives (FP) to a minimum to avoid unnecessary fraud alerts. The high number of true negatives (TN) indicates the model's ability to correctly identify non-fraudulent transactions.

### **Overall, a high TN and TP, along with low FN and FP, suggest strong model performance in fraud detection.**

#### *Additional info*

For the other plots, hyperparameters, learning curve, prediction error and learning curve are what I observe to see the performance and fine-tune

### Predict on Test set

In [11]:
pred_holdout = predict_model(best_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.9994,0.9271,0.7027,0.9204,0.7969,0.7966,0.8039


| Model                    | Metric           | Value  | Description                           | Reasoning                                |
|--------------------------|------------------|--------|---------------------------------------|------------------------------------------|
| Random Forest Classifier | Accuracy         | 0.9994 | Proportion of correctly classified samples | High accuracy indicates overall model performance. |
|                          | AUC              | 0.9271 | Area under the ROC curve               | AUC measures the model's ability to distinguish between classes. A high AUC indicates strong discrimination. |
|                          | Recall (Sensitivity) | 0.7027 | True positive rate           | High recall indicates the model's ability to capture actual positive cases. |
|                          | Precision (Positive Predictive Value) | 0.9204 | True positive rate among predicted positives | High precision indicates low false positive rate. |
|                          | F1 Score         | 0.7969 | Harmonic mean of precision and recall | F1 score balances precision and recall. High F1 implies a good trade-off between precision and recall. |
|                          | Kappa            | 0.7966 | Cohen's Kappa statistic               | Kappa measures inter-rater agreement. High Kappa indicates strong agreement between predicted and actual values. |
|                          | MCC (Matthews Correlation Coefficient) | 0.8039 | Correlation between true and predicted binary classifications | MCC is a balanced metric that accounts for true positives, true negatives, false positives, and false negatives. |


### Fine Tuning

In [24]:
### Future list:
### Hyperparameter tuning
### Apply resampling (Might not use it for this dataset)
### Currently showing a quick approach of model comparison and not in-depth (not fine-tuned as of now)

### Save the model for future use (saved as a pickle file)

In [13]:
save_model(best_model, 'final_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Time', 'V1', 'V2', 'V3', 'V4',
                                              'V5', 'V6', 'V7', 'V8', 'V9',
                                              'V10', 'V11', 'V12', 'V13', 'V14',
                                              'V15', 'V16', 'V17', 'V18', 'V19',
                                              'V20', 'V21', 'V22', 'V23', 'V24',
                                              'V25', 'V26', 'V27', 'V28',
                                              'Amount'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=...
                  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                  

### Predict on new data

In [12]:
loaded_model = joblib.load('final_model.pkl')
new_data = pd.read_csv('new_data.csv')
new_predictions = loaded_model.predict(new_data)

The code above is to test the model on new dataset (with similar format)