# Project Name : Credit Card Fraud Detection using PyCaret

## In this notebook we will perform the following task: 
- Data Analysis
- Feature Engineering
- Model Building and Prediction using ML Techniques
- Model Building and Prediction using PyCaret(Auto ML)

### Importing Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import scipy
from sklearn.metrics import classification_report,accuracy_score

### Importing Libraries for Outlier Detection

In [None]:
from sklearn.ensemble import IsolationForest

from sklearn.svm import OneClassSVM

### Reading our Dataset

In [8]:
from google.colab import drive
drive.mount('/content/drive')

In [9]:
df= pd.read_csv("/content/drive/MyDrive/creditcard.csv")

In [10]:
df.head()

### Data Analysis

In [11]:
df.shape

#### Checking Null Values

In [12]:
df.isnull().sum()

### Checking the distribution of Normal and Fraud cases in our Data Set

In [14]:
fraud_check = pd.value_counts(df['Class'], sort = True)
fraud_check.plot(kind = 'bar', rot=0, color= 'r')
plt.title("Normal and Fraud Distribution")
plt.xlabel("Class")
plt.ylabel("Frequency")
 ## Defining labels to replace our 0 and 1 valuelabels= ['Normal','Fraud']
## mapping those labels
plt.xticks(range(2), labels)
plt.show()


#### Let us see what is the shape of Normal and Fraud data set

In [15]:
fraud_people = df[df['Class']==1]
normal_people = df[df['Class']==0]

In [16]:
fraud_people.shape

In [17]:
normal_people.shape

#### Finding out the avg amount in our both the data sets

In [18]:
fraud_people['Amount'].describe()

In [19]:
normal_people['Amount'].describe()

#### Let us analyse it visually

In [21]:
graph, (plot1, plot2) = plt.subplots(2,1,sharex= True)
graph.suptitle('Average amount per class')
bins = 70

plot1.hist(fraud_people['Amount'] , bins = bins)
plot1.set_title('Fraud Amount')

plot2.hist(normal_people['Amount'] , bins = bins)
plot2.set_title('Normal Amount')

plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show();

#### Plotting a corr Heatmap

In [22]:
df.corr()
plt.figure(figsize=(30,30))
g=sns.heatmap(df.corr(),annot=True)

### Creating our Dependent and Independent Features

In [23]:
columns = df.columns.tolist()
# Making our Independent Features
columns = [var for var in columns if var not in ["Class"]]
# Making our Dependent Variable
target = "Class"
x= df[columns]
y= df[target]

In [24]:
x.shape

In [25]:
y.shape

In [26]:
x.head() ## Independent Variable

In [27]:
y.head() ## Dependent Variable

## Model building

### Splitting the data

In [28]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42)

### We wil be using the following Models for our Anamoly Detection:
- Isolation Forest
- OneClassSVM

## Isolation Forest

*   Isolation Forest
   
   Isolation Forest is an unsupervised machine learning algorithm used for anomaly detection.The algorithm is based on the concept of isolating anomalies rather than trying to model normal data points.

*   Here's how the Isolation Forest algorithm works:

1.  Isolation: The algorithm randomly selects a feature and a random split value within the range of that feature. It partitions the data points based on this split value, creating isolation. Anomalies are expected to require fewer partitions to be isolated compared to normal data points.

2.  Recursive Partitioning: The process of recursively partitioning continues until all data points are isolated or a predefined maximum depth is reached. Each partitioning creates a binary tree-like structure called an isolation tree.

3.  Anomaly Scoring: Anomaly scores are assigned to the data points based on the average path length required to isolate them. Anomalies, being isolated earlier, will have shorter average path lengths and thus higher anomaly scores.

4.  Thresholding: Based on the anomaly scores, a threshold can be set to determine the cutoff point for identifying anomalies. Data points with anomaly scores above the threshold are considered anomalies.

*   Isolation Forest has some advantages for anomaly detection:

1.  It is effective in high-dimensional datasets, unlike some other anomaly detection algorithms that struggle with the curse of dimensionality.

2.  It can handle both global and local anomalies, as it doesn't rely on assumptions about the data distribution.

3.  The algorithm is computationally efficient, especially for large datasets, as it only requires a random subset of features for partitioning.

4.  Isolation Forest can be applied to various domains, such as fraud detection, network intrusion detection, and outlier detection in data analysis.


In [29]:
iso_forest= IsolationForest(n_estimators=100, max_samples=len(x_train),random_state=0, verbose=0)                        

In [30]:
iso_forest.fit(x_train,y_train)

In [31]:
ypred= iso_forest.predict(x_test)

In [32]:
ypred

#### Mapping the values as we want to have an output in 0 and 1

In [33]:
ypred[ypred == 1] = 0
ypred[ypred == -1] = 1


### Accuracy score and Matrix
1.  Precision: Precision measures how many of the positively predicted instances are actually true positives. In other words, it focuses on the accuracy of positive predictions. It is calculated by dividing the number of true positives by the sum of true positives and false positives. A higher precision indicates fewer false positives.

2.  Recall: Recall, also known as sensitivity or true positive rate, measures how many of the actual positive instances are correctly identified by the model. It focuses on the ability to find all positive instances. It is calculated by dividing the number of true positives by the sum of true positives and false negatives. A higher recall indicates fewer false negatives.

3.  F1-score: The F1-score is a harmonic mean of precision and recall. It provides a single metric that combines both precision and recall into one value. The F1-score is useful when you want to consider both precision and recall simultaneously. It ranges from 0 to 1, with 1 being the best score.

4.  Support: Support refers to the number of instances in each class (positive or negative) in the dataset. It provides information about the distribution of classes and can help interpret the performance of the model.

In summary, precision tells us how many positive predictions are correct, recall tells us how many positive instances are correctly identified, and the F1-score provides a balanced measure by considering both precision and recall. The support value represents the number of instances in each class. These metrics are useful in assessing the performance of a classification model and understanding its strengths and weaknesses in identifying positive instances.

In [34]:
print(accuracy_score(y_test,ypred))

In [35]:
print(classification_report(y_test,ypred))

In [36]:
from sklearn.metrics import confusion_matrix



*   A confusion matrix provides a summary of the model's predictions by 
comparing them to the actual labels in the dataset. It helps us understand how well the model is performing in terms of correctly and incorrectly classifying instances.
*   The confusion matrix consists of four main elements:


1.  True Positives (TP): These are the instances that are correctly predicted as positive by the model. In other words, the model correctly identified positive instances.

2.  True Negatives (TN): These are the instances that are correctly predicted as negative by the model. The model correctly identified instances that are not positive.

3.  False Positives (FP): These are the instances that are incorrectly predicted as positive by the model. The model mistakenly identified instances as positive when they are actually negative.

4.  False Negatives (FN): These are the instances that are incorrectly predicted as negative by the model. The model failed to identify instances that are actually positive.

5.  The confusion matrix is typically presented in a tabular form, with the actual labels forming the rows and the predicted labels forming the columns. It helps us visualize and analyze the performance of the model across different classes.

6.  By examining the values in the confusion matrix, we can derive several evaluation metrics such as accuracy, precision, recall, and F1-score, which provide insights into the model's performance for each class.

In summary, a confusion matrix is a useful tool that helps us understand how well a classification model is performing by comparing its predictions with the actual labels. It allows us to analyze the model's accuracy, as well as the rate of false positives and false negatives, providing valuable information to evaluate and improve the model's performance.

In [37]:
confusion_matrix(y_test, ypred)

### We can also print how many errors our model have

In [38]:
n_errors = (ypred != y_test).sum()
print("Isolation Forest have {} errors.".format(n_errors))

## OneClassSVM

**OneClassSVM**
OneClassSVM (Support Vector Machine) is a machine learning algorithm used for anomaly detection and novelty detection tasks. It is a type of Support Vector Machine that is trained in an unsupervised manner, meaning it doesn't require labeled data with anomalies or novelties during training.



*   Here's how the OneClassSVM algorithm works:



1.  Training: The OneClassSVM algorithm aims to build a decision boundary that encapsulates the majority of the data points in a high-dimensional space. It learns to distinguish between normal data points and potential outliers.

2.  Support Vectors: OneClassSVM identifies support vectors, which are the data points closest to the decision boundary. These support vectors play a crucial role in defining the decision boundary and separating normal data points from anomalies.

3.  Anomaly Scoring: Once the model is trained, it can assign anomaly scores to new, unseen data points. Data points that lie outside the decision boundary or have a large distance from the support vectors are considered potential anomalies and receive higher anomaly scores.

4.  Thresholding: A threshold can be set on the anomaly scores to determine the cutoff point for classifying anomalies. Data points with anomaly scores above the threshold are considered anomalies.



*   OneClassSVM has a few key characteristics and use cases:



1.  Unsupervised Anomaly Detection: OneClassSVM is primarily used for unsupervised anomaly detection, where anomalies are detected in a dataset without explicitly labeled anomalies.

2.  Non-linear Decision Boundaries: OneClassSVM is capable of learning non-linear decision boundaries through the use of kernel functions, such as the radial basis function (RBF) kernel.

3.  Handling Imbalanced Data: OneClassSVM can handle imbalanced datasets where the majority of the data points are normal, and anomalies are rare.

4.  Novelty Detection: OneClassSVM can also be used for novelty detection, where the goal is to identify data points that differ significantly from the training data, even if they are not necessarily anomalies.



In [40]:
svm= OneClassSVM(kernel='rbf', degree=3, gamma=0.1,nu=0.05, 
                                         #max_iter=-1)

In [41]:
svm.fit(x_train,y_train)

In [42]:
ypred1= svm.predict(x_test)

#### Here also we do the same thing as above, mapping our results in 0 and 1

In [43]:
ypred1[ypred1 == 1] = 0
ypred1[ypred1 == -1] = 1

### Accuracy score and Matrix

In [44]:
print(accuracy_score(y_test,ypred))

In [45]:
print(classification_report(y_test,ypred))

In [46]:
from sklearn.metrics import confusion_matrix

In [47]:
confusion_matrix(y_test, ypred)

In [49]:
n_errors = (ypred1 != y_test).sum()
print("SVM have {} errors.".format(n_errors))

## Solving the Problem Statement using PyCaret Library(Auto ML)

# PyCaret :

### PyCaret
PyCaret is a Python library that facilitates the end-to-end machine learning workflow. It provides a simplified interface for automating various steps in the machine learning process, including data preprocessing, feature selection, model training, hyperparameter tuning, model evaluation, and deployment.

  Here are some key features and benefits of PyCaret:

1.  Simplified API: PyCaret offers a high-level API that abstracts away the complexities of machine learning tasks, making it easier and faster to build and deploy models.

2.  Automated Preprocessing: PyCaret automates common data preprocessing tasks such as handling missing values, encoding categorical variables, feature scaling, and more.

3.  Automatic Feature Selection: PyCaret can automatically select relevant features from your dataset based on various techniques such as statistical methods, importance ranking, and domain-specific feature selection algorithms.

4.  Model Training and Tuning: PyCaret supports a wide range of classification, regression, clustering, and anomaly detection algorithms. It provides an easy-to-use interface for training models on your data and automatically tuning hyperparameters to optimize model performance.

5.  Model Comparison and Evaluation: PyCaret allows you to compare multiple models using various evaluation metrics and visualizations, enabling you to make informed decisions about which model to choose.

6.  Model Deployment: PyCaret provides functionality to deploy trained models as Python code, allowing you to integrate them into production systems or build APIs for making predictions.

7.  Experiment Logging and Reproducibility: PyCaret keeps track of all the steps in your machine learning workflow, including data preprocessing, model training, and hyperparameter tuning. This makes it easy to reproduce experiments and share them with others.

Overall, PyCaret aims to simplify and streamline the machine learning process, enabling users to quickly experiment with different algorithms, preprocess data efficiently, and deploy models with ease. It can be a valuable tool for data scientists, machine learning engineers, and researchers looking to accelerate their workflow and build robust machine learning models.

### Installing Pycaret

In [50]:
!pip install pycaret

In [51]:
df= pd.read_csv("creditcard.csv")

In [52]:
df.head()

In [53]:
from pycaret.classification import *

In [54]:
model= setup(data= df, target= 'Class')

In [55]:
compare_models()

In [56]:
random_forest= create_model('rf')

### As we see we have a very good Kappa score which is often seen in an Imbalanced dataset

In [57]:
random_forest

### We can Hypertune our model to

In [58]:
tuned_model= tune_model('random_forest')

## Predictions

In [59]:
pred_holdout = predict_model(random_forest,data= x_test)

In [60]:
pred_holdout