# Credit Card Fraud Detection With Classification Algorithms In Python

Fraud transactions or fraudulent activities are significant issues in many industries like banking, insurance, etc. Especially for the banking industry, credit card fraud detection is a pressing issue to resolve.

These industries suffer too much due to fraudulent activities towards revenue growth and lose customer’s trust. So these companies need to find fraud transactions before it becomes a big problem for them.  

Unlike the other machine learning problems, in credit card fraud detection the target class distribution is not equally distributed. It is popularly known as the class imbalance problem or unbalanced data issue.

This makes this problem even more challenging to solve.

So In this article, we will explain to you how to build credit card fraud detection using different machine learning classification algorithms.

Such as,

1. Decision trees algorithm

2. Random forest algorithm

You will also get an idea about the impact of unbalanced data on the model’s performance.

Let us give you the list of contents that we will discuss in the next few minutes. Just to give you a glimpse about the topics that you are going to learn from this article.

Let’s begin the discussion by understanding why we need to find fraudulent transactions/activities in any industry.

1. Why do we need to find fraud transactions?


   Fraud Detection Approaches
   
    
2. What is Credit Card Fraud Detection?


3. Understanding of Credit Card Dataset 



   Data Explorations
   
   
   
4. Credit Card Data Preprocessing


   Removing irrelevant columns/features



   Checking null or nan values



   Data Transformation



   Splitting dataset 



5. Building Credit Card Fraud Detection using Machine Learning algorithms


   Decision Tree Algorithm Overview


   Random Forest Algorithm Overview


6. Credit Card Fraud Detection with Decision Tree Algorithm


   Decision tree algorithm Implementation using python sklearn library


7. Credit Card Fraud Detection with Random Forest Algorithm


   Random forest algorithm Implementation using sklearn library


8. Why Accuracy not suitable for Data Imbalance Problems?


9. Suitable evaluation metrics for imbalanced data


   Decision Tree Classification model results


   Random Forest Classification model results


   AUC and ROC Curves


10. Model Improvement Using Sampling Techniques


   Applying Sampling Techniques 


   Decision tree classification after applying sampling techniques


   Random Forest Tree Classifier after applying the sampling techniques


11. Conclusion


# Why do we need to find fraud transactions?

For many companies, fraud detection is a big problem because they find these fraudulent activities after they experience high loss. 

Fraud activities happen in all  industries. We can't say only particular companies/industries suffer from these fraudulent activities or transactions. 

But when it comes to financial-related companies, this fraud transaction becomes more of an issue/problem.  So these companies want to detect fraud transactions before the fraud activities turn into significant damage to their company.

In the current generation, with high-end technology, still, on every 100 credit card transactions, 13% are falling into the fraudulent activities reported by the creditcards website.

A survey paper mentioned that in the year 1997, 63% of companies experienced one fraud in the past two years, and in another year 1999, 57% of companies experienced at least one fraud in the last one year. 

Here the point is not only fraud activities increase, but the way of doing scams also increases badly. 

Companies suffer from detecting fraud, and due to these fraudulent activities, many companies worldwide have lost billions of dollars yearly.

And one more thing, for any company, customer's trust is more important to achieve or reach some position in the business marketplace. If a company cannot find these fraudulent activities, companies lose customer's trust; then, they will suffer from customer churn.

Fraud Detection Approaches
So companies start to detect these fraud activities automatically by using smart technologies. 

First, companies hire few people only for the detection of these kinds of activities or transactions. But here they must and should be experts in this field or domain, and also the team should have knowledge of how frauds occur in particular domains. This requires more resources, such as people's effort and time.

Second, companies changed manual processes to rule-based solutions. But this one also fails most of the time to detect frauds. 

Because in the real world, the way of doing frauds is changing drastically day by day. These rule-based systems follow some rules and conditions. If a new fraud process is different from others, then these systems fail. It requires adding that new rule to code and execute. 

Now companies are trying to adopt Artificial Intelligence or machine learning algorithms to detect frauds. Machine learning algorithms performed very well for this type of problem.  

The payment gateway Stripe, for example — which can be integrated with the recurring payment provider Chargebee — uses an adaptive machine learning algorithm that evaluates risk in real-time and predicts whether a payment is likely to be fraudulent. 

# Fraud Detection Approaches

So companies start to detect these fraud activities automatically by using smart technologies. 

First, companies hire few people only for the detection of these kinds of activities or transactions. But here they must and should be experts in this field or domain, and also the team should have knowledge of how frauds occur in particular domains. This requires more resources, such as people's effort and time.

Second, companies changed manual processes to rule-based solutions. But this one also fails most of the time to detect frauds. 

Because in the real world, the way of doing frauds is changing drastically day by day. These rule-based systems follow some rules and conditions. If a new fraud process is different from others, then these systems fail. It requires adding that new rule to code and execute. 

Now companies are trying to adopt Artificial Intelligence or machine learning algorithms to detect frauds. Machine learning algorithms performed very well for this type of problem.  

The payment gateway Stripe, for example — which can be integrated with the recurring payment provider Chargebee — uses an adaptive machine learning algorithm that evaluates risk in real-time and predicts whether a payment is likely to be fraudulent. 

# What is Credit Card Fraud Detection?

In the above section, we discussed the need for identifying fraudulent activities. The credit card fraud classification problem is used to find fraud transactions or fraudulent activities before they become a major problem to credit card companies. 

It uses the combination of fraud and non-fraud transactions from the historical data with different people's credit card transaction data to estimate fraud or non-fraud on credit card transactions.

In this article, we are using the popular credit card dataset. Let’s understand the data before we start building the fraud detection models.

# Understanding of Credit Card Dataset

Before going to the model development part, we should have some knowledge about our dataset.
Such as 

1.What is the size of the dataset?

2.How many features does the dataset have?

3.What are the target values?

4.How many samples under each target value? , etc.

If we know some information about the dataset, then we can decide what we have to do?. 

What are the questions we discussed above, all  we can explore by using the python pandas library. 

Let's jump to the data exploration part to find answers to all questions we have.

Data Explorations
First, we need to load the dataset. After downloading the dataset, extract the data and keep the file in the dataset under the project folder. 

We can quickly load it using pandas.

In [1]:
import pandas as pd

# load dataset
fraud_df = pd.read_csv("creditcard.csv")

In [2]:
fraud_df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


Dataset has 284807 rows and 31 features. The result of the shape variable is a tuple that has the number of rows, number of columns of the dataset.

We can see how the dataset looks like. The below command showcases  only five rows, head() by default, gives 5 samples. 

If you want to see more samples from the top, pass the number representing the number of samples you want to see like fraud_df.head(10). 

You can also see bottom samples by using the tail() function. Both are working in the same way.

We can get all the list of feature names.

In [6]:
print(f"Columns or Feature names :- \n {fraud_df.columns}")

Columns or Feature names :- 
 Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')


From this, we know Class is the target variable, and the remaining all are features of our dataset.

Let's see what are the unique values we are having for the target variable.

In [7]:
print(f"Unique values of target variable :- \n {fraud_df['Class'].unique()}")

Unique values of target variable :- 
 [0 1]


The target variable Class has 0 and 1 values. Here

1) 0 for non-fraudulent transactions

2) 1 for fraudulent transactions

Because we aim to find fraudulent transactions, the dataset's target value has a positive value for that. 

Still, What is pending in data exploration questions? 

yeah, we have to check how many samples each target class is having.

In [8]:
print(f"Number of samples under each target value :- \n {fraud_df['Class'].value_counts()}")

Number of samples under each target value :- 
 0    284315
1       492
Name: Class, dtype: int64


Yeah, we have 284315 non-fraudulent transaction samples & 492 fraudulent transaction samples.

We will discuss more about the data in the later sections of this article. 

You are going to know the variation of this number of samples and how much impact on the model's performance, how we can evaluate model performance for this data, etc.

Still, now you only know about the dataset, such

1. Dataset size

2. Number of samples(rows) and features(columns)

3. Names of the features

4. About target variables, etc.

Now we will discuss different data preprocessing techniques for our dataset. 

The data preprocessing techniques will be completely different from the text preprocessing techniques we discussed in the natural language processing data preprocessing techniques article 

# Credit Card Data Preprocessing

Preprocessing is the process of cleaning the dataset. In this step, we will apply different methods to clean the raw data to feed more meaningful data for the modeling phase. This method includes

1. Remove duplicates or irrelevant samples

2. Update missing values with the most relevant values 

3. Convert one data type to another example, categorical to integers, etc.

Okay, now we will spend a couple of minutes checking the dataset and applying corresponding techniques to clean data. 

This step aims to improve the quality of the data.

# Removing irrelevant columns/features

In our dataset, only one irrelevant or not useful feature id Time. So we can drop that feature from the dataset.

In [9]:
# make sure which features are useful & which are not
# we can remove irrelevant features
fraud_df = fraud_df.drop(['Time'], axis=1)
print(f"list of feature names after removing Time column :- \n{fraud_df.columns}")

list of feature names after removing Time column :- 
Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'],
      dtype='object')


If you want to drop more features from data, call drop() method with a list of feature names. 

We can observe no feature name Time in the list of feature names after dropping the Time feature/column.

# Checking null or nan values

We can check the datatypes of all features and, at the same time, the number of non-null values of all features by using info() of pandas. 

Null or nan values are nothing, but there is no value for that particular feature or attribute.

For example, these nan or null values are coming if the customer or user does not fill all information in the forms. Blank values are treated as null or nan values. 

It's okay; we can know all this information just by using info() from pandas.

In [10]:
print(f"Dataset info :- \n {fraud_df.info()}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   V1      284807 non-null  float64
 1   V2      284807 non-null  float64
 2   V3      284807 non-null  float64
 3   V4      284807 non-null  float64
 4   V5      284807 non-null  float64
 5   V6      284807 non-null  float64
 6   V7      284807 non-null  float64
 7   V8      284807 non-null  float64
 8   V9      284807 non-null  float64
 9   V10     284807 non-null  float64
 10  V11     284807 non-null  float64
 11  V12     284807 non-null  float64
 12  V13     284807 non-null  float64
 13  V14     284807 non-null  float64
 14  V15     284807 non-null  float64
 15  V16     284807 non-null  float64
 16  V17     284807 non-null  float64
 17  V18     284807 non-null  float64
 18  V19     284807 non-null  float64
 19  V20     284807 non-null  float64
 20  V21     284807 non-null  float64
 21  V22     28

See the result of dataset info(); 

it provides all information about our dataset, such as 

1.Total number of samples or rows

2.Column names

3.Number of non-null values

4.The data type of each column

Our dataset doesn’t have any null values because the total number features are 284807 that ranges from 0-284806; all features have the same number of samples/rows.

# Data Transformation

Except for the Amount column, all column’s values are within some range of values. So let's change the Amount columns values to a smaller range of numbers. 

We can simply do this process by using StandardScaler from the sklearn library.

In [11]:
print(f"few values of Amount column :- \n {fraud_df['Amount'][0:4]}")

few values of Amount column :- 
 0    149.62
1      2.69
2    378.66
3    123.50
Name: Amount, dtype: float64


See the values of the Amount feature values are in high range compared to other feature values. 

We will change values within a smaller range.

In [12]:
# data preprocessing
from sklearn.preprocessing import StandardScaler
fraud_df['norm_amount'] = StandardScaler().fit_transform(
fraud_df['Amount'].values.reshape(-1,1))
fraud_df = fraud_df.drop(['Amount'], axis=1)
print(f"few values of Amount column after applying StandardScaler:- \n {fraud_df['norm_amount'][0:4]}")

few values of Amount column after applying StandardScaler:- 
 0    0.244964
1   -0.342475
2    1.160686
3    0.140534
Name: norm_amount, dtype: float64


The scalar result is added as a new column with norm_amount name to the data frame after we drop the Amount column because there is no use with it.

# Splitting dataset

Now we will take all independent columns (target column is dependent and the remaining all are independent columns to each other), as X and the target variable as y.

In [13]:
## Features and target creations
X = fraud_df.drop(['Class'], axis=1)
y = fraud_df[['Class']]

Now we need to split the whole dataset into train and test dataset. Training data is used at the time of building the model and a test dataset is used to evaluate trained models. 

By using the train_test_split method from the sklearn library we can do this process of splitting the dataset to train and test sets.

In [14]:
# splitting dataset to train & test dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(199364, 29)
(85443, 29)
(199364, 1)
(85443, 1)


Now our dataset is ready for building models. Let's jump to the development of  the model using machine learning algorithms such as decision tree and random forest classification algorithms from the sklearn module.

# Building Credit Card Fraud Detection using Machine Learning algorithms

Now we can build models using different machine learning algorithms. Before creating a model, we need to find the type of problem statement, which means is supervised or unsupervised algorithms. 

Our problem statement falls under the supervised learning problem means the dataset has a target value for each row or sample in the dataset. 

Supervised machine learning algorithms are two types 

1.Classification Algorithms

2.Regression Algorithms

Our problem statement belongs to what type of algorithms? 

Yeah, exactly.

Credit card fraud detection is a classification problem. Target variable values of Classification problems have integer(0,1) or categorical values(fraud, non-fraud). The target variable of our dataset ‘Class’ has only two labels - 0 (non-fraudulent) and 1 (fraudulent).

Before going further let us give an introduction for both decision tree classification and random forest classification. As in this article, we are going to use these two algorithms to build the credit card fraudulent activities identification model.

1.Decision Tree Classification Algorithm

2.Random Forest Classification Algorithm


# Decision Tree Algorithm Overview

The decision tree is the simplest and most popular classification algorithm. For building the model the decision tree algorithm considers all the provided features of the data and comes up with the important features.

Because of this advantage, the decision tree algorithms also used in identifying the importance of the feature metrics. Which used in handpicking the features. 

Once the important features identified then the model trains with the training data to come up with a set of rules. These rules used in predicting future cases or for the test dataset. 

This is a quick overview of the decision tree algorithm. If you want to learn more about the algorithm and implement in python, have a look at the below articles written by our team.

1.How the decision tree learns from the training data

2.Decision tree algorithm implementation in python

3.Implementing decision tree in R

4.How to visuvalizing the decsion tree

Now let’s see a quick overview of the random forest algorithm.

# Random Forest Algorithm Overview

The random forest algorithm falls under the ensemble learning algorithm category. In the random forest algorithm, we build N decision tree models.  

All the models predict the target value. Using the majority voting approach the final target value will be predicted.

For building the individual decision tree, the random forest algorithm randomly creates the sample dataset. These sample datasets are called as the bootstrap samples.

Suppose we want to build the N decision trees to create the forest, the algorithm first creates N bootstrap samples. Later for each bootstrap sample, one decision tree model will build.

This is a quick overview of the random forest algorithm, If you want to learn more, please have a look at the below articles.

1. How bootstrap samples created in Ensembleme learning methods.

2. End to end the working nature of the Random forest algorithm.

3. Implementing the Random forest algorithm in python.

Now let’s go to the implementation part, the crazy one 🙂

# Credit Card Fraud Detection with Decision Tree Algorithm

We will use the DecisionTreeClassifier class from the sklearn library to train and evaluate models. We use X_train and y_train data for training purposes. X_train is a training dataset with features, and y_train is the target label.

In [15]:
## Model with randomforest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

def random_forest_classifier(X_train, y_train, X_test, y_test):
     # initialize object for DecisionTreeClassifier class
     rf_classifier = RandomForestClassifier(n_estimators=50)
     # train model by using fit method
     print("Model training starts........")
     rf_classifier.fit(X_train, y_train.values.ravel())
     acc_score = rf_classifier.score(X_test, y_test)
     print(f'Accuracy of model on test dataset :- {acc_score}')
     # predict result using test dataset
     y_pred = rf_classifier.predict(X_test)
     # confusion matrix
     print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
     # classification report for f1-score
     print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")


# calling random_forest_classifier
random_forest_classifier(X_train, y_train, X_test, y_test)

Model training starts........
Accuracy of model on test dataset :- 0.9995435553526912
Confusion Matrix :- 
 [[85292     4]
 [   35   112]]
Classification Report :- 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85296
           1       0.97      0.76      0.85       147

    accuracy                           1.00     85443
   macro avg       0.98      0.88      0.93     85443
weighted avg       1.00      1.00      1.00     85443



Wow, our decision tree classification gives 99% accuracy on test data. 

But why f1-score on label 1 too less ?. 

Remember this point; we will discuss these metrics performances in the coming section of this article where we address the question

Why the accuracy evaluation metric is not suitable for this problem?

# Credit Card Fraud Detection with Random Forest Algorithm

Same as the above decision tree implementation, we use X_train and y_train dataset for training purposes and X_test for evaluation. Here we train the ensemble technique model of RandomForestClassifier from the sklearn. We can see the variations in the evaluation results.

# Random forest algorithm Implementation using sklearn library

In [16]:
## Model with randomforest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

def random_forest_classifier(X_train, y_train, X_test, y_test):
     # initialize object for DecisionTreeClassifier class
     rf_classifier = RandomForestClassifier(n_estimators=50)
     # train model by using fit method
     print("Model training starts........")
     rf_classifier.fit(X_train, y_train.values.ravel())
     acc_score = rf_classifier.score(X_test, y_test)
     print(f'Accuracy of model on test dataset :- {acc_score}')
     # predict result using test dataset
     y_pred = rf_classifier.predict(X_test)
     # confusion matrix
     print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
     # classification report for f1-score
     print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")


# calling random_forest_classifier
random_forest_classifier(X_train, y_train, X_test, y_test)

Model training starts........
Accuracy of model on test dataset :- 0.9994967405170698
Confusion Matrix :- 
 [[85289     7]
 [   36   111]]
Classification Report :- 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85296
           1       0.94      0.76      0.84       147

    accuracy                           1.00     85443
   macro avg       0.97      0.88      0.92     85443
weighted avg       1.00      1.00      1.00     85443



Wow, this model's accuracy is also 99% great, but what about remaining evaluation metrics such as precision, recall, F1-score. 

Let's discuss these variations why it happens, all these in the coming section.

# Why Accuracy not suitable for Data Imbalance Problems?

What was the reason for not applying or not considering accuracy as a performance metric for this specific problem?

Just take some time, think about it.

Model training is completed; we got accuracy on the test set as 99%. 

But why this section? 

We are having various classification evaluation metrics to quantify the performance of the build model, accuracy is one method in that. What other methods we can apply?

Now we will discuss our dataset and what are the best evaluation metrics for these kinds of problems.

For this discussion, we have to remember two things that are previously discussed.

1. The number of samples for each Class (target variable) value.

2. Evaluation metrics at both the decision tree and random forest classification models.

Do you remember the number of samples/rows for each target value? 

No? okay, let us check that number.

In [None]:
print(f"Number of samples under each target value :- \n {fraud_df['Class'].value_counts()}")

See the number of samples for Class-1 (fraudulent) less than the samples for class-0 (non-fraudulent). 

This kind of dataset is called unbalanced data. Which means one class label samples are  higher and dominating the other class label. 

For a balanced dataset, accuracy is suitable because we take the divided value of the correctly predicted samples count with the total number of samples for accuracy.

# Accuracy = number of correctly predicted samples / total number of samples

For example. 

If our dataset has 20 samples, out of that 2 for Class 0 & 18 for Class 1. Our trained model correctly predicted 17 samples out of 18 Class-1 samples and 0 samples out of 2 Class-0 samples. 

What is the accuracy value for this? 85%.

But this is not correct, right? Because the model doesn’t even predict one sample correctly for Class-0 samples, but we got 85% accuracy. 

For an unbalanced dataset, a list of evaluation metrics are available. In the next section, we will discuss this.

# Suitable evaluation metrics for imbalanced data

So which all metrics are suitable for unbalanced data?

We can use any of the below-mentioned metrics for unbalanced or skewed datasets.

1.Recall

2.Precision

3.F1-score

4.Area Under ROC curve.

We can see the huge difference among different evaluation metrics for both classifications (decision tree & random forest) models. 

Do you remember we mentioned at model development stage, accuracy, classification report, etc. ? 


Here we have to discuss a few terms and formulae related to confusion matrix, precision, recall & F1-score.

In [None]:
 #                  PREDICTED   PREDICTED
 #                  Positive    Negative
              
 #ACTUAL Positive   TP          FN
 #ACTUAL Negative   FP          TN  
 

1. True Positive (TP):-  
The number of positive labels correctly predicted by trained models.  This means the number of Class-1 samples correctly predicted as Class-1.


2. True Negative (TN):-
The number of negative labels correctly predicted by trained models.  This means the number of Class-0 samples correctly predicted as Class-0.


3. False Positive (FP):-  
The number of positive labels incorrectly predicted by trained models. This means the number of Class-1 samples incorrectly predicted as Class-0.


4. False Negative (FN):-  
The number of negative labels incorrectly predicted by trained models.  This means the number of Class-0 samples incorrectly predicted as Class-1.

Formulae

1.Recall = TP / (TP + FN)

2.Precision = TP / (TP + FP)

3.F1-Score = 2*P*R / (P + R) here P for Precision, R for Recall

Both classification models got accuracy scores as 99%. 

But when we observe the result of the classification report of both classifiers, f1-score for Class-0 got 100%, but for Class-1, F1-scores are significantly less. 

All these variations occur due to the unbalanced or skewed dataset. 

Why f1-score for class-0 100%? 

Because of the number of samples for class-0 (2 lakhs). The number of samples for Class-0 is very high than the Class-1 samples.

So what we need to do here is handle an unbalanced dataset. If you want to learn more about it, check the Best ways to handle unbalanced data in the machine learning article which explained various ways to handle the imbalanced data.

One more thing is left for discussion in this section, which is about areas under the ROC curve.

# AUC and ROC Curves

Area Under ROC curve is another evaluation metric for classification problems. This is mostly suitable for skewed datasets. It tells us about model performance, such as the model's capability to distinguish between target classes. 

The effective model has a higher Area Under the ROC curve value. Here we measure the ability of class separability of a model by using the Area Under ROC curve.

Good models have AUC value near to 1, and the worst models have AUC value near 0.

All the model performance methods help in the measuring the performance of the model based on the problem, but how to build the best models when we face with the data imbalance issue?

For that, we need to apply different sampling methods to the data before building the models.

Let’s see how sampling methods improve model performance, and how much AUC score for that model in the coming section.

# Model Improvement Using Sampling Techniques


Data sampling is the statistical method for selecting data points (here, the data point is a single row) from the whole dataset. In machine learning problems, there are many sampling techniques available.

Here we take undersampling and oversampling strategies for handling imbalanced data.  

What is this undersampling and oversampling?
Let us take an example of a dataset that has nine samples. 

1. Six samples belong to class-0,

2. Three samples belong to class-1

Oversampling = 6 class-0 samples x  2 times of class-1 samples of 3

Undersampling = 3 Class-1 samples x 3 samples from Class-0

Here what we are trying to do is the number of samples of both target classes to be equal. 

In the oversampling technique, samples are repeated, and the dataset size is larger than the original dataset.

In the undersampling technique, samples are not repeated, and the dataset size is less than the original dataset.

# Applying Sampling Techniques 

For undersampling techniques, we are checking the number of samples of both classes and selecting the smaller number and taking random samples from other class samples to create a new dataset.  

The new dataset has an equal number of samples for both target classes.

This is a whole process of undersampling, and now we are going to implement this entire process using python.

In [18]:
## Target class distribution
class_val = fraud_df['Class'].value_counts()
print(f"Number of samples for each class :- \n {class_val}")
non_fraud = class_val[0]
fraud = class_val[1]
print(f"Non Fraudulent Numbers :- {non_fraud}")
print(f"Fraudulent Numbers :- {fraud}")


Number of samples for each class :- 
 0    284315
1       492
Name: Class, dtype: int64
Non Fraudulent Numbers :- 284315
Fraudulent Numbers :- 492


The above is the target class distributions, now let's see how we can change this.



In [20]:
import numpy as np
## Equal both the target samples to the same level
# take indexes of non fraudulent
nonfraud_indexies = fraud_df[fraud_df.Class == 0].index
fraud_indices = np.array(fraud_df[fraud_df['Class'] == 1].index)
# take random samples from non fraudulent that are equal to fraudulent samples
random_normal_indexies = np.random.choice(nonfraud_indexies, fraud, replace=False)
random_normal_indexies = np.array(random_normal_indexies)

Here first, we take indexes of both classes and randomly choose Class-0 samples indexes that are equal to the number of Class-1 samples. 

In the below code snippet, Combine both classes indexes. Then we extract all features of gathered indexes.

In [21]:
## Equal both the target samples to the same level
# take indexes of non fraudulent
nonfraud_indexies = fraud_df[fraud_df.Class == 0].index
fraud_indices = np.array(fraud_df[fraud_df['Class'] == 1].index)
# take random samples from non fraudulent that are equal to fraudulent samples
random_normal_indexies = np.random.choice(nonfraud_indexies, fraud, replace=False)
random_normal_indexies = np.array(random_normal_indexies)


## Undersampling techniques

# concatenate both indices of fraud and non fraud
under_sample_indices = np.concatenate([fraud_indices, random_normal_indexies])

#extract all features from whole data for under sample indices only
under_sample_data = fraud_df.iloc[under_sample_indices, :]

# now we have to divide under sampling data to all features & target
x_undersample_data = under_sample_data.drop(['Class'], axis=1)
y_undersample_data = under_sample_data[['Class']]
# now split dataset to train and test datasets as before
X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(
x_undersample_data, y_undersample_data, test_size=0.2, random_state=0)

The above code first divides features and targets as x_undersample_data and y_undersample_data and then splits new undersample data into train and test dataset.

Okay, now we will call both classifiers with these new under sampling train and test datasets.

# Decision tree classification after applying sampling techniques

In [22]:
## DecisionTreeClassifier after applying undersampling technique

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

def decision_tree_classification(X_train, y_train, X_test, y_test):
 # initialize object for DecisionTreeClassifier class
 dt_classifier = DecisionTreeClassifier()
 # train model by using fit method
 print("Model training start........")
 dt_classifier.fit(X_train, y_train.values.ravel())
 print("Model training completed")
 acc_score = dt_classifier.score(X_test, y_test)
 print(f'Accuracy of model on test dataset :- {acc_score}')
 # predict result using test dataset
 y_pred = dt_classifier.predict(X_test)
 # confusion matrix
 print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
 # classification report for f1-score
 print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
 print(f"AROC score :- \n {roc_auc_score(y_test, y_pred)}")

# calling decision tree classifier function 
decision_tree_classification(X_train_sample, y_train_sample, 
X_test_sample, y_test_sample)

Model training start........
Model training completed
Accuracy of model on test dataset :- 0.9187817258883249
Confusion Matrix :- 
 [[97  9]
 [ 7 84]]
Classification Report :- 
               precision    recall  f1-score   support

           0       0.93      0.92      0.92       106
           1       0.90      0.92      0.91        91

    accuracy                           0.92       197
   macro avg       0.92      0.92      0.92       197
weighted avg       0.92      0.92      0.92       197

AROC score :- 
 0.9190856313497823


# Random Forest Tree Classifier after applying the sampling techniques

In [24]:
## RandomForestClassifier after apply the undersampling techniques

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

def random_forest_classifier(X_train, y_train, X_test, y_test):
 # initialize object for DecisionTreeClassifier class
 rf_classifier = RandomForestClassifier(n_estimators=50)
 # train model by using fit method
 print("Model training start........")
 rf_classifier.fit(X_train, y_train.values.ravel())
 acc_score = rf_classifier.score(X_test, y_test)
 print(f'Accuracy of model on test dataset :- {acc_score}')
 # predict result using test dataset
 y_pred = rf_classifier.predict(X_test)
 # confusion matrix
 print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
 # classification report for f1-score
 print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
 # area under roc curve
 print(f"AROC score :- \n {roc_auc_score(y_test, y_pred)}")

random_forest_classifier(X_train_sample, y_train_sample, X_test_sample, y_test_sample)


Model training start........
Accuracy of model on test dataset :- 0.9644670050761421
Confusion Matrix :- 
 [[104   2]
 [  5  86]]
Classification Report :- 
               precision    recall  f1-score   support

           0       0.95      0.98      0.97       106
           1       0.98      0.95      0.96        91

    accuracy                           0.96       197
   macro avg       0.97      0.96      0.96       197
weighted avg       0.96      0.96      0.96       197

AROC score :- 
 0.9630935102633216


See, the results of the F1-score for both target values are 95%, and the Area Under ROC curve is near to 1. 

For the best models, we have the AUROC value near to 1. Here we implemented the undersampling technique; you can apply oversampling also like an undersampling process.

# Conclusion

Finally, our model gives 94% of the Area Under the ROC curve value. We can improve model results by adding more trees or applying additional data preprocessing techniques, etc. 

Not only decision trees or random forest classifiers suitable for this problem. You can try with other machine learning classification algorithms such as Support Vector Machines (SVM), k-nearest neighbors, etc.  to check how different algorithms are performed on classifying fraudulent activities