# FRAUD DETECTION 

A classic problem in financial environments. We will take a look at classifying transitions into fraud and not fraud using scikit-learn package.

This notebook will be using a synthetic database found on kaggle for educational purposes beacuse we all need practice :-).
I will break this proces down into the follwoing steps:

1. Data Exploration 
2. Data prepartion and pre-processing 
3. Modelling 
4. Evaluation and testing 

We have the following fields within our dataset: 

transaction_id - identifier to each transaction   
user_id - identifier for each user.   
transaction_amount - amount for each transaction.   
transaction_type - how funds were exchanged e.g "payment" or "bank transfer".   
payment_mode - wallet, card, UPI etc.   
device_type - device transaction was made from e.g iOS.   
device_location - location of the device used to make transaction.   
account_age_days - age of the account     
transaction_hour - time of transaction in 24 hour notation.    
previous_failed_attempts - if there were previous attempts to make fraudulent transactions   
avg_transaction_amount - avg amount each account usually makes   
is_international - is the trasnaction international    
ip_risk_score - a numerical value  that quantifies the likelihood an IP address is involved in malicious activity, such as fraud, spam, or cyberattacks. 
login_attempts_last_24h - number of login attempts to the account in the last 24 hours   
fraud_label- is the transaction fraud or not 


## Data Exploration 

In [None]:
# First we need to start by acquiring all of our dependecies 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#Now we will load our data in 
fraud = pd.read_csv('/Users/leta/Desktop/Data Science Career /Python/Python Projects/Fraud Detection /dataset/Digital_Payment_Fraud_Detection_Dataset.csv')


In [None]:
fraud.shape

In [None]:
#Let us take a look into the data we have 
fraud.head()


In [None]:
#In this dataset we have mostly numerical variables with a few catgeroical variables.
#Let us look at what fields we have
fraud.keys()


In [None]:
#What unique labels do we have in the dataset 
fraud.nunique()
# The fields with the highest unique values are the ID fields, transaction amounts, account ages, avgerage transaction amounts, ip risk scores 
# ID fields are just unqiue ientifiers but have no bearing on prediction therefore we will drop these fields especially since they have many unique values.  
# We will plore some more of these fields to see there bearing on the predcitve power of our models using exploratory data analysis and data mining.

In [None]:
# Finally let us see of the data is balanced or not 
fraud["fraud_label"].value_counts(normalize=True). 
# Fraud - 6.52% of observations    
#Non fraud - 93.28% of observations 
# Our data is very skewed towards non fraud transactions.
# This will infrom our metric to measure perfomrance (probably balanced accuarcy, recall and precision) 
# as well as how we stratify the data when partitioning into test and train sets. 


Our variables with categorical information being: transaction_typ, payment_mode	and device_type	device_location do not have too many unqiue values.  
This infroms what methods we can use to deal with these values for classification models that only use numerical data.  
- A possible solution ould be to use dummy variables. 

##  DATA PREPARATION AND PRE-PROCESSING 


In [None]:
# As we have seen above the ID variables have no bearin on the prediction so we will remove them.
fraud = fraud.drop(columns= ["transaction_id", "user_id"])

In [None]:
# FEATURE ENGINEERING 
# Since we want to use scikit-learn binary classification methods, the simplest way is to use dummy varibles for non-numeric variables 
df1 = pd.get_dummies( data = fraud, #what data we want ot get dummies variables of 
                     columns = ["transaction_type", "payment_mode", "device_type", "device_location"],  #the non-numeri columns we will convert into dummies
                     dtype=int) #turning dummies from True/ False into binary 1/0. 
df1.head()

In [None]:
# Now we will look at the dimensions of the dataframe to see how many more predictors we have added
df1.shape # we have added 11 new columns, not too many 

## MODELLING 

In [None]:
# Now we can start modelling our data usinf sklearn.
#First we download dependencies
from sklearn.linear_model import LogisticRegression #this model acts as our basline model for binary classification 

In [None]:
# Now we will take partition the data into dependent and indepenedt variables. 
x =df1[[ 'transaction_amount', 'account_age_days',
       'transaction_hour', 'previous_failed_attempts',
       'avg_transaction_amount', 'is_international', 'ip_risk_score',
       'login_attempts_last_24h', 'transaction_type_Payment',
       'transaction_type_Transfer', 'transaction_type_Withdrawal',
       'payment_mode_Card', 'payment_mode_NetBanking', 'payment_mode_UPI',
       'payment_mode_Wallet', 'device_type_Android', 'device_type_Web',
       'device_type_iOS', 'device_location_Bangalore',
       'device_location_Chennai', 'device_location_Delhi',
       'device_location_Hyderabad', 'device_location_Mumbai']]
y = df1[['fraud_label']]

In [None]:
# Partitioning our data 
# The data will be partitioned into tran and test splits at a 70/30% proportion 

from sklearn.model_selection import train_test_split #package to spilt our data 

x_train, x_test, y_train, y_test = train_test_split(x,y,  #our dependent and independent variables to be used
                                                    stratify = y, #keeping proportion of fraud and not fraud equal in the the train and test sets
                                                    random_state=123, #setting a seed for reproducibility 
                                                    test_size=0.3) #30% test size


In [None]:
# Now we ca n fit the data 
logreg = LogisticRegression() #assining our model to a variables we will call later 
logreg.fit(x_train, y_train) #fit our training data to the model 

## EVALUATION AND TESTING 
We have trained our model on the train set.   
We can now use the metrics mentined earlier like balanced accuracy, precision and recall to judge our models

**Presicion** - of all the positievs identified how many were truly positive?  
**Recall**- of all possible positive instances, how many did the model catch? 

For fraud detection, we value recall as our main metric.   
This is beacuse catching every fraudlent trasaction allows for the least damage to our client base as opposed to flagging a genuine transaction as fraud that can be reversed wuth no harm to the user. 

In [None]:
y_pred = logreg.predict(x_test) #making predictios on the train data 


In [None]:
# Importing balanced accuracy
from sklearn.metrics import accuracy_score,balanced_accuracy_score, precision_recall_curve, confusion_matrix,precision_score, recall_score, ConfusionMatrixDisplay

In [None]:
accuracy_score(y_test, y_pred) #We have a very high accuracy score of 0.935


In [None]:
balanced_accuracy_score(y_test,y_pred) #our balaned accuracy score is very lowe at 0.5

In [None]:
precision_score(y_test,y_pred) # precision is 0

In [None]:
recall_score(y_test,y_pred) #recall is 0

This indicates that the imbalance in the dataset even after stratifying the data.     
This indicates that our model learned to predict non fraud every time (class 0) to get a high accuracy bot lowe balanced accuracy, precision and recall 
Therefore we need to find another way to ensure we can identify all the positive cases. 
# We can award a greater penalty to predicting a very imbalanced outcome 

In [None]:
logreg2 = LogisticRegression(class_weight= "balanced") #adding penatly to predciting only one class accuractely 


In [None]:
logreg2.fit(x_train, y_train)

In [None]:
y_pred2 = logreg2.predict(x_test) 

In [None]:
balanced_accuracy_score(y_test, y_pred2) #Our model still does not have the best balanced accuracy at 0.525

In [None]:
precision_score(y_test, y_pred2) # precision is alos very low 

In [None]:
recall_score(y_test, y_pred2) #however, our recall is much higher therefore, we have caught more our our fruad cases

In [None]:
confusion_matrix_log = ConfusionMatrixDisplay.from_predictions(
    y_test,
    y_pred2
)
confusion_matrix.ax_.set_title("Confusion Matrix")
plt.show()


In conculsion - even after we add extra penalty to predict each class as good as possible. 
We will now try to use another classification model 

In [None]:
# Now we will try to use a Random Forest Model 
from sklearn.ensemble import RandomForestClassifier

In [None]:
random_for = RandomForestClassifier(class_weight= "balanced")

In [None]:
random_for.fit(x_train, y_train)

In [None]:
y_pred_rando = random_for.predict(x_test)

In [None]:
balanced_accuracy_score(y_test, y_pred_rando)

In [None]:
precision_score(y_test, y_pred_rando)

In [None]:
recall_score(y_test, y_pred_rando)

In [None]:
confusion_matrix_rando = ConfusionMatrixDisplay.from_predictions(
    y_test,
    y_pred_rando
)
confusion_matrix.ax_.set_title("Confusion Matrix")
plt.show()

Both our models minimise error by just predicting not fraud for all the transaction since the data is so skewed in not fraud approximatley 94/6 %.
Therefore since we have such unbalanced data we have to give a much larger penatly to make our models predict both classes more accuractely.   

## MODEL WITH GREATER PENALTY 

In [180]:
# We will now create both models with a greater penatly for predicting wrong in our fraud class as well as lower our acceptance threshold for fraud.
Log = LogisticRegression(class_weight= "balanced") #a mistake in our fraud class (class 1) is 50 times more detrimental than predicting wrong for class 1
Random = RandomForestClassifier(class_weight= "balanced")

In [183]:
algos = [Log, Random]
b_accuracy =[]
recall_metric = []
precision_metric = []
threshold = 0.1

for a in algos: 
    a.fit(x_train, y_train)
    predictions = (a.predict_proba(x_test)[:,1] >= threshold).astype(int)
    acc = balanced_accuracy_score(y_test, predictions)
    prec = precision_score(y_test, predictions)
    rec = recall_score(y_test, predictions)
    
    b_accuracy.append(acc)
    recall_metric.append(rec)
    precision_metric.append(prec)


  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  return fit_method(estimator, *args, **kwargs)


In [184]:
main_metrics =pd.DataFrame()
main_metrics["Description"] = [ "Balanced accuracy", "Recall", "Precision"]
main_metrics["Logistic Regression"] = [b_accuracy[0], recall_metric[0], precision_metric[0]]
main_metrics["Random Forest"] = [b_accuracy[1], recall_metric[1], precision_metric[1]]


main_metrics

Unnamed: 0,Description,Logistic Regression,Random Forest
0,Balanced accuracy,0.5,0.51952
1,Recall,1.0,0.22449
2,Precision,0.065333,0.078014


Even after using a lower threshold for classification we still have a very low balanced accuracy for oth models at around 50% as well as a low recal score for Random Forest 22.4% and a high recall for Logistic regression 100%. 

Although the high recall for Logistci Regression may seem good, it does so by over-predicting fraud therefore we trade off our balanced accuracy anf precision. Too many transactions are being flagged as fraud 