<h3>Financial Fraud Detection Using Machine Learning</h3>
<h4>Project by : team okcomputer</h4> 
<h4>Author : Hari Govind V</h4>
<br><br><br>



<p>This is a machine learning model to detect fraudulent transactions. Trained on the <a href = 'https://www.kaggle.com/ntnu-testimon/paysim1'>PaySim dataset</a>, this model is built after exploring the concepts of exploratory data analysis, feature engineering, and other machine learning concepts.
    



<h6>Importing The Libraries For Data Reading</h6>

In [2]:
import pandas as pd
import numpy as np
import pickle

In [214]:
df = pd.read_csv('hackathon.csv')

<p>The original <a href='https://www.kaggle.com/ntnu-testimon/paysim1'>Paysim Dataset</a> is highly unbalanced, so I shuffled a random pack and created a new balanced dataset (hackathon.csv) using pandas data sampling, on which this model is going to be trained.</p>

<h6>Basic EDA</h6>

<p>Checking the columns of the dataset</p>

In [215]:
df.head()

Unnamed: 0.1,Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
1,3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
2,251,1,TRANSFER,2806.0,C1420196421,2806.0,0.0,C972765878,0.0,0.0,1,0
3,252,1,CASH_OUT,2806.0,C2101527076,2806.0,0.0,C1007251739,26202.0,0.0,1,0
4,680,1,TRANSFER,20128.0,C137533655,20128.0,0.0,C1848415041,0.0,0.0,1,0


<p>Resetting the index of the dataframe</p>

In [216]:
df.reset_index(drop=True)

Unnamed: 0.1,Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
1,3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
2,251,1,TRANSFER,2806.00,C1420196421,2806.00,0.00,C972765878,0.00,0.00,1,0
3,252,1,CASH_OUT,2806.00,C2101527076,2806.00,0.00,C1007251739,26202.00,0.00,1,0
4,680,1,TRANSFER,20128.00,C137533655,20128.00,0.00,C1848415041,0.00,0.00,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
16412,2596607,207,PAYMENT,18588.97,C264877418,393507.58,374918.61,M1037637931,0.00,0.00,0,0
16413,1740940,161,TRANSFER,94957.41,C1966589833,0.00,0.00,C37728871,827932.92,922890.33,0,0
16414,1107941,130,PAYMENT,4646.91,C821923409,336051.82,331404.91,M589073853,0.00,0.00,0,0
16415,5107980,355,DEBIT,2804.53,C1742662400,895.00,0.00,C1661446958,248929.59,251734.12,0,0


<p>Dropping the 'Unnamed: 0' column from the dataset.</p>

In [217]:
df.drop('Unnamed: 0', axis=1, inplace=True)

<h6>Data PreProcessing and Feature Engineering</h6>

In [218]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


droplist=['step','isFlaggedFraud','type','nameDest','nameOrig']


def data_preprocessing(data):
    
    data['TRANSFER']=data['type'].apply(lambda x: 1 if x=='TRANSFER' else 0)
    data['CASH_OUT']=data['type'].apply(lambda x: 1 if x=='CASH_OUT' else 0)
    
    data['errorBalanceOrig'] = data.newbalanceOrig + data.amount - data.oldbalanceOrg
    data['errorBalanceDest'] = data.oldbalanceDest + data.amount - data.newbalanceDest
    
          
    data = data.drop(labels=droplist,axis=1)
    
    return data

<p>Here, one-hot encoding is used to encode the categorical column 'type', and engineer two new columns, 'Transfer' and 'Cash_out'.</p> 
<p>Two new columns, namely 'errorBalanceOrig' and 'errorBalanceDest' keeps track of the account balance error of the sender and reciever.</p>  

<h6>Model Evaluation Function.</h6> 
<p>This function returns the evaluation details of the model like f1-score, accuracy score, confusion matrix, classification report, etc.</p> 

In [219]:
def model_result(model,x_test,y_test):
    y_pred=model.predict(x_test)
    print('F1-score :',(f1_score(y_test,y_pred)))
    print('Confusion_matrix : ')
    print(confusion_matrix(y_test,y_pred))
    print("accuracy_score : ", end = '')
    print(accuracy_score(y_test,y_pred))
    print("classification_report")
    print(classification_report(y_test,y_pred))

<h6>Train Test Split and Model Training</h6> 
<p>Importing the libraries for performing train test split.</p>

In [220]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, average_precision_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

<p>Sending the data for preprocessing - </p>

In [221]:
df = data_preprocessing(df)

<p>Checking the degree of balance in the dataset - </p>

In [222]:
df['isFraud'].value_counts()

1    8213
0    8204
Name: isFraud, dtype: int64

<p>Dropping the target column 'isFraud' and sending it for model training, followed by a 70:30 train test split - </p>

In [223]:

X = df.drop('isFraud',axis=1)
Y = df['isFraud']
X = scaler.fit_transform(X) 

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, shuffle=True)
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train,y_train)
model_result(rf_model,X_test,y_test)

F1-score : 0.9982046678635548
Confusion_matrix : 
[[2415    1]
 [   8 2502]]
accuracy_score : 0.9981729598051157
classification_report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2416
           1       1.00      1.00      1.00      2510

    accuracy                           1.00      4926
   macro avg       1.00      1.00      1.00      4926
weighted avg       1.00      1.00      1.00      4926



<p>The classification report shows that the model gives an f1-score of 1, an accuracy score of 0.99, along with a precision and recall score of 1, showing it's excellent performance in the test data.</p>

<p>Pickling the model into a save file - </p>

In [224]:
filename = 'final_model.sav'
pickle.dump(rf_model, open(filename, 'wb'))

<h6>Model Testing</h6>
<p>Giving the dataset custom values for prediction.</p>

In [24]:
lst = [1,
     2806.00,
      'C1305486145',
      2806.00,
      0.00,
      'C553264065',
      26202.00,
      0.00,
      'CASH_OUT']
#1	CASH_OUT	2806.00	C2101527076	2806.00	0.00	C1007251739	26202.00	0.00	1	
#1	CASH_OUT	181.00	C840083671	181.00	0.00	C38997010	21182.00	0.00	

arr = np.array(lst).reshape(1,-1)

def data_preprocessing(data):
    
    #Ivde extracting nameOrig and nameDest where C is present
    
    droplist=['step','type','nameDest','nameOrig']

    
#     data['OrigC']=data['nameOrig'].apply(lambda x: 1 if str(x).find('C')==0 else 0)
#     data['DestC']=data['nameDest'].apply(lambda x: 1 if str(x).find('C')==0 else 0)

    #Ivde creating new feature for transfer and cash_out
    data['TRANSFER']=data['type'].apply(lambda x: 1 if x=='TRANSFER' else 0)
    data['CASH_OUT']=data['type'].apply(lambda x: 1 if x=='CASH_OUT' else 0)
    
    #Ivde extracting error in account balances
    data['errorBalanceOrig'] = data.newbalanceOrig + data.amount - data.oldbalanceOrg
    data['errorBalanceDest'] = data.oldbalanceDest + data.amount - data.newbalanceDest
       
    data = data.drop(labels=droplist,axis=1)
    
    
    return data



df = pd.DataFrame(data = arr,
                  columns = ['step', 'amount', 'nameOrig','oldbalanceOrg', 'newbalanceOrig', 'nameDest', 
                  'oldbalanceDest', 'newbalanceDest', 'type'])

df['step'] = df['step'].astype(int)
df['amount'] = df['amount'].astype(float)
df['oldbalanceOrg'] = df['oldbalanceOrg'].astype(float)
df['newbalanceOrig'] = df['newbalanceOrig'].astype(float)
df['oldbalanceDest'] = df['oldbalanceDest'].astype(float)
df['newbalanceDest'] = df['newbalanceDest'].astype(float)    


df = data_preprocessing(df)
#df = scaler.transform(df)


model = pickle.load(open('final_model.sav','rb'))

prediction = model.predict(df)
if prediction[0]==1:
        print('fraud')
elif prediction[0] ==0:
        print('safe')

fraud


<p>The model gives accurate predictions.</p>

<p>A notebook by  Hari Govind V</p>