# Random Forest Implementation using scikit-learn

Here, we'll implement the random forest model to predict online fraud using scikit learn library.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

## Load Data
We'll now load the online payment fraud detection dataset.

In [2]:
data = pd.read_csv('onlinefraud.csv')

data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


## Preprocessing and Scaling

Here we'll preporcess the data as follows - 
 - Encode string column for easier implementation using one-hot encoding
  - Remove name_orig and name_dest since they are irrelevent for out current implementation and remove isFlaggedFraud since it is not a predective feature.
  - Scale the numerical features since it is good practice. 

In [3]:
data = pd.get_dummies(data, columns=['type'], drop_first=True)

data = data.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])

X = data.drop(columns=['isFraud'])
y = data['isFraud']

numerical_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
scaler = StandardScaler()
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

X.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,-1.703042,-0.28156,-0.22981,-0.237622,-0.323814,-0.333411,False,False,True,False
1,-1.703042,-0.294767,-0.281359,-0.285812,-0.323814,-0.333411,False,False,True,False
2,-1.703042,-0.297555,-0.288654,-0.292442,-0.323814,-0.333411,False,False,False,True
3,-1.703042,-0.297555,-0.288654,-0.292442,-0.317582,-0.333411,True,False,False,False
4,-1.703042,-0.278532,-0.274329,-0.282221,-0.323814,-0.333411,False,False,True,False


## Train Test Split

We'll split the dataset into training and testing databases.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Training the random forest model

We'll train our ranfom forest model now. We'll also set class_weight as balanced since our dataset contains very low number of fraud cases so that the model pays attention to the minor details.


In [6]:
rf_model = RandomForestClassifier(class_weight='balanced', n_jobs=-1)

rf_model.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## Evaluating the model

We'll now evaluate the model on accuracy, the confusion matrix, precision, recall, f1-score.

In [7]:
y_pred = rf_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

print("Confusion Matrix")
cm = confusion_matrix(y_test, y_pred)
print(cm)
print()

print("Classsification Report: ")
print(classification_report(y_test, y_pred, target_names=['Not Fraud (0)', 'Fraud (1)']))

Accuracy:  0.9997013808776888
Confusion Matrix
[[1270815      17]
 [    363    1329]]

Classsification Report: 
               precision    recall  f1-score   support

Not Fraud (0)       1.00      1.00      1.00   1270832
    Fraud (1)       0.99      0.79      0.87      1692

     accuracy                           1.00   1272524
    macro avg       0.99      0.89      0.94   1272524
 weighted avg       1.00      1.00      1.00   1272524



From the output , We can conclude that - 
 - The model almost never creates a false positive (marks a real transaction as fardulant) as seen by a 99% precision
 - However, It's recall is still low at 79% meaning that it misses every 1 in 5 fraud transactions.