# Overview of Fraud Detection Project

With the prevalence of online payment, e-commerce platforms are plagued with payment frauds.  Payment fraud is expensive and time-consuming for both customers and business owners.
So, any companies that need to process credit card payment should be aware of online frauds and should invest in fraud prevention.


This project aims to develop an algorithm to predict the probability of a transaction on an e-commerce platform being a fraud based on an anonymous e-commerce platform transaction data.
Mainly insights:
* The challenge of this fraud detection is that the dataset is highly imbalanced.
* The features of interval_after_signup and time-related aggregate features are highly predictive of fraudulent activities.

# Data Exploration

In [1]:
# !pip install imblearn
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import roc_curve, confusion_matrix
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score, roc_auc_score, roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
import pandas_profiling
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 500)

In [2]:
fraud_data = pd.read_csv('data/imbalancedFraudDF.csv')

In [3]:
#Distribution of the label column
fraud_data['class'].value_counts()
# dataset is highly imbalanced; the fraud data is less than 10%.

0    136961
1      1415
Name: class, dtype: int64

In [4]:
fraud_data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
3,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0
4,159135,2015-05-21 06:03:03,2015-07-09 08:05:14,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0


In [5]:
# the more a ip is shared, the more suspicious
fraud_data['n_ip_shared'] = fraud_data.ip_address.map(fraud_data.ip_address.value_counts(dropna=False))

In [6]:
#Inline summary report about each feature
# pandas_profiling.ProfileReport(fraud_data)

In [7]:
print (fraud_data.user_id.nunique())#138289
print (len(fraud_data.index))#138376

#All the user_id has only the first 1 transaction, difficult to do user-level aggregation, 

138376
138376


# Feature Engineering

### Feature Creation: country
create a new feature country based on the ip_address feature in fraud_data and ip boundaries in IpAddress_to_Country.csv

In [8]:
ipToCountry = pd.read_csv('data/IpAddress_to_Country.csv')
ipToCountry.head()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia
1,16777472.0,16777727,China
2,16777728.0,16778239,China
3,16778240.0,16779263,Australia
4,16779264.0,16781311,China


In [9]:
def find_country_by_ip(ip):
    met_cond = (ipToCountry['lower_bound_ip_address'] <= ip) & (ip<= ipToCountry['upper_bound_ip_address'])
    c = ipToCountry.loc[met_cond, 'country']
    return c.values[0] if len(c) == 1 else 'NA'

In [10]:
fraud_data['country'] = fraud_data['ip_address'].map(find_country_by_ip)

### Time-related features transformation

In [11]:
fraud_data['interval_after_signup'] = (pd.to_datetime(fraud_data['purchase_time']) - pd.to_datetime(
        fraud_data['signup_time'])).dt.total_seconds()

fraud_data['signup_days_of_year'] = pd.DatetimeIndex(fraud_data['signup_time']).dayofyear

fraud_data['signup_seconds_of_day'] = pd.DatetimeIndex(fraud_data['signup_time']).second + 60 * pd.DatetimeIndex(
    fraud_data['signup_time']).minute + 3600 * pd.DatetimeIndex(fraud_data['signup_time']).hour

fraud_data['purchase_days_of_year'] = pd.DatetimeIndex(fraud_data['purchase_time']).dayofyear
fraud_data['purchase_seconds_of_day'] = pd.DatetimeIndex(fraud_data['purchase_time']).second + 60 * pd.DatetimeIndex(
    fraud_data['purchase_time']).minute + 3600 * pd.DatetimeIndex(fraud_data['purchase_time']).hour

fraud_data = fraud_data.drop(['user_id','signup_time','purchase_time'], axis=1)

In [12]:
# check the new table after feature engineering
fraud_data.head()

Unnamed: 0,purchase_value,device_id,source,browser,sex,age,ip_address,class,n_ip_shared,country,interval_after_signup,signup_days_of_year,signup_seconds_of_day,purchase_days_of_year,purchase_seconds_of_day
0,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,1,Japan,4506682.0,55,82549,108,10031
1,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,1,United States,17944.0,158,74390,159,5934
2,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,1,,492085.0,118,76405,124,50090
3,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,1,United States,4361461.0,202,25792,252,67253
4,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0,1,Canada,4240931.0,141,21783,190,29114


### Train and test data split


In [13]:
y = fraud_data['class']
X = fraud_data.drop(['class'], axis=1)

#split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)

X_train.shape: (110700, 14)
y_train.shape: (110700,)


### Convert categorical features to numericals

In [14]:
#training data conversation

X_train = pd.get_dummies(X_train, columns=['source', 'browser'])
X_train['sex'] = (X_train.sex == 'M').astype(int)

# frequency encoding
# the more a device is shared, the more suspicious
X_train['n_dev_shared'] = X_train.device_id.map(X_train.device_id.value_counts(dropna=False))

# the more a ip is shared, the more suspicious
X_train['n_ip_shared'] = X_train.ip_address.map(X_train.ip_address.value_counts(dropna=False))

# the less visit from a country, the more suspicious
X_train['n_country_shared'] = X_train.country.map(X_train.country.value_counts(dropna=False))
X_train = X_train.drop(['device_id','ip_address','country'], axis=1)


# testing data conversion
X_test = pd.get_dummies(X_test, columns=['source', 'browser'])
X_test['sex'] = (X_test.sex == 'M').astype(int)

# the more a device is shared, the more suspicious
X_test['n_dev_shared'] = X_test.device_id.map(X_test.device_id.value_counts(dropna=False))

# the more a ip is shared, the more suspicious
X_test['n_ip_shared'] = X_test.ip_address.map(X_test.ip_address.value_counts(dropna=False))

# the less visit from a country, the more suspicious
X_test['n_country_shared'] = X_test.country.map(X_test.country.value_counts(dropna=False))

X_test = X_test.drop(['device_id','ip_address','country'], axis=1)

In [15]:
X_train.head(20)

Unnamed: 0,purchase_value,sex,age,n_ip_shared,interval_after_signup,signup_days_of_year,signup_seconds_of_day,purchase_days_of_year,purchase_seconds_of_day,source_Ads,source_Direct,source_SEO,browser_Chrome,browser_FireFox,browser_IE,browser_Opera,browser_Safari,n_dev_shared,n_country_shared
8660,10,1,33,1,598553.0,19,14353,26,8106,0,0,1,1,0,0,0,0,1,8797
31055,86,0,33,1,4656631.0,35,24402,89,15433,1,0,0,0,0,0,1,0,1,3029
1788,35,1,37,1,3509728.0,77,43391,118,10719,1,0,0,0,0,1,0,0,1,11
128970,34,0,29,1,1102702.0,185,53345,198,32847,1,0,0,0,0,1,0,0,1,2134
3928,17,1,38,1,2241832.0,142,73917,168,69349,0,0,1,0,0,1,0,0,1,3029
21231,38,1,41,1,6921952.0,74,8949,154,18901,1,0,0,0,1,0,0,0,1,42547
17894,63,0,36,1,6321925.0,149,42305,222,57030,0,1,0,0,0,0,0,1,1,42547
66066,15,0,41,1,803410.0,165,64713,175,4123,0,0,1,0,1,0,0,0,1,42547
137967,33,1,25,2,1.0,8,81781,8,81782,1,0,0,0,0,0,0,1,2,42547
94479,36,0,39,1,7533319.0,180,12824,267,29343,1,0,0,1,0,0,0,0,1,42547


### Scale the data

In [16]:
#Compute the train minimum and maximum to be used for later scaling:
scaler = preprocessing.MinMaxScaler().fit(X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']]) 
#print(scaler.data_max_)

#transform the training data and use them for the model training
X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']] = scaler.transform(X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])

#apply the same scaler obtained from above on X_test
X_test[['n_dev_shared', 'n_ip_shared', 'n_country_shared']] = scaler.transform(X_test[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])


# Model Training

### Simple LogisticRegression

In [17]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train,y_train)

# predict on test
y_pred=logreg.predict(X_test)

In [18]:
cm = confusion_matrix(y_test, y_pred)
cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
print(cmDF)

# Logistic Regression with default parameters are not effective in this context. It doesn't identify any frauds.

        pred_0  pred_1
true_0   27393       0
true_1     283       0


### Simple Random Forest

In [19]:
classifier_RF = RandomForestClassifier(random_state=0)

classifier_RF.fit(X_train, y_train)

# predict class labels 0/1 for the test set
predicted = classifier_RF.predict(X_test)

# generate class probabilities
probs = classifier_RF.predict_proba(X_test)

# generate evaluation metrics
print("%s: %r" % ("accuracy_score is: ", accuracy_score(y_test, predicted)))
print("%s: %r" % ("roc_auc_score is: ", roc_auc_score(y_test, probs[:, 1])))
print("%s: %r" % ("f1_score is: ", f1_score(y_test, predicted )))#string to int

print ("confusion_matrix is: ")
cm = confusion_matrix(y_test, predicted)
cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
print(cmDF)
print ('recall =',float(cm[1,1])/(cm[1,0]+cm[1,1]))
print ('precision =', float(cm[1,1])/(cm[1,1] + cm[0,1]))#1.0)


# Random Forest Classifier has better performance but fails to identify half of the fraud activities; There is no false alarm.

accuracy_score is: : 0.9951221274750687
roc_auc_score is: : 0.7650125080315714
f1_score is: : 0.6867749419953596
confusion_matrix is: 
        pred_0  pred_1
true_0   27393       0
true_1     135     148
recall = 0.5229681978798587
precision = 1.0


## SMOTE sampling
try to increase the percentage of minority class(fraud data) by synthesizing some fraud data to increase model performance

In [20]:
np.unique(y_train, return_counts=True)

(array([0, 1]), array([109568,   1132]))

In [21]:
smote = SMOTE(random_state=12)
x_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

unique, counts = np.unique(y_train_sm, return_counts=True)

print(np.asarray((unique, counts)).T)

[[     0 109568]
 [     1 109568]]


In [22]:
# RF on smoted training data
classifier_RF_sm = RandomForestClassifier(random_state=0)

classifier_RF_sm.fit(x_train_sm, y_train_sm)

# predict class labels for the test set
predicted_sm = classifier_RF_sm.predict(X_test)

# generate class probabilities
probs_sm = classifier_RF_sm.predict_proba(X_test)


# generate evaluation metrics
print("%s: %r" % ("accuracy_score_sm is: ", accuracy_score(y_test, predicted_sm)))
print("%s: %r" % ("roc_auc_score_sm is: ", roc_auc_score(y_test, probs_sm[:, 1])))
print("%s: %r" % ("f1_score_sm is: ", f1_score(y_test, predicted_sm )))#string to int

print ("confusion_matrix_sm is: ")
cm_sm = confusion_matrix(y_test, predicted_sm)
cmDF = pd.DataFrame(cm_sm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
print(cmDF)
print ('recall or sens_sm =',float(cm_sm[1,1])/(cm_sm[1,0]+cm_sm[1,1]))
print ('precision_sm =', float(cm_sm[1,1])/(cm_sm[1,1] + cm_sm[0,1]))

# compared with the former simple RF, this random forest is not very effective.
# The TP rate doesn't increase but false alarm increase. 

accuracy_score_sm is: : 0.995013730307848
roc_auc_score_sm is: : 0.7584508383986572
f1_score_sm is: : 0.6820276497695853
confusion_matrix_sm is: 
        pred_0  pred_1
true_0   27390       3
true_1     135     148
recall or sens_sm = 0.5229681978798587
precision_sm = 0.9801324503311258


## Parameter tuning by GridSearchCV

In [23]:
# Eval metrics to be calculated for each combination of parameters and cv
scorers = {
    'precision_score': make_scorer(precision_score, zero_division=0),
    'recall_score': make_scorer(recall_score, zero_division=0),
    'f1_score': make_scorer(f1_score, pos_label=1, zero_division=0)
}

In [24]:
def grid_search_wrapper(model, parameters, refit_score='f1_score'):
    """
    fits a GridSearchCV classifier using refit_score for optimization(refit on the best model according to refit_score)
    prints classifier performance metrics
    """
#     skf = StratifiedKFold(n_splits=10)
#     grid_search = GridSearchCV(clf, param_grid, scoring=scorers, refit=refit_score,
#                            cv=skf, return_train_score=True, n_jobs=-1)
    grid_search = GridSearchCV(model, parameters, scoring=scorers, refit=refit_score,
                           cv=StratifiedKFold(5), return_train_score=True, n_jobs=-1)
    grid_search.fit(X_train, y_train)

    # make the predictions
    y_pred = grid_search.predict(X_test)
    y_prob = grid_search.predict_proba(X_test)[:, 1]
    
    print('Best params for {}'.format(refit_score))
    print(grid_search.best_params_)

    # confusion matrix on the test data.
    print('\nConfusion matrix optimized for {} on the test data:'.format(refit_score))
    cm = confusion_matrix(y_test, y_pred)
    cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
    print(cmDF)
    
    print("\t%s: %r" % ("roc_auc_score is: ", roc_auc_score(y_test, y_prob)))
    print("\t%s: %r" % ("f1_score is: ", f1_score(y_test, y_pred)))#string to int

    print ('recall = ', recall_score(y_test, y_pred))
    print ('precision = ', precision_score(y_test, y_pred))

    return grid_search


## Optimizing on f1_score on LR

In [25]:
# C: inverse of regularization strength, smaller values specify stronger regularization
LRGrid = {"C" : np.logspace(-2,2,5), 
"penalty":["l1","l2"]}# l1 lasso l2 ridge
#param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}
logRegModel = LogisticRegression(random_state=0)

grid_search_LR_f1 = grid_search_wrapper(logRegModel, LRGrid, refit_score='f1_score')

Best params for f1_score
{'C': 0.01, 'penalty': 'l2'}

Confusion matrix optimized for f1_score on the test data:
        pred_0  pred_1
true_0   27393       0
true_1     283       0
	roc_auc_score is: : 0.7577997989994865
	f1_score is: : 0.0
recall =  0.0
precision =  0.0


## Optimizing on f1_score on RF

In [26]:
parameters = {        
'max_depth': [None, 5, 15],
'n_estimators' :  [10,150],
'class_weight' : [{0: 1, 1: w} for w in [0.2, 1, 100]]
}

clf = RandomForestClassifier(random_state=0)

In [27]:
grid_search_rf_f1 = grid_search_wrapper(clf, parameters, refit_score='f1_score')

Best params for f1_score
{'class_weight': {0: 1, 1: 1}, 'max_depth': None, 'n_estimators': 10}

Confusion matrix optimized for f1_score on the test data:
        pred_0  pred_1
true_0   27393       0
true_1     135     148
	roc_auc_score is: : 0.767921610573695
	f1_score is: : 0.6867749419953596
recall =  0.5229681978798587
precision =  1.0


In [28]:
best_rf_model_f1 = grid_search_rf_f1.best_estimator_
best_rf_model_f1

In [29]:
results_f1 = pd.DataFrame(grid_search_rf_f1.cv_results_)
results_sortf1 = results_f1.sort_values(by='mean_test_f1_score', ascending=False)
results_sortf1[['mean_test_precision_score', 'mean_test_recall_score', 'mean_test_f1_score', 'mean_train_precision_score', 'mean_train_recall_score', 'mean_train_f1_score','param_max_depth', 'param_class_weight', 'param_n_estimators']].round(3).head()


Unnamed: 0,mean_test_precision_score,mean_test_recall_score,mean_test_f1_score,mean_train_precision_score,mean_train_recall_score,mean_train_f1_score,param_max_depth,param_class_weight,param_n_estimators
13,0.998,0.522,0.685,1.0,0.999,1.0,,"{0: 1, 1: 100}",150
6,0.998,0.522,0.685,1.0,0.886,0.939,,"{0: 1, 1: 1}",10
9,0.998,0.522,0.685,1.0,0.522,0.686,5.0,"{0: 1, 1: 1}",150
3,0.998,0.522,0.685,1.0,0.522,0.686,5.0,"{0: 1, 1: 0.2}",150
5,0.998,0.522,0.685,1.0,0.549,0.709,15.0,"{0: 1, 1: 0.2}",150


## Insights Generation

In [30]:
# predictive factors

pd.DataFrame(best_rf_model_f1.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False)

# interval_after_signup, aggregate purchase and signup time-related-features, and n_ip_shared, purchase_value are highly predictive factors of frauds

Unnamed: 0,importance
interval_after_signup,0.323367
purchase_days_of_year,0.142157
n_ip_shared,0.098898
signup_seconds_of_day,0.078117
purchase_seconds_of_day,0.074009
signup_days_of_year,0.073908
n_dev_shared,0.048776
age,0.042781
purchase_value,0.042685
n_country_shared,0.030178


In [31]:
trainDF = pd.concat([X_train, y_train], axis=1)
pd.crosstab(trainDF["n_dev_shared"],trainDF["class"])

# insight1: the larger n_dev_shared, the higher rate of fraud

class,0,1
n_dev_shared,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,104853,457
0.2,4509,369
0.4,171,189
0.6,29,87
0.8,6,24
1.0,0,6


In [32]:
fraud_data.groupby("class")[['interval_after_signup']].mean()
#action velocity(consecutive operations/actions of user)

# insight2: interval_after_signup on frauds are significantly lower compared to legits

Unnamed: 0_level_0,interval_after_signup
class,Unnamed: 1_level_1
0,5191179.0
1,2570226.0


In [33]:
fraud_data.groupby("class")[['interval_after_signup']].median()#1
# insight 3: more than half of fraud happened 1s after signing up

Unnamed: 0_level_0,interval_after_signup
class,Unnamed: 1_level_1
0,5194911.0
1,1.0


# Conclusion

After trying simple logistic regression, simple random forests, random forests with smoke sampling and random forest with optimization on F-1 score, I found that random forest with optimization on F-1 score has the best performance, identifying most fraudulent transactions and zero false alarm and its F-1 score is 0.67.

##### Insights gained:
The features of interval_after_signup and time-related aggregate features are highly predictive of fraudulent activities.
1. the higher the number of devices that each account uses, the higher the chances of frauds
2. the interval between signup and purchase on frauds are significantly lower compared to legitimate transactions
3. more than half of frauds happen 1s after signing up