Fraud Detection

Nov. 2018

Lu Zhang

Goal:

The goal of this challenge is to build a machine learning model that predicts the probability that the first transaction of new user is fraudulent.

Outline:

Data Preprocessing
Exploratory Data Analysis & Feature Engineering
Model Training and Selection
Hyperparameter Tuning
Findings & Suggestion

1. Data Preprocessing

For this analysis, I will use pandas for data manipulation, matplotlib for plotting, and sklearn for machine learning.

1.1 Data loading and browsing

1.2 Data preparation: Join dataframes

1.3 Data cleansing

1.4 Data transformation: Transform time-series variable

2. Exploratory Data Analysis and Feature Engineering

I performed EDA with target, checked relationship between variables and response, and generated features as necessary.

2.1 Transaction Attributes:

age, country: Categorical features with too many levels.I binned the features down to reduce model complexity
purchase_value: Fraudster are tend to place many orders with identical purchase_value . Feature order_cnt was generated to monitor how many orders with same value a customer place.

2.2 Transaction Digital Foot Prints:

source: Fraud transactions happend more from direct visiting.
browser: Fraudsters use chrome more often than other browsers.

2.3 Times:

In order to take advantage of time series variables, we could create day_of_week, hour_of_day features for signup and purchase respectively.
Fraudsters are more likely to place order immediately after signing up without browsing content of webpage. So we could generate feature signup_purchase_delta to cap this information.

2.4 Transaction Frequencies:

Frausters may use same device_id or same ip_address for multiple transactions. More than half of users who signed up with same device more than once are fraudulent. So we could generate features signup_anomaly and ip_anomaly to monitor users' signup and purchase behavior.
We could also generate {count}_signup_last_{time_window}_by_{device,ip} to measure user transactions' frequency.

3. Model Building

First, we'll split variables into features (X) and target (y), and split data into a training(70%) and test(30%) set.

3.1 Metrics for Model Comparison

precision, recall, f1-score: Since our data is highly imbalanced, we should consider both precision and racall, rather than just accuracy.
ROC plot and AUC
Profit curve

3.2 Model building preparation

3.3 Logistic regression (with regularization)

3.4 Random forest

3.5 Gradient boosting tree

Model comparision:

Model	Accuracy	Precision	Recall	F1-score	AUC
Logistic Regression	0.9570	0.9988	0.5252	0.6884	0.8260
Random Forest	0.9570	0.9988	0.5252	0.6884	0.8283
Gradient Boosting Tree	0.9569	0.9942	0.5264	0.6887	0.8293

Without any model tuning, gradient boosting seems to have the best performance based on the metrics: highest recall, ROC plot with highest AUC.

3.6 Down-sampling:

All the 3 models' performance have very high precision and poor recall. This is because the data set is imbalanced.
Models built on imbalanced data tend to show a bias to the majority class, treating the minority class as noise.
Moreover, model evaluation metrics have problems, too. When accuracy seems very high, it may just ignore the minority class. So the classifiers are unreliable.
Methods dealing with imbalanced data including, down-sampling, and over-sampling. Here I applied down-sampling, and got model performance below.

Model	Accuracy	Precision	Recall	F1-score	AUC
Random Forest	0.8179	0.9312	0.6829	0.7879	0.8349
Gradient Boosting Tree	0.8177	0.9333	0.6808	0.7873	0.8311

Sampling does help to control precision, accuracy, and boost recall on imbalanced dataset.

4. Hyperparameter Tuning

I adopted GridSearchCV in scikit-learn for hyperparameter tuning. It runs through each combination of search parameters, and compares them based on scoring method roc_auc_score. Model with best performance:

Model	Accuracy	Precision	Recall	F1-score	AUC
Random Forest	0.8185	0.9331	0.6826	0.7886	0.8359
Gradient Boosting Tree	0.8188	0.9374	0.6796	0.7880	0.8353

Random forest classifier has a better performance after tuning.

Save the random forest classifier with parameter below as final model to disk:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=20, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=2, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1, oob_score=False, random_state=None, verbose=0, warm_start=False)

5. Findings & Suggestion

5.1 Findings - Fraud chracteristics

Fraudsters signup on same device for multiple times.
Fraudsters take care about their ip address. (They may use VPN.) Signing up multiple times at same ip address may be just happened to normal users.
Fraudsters make purchase immediately after signing up.
Fraudsters place many orders with identical purchase value.
Fraudsters more likely to visit the website directly, instead of searching or converting by ads.
Fraudsters are more likely to use Chrome than other browsers.

5.2 Suggestions

Limit products' purchasing amount per day.
Ask customer fillin CAPTCHA when he/she purchase for the first time.
Limit signup times on each device, in case frausters signup multiple times in a short time.
Increase chance of showing CAPTCHA for user validation, if the user comes to visit the website directly, or visits by Chrome.

6. Evaluation

Evaluate the model by unseen data by feeding in test_data.csv. Run the following steps. We can get the predictions' accuracy at the end.

Name	Name	Last commit message	Last commit date
Latest commit lu-16 Update README.md Nov 29, 2018 d56c3be · Nov 29, 2018 History 5 Commits
data	data	initial submit	Nov 29, 2018
doc	doc	initial submit	Nov 29, 2018
src	src	changed file format	Nov 29, 2018
README.md	README.md	Update README.md	Nov 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fraud Detection

Goal:

Outline:

1. Data Preprocessing

2. Exploratory Data Analysis and Feature Engineering

3. Model Building

4. Hyperparameter Tuning

5. Findings & Suggestion

6. Evaluation

About

Releases

Packages

Languages

lu-16/FraudDetection

Folders and files

Latest commit

History

Repository files navigation

Fraud Detection

Goal:

Outline:

1. Data Preprocessing

2. Exploratory Data Analysis and Feature Engineering

3. Model Building

4. Hyperparameter Tuning

5. Findings & Suggestion

6. Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages