# Overview of Fraud Detection Project

With the prevalence of online payment, e-commerce platforms are plagued with payment frauds.  Payment fraud is expensive and time-consuming for both customers and business owners.
So, any companies that need to process credit card payment should be aware of online frauds and should invest in fraud prevention.


This project aims to develop an algorithm to predict the probability of a transaction on an e-commerce platform being a fraud based on an anonymous e-commerce platform transaction data.
Mainly insights:
* The challenge of this fraud detection is that the dataset is highly imbalanced.
* The features of interval_after_signup and time-related aggregate features are highly predictive of fraudulent activities.

# Data Exploration

In [1]:
# !pip install imblearn
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score, roc_auc_score, roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 500)




In [2]:
# Import and store dataset
fraud_data = pd.read_csv(r'C:\Users\Adela\Dropbox\OA\Fraud Decttion\imbalancedFraudDF.csv')
ipToCountry = pd.read_csv(r'C:\Users\Adela\Dropbox\OA\Fraud Decttion\IpAddress_to_Country.csv')

In [3]:
#Distribution of the label column
fraud_data['class'].value_counts()
# dataset is highly imbalanced; the fraud data is less than 10%.

0    136961
1      1415
Name: class, dtype: int64

In [4]:
fraud_data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
3,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0
4,159135,2015-05-21 06:03:03,2015-07-09 08:05:14,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0


In [5]:
# the more a ip is shared, the more suspicious
fraud_data['n_ip_shared'] = fraud_data.ip_address.map(fraud_data.ip_address.value_counts(dropna=False))

In [6]:
import pandas_profiling

#Inline summary report about each feature
pandas_profiling.ProfileReport(fraud_data)

0,1
Number of variables,12
Number of observations,138376
Total Missing (%),0.0%
Total size in memory,12.7 MiB
Average record size in memory,96.0 B

0,1
Numeric,5
Categorical,5
Boolean,1
Date,0
Text (Unique),1
Rejected,0
Unsupported,0

0,1
Distinct count,58
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,33.126
Minimum,18
Maximum,76
Zeros (%),0.0%

0,1
Minimum,18
5-th percentile,20
Q1,27
Median,33
Q3,39
95-th percentile,48
Maximum,76
Range,58
Interquartile range,12

0,1
Standard deviation,8.6236
Coef of variation,0.26033
Kurtosis,-0.17773
Mean,33.126
MAD,7.0076
Skewness,0.42729
Sum,4583825
Variance,74.367
Memory size,1.1 MiB

Value,Count,Frequency (%),Unnamed: 3
31,6047,0.0%,
32,6022,0.0%,
33,5976,0.0%,
30,5868,0.0%,
34,5813,0.0%,
29,5812,0.0%,
28,5733,0.0%,
35,5713,0.0%,
36,5470,0.0%,
27,5403,0.0%,

Value,Count,Frequency (%),Unnamed: 3
18,2514,0.0%,
19,2692,0.0%,
20,3091,0.0%,
21,3475,0.0%,
22,3753,0.0%,

Value,Count,Frequency (%),Unnamed: 3
71,2,0.0%,
72,2,0.0%,
73,1,0.0%,
74,1,0.0%,
76,1,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Chrome,55993
IE,33836
Safari,22670
Other values (2),25877

Value,Count,Frequency (%),Unnamed: 3
Chrome,55993,0.0%,
IE,33836,0.0%,
Safari,22670,0.0%,
FireFox,22500,0.0%,
Opera,3377,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.010226

0,1
0,136961
1,1415

Value,Count,Frequency (%),Unnamed: 3
0,136961,0.0%,
1,1415,0.0%,

0,1
Distinct count,134121
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
QTXDJHIIXYVQN,7
BWSMVSLCJXMCM,6
HCYSLYNRFLAXU,5
Other values (134118),138358

Value,Count,Frequency (%),Unnamed: 3
QTXDJHIIXYVQN,7,0.0%,
BWSMVSLCJXMCM,6,0.0%,
HCYSLYNRFLAXU,5,0.0%,
WBBPGFKHVUYEU,5,0.0%,
IMYGUIRZTJLAA,5,0.0%,
JGYUBNPJHXFEH,5,0.0%,
HOBZLEUMZUDEB,5,0.0%,
HSKCGAKNSEMHZ,5,0.0%,
LQXZGPLMKAJJV,5,0.0%,
ULGTRBHXSTOEV,5,0.0%,

0,1
Distinct count,137653
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2154400000
Minimum,52093
Maximum,4294900000
Zeros (%),0.0%

0,1
Minimum,52093
5-th percentile,209350000
Q1,1085100000
Median,2156500000
Q3,3249200000
95-th percentile,4088000000
Maximum,4294900000
Range,4294800000
Interquartile range,2164100000

0,1
Standard deviation,1250600000
Coef of variation,0.58047
Kurtosis,-1.2148
Mean,2154400000
MAD,1083700000
Skewness,-0.007369
Sum,2.9811e+14
Variance,1.5639e+18
Memory size,1.1 MiB

Value,Count,Frequency (%),Unnamed: 3
1954600796.06,7,0.0%,
2937899119.5,6,0.0%,
2647792501.43,5,0.0%,
799584366.227,5,0.0%,
1120496524.56,5,0.0%,
4149333595.44,5,0.0%,
1537339184.26,5,0.0%,
4204700290.71,5,0.0%,
1977076202.66,5,0.0%,
1797069085.54,5,0.0%,

Value,Count,Frequency (%),Unnamed: 3
52093.496895,1,0.0%,
93447.1389614,1,0.0%,
105818.501505,1,0.0%,
117566.664867,1,0.0%,
131423.789042,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
4294673680.77,1,0.0%,
4294714854.85,1,0.0%,
4294719533.35,1,0.0%,
4294822241.88,1,0.0%,
4294850499.68,1,0.0%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.016
Minimum,1
Maximum,7
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,1
Q3,1
95-th percentile,1
Maximum,7
Range,6
Interquartile range,0

0,1
Standard deviation,0.1952
Coef of variation,0.19213
Kurtosis,253.87
Mean,1.016
MAD,0.031701
Skewness,14.798
Sum,140588
Variance,0.038104
Memory size,1.1 MiB

Value,Count,Frequency (%),Unnamed: 3
1,137207,0.0%,
2,506,0.0%,
3,381,0.0%,
4,204,0.0%,
5,65,0.0%,
7,7,0.0%,
6,6,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,137207,0.0%,
2,506,0.0%,
3,381,0.0%,
4,204,0.0%,
5,65,0.0%,

Value,Count,Frequency (%),Unnamed: 3
3,381,0.0%,
4,204,0.0%,
5,65,0.0%,
6,6,0.0%,
7,7,0.0%,

0,1
Distinct count,137985
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
2015-06-08 09:42:04,3
2015-09-10 09:04:53,3
2015-05-12 19:25:04,2
Other values (137982),138368

Value,Count,Frequency (%),Unnamed: 3
2015-06-08 09:42:04,3,0.0%,
2015-09-10 09:04:53,3,0.0%,
2015-05-12 19:25:04,2,0.0%,
2015-05-15 18:54:38,2,0.0%,
2015-10-04 16:10:42,2,0.0%,
2015-09-02 14:14:43,2,0.0%,
2015-06-12 18:20:21,2,0.0%,
2015-06-27 06:40:57,2,0.0%,
2015-04-02 10:08:06,2,0.0%,
2015-08-29 03:21:02,2,0.0%,

0,1
Distinct count,122
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,36.939
Minimum,9
Maximum,154
Zeros (%),0.0%

0,1
Minimum,9
5-th percentile,12
Q1,22
Median,35
Q3,49
95-th percentile,71
Maximum,154
Range,145
Interquartile range,27

0,1
Standard deviation,18.321
Coef of variation,0.49598
Kurtosis,0.16021
Mean,36.939
MAD,14.876
Skewness,0.67599
Sum,5111469
Variance,335.66
Memory size,1.1 MiB

Value,Count,Frequency (%),Unnamed: 3
28,3041,0.0%,
26,2931,0.0%,
27,2929,0.0%,
30,2920,0.0%,
24,2898,0.0%,
29,2850,0.0%,
34,2848,0.0%,
32,2832,0.0%,
25,2823,0.0%,
22,2795,0.0%,

Value,Count,Frequency (%),Unnamed: 3
9,2121,0.0%,
10,2060,0.0%,
11,2229,0.0%,
12,2269,0.0%,
13,2362,0.0%,

Value,Count,Frequency (%),Unnamed: 3
128,3,0.0%,
129,2,0.0%,
132,1,0.0%,
140,1,0.0%,
154,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
M,80693
F,57683

Value,Count,Frequency (%),Unnamed: 3
M,80693,0.0%,
F,57683,0.0%,

First 3 values
2015-04-12 05:40:17
2015-01-20 13:59:26
2015-08-05 09:18:05

Last 3 values
2015-01-24 06:41:05
2015-04-15 13:45:18
2015-02-10 11:15:18

Value,Count,Frequency (%),Unnamed: 3
2015-01-01 00:00:42,1,0.0%,
2015-01-01 00:00:46,1,0.0%,
2015-01-01 00:05:19,1,0.0%,
2015-01-01 00:07:11,1,0.0%,
2015-01-01 00:08:56,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2015-08-18 04:25:00,1,0.0%,
2015-08-18 04:29:35,1,0.0%,
2015-08-18 04:31:58,1,0.0%,
2015-08-18 04:37:34,1,0.0%,
2015-08-18 04:40:29,1,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
SEO,55766
Ads,54913
Direct,27697

Value,Count,Frequency (%),Unnamed: 3
SEO,55766,0.0%,
Ads,54913,0.0%,
Direct,27697,0.0%,

0,1
Distinct count,138376
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,200150
Minimum,2
Maximum,400000
Zeros (%),0.0%

0,1
Minimum,2
5-th percentile,20234
Q1,100890
Median,200000
Q3,299750
95-th percentile,379900
Maximum,400000
Range,399998
Interquartile range,198850

0,1
Standard deviation,115230
Coef of variation,0.5757
Kurtosis,-1.1945
Mean,200150
MAD,99727
Skewness,0.00041272
Sum,27695822486
Variance,13277000000
Memory size,1.1 MiB

Value,Count,Frequency (%),Unnamed: 3
264191,1,0.0%,
183539,1,0.0%,
212173,1,0.0%,
251088,1,0.0%,
253137,1,0.0%,
257239,1,0.0%,
327188,1,0.0%,
135060,1,0.0%,
105884,1,0.0%,
393974,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2,1,0.0%,
4,1,0.0%,
8,1,0.0%,
9,1,0.0%,
12,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
399992,1,0.0%,
399993,1,0.0%,
399995,1,0.0%,
399997,1,0.0%,
400000,1,0.0%,

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,n_ip_shared
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,1
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,1
2,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,1
3,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,1
4,159135,2015-05-21 06:03:03,2015-07-09 08:05:14,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0,1


In [7]:
print fraud_data.user_id.nunique()#138289
print len(fraud_data.index)#138376

#Allthe user_id has only the first 1 transaction, difficult to do user-level aggregation, 

138376
138376


# Feature Engineering

### Feature Creation: country
create a new feature country based on the ip_address feature in fraud_data and ip boundaries in IpAddress_to_Country.csv

In [8]:
ipToCountry = pd.read_csv('IpAddress_to_Country.csv')
ipToCountry.head()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia
1,16777472.0,16777727,China
2,16777728.0,16778239,China
3,16778240.0,16779263,Australia
4,16779264.0,16781311,China


In [9]:
countries = []
for i in range(len(fraud_data)):
    ip_address = fraud_data.loc[i, 'ip_address']
    tmp = ipToCountry[(ipToCountry['lower_bound_ip_address'] <= ip_address) &
                    (ipToCountry['upper_bound_ip_address'] >= ip_address)]
    if len(tmp) == 1:#found match
        countries.append(tmp['country'].values[0])
    else:#no match
        countries.append('NA')
        
fraud_data['country'] = countries

### Time-related features transformation

In [10]:
fraud_data['interval_after_signup'] = (pd.to_datetime(fraud_data['purchase_time']) - pd.to_datetime(
        fraud_data['signup_time'])).dt.total_seconds()

fraud_data['signup_days_of_year'] = pd.DatetimeIndex(fraud_data['signup_time']).dayofyear

fraud_data['signup_seconds_of_day'] = pd.DatetimeIndex(fraud_data['signup_time']).second + 60 * pd.DatetimeIndex(
    fraud_data['signup_time']).minute + 3600 * pd.DatetimeIndex(fraud_data['signup_time']).hour

fraud_data['purchase_days_of_year'] = pd.DatetimeIndex(fraud_data['purchase_time']).dayofyear
fraud_data['purchase_seconds_of_day'] = pd.DatetimeIndex(fraud_data['purchase_time']).second + 60 * pd.DatetimeIndex(
    fraud_data['purchase_time']).minute + 3600 * pd.DatetimeIndex(fraud_data['purchase_time']).hour

fraud_data = fraud_data.drop(['user_id','signup_time','purchase_time'], axis=1)

In [11]:
# check the new table after feature enginering
fraud_data.head()

Unnamed: 0,purchase_value,device_id,source,browser,sex,age,ip_address,class,n_ip_shared,country,interval_after_signup,signup_days_of_year,signup_seconds_of_day,purchase_days_of_year,purchase_seconds_of_day
0,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,1,Japan,4506682.0,55,82549,108,10031
1,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,1,United States,17944.0,158,74390,159,5934
2,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,1,,492085.0,118,76405,124,50090
3,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,1,United States,4361461.0,202,25792,252,67253
4,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0,1,Canada,4240931.0,141,21783,190,29114


In [12]:
print fraud_data.source.value_counts()

SEO       55766
Ads       54913
Direct    27697
Name: source, dtype: int64


### Train and test data split


In [13]:
y = fraud_data['class']
X = fraud_data.drop(['class'], axis=1)

#split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)

('X_train.shape:', (110700, 14))
('y_train.shape:', (110700L,))


### Convert categorical features to numericals

In [14]:
#training data conversation

X_train = pd.get_dummies(X_train, columns=['source', 'browser'])#need to drop ['source', 'browser']? no, auto dropped by get_dummies 
X_train['sex'] = (X_train.sex == 'M').astype(int)

# the more a device is shared, the more suspicious
X_train['n_dev_shared'] = X_train.device_id.map(X_train.device_id.value_counts(dropna=False))

# the more a ip is shared, the more suspicious
X_train['n_ip_shared'] = X_train.ip_address.map(X_train.ip_address.value_counts(dropna=False))

# the less visit from a country, the more suspicious
X_train['n_country_shared'] = X_train.country.map(X_train.country.value_counts(dropna=False))#lots of NAs in country column, #without dropna=False will produce nan in this col

X_train = X_train.drop(['device_id','ip_address','country'], axis=1)


# testing data conversion
X_test = pd.get_dummies(X_test, columns=['source', 'browser'])
X_test['sex'] = (X_test.sex == 'M').astype(int)

# the more a device is shared, the more suspicious
X_test['n_dev_shared'] = X_test.device_id.map(X_test.device_id.value_counts(dropna=False))

# the more a ip is shared, the more suspicious
X_test['n_ip_shared'] = X_test.ip_address.map(X_test.ip_address.value_counts(dropna=False))

# the less visit from a country, the more suspicious
X_test['n_country_shared'] = X_test.country.map(X_test.country.value_counts(dropna=False))

X_test = X_test.drop(['device_id','ip_address','country'], axis=1)

In [15]:
X_train.head(20)

Unnamed: 0,purchase_value,sex,age,n_ip_shared,interval_after_signup,signup_days_of_year,signup_seconds_of_day,purchase_days_of_year,purchase_seconds_of_day,source_Ads,source_Direct,source_SEO,browser_Chrome,browser_FireFox,browser_IE,browser_Opera,browser_Safari,n_dev_shared,n_country_shared
29343,12,1,42,1,3499664.0,183,67384,224,24648,1,0,0,1,0,0,0,0,1,3075
12190,10,1,29,1,6766039.0,5,78146,84,18585,1,0,0,0,0,0,1,0,1,42348
19388,34,1,53,1,5870515.0,197,81354,265,76669,0,1,0,1,0,0,0,0,1,16275
89104,48,1,29,1,2145618.0,160,30920,185,16538,1,0,0,1,0,0,0,0,1,2322
82082,44,1,24,1,7079059.0,111,71897,193,66156,1,0,0,0,1,0,0,0,1,8876
76812,56,1,25,1,7872819.0,102,78778,194,2797,1,0,0,1,0,0,0,0,1,42348
111006,67,1,43,1,7662881.0,143,68977,232,42258,1,0,0,0,1,0,0,0,1,16275
37929,29,0,25,1,1293152.0,69,70051,84,67203,0,0,1,0,0,0,0,1,1,42348
88089,20,1,18,1,7551233.0,225,22512,312,56945,0,0,1,1,0,0,0,0,1,42348
50851,14,1,28,1,6830027.0,188,26963,267,31390,1,0,0,0,1,0,0,0,1,42348


### Scale the data

In [16]:
#Compute the train minimum and maximum to be used for later scaling:
scaler = preprocessing.MinMaxScaler().fit(X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']]) 
#print(scaler.data_max_)

#transform the training data and use them for the model training
X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']] = scaler.transform(X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])

#before the prediction of the test data, apply the same scaler obtained from above on X_test, not fitting a brandnew scaler on test
X_test[['n_dev_shared', 'n_ip_shared', 'n_country_shared']] = scaler.transform(X_test[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])


In [17]:
X_train.n_dev_shared.value_counts(dropna=False)

0.0    105427
0.2      4774
0.4       324
0.6       124
0.8        45
1.0         6
Name: n_dev_shared, dtype: int64

In [18]:
X_test.n_dev_shared.value_counts(dropna=False)

0.0    27330
0.2      334
0.4       12
Name: n_dev_shared, dtype: int64

# Model Training

### Simple LogisticRegression

In [19]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train,y_train)

# predict on test
y_pred=logreg.predict(X_test)

In [20]:
cm = metrics.confusion_matrix(y_test, y_pred)
cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
print(cmDF)

# Logistic Regression with default parameters are not effecitve in this context. It doesn't indentify any frauds.

        pred_0  pred_1
true_0   27389       0
true_1     287       0


### Simple Random Forest

In [21]:
classifier_RF = RandomForestClassifier(random_state=0)

classifier_RF.fit(X_train, y_train)

# predict class labels 0/1 for the test set
predicted = classifier_RF.predict(X_test)

# generate class probabilities
probs = classifier_RF.predict_proba(X_test)

# generate evaluation metrics
print("%s: %r" % ("accuracy_score is: ", accuracy_score(y_test, predicted)))
print("%s: %r" % ("roc_auc_score is: ", roc_auc_score(y_test, probs[:, 1])))
print("%s: %r" % ("f1_score is: ", f1_score(y_test, predicted )))#string to int

print ("confusion_matrix is: ")
cm = confusion_matrix(y_test, predicted)
cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
print(cmDF)
print 'recall =',float(cm[1,1])/(cm[1,0]+cm[1,1])
print 'precision =', float(cm[1,1])/(cm[1,1] + cm[0,1])#1.0


#Random Forest Classfier has better performance but fails to indentify half of the fraud activities; There is no false alarm.

accuracy_score is: : 0.9948692007515537
roc_auc_score is: : 0.7628089076173539
f1_score is: : 0.6712962962962962
confusion_matrix is: 
        pred_0  pred_1
true_0   27389       0
true_1     142     145
recall = 0.5052264808362369
precision = 1.0


## SMOTE sampling
try to increase the percentage of minority class(fraud data) by synthesizing some fraud data to increase model performance

In [22]:
smote = SMOTE(random_state=12)
x_train_sm, y_train_sm = smote.fit_sample(X_train, y_train)

unique, counts = np.unique(y_train_sm, return_counts=True)

print np.asarray((unique, counts)).T

[[     0 109572]
 [     1 109572]]


In [23]:
#RF on smoted training data
classifier_RF_sm = RandomForestClassifier(random_state=0)

classifier_RF_sm.fit(x_train_sm, y_train_sm)

# predict class labels for the test set
predicted_sm = classifier_RF_sm.predict(X_test)

# generate class probabilities
probs_sm = classifier_RF_sm.predict_proba(X_test)


# generate evaluation metrics
print("%s: %r" % ("accuracy_score_sm is: ", accuracy_score(y_test, predicted_sm)))
print("%s: %r" % ("roc_auc_score_sm is: ", roc_auc_score(y_test, probs_sm[:, 1])))
print("%s: %r" % ("f1_score_sm is: ", f1_score(y_test, predicted_sm )))#string to int

print ("confusion_matrix_sm is: ")
cm_sm = confusion_matrix(y_test, predicted_sm)
cmDF = pd.DataFrame(cm_sm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
print(cmDF)
print 'recall or sens_sm =',float(cm_sm[1,1])/(cm_sm[1,0]+cm_sm[1,1])
print 'precision_sm =', float(cm_sm[1,1])/(cm_sm[1,1] + cm_sm[0,1])

# compared with the former simple RF, this random foreset is not very effective. The TP rate doesn't increase but false alarm increase. 

accuracy_score_sm is: : 0.994507876860818
roc_auc_score_sm is: : 0.741489850130581
f1_score_sm is: : 0.6415094339622641
confusion_matrix_sm is: 
        pred_0  pred_1
true_0   27388       1
true_1     151     136
recall or sens_sm = 0.4738675958188153
precision_sm = 0.9927007299270073


## Parameter tuning by GridSearchCV

In [24]:
# Eval metrics to be calculated for each combination of parameters and cv
scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'f1_score': make_scorer(f1_score, pos_label=1)
}

In [25]:
def grid_search_wrapper(model, parameters, refit_score='f1_score'):
    """
    fits a GridSearchCV classifier using refit_score for optimization(refit on the best model according to refit_score)
    prints classifier performance metrics
    """
#     skf = StratifiedKFold(n_splits=10)
#     grid_search = GridSearchCV(clf, param_grid, scoring=scorers, refit=refit_score,
#                            cv=skf, return_train_score=True, n_jobs=-1)
    grid_search = GridSearchCV(model, parameters, scoring=scorers, refit=refit_score,
                           cv=3, return_train_score=True, n_jobs=-1)
    grid_search.fit(X_train, y_train)

    # make the predictions
    y_pred = grid_search.predict(X_test)
    y_prob = grid_search.predict_proba(X_test)[:, 1]
    
    print('Best params for {}'.format(refit_score))
    print(grid_search.best_params_)

    # confusion matrix on the test data.
    print('\nConfusion matrix of Random Forest optimized for {} on the test data:'.format(refit_score))
    cm = confusion_matrix(y_test, y_pred)
    cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
    print(cmDF)
    
    print("\t%s: %r" % ("roc_auc_score is: ", roc_auc_score(y_test, y_prob)))
    print("\t%s: %r" % ("f1_score is: ", f1_score(y_test, y_pred)))#string to int

    print 'recall = ', float(cm[1,1]) / (cm[1,0] + cm[1,1])
    print 'precision = ', float(cm[1,1]) / (cm[1, 1] + cm[0,1])

    return grid_search


In [26]:
## Optimizing on f1_score on LR

In [27]:
# C: inverse of regularization strength, smaller values specify stronger regularization
LRGrid = {"C" : np.logspace(-2,2,5), "penalty":["l1","l2"]}# l1 lasso l2 ridge
#param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}
logRegModel = LogisticRegression(random_state=0)

grid_search_LR_f1 = grid_search_wrapper(logRegModel, LRGrid, refit_score='f1_score')

Best params for f1_score
{'penalty': 'l1', 'C': 0.1}

Confusion matrix of Random Forest optimized for f1_score on the test data:
        pred_0  pred_1
true_0   27386       3
true_1     278       9
	roc_auc_score is: : 0.7597126596386581
	f1_score is: : 0.06020066889632108
recall =  0.0313588850174216
precision =  0.75


## Optimizing on f1_score on RF

In [28]:
parameters = {        
'max_depth': [None, 5, 15],
'n_estimators' :  [10,150],
'class_weight' : [{0: 1, 1: w} for w in [0.2, 1, 100]]
}

clf = RandomForestClassifier(random_state=0)

In [29]:
grid_search_rf_f1 = grid_search_wrapper(clf, parameters, refit_score='f1_score')

Best params for f1_score
{'n_estimators': 150, 'max_depth': None, 'class_weight': {0: 1, 1: 0.2}}

Confusion matrix of Random Forest optimized for f1_score on the test data:
        pred_0  pred_1
true_0   27389       0
true_1     142     145
	roc_auc_score is: : 0.7858983673473022
	f1_score is: : 0.6712962962962962
recall =  0.5052264808362369
precision =  1.0


In [30]:
best_rf_model_f1 = grid_search_rf_f1.best_estimator_
best_rf_model_f1

RandomForestClassifier(bootstrap=True, class_weight={0: 1, 1: 0.2},
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=150, n_jobs=None, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [31]:
results_f1 = pd.DataFrame(grid_search_rf_f1.cv_results_)
results_sortf1 = results_f1.sort_values(by='mean_test_f1_score', ascending=False)
results_sortf1[['mean_test_precision_score', 'mean_test_recall_score', 'mean_test_f1_score', 'mean_train_precision_score', 'mean_train_recall_score', 'mean_train_f1_score','param_max_depth', 'param_class_weight', 'param_n_estimators']].round(3).head()


Unnamed: 0,mean_test_precision_score,mean_test_recall_score,mean_test_f1_score,mean_train_precision_score,mean_train_recall_score,mean_train_f1_score,param_max_depth,param_class_weight,param_n_estimators
9,1.0,0.527,0.69,1.0,0.527,0.69,5.0,"{0: 1, 1: 1}",150
1,1.0,0.527,0.69,1.0,1.0,1.0,,"{0: 1, 1: 0.2}",150
13,1.0,0.527,0.69,1.0,1.0,1.0,,"{0: 1, 1: 100}",150
3,1.0,0.527,0.69,1.0,0.527,0.69,5.0,"{0: 1, 1: 0.2}",150
11,1.0,0.527,0.69,1.0,0.586,0.739,15.0,"{0: 1, 1: 1}",150


## Insights Generation

In [32]:
# predictive factors

pd.DataFrame(best_rf_model_f1.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False)

# interval_after_signup, aggregate purchase and signup time-related-features, and n_ip_shared, purchase_value are highly predictive factors of frauds

Unnamed: 0,importance
interval_after_signup,0.378626
purchase_days_of_year,0.159541
purchase_seconds_of_day,0.078872
signup_seconds_of_day,0.077677
signup_days_of_year,0.05808
n_ip_shared,0.057116
purchase_value,0.042805
age,0.039436
n_dev_shared,0.033011
n_country_shared,0.027644


In [33]:
trainDF = pd.concat([X_train, y_train], axis=1)
pd.crosstab(trainDF["n_dev_shared"],trainDF["class"])

# insight1: the larger n_dev_shared, the higher rate of fraud

class,0,1
n_dev_shared,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,104966,461
0.2,4403,371
0.4,152,172
0.6,37,87
0.8,13,32
1.0,1,5


In [34]:
fraud_data.groupby("class")[['interval_after_signup']].mean()#action velocity(consecutive operations/actions of user)

# insight2: interval_after_signup on frauds are significantly lower compared to legits

Unnamed: 0_level_0,interval_after_signup
class,Unnamed: 1_level_1
0,5191179.0
1,2570226.0


In [35]:
fraud_data.groupby("class")[['interval_after_signup']].median()#1
# insight 3: more than half of fraud happened 1s after signing up

Unnamed: 0_level_0,interval_after_signup
class,Unnamed: 1_level_1
0,5194911.0
1,1.0


# Conclusion

After trying simple logistic regression, simple random forests, random forests with smoke sampling and random forest with optimization on F-1 score, I found that random forest with optimization on F-1 score has the best performance, identifying most fraudulent transactions and zero false alarm and its F-1 score is 0.67.

##### Insights gained:
The features of interval_after_signup and time-related aggregate features are highly predictive of fraudulent activities.
1. the higher the number of devices that each account uses, the higher the chances of frauds
2. the interval between signup and purchase on frauds are significantly lower compared to legitimate transactions
3. more than half of frauds happen 1s after signing up