# Goal

E-commerce websites often transact huge amounts of money. And whenever a huge amount of
money is moved, there is a high risk of users performing fraudulent activities, e.g. using stolen
credit cards, doing money laundry, etc.

Machine Learning really excels at identifying fraudulent activities. Any website where you put
your credit card information has a risk team in charge of avoiding frauds via machine learning.

The goal of this challenge is to build a machine learning model that predicts the probability that
the first transaction of a new user is fraudulent.


# Challenge Description

Company XYZ is an e-commerce site that sells hand-made clothes.

You have to build a model that predicts whether a user has a high probability of using the site to
perform some illegal activity or not. This is a super common task for data scientists.

You only have information about the user first transaction on the site and based on that you
have to make your classification ("fraud/no fraud").

These are the tasks you are asked to do:

* For each user, determine her country based on the numeric IP address.
* Build a model to predict whether an activity is fraudulent or not. Explain how different
assumptions about the cost of false positives vs false negatives would impact the model.
* Your boss is a bit worried about using a model she doesn't understand for something as
important as fraud detection. How would you explain her how the model is making the
predictions? Not from a mathematical perspective (she couldn't care less about that), but
from a user perspective. What kinds of users are more likely to be classified as at risk?
What are their characteristics?
* Let's say you now have this model which can be used live to predict in real time if an
activity is fraudulent or not. From a product perspective, how would you use it? That is,
what kind of different user experiences would you build based on the model output?

# Data

"Fraud_Data" - information about each user first transaction
## columns:

* user_id : Id of the user. Unique by user
* signup_time : the time when the user created her account (GMT time)
* purchase_time : the time when the user bought the item (GMT time)
* purchase_value : the cost of the item purchased (USD)
* device_id : the device id. You can assume that it is unique by device. I.e.,  transaJtions
with the same device ID means that the same physical device was used to bu`
* source : user marketing channel: ads, SEO, Direct (i.e. came to the site by directly typing
the site address on the browser).
* browser : the browser used by the user.
* sex : user sex: Male/Female
* age : user age
* ip_address : user numeric ip address
* class : this is what we are trying to predict: whether the activity was fraudulent (1) or not
(0).


"IpAddress_to_Country" - mapping each numeric ip address to its country.
For each country, it gives a range. If the numeric ip address falls within
the range, then the ip address belongs to the corresponding country.

## Columns:

* lower_bound_ip_address : the lower bound of the numeric ip address for that country
* upper_bound_ip_address : the upper bound of the numeric ip address for that country
* country : the corresponding country. If a user has an ip address whose value is within
the upper and lower bound, then she is based in this country.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
fraud_data = pd.read_csv('Fraud_Data.csv')
ip_address = pd.read_csv('IpAddress_to_Country.csv')

In [5]:
# look at the data
# no null data for fraud data
fraud_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         151112 non-null  int64  
 1   signup_time     151112 non-null  object 
 2   purchase_time   151112 non-null  object 
 3   purchase_value  151112 non-null  int64  
 4   device_id       151112 non-null  object 
 5   source          151112 non-null  object 
 6   browser         151112 non-null  object 
 7   sex             151112 non-null  object 
 8   age             151112 non-null  int64  
 9   ip_address      151112 non-null  float64
 10  class           151112 non-null  int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 12.7+ MB


In [7]:
#no null data for ip_address data
ip_address.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138846 entries, 0 to 138845
Data columns (total 3 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   lower_bound_ip_address  138846 non-null  float64
 1   upper_bound_ip_address  138846 non-null  int64  
 2   country                 138846 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 3.2+ MB


In [9]:
# the ip address for fraud data have 5*10^4, but the min ip address in country table is 10^7
fraud_data.describe()

Unnamed: 0,user_id,purchase_value,age,ip_address,class
count,151112.0,151112.0,151112.0,151112.0,151112.0
mean,200171.04097,36.935372,33.140704,2152145000.0,0.093646
std,115369.285024,18.322762,8.617733,1248497000.0,0.291336
min,2.0,9.0,18.0,52093.5,0.0
25%,100642.5,22.0,27.0,1085934000.0,0.0
50%,199958.0,35.0,33.0,2154770000.0,0.0
75%,300054.0,49.0,39.0,3243258000.0,0.0
max,400000.0,154.0,76.0,4294850000.0,1.0


In [11]:
ip_address.describe()


Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address
count,138846.0,138846.0
mean,2724532000.0,2724557000.0
std,897521500.0,897497900.0
min,16777220.0,16777470.0
25%,1919930000.0,1920008000.0
50%,3230887000.0,3230888000.0
75%,3350465000.0,3350466000.0
max,3758096000.0,3758096000.0


In [13]:
ip_address['lower_bound_ip_address'].min()

16777216.0

In [14]:
print('# of ip_address not in ip_address table:', fraud_data[fraud_data['ip_address']<=16777216].count())


# of ip_address not in ip_address table: user_id           634
signup_time       634
purchase_time     634
purchase_value    634
device_id         634
source            634
browser           634
sex               634
age               634
ip_address        634
class             634
dtype: int64


In [15]:
# the number of distinct country, only 235 country in ip_address table, but the number of data in ip_address table is 138846
unique_country = ip_address['country'].unique()
unique_country.size

235

In [16]:
ip_address.groupby('country').count()

Unnamed: 0_level_0,lower_bound_ip_address,upper_bound_ip_address
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,46,46
Albania,56,56
Algeria,30,30
American Samoa,1,1
Andorra,5,5
...,...,...
Virgin Islands (U.S.),14,14
Wallis and Futuna Islands,2,2
Yemen,12,12
Zambia,26,26


In [17]:
# look at one country, the lower bound and upper bound are quite different with each other.
ip_address[ip_address['country']=='Andorra']

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
17184,1432265000.0,1432272895,Andorra
19219,1538998000.0,1539006463,Andorra
33427,1836016000.0,1836023807,Andorra
53970,3104060000.0,3104061439,Andorra
86442,3265151000.0,3265159167,Andorra


In [18]:
unique_low_bound = ip_address['lower_bound_ip_address'].unique()
print('the number of distinct lower bound of ip address:', unique_low_bound.size)
print('the number of distinct ip address:', ip_address['country'].count())


the number of distinct lower bound of ip address: 138846
the number of distinct ip address: 138846


In [19]:
fraud_data.reset_index(inplace = True)

In [20]:
#merge ip_address table to fraud table
from pandasql import sqldf
pysql = lambda q:sqldf(q, globals())
join = '''
select f.user_id, f.signup_time, f.purchase_time, f.purchase_value, f.device_id, f.source, f.browser, f.sex, f.age, f.ip_address,
f.class, i.lower_bound_ip_address, i.upper_bound_ip_address, i.country
from fraud_data f
left join ip_address i on f.ip_address <=i.upper_bound_ip_address and f.ip_address >= i.lower_bound_ip_address
'''
fraud_join_data = pysql(join)


In [28]:
# look at the merge data, the # of column is equal to the # of column in fraud data, the ip_address does not have overlap for each other
# There are have some null data for country column
fraud_join_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 14 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   user_id                 151112 non-null  int64  
 1   signup_time             151112 non-null  object 
 2   purchase_time           151112 non-null  object 
 3   purchase_value          151112 non-null  int64  
 4   device_id               151112 non-null  object 
 5   source                  151112 non-null  object 
 6   browser                 151112 non-null  object 
 7   sex                     151112 non-null  object 
 8   age                     151112 non-null  int64  
 9   ip_address              151112 non-null  float64
 10  class                   151112 non-null  int64  
 11  lower_bound_ip_address  129146 non-null  float64
 12  upper_bound_ip_address  129146 non-null  float64
 13  country                 129146 non-null  object 
dtypes: float64(3), int64

In [26]:
fraud_join_data.describe()

Unnamed: 0,user_id,purchase_value,age,ip_address,class,lower_bound_ip_address,upper_bound_ip_address
count,151112.0,151112.0,151112.0,151112.0,151112.0,129146.0,129146.0
mean,200171.04097,36.935372,33.140704,2152145000.0,0.093646,1890950000.0,1894646000.0
std,115369.285024,18.322762,8.617733,1248497000.0,0.291336,1086802000.0,1083635000.0
min,2.0,9.0,18.0,52093.5,0.0,16778240.0,16779260.0
25%,100642.5,22.0,27.0,1085934000.0,0.0,939524100.0,956301300.0
50%,199958.0,35.0,33.0,2154770000.0,0.0,1899708000.0,1899733000.0
75%,300054.0,49.0,39.0,3243258000.0,0.0,2832073000.0,2832138000.0
max,400000.0,154.0,76.0,4294850000.0,1.0,3758031000.0,3758064000.0


In [27]:
fraud_join_data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,lower_bound_ip_address,upper_bound_ip_address,country
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,729808900.0,734003200.0,Japan
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,335544300.0,352321500.0,United States
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1,2621440000.0,2621506000.0,United States
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,,,
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,415498200.0,415629300.0,United States


In [29]:
#transform to datetime type
fraud_join_data['signup_time'] = pd.to_datetime(fraud_join_data.iloc[:, 1])
fraud_join_data['purchase_time'] = pd.to_datetime(fraud_join_data.iloc[:, 2])

In [30]:
fraud_join_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 14 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   user_id                 151112 non-null  int64         
 1   signup_time             151112 non-null  datetime64[ns]
 2   purchase_time           151112 non-null  datetime64[ns]
 3   purchase_value          151112 non-null  int64         
 4   device_id               151112 non-null  object        
 5   source                  151112 non-null  object        
 6   browser                 151112 non-null  object        
 7   sex                     151112 non-null  object        
 8   age                     151112 non-null  int64         
 9   ip_address              151112 non-null  float64       
 10  class                   151112 non-null  int64         
 11  lower_bound_ip_address  129146 non-null  float64       
 12  upper_bound_ip_address  129146

In [45]:
fraud_join_data_country = fraud_join_data.groupby('country').count()
fraud_join_data_country = fraud_join_data_country['user_id']
fraud_join_data_country.sort_values(ascending = True)

country
Nauru                                 1
Dominica                              1
Guadeloupe                            1
Cape Verde                            1
British Indian Ocean Territory        1
                                  ...  
Korea Republic of                  4162
United Kingdom                     4490
Japan                              7306
China                             12038
United States                     58049
Name: user_id, Length: 181, dtype: int64

Before jumping into building a model, think about whether you can create new powerful variables. This is
called feature engineering and it is the most important step in machine learning. However, feature
engineering is quite time consuming.

A few obvious variables that can be created here could be:

* Time difference between sign-up time and purchase time
* If the device id is unique or certain users are sharing the same device (many different user ids using
the same device could be an indicator of fake accounts)
* Same for the ip address. Many different users having the same ip address could be an indicator of fake accounts
* Usual week of the year and day of the week from time variables

In [51]:
#calculate the time difference between the purchase and signup
fraud_join_data['time_difference'] = fraud_join_data['purchase_time'] - fraud_join_data['signup_time']

In [60]:
fraud_join_data['time_difference'] = fraud_join_data.iloc[:,14].dt.total_seconds()

In [63]:
#check for each device id how many different user had it
device_count = fraud_join_data.groupby('device_id').count()
device_count[device_count['user_id']>2]

Unnamed: 0_level_0,user_id,signup_time,purchase_time,purchase_value,source,browser,sex,age,ip_address,class,lower_bound_ip_address,upper_bound_ip_address,country,time_difference
device_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
AAAXXOZJRZRAO,11,11,11,11,11,11,11,11,11,11,11,11,11,11
AANYBGQSWHRTK,8,8,8,8,8,8,8,8,8,8,8,8,8,8
ADEDUDCYQMYTI,14,14,14,14,14,14,14,14,14,14,14,14,14,14
AENUQLGTUHYMS,7,7,7,7,7,7,7,7,7,7,0,0,0,7
AIGPGDVRDKOKT,12,12,12,12,12,12,12,12,12,12,12,12,12,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYHVSPGHWACPO,6,6,6,6,6,6,6,6,6,6,6,6,6,6
ZYZQZXBXADPST,16,16,16,16,16,16,16,16,16,16,16,16,16,16
ZZCAWCKYVMWNH,9,9,9,9,9,9,9,9,9,9,9,9,9,9
ZZFFPOVMCQVCG,7,7,7,7,7,7,7,7,7,7,7,7,7,7


In [64]:
#check for each ip address how many different users had it
ip_account = fraud_join_data.groupby('ip_address').count()
ip_account[ip_account['user_id']>2]

Unnamed: 0_level_0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,class,lower_bound_ip_address,upper_bound_ip_address,country,time_difference
ip_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2.278096e+06,16,16,16,16,16,16,16,16,16,16,0,0,0,16
6.150367e+06,9,9,9,9,9,9,9,9,9,9,0,0,0,9
1.666923e+07,11,11,11,11,11,11,11,11,11,11,0,0,0,11
1.819146e+07,6,6,6,6,6,6,6,6,6,6,6,6,6,6
2.584822e+07,7,7,7,7,7,7,7,7,7,7,7,7,7,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4.262695e+09,13,13,13,13,13,13,13,13,13,13,0,0,0,13
4.275223e+09,11,11,11,11,11,11,11,11,11,11,0,0,0,11
4.279796e+09,12,12,12,12,12,12,12,12,12,12,0,0,0,12
4.281743e+09,7,7,7,7,7,7,7,7,7,7,0,0,0,7
