# Goal

E-commerce websites often transact huge amounts of money. And whenever a huge amount of
money is moved, there is a high risk of users performing fraudulent activities, e.g. using stolen
credit cards, doing money laundry, etc.

Machine Learning really excels at identifying fraudulent activities. Any website where you put
your credit card information has a risk team in charge of avoiding frauds via machine learning.

The goal of this challenge is to build a machine learning model that predicts the probability that
the first transaction of a new user is fraudulent.


# Challenge Description

Company XYZ is an e-commerce site that sells hand-made clothes.

You have to build a model that predicts whether a user has a high probability of using the site to
perform some illegal activity or not. This is a super common task for data scientists.

You only have information about the user first transaction on the site and based on that you
have to make your classification ("fraud/no fraud").

These are the tasks you are asked to do:

* For each user, determine her country based on the numeric IP address.
* Build a model to predict whether an activity is fraudulent or not. Explain how different
assumptions about the cost of false positives vs false negatives would impact the model.
* Your boss is a bit worried about using a model she doesn't understand for something as
important as fraud detection. How would you explain her how the model is making the
predictions? Not from a mathematical perspective (she couldn't care less about that), but
from a user perspective. What kinds of users are more likely to be classified as at risk?
What are their characteristics?
* Let's say you now have this model which can be used live to predict in real time if an
activity is fraudulent or not. From a product perspective, how would you use it? That is,
what kind of different user experiences would you build based on the model output?

# Data

"Fraud_Data" - information about each user first transaction
## columns:

* user_id : Id of the user. Unique by user
* signup_time : the time when the user created her account (GMT time)
* purchase_time : the time when the user bought the item (GMT time)
* purchase_value : the cost of the item purchased (USD)
* device_id : the device id. You can assume that it is unique by device. I.e.,  transaJtions
with the same device ID means that the same physical device was used to bu`
* source : user marketing channel: ads, SEO, Direct (i.e. came to the site by directly typing
the site address on the browser).
* browser : the browser used by the user.
* sex : user sex: Male/Female
* age : user age
* ip_address : user numeric ip address
* class : this is what we are trying to predict: whether the activity was fraudulent (1) or not
(0).


"IpAddress_to_Country" - mapping each numeric ip address to its country.
For each country, it gives a range. If the numeric ip address falls within
the range, then the ip address belongs to the corresponding country.

## Columns:

* lower_bound_ip_address : the lower bound of the numeric ip address for that country
* upper_bound_ip_address : the upper bound of the numeric ip address for that country
* country : the corresponding country. If a user has an ip address whose value is within
the upper and lower bound, then she is based in this country.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
fraud_data = pd.read_csv('Fraud_Data.csv')
ip_address = pd.read_csv('IpAddress_to_Country.csv')

In [2]:
# look at the data
# no null data for fraud data
fraud_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         151112 non-null  int64  
 1   signup_time     151112 non-null  object 
 2   purchase_time   151112 non-null  object 
 3   purchase_value  151112 non-null  int64  
 4   device_id       151112 non-null  object 
 5   source          151112 non-null  object 
 6   browser         151112 non-null  object 
 7   sex             151112 non-null  object 
 8   age             151112 non-null  int64  
 9   ip_address      151112 non-null  float64
 10  class           151112 non-null  int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 12.7+ MB


In [3]:
#no null data for ip_address data
ip_address.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138846 entries, 0 to 138845
Data columns (total 3 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   lower_bound_ip_address  138846 non-null  float64
 1   upper_bound_ip_address  138846 non-null  int64  
 2   country                 138846 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 3.2+ MB


In [5]:
# the ip address for fraud data have 5*10^4, but the min ip address in country table is 10^7
fraud_data.describe()

Unnamed: 0,user_id,purchase_value,age,ip_address,class
count,151112.0,151112.0,151112.0,151112.0,151112.0
mean,200171.04097,36.935372,33.140704,2152145000.0,0.093646
std,115369.285024,18.322762,8.617733,1248497000.0,0.291336
min,2.0,9.0,18.0,52093.5,0.0
25%,100642.5,22.0,27.0,1085934000.0,0.0
50%,199958.0,35.0,33.0,2154770000.0,0.0
75%,300054.0,49.0,39.0,3243258000.0,0.0
max,400000.0,154.0,76.0,4294850000.0,1.0


In [6]:
ip_address.describe()


Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address
count,138846.0,138846.0
mean,2724532000.0,2724557000.0
std,897521500.0,897497900.0
min,16777220.0,16777470.0
25%,1919930000.0,1920008000.0
50%,3230887000.0,3230888000.0
75%,3350465000.0,3350466000.0
max,3758096000.0,3758096000.0


In [17]:
ip_address['lower_bound_ip_address'].min()

16777216.0

In [20]:
print('# of ip_address not in ip_address table:', fraud_data[fraud_data['ip_address']<=16777216].count())


# of ip_address not in ip_address table: user_id           634
signup_time       634
purchase_time     634
purchase_value    634
device_id         634
source            634
browser           634
sex               634
age               634
ip_address        634
class             634
dtype: int64


In [16]:
# the number of distinct country, only 235 country in ip_address table, but the number of data in ip_address table is 138846
unique_country = ip_address['country'].unique()
unique_country.size

235

In [21]:
ip_address.groupby('country').count()

Unnamed: 0_level_0,lower_bound_ip_address,upper_bound_ip_address
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,46,46
Albania,56,56
Algeria,30,30
American Samoa,1,1
Andorra,5,5
...,...,...
Virgin Islands (U.S.),14,14
Wallis and Futuna Islands,2,2
Yemen,12,12
Zambia,26,26


In [22]:
# look at one country, the lower bound and upper bound are quite different with each other.
ip_address[ip_address['country']=='Andorra']

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
17184,1432265000.0,1432272895,Andorra
19219,1538998000.0,1539006463,Andorra
33427,1836016000.0,1836023807,Andorra
53970,3104060000.0,3104061439,Andorra
86442,3265151000.0,3265159167,Andorra


In [27]:
unique_low_bound = ip_address['lower_bound_ip_address'].unique()
print('the number of distinct lower bound of ip address:', unique_low_bound.size)
print('the number of distinct ip address:', ip_address['country'].count())


the number of distinct lower bound of ip address: 138846
the number of distinct ip address: 138846


In [37]:
fraud_data.reset_index(inplace = True)

ValueError: cannot insert level_0, already exists

In [None]:
#merge ip_address table to fraud table
from pandasql import sqldf
pysql = lambda q:sqldf(q, globals())
join = '''
select f.user_id, f.signup_time, f.purchase_time, f.purchase_value, f.device_id, f.source, f.browser, f.sex, f.age, f.ip_address,
f.class, i.lower_bound_ip_address, i.upper_bound_ip_address, i.country
from fraud_data f
left join ip_address i on f.ip_address <=i.upper_bound_ip_address and f.ip_address >= i.lower_bound_ip_address
'''
fraud_join_data = pysql(join)
