Company XYZ is an e-commerce site that sells hand-made clothes.

You have to build a model that predicts whether a user has a high probability of using the site to perform some illegal activity or not. This is a super common task for data scientists.

You only have information about the user ﬁrst transaction on the site and based on that you have to make your classiﬁcation ("fraud/no fraud").

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

In [2]:
fraud_data = pd.read_csv("Fraud_Data.csv")
ip2country = pd.read_csv("IpAddress_to_Country.csv")

In [3]:
fraud_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 11 columns):
user_id           151112 non-null int64
signup_time       151112 non-null object
purchase_time     151112 non-null object
purchase_value    151112 non-null int64
device_id         151112 non-null object
source            151112 non-null object
browser           151112 non-null object
sex               151112 non-null object
age               151112 non-null int64
ip_address        151112 non-null float64
class             151112 non-null int64
dtypes: float64(1), int64(4), object(6)
memory usage: 12.7+ MB


In [4]:
ip2country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138846 entries, 0 to 138845
Data columns (total 3 columns):
lower_bound_ip_address    138846 non-null float64
upper_bound_ip_address    138846 non-null int64
country                   138846 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 3.2+ MB


# Question 1
For each user, determine her country based on the numeric IP address.

In [5]:
# using binary search to do the IP address mapping
import bisect
class ipLookUpTable:
    def __init__(self, ip_df):
        '''
        input: 
            ip_df: pd.DataFrame which contains lower_bound, upper_bound and name for each country
        '''
        self._nrows = ip_df.shape[0]
        # add two unknown category for out of bound ips
        self._ip_lowbds = [0 for _ in range(self._nrows+2)]
        self._countries = ['Unknown' for _ in range(self._nrows+2)]
        
        for r in range(1, self._nrows + 1):
            self._ip_lowbds[r] = ip_df.iloc[r-1, 0]
            self._countries[r] = ip_df.iloc[r-1, 2]
            # assume the files is in ascending order regarding to the lower bounds of ip
            assert self._ip_lowbds[r] > self._ip_lowbds[r - 1]
    
    def fmap_ip_country(self, ip):
        index = bisect.bisect(self._ip_lowbds, ip) - 1
        assert ip >= self._ip_lowbds[index] and \
        (index == self._nrows + 1 or ip < self._ip_lowbds[index + 1])
        return self._countries[index]        

In [6]:
cmap_ip_country = ipLookUpTable(ip_df=ip2country)
fraud_data['country'] = fraud_data.ip_address.map(cmap_ip_country.fmap_ip_country)

In [7]:
fraud_data.country.value_counts().head(5)

United States     59222
Unknown           20017
China             12038
Japan              7918
United Kingdom     4492
Name: country, dtype: int64

In [8]:
fraud_data.country.value_counts().tail(5)

Cape Verde                        1
Burundi                           1
British Indian Ocean Territory    1
Bonaire; Sint Eustatius; Saba     1
South Sudan                       1
Name: country, dtype: int64

# Question 2
Build a model to predict whether an activity is fraudulent or not. Explain how diﬀerent assumptions about the cost of false positives vs false negatives would impact the model.`

## Data validation and cleaning

In [9]:
fraud_data.isna().sum()

user_id           0
signup_time       0
purchase_time     0
purchase_value    0
device_id         0
source            0
browser           0
sex               0
age               0
ip_address        0
class             0
country           0
dtype: int64

In [10]:
np.any(fraud_data.age > 100)

False

## Feature engineering

In [11]:
fraud_data.head(5)

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,country
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,Japan
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,United States
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1,United States
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,Unknown
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,United States


After looking at the data itself, I decide to work on feature engineering for the following attributes:
- whether the user is signing up and then purchasing immediately
- the counts of each unique device_id is used
- the counts of each unique ip address is used
- whether it is from a really rare country or unknown ip(from id address) -> get a binary variable
- after all the features, reduce the table size

All right, this field is really boring to me, I believe doing this project is a waste of time for me, as AML is a really developed field, also, I have no idea to know more about this filed. Lots of work can be done here other than just play with this well formalized "data". I will just jump into the next task.

all the following is copied from this nice repo [link](https://github.com/stasi009/TakeHomeDataChallenges/blob/master/04.FraudActivity/fraud_activity.ipynb)

In [12]:
datas = fraud_data

In [13]:
datas['signup_time'] = pd.to_datetime(datas.signup_time)
datas['purchase_time'] = pd.to_datetime(datas.purchase_time)

# it is very suspicious for a user signup and then immediately purchase
datas['interval_after_signup'] = (datas.purchase_time - datas.signup_time).dt.total_seconds()

datas.drop(["signup_time", "purchase_time"], axis=1, inplace=True)
# how many times a device is shared
n_dev_shared = datas.device_id.value_counts()

# because we are studying user's first transaction
# the more a device is shared, the more suspicious
datas['n_dev_shared'] = datas.device_id.map(n_dev_shared)
del datas['device_id']
# how many times a ip address is shared
n_ip_shared = datas.ip_address.value_counts()

# because we are studying user's first transaction
# the more a ip is shared, the more suspicous
datas['n_ip_shared'] = datas.ip_address.map(n_ip_shared)
del datas['ip_address']
# how many users are from the same country
n_country_shared = datas.country.value_counts()

# the less visit from a country, the more suspicious
datas['n_country_shared'] = datas.country.map(n_country_shared)
del datas['country']
datas.head()#glance

Unnamed: 0,user_id,purchase_value,source,browser,sex,age,class,interval_after_signup,n_dev_shared,n_ip_shared,n_country_shared
0,22058,34,SEO,Chrome,M,39,0,4506682.0,1,1,7918
1,333320,16,Ads,Chrome,F,53,0,17944.0,1,1,59222
2,1359,15,SEO,Opera,M,53,1,1.0,12,12,59222
3,150084,44,SEO,Safari,M,41,0,492085.0,1,1,20017
4,221365,39,Ads,Safari,M,45,0,4361461.0,1,1,59222


In [14]:
datas['is_male'] = (datas.sex == 'M').astype(int)
del datas['sex']
# get dummy variables
datas = pd.get_dummies(datas,columns=['source','browser'])
del datas['source_Direct']
del datas['browser_Opera']
datas.head()
datas.rename(columns={'class':'is_fraud'},inplace=True)
# 'class' is a reserved keyword
datas.to_csv("fraud_cleaned.csv",index_label="user_id")

## Train the model

let's just ignore this part