# TalkingData AdTracking Fraud Detection

This dataset was taken from the __[“TalkingData AdTracking Fraud Detection Challenge”](https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection)__ on
Kaggle. The host company, TalkingData, is China’s largest independent big data service
platform. It covers over 70% of active mobile devices nationwide and handles 3 billion clicks
per day, most of which are potentially fraudulent.<br> 
For this challenge, our goal was to build an
algorithm that predicts whether a user will download an application after clicking a mobile
advertisement.
<br><br>
We'll start by reading in the data (data files can be downloaded __[here](https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data)__)

In [None]:
# Import libraries for the project
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

from sklearn.utils import resample
from sklearn import metrics, cross_validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb

In [None]:
# Path to the data folder
DATA_PATH = r"C:\Users\reio\Documents\GitHub\data-science-portfolio\TalkingData\data"

# Function to load in data
def load_data(data_path=DATA_PATH):
    # PATHS TO FILE
    train_path = os.path.join(data_path, "train.csv")
    test_path = os.path.join(data_path, "test.csv")

    return pd.read_csv(train_path), pd.read_csv(test_path)

# Load in train and test sets
train, test = load_data()

## Data Exploration
The dataset that TalkingData had provided covers approximately 200 million clicks. The whole dataset, which was ordered by timestamps, was split into the train set, whose first click began on 2017-11-06, and the test set, whose first click began on 2017-11-10. The train set covers all the clicks during a 3-day period, and the test set covers all the clicks on the next day.<br>

### Data Description

In [None]:
### Take a look at the train set ###
print(train.head())
# Describe train
print(train.dtypes)
print(train.max())

### Take a look at the test set ###
print(test.head())
# Describe train
print(test.dtypes)
print(test.max())

8 features are included in train data set.<br>
The test set contains only 7 features, with the exception of **is_attributed**, which is the
binary response variable that needs to be predicted, and **attributed_time**. The test set
also contains an ID column to identify each observation.

In [None]:
# Check for missing values
print(train.isnull().sum())
# Extract data where is_attributed == 1
train_att = train[train['is_attributed']==1]
# Check NAs
print(train_att.isnull().sum())

Inspecting the **attributed_time** feature, we found most of the observations to be missing.
However, this was due to the fact that most of the clicks did not convert into downloads (i.e.
**is_attributed** equals 0). This feature is only available when a click was converted into a
download, which is also why it was missing in the test data set.

In [None]:
# Percentage of is_attributed == 1
p = len(train_att)/len(train)
print(len(train_att))
print('The percentage of converted clicks is {num:.10%}'.format(num=p))

This dataset is highly imbalanced, with 99.75% of the observations belonging to the negative class and only 0.25% of the observations in the positive class.

In [None]:
# Plot the proportion of clicks that converted into a download or not
plt.figure(figsize=(6,6))
#sns.set(font_scale=1.2)
mean = (train.is_attributed.values == 1).mean()
ax = sns.barplot(['Converted (1)', 'Not Converted (0)'], [mean, 1-mean])
ax.set(ylabel='Proportion', title='Proportion of clicks converted into app downloads')
for p, uniq in zip(ax.patches, [mean, 1-mean]):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height+0.01,
            '{}%'.format(round(uniq * 100, 2)),
            ha="center")

### Feature Observation

In [None]:
# For IP 6
ip_6 = train[train['ip'] == 6]
print(ip_6)

The feature **ip** contains different IP addresses. However, it may not be appropriate
to treat each IP as a unique value. Firstly, IP addresses are easily generated, and
we would not be able to recognize a real IP from a fake one. <br>
We can see that a single IP can be associated with different devices on different operating systems,
implying that the IP address could have been representative of a network address. This
was a motivation to treat the aggregated feature **ip_device_os** as one single user.

In [None]:
# Inspecting behaviors of click_time and attributed_time


**click_time** and **attributed_time** have the same patterns.
Along with the fact that most of **attributed_time** is N/A, it seems reasonable to drop the **attributed_time** feature and only extract information from the **click_time** feature. <br><br>
As the behaviors of the converted and non-converted clicks are possibly different, let's take a look at the conversion rate, which is the proportion of download counts over the total click counts.

In [None]:
# Behavior of click_time


We can see a seasonal behavior in **click_time** on a daily basis, so it's reasonable for
us to use **hour** and **wday** (day of the week) as new features in data analysis. Since our data
only covers a 4-day span, the year, month, or week are all the same for all observations,
so we don't need these in prediction, and **click_time** feature is dropped after
necessary information has been extracted.

### Data-preprocessing

In [None]:
# Extract features from click_time
def ppClicktime(df):
    df['click_time'] = pd.to_datetime(df['click_time'])
    df['wday'] = df['click_time'].dt.dayofweek
    df['hour'] = df['click_time'].dt.hour
    return df

# Pre-processed training and testing sets
train_pp = ppClicktime(train)
test_pp = ppClicktime(test)

# Drop click_time
train_pp.drop('click_time', axis = 1, inplace = True)
test_pp.drop('click_time', axis = 1, inplace = True)

# Drop attributed_time
train_pp.drop('attributed_time', axis = 1, inplace = True)

# Write to csv
train_pp.to_csv(os.path.join(data_path,"train_pp.csv"),index=None)
test_pp.to_csv(os.path.join(data_path,"test_pp.csv"),index=None)

Our data set is highly imbalanced, as described in the previous section. In classification
problems, most classifiers like logistic regression and decision tree work best when the class
distribution of the response variable in the dataset is balanced. However, in real world
problems, like fraud detection, the datasets are often highly imbalanced and it would be
difficult to derive a meaningful and good predictive model, due to the lack of information
of the minority class. <br><br>
There are two popular solutions to deal with imbalanced data:
oversampling or undersampling. Considering the nature of this dataset (high number of
observations, which would require a lot of computing power), we'd go with the undersampling
method, which randomly discards majority samples (negative observations in
is attributed) to achieve equal distribution with the minority class (positive observations
in is attributed). <br><br>
However, this means that we're potentially losing useful information in our original train
dataset. Therefore, we'll also perform our modeling methods on two other subsets, including
the first 10 or 50 million observations in the given train set. Even if these subsets are
still imbalanced, we're hoping to extract some useful information, while still keeping the
computation capabilities in check.

In [None]:
### Create balanced train set ###

# Separate the 2 classes
t0 = train_pp[train_pp['is_attributed'] == 0]
t1 = train_pp[train_pp['is_attributed'] == 1]

# Undersample class 0 (without replacement)
t0_udsp = resample(t0, replace=False, n_samples=len(t1), random_state=142) 

# Combine minority class with downsampled majority class
train_1m = pd.concat([t0_udsp, t1])
 
# Display new class counts
print(train_1m.is_attributed.value_counts())

In [None]:
# Function to load in the first ssize rows of pre-processed train data
def load_train_pp(data_path=DATA_PATH,ssize):
    train_pp_path = os.path.join(data_path, "train_pp.csv")
    return pd.read_csv(train_pp_path,nrows=ssize)

In [None]:
### Load in pre-processed train subset with the first 10 million observations ###
train_10m = load_train_pp(ssize=10000000)
train_50m = load_train_pp(ssize=50000000)

To obtain the latent information of the conversion rate, we'll aggregate the training data
on different groups of features. I tried several combinations of two-way and three-way
interactions. For example, feature **ip_os_count** means that we grouped the data by **ip** and
**os** and count the number of clicks. I paid close attention to the features aggregated on
hour, as this might be an important feature, and those that were aggregated on
hour proved to be significant in our models. After trying all possible combinations, our final
predictive model includes 20 features

In [None]:
# Function to add new aggregated features
def aggregate_features(df):
    # IPs
    n_ip = df[['ip','channel']].groupby(by=['ip'])[['channel']].count().reset_index().rename(index = str, columns={'channel': 'n_ip'})
    df = df.merge(n_ip, on = ['ip'], how = 'left')
    # app count
    ip_app_count = df[['ip','app', 'channel']].groupby(by=['ip', 'app'])[['channel']].count().reset_index().rename(columns={'channel': 'ip_app_count'})
    df = df.merge(ip_app_count, on = ['ip', 'app'], how = 'left')
    # device count
    ip_device_count = df[['ip','device', 'channel']].groupby(by=['ip', 'device'])[['channel']].count().reset_index().rename(columns={'channel': 'ip_device_count'})
    df = df.merge(ip_device_count, on = ['ip', 'device'], how = 'left')
    # os count
    ip_os_count = df[['ip','os', 'channel']].groupby(by=['ip', 'os'])[['channel']].count().reset_index().rename(columns={'channel': 'ip_os_count'})
    df = df.merge(ip_os_count, on = ['ip', 'os'], how = 'left')
    # wday + hour
    ip_wday_hour = df[['ip', 'wday', 'hour', 'channel']].groupby(by = ['ip','wday','hour'])[['channel']].count().reset_index().rename(index = str, columns = {'channel': 'ip_wday_hour'})
    df = df.merge(ip_wday_hour, on = ['ip', 'wday', 'hour'], how = 'left')
    # app + hour
    ip_app_hour = df[['ip', 'app', 'hour', 'channel']].groupby(by = ['ip','app','hour'])[['channel']].count().reset_index().rename(index = str, columns = {'channel': 'ip_app_hour'})
    df = df.merge(ip_app_hour, on = ['ip', 'app', 'hour'], how = 'left')
    # device + hour
    ip_device_hour = df[['ip', 'device', 'hour', 'channel']].groupby(by = ['ip','device','hour'])[['channel']].count().reset_index().rename(index = str, columns = {'channel': 'ip_device_hour'})
    df = df.merge(ip_device_hour, on = ['ip', 'device', 'hour'], how = 'left')
    # os + hour
    ip_os_hour = df[['ip', 'os', 'hour', 'channel']].groupby(by = ['ip','os','hour'])[['channel']].count().reset_index().rename(index = str, columns = {'channel': 'ip_os_hour'})
    df = df.merge(ip_os_hour, on = ['ip', 'os', 'hour'], how = 'left')
    # os + device + hour
    ip_os_device_hour = df[['ip', 'os', 'device', 'hour', 'channel']].groupby(by = ['ip','os', 'device', 'hour'])[['channel']].count().reset_index().rename(index = str, columns = {'channel': 'ip_os_device_hour'})
    df = df.merge(ip_os_device_hour, on = ['ip', 'os', 'device', 'hour'], how = 'left')
    # app + device + hour
    ip_app_device_hour = df[['ip', 'app', 'device', 'hour', 'channel']].groupby(by = ['ip','app', 'device', 'hour'])[['channel']].count().reset_index().rename(index = str, columns = {'channel': 'ip_app_device_hour'})
    df = df.merge(ip_app_device_hour, on = ['ip', 'app', 'device', 'hour'], how = 'left')
    # device + os
    ip_os_device = df[['ip', 'os', 'device', 'channel']].groupby(by = ['ip','os', 'device'])[['channel']].count().reset_index().rename(index = str, columns = {'channel': 'ip_os_device'})
    df = df.merge(ip_os_device, on = ['ip', 'os', 'device'], how = 'left')
    # app + device
    ip_app_device = df[['ip', 'app', 'device', 'channel']].groupby(by = ['ip','app', 'device'])[['channel']].count().reset_index().rename(index = str, columns = {'channel': 'ip_app_device'})
    df = df.merge(ip_app_device, on = ['ip', 'app', 'device'], how = 'left')
    return df

Let's investigate the correlation matrix of these features.

In [None]:
# Correlation matrix

In [None]:
## Aggregate data (full dataset)
#train_ag = aggregate_features(train_pp)
test_ag = aggregate_features(test_pp)
## Write to csv
#train_ag.to_csv("train_ag.csv",index=None)
#test_ag.to_csv("test_ag.csv",index=None)

# 1 million subset
train_1m_ag = aggregate_features(train_1m)
# 10 million subset
train_10m_ag = aggregate_features(train_10m)
# 50 million subset
train_50m_ag = aggregate_features(train_50m)

Let's drop **ip** since we are not going to use this feature in prediction.

In [None]:
train_1m_ag = train_1m_ag.drop(['ip'],axis=1)
train_10m_ag = train_10m_ag.drop(['ip'],axis=1)
train_50m_ag = train_50m_ag.drop(['ip'],axis=1)

## Modeling Methods
### Logistic Regression
Let's first perform Logistic Regression (with 5-fold cross-validation) on our balanced train data. Since our data is sparse, we'll apply a LASSO regularization.

In [None]:
# Separate response variables from predictors
y = list(train_1m_ag.is_attributed)
X = train_1m_ag.drop(['is_attributed'],axis=1)

In [None]:
# Logistic Regression model
logreg = LogisticRegression(penalty='l1', solver='liblinear')
y_pred = cross_validation.cross_val_predict(logreg, X, y, cv=5)
print(metrics.accuracy_score(y, y_pred))

### Random Forest
Logistic Regression did not give very good results. Let's try Random Forest

In [None]:
# Random Forest model
Ntree = 500
rfc = RandomForestClassifier(n_estimators=Ntree)
y_pred = cross_validation.cross_val_predict(rfc, X, y, cv=5)
print(metrics.accuracy_score(y, y_pred))

### LightGBM
Both Logistic Regression and Random Forest did not give good results, so we will try LightGBM, which is a Gradient Boosting Decision Tree (GBDT) algorithm that has been widely used in recent years, especially in Kaggle competitions. GBDT methods, especially LightGBM, have proved their efficiency and accuracy over common ensemble techniques like random forest.

In [None]:
# Define the LightGBM model parameters
target = 'is_attributed'
predictors = ['device', 'app', 'os', 'channel', 'wday', 'hour',
              'n_ip', 'ip_app_count', 'ip_device_count', 'ip_os_count',
              'ip_wday_hour', 'ip_app_hour', 'ip_device_hour', 
              'ip_os_hour', 'ip_os_device_hour']
categorical = ['app', 'device', 'os', 'channel', 'wday', 'hour']
params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'auc',
        'learning_rate': 0.1,
        'num_leaves': 255,  
        'max_depth': 8,  
        'min_child_samples': 100,  
        'max_bin': 100,  
        'subsample': 0.7,  
        'subsample_freq': 1,  
        'colsample_bytree': 0.7,  
        'min_child_weight': 0,  
        'subsample_for_bin': 200000,  
        'min_split_gain': 0,  
        'reg_alpha': 0,  
        'reg_lambda': 0,  
        # 'nthread': 8,
        'verbose': 0,
    }

In [None]:
# Set up the lightGBM model
def lgb_train(X_train, X_val, y_train, y_val):
    # Model
    dtrain = lgb.Dataset(X_train[predictors].values, label=y_train,
                      feature_name=predictors,
                      categorical_feature=categorical
                      )
    dvalid = lgb.Dataset(X_val[predictors].values, label=y_val,
                      feature_name=predictors,
                      categorical_feature=categorical
                      )
    lgb_model = lgb.train(params, 
                 dtrain, 
                 valid_sets=[dtrain, dvalid], 
                 valid_names=['train','valid'], 
                 evals_result=evals_results, 
                 num_boost_round=350,
                 early_stopping_rounds=30,
                 verbose_eval=True, 
                 feval=None)
    return lgb_model

Let's first try it on the 10 million dataset.

In [None]:
# Separate response variables from predictors
y = list(train_10m_ag.is_attributed)
X = train_10m_ag.drop(['is_attributed'],axis=1)
# Split the training data into training and validation sets for cross-validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)

In [None]:
# Train LightGBM model
lgb_model = lgb_train(X_train, X_val, y_train, y_val)
# Predict on test dataset and write out submission file
test2 = test_ag.drop(['click_id','ip'],axis=1)
y_submit = lgb_model.predict(test2[predictors],num_iteration=lgb_model.best_iteration)
test_ag['is_attributed'] = y_submit
ans = test_ag[['click_id', 'is_attributed']]
ans.to_csv('submission.csv', index=None)

Now let's try it on the 50 million dataset.

In [None]:
# Separate response variables from predictors
y = list(train_50m_ag.is_attributed)
X = train_50m_ag.drop(['is_attributed'],axis=1)
# Split the training data into training and validation sets for cross-validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)

In [None]:
# Train LightGBM model
lgb_model = lgb_train(X_train, X_val, y_train, y_val)
# Predict on test dataset and write out submission file
test2 = test_ag.drop(['click_id','ip'],axis=1)
y_submit = lgb_model.predict(test2[predictors],num_iteration=lgb_model.best_iteration)
test_ag['is_attributed'] = y_submit
ans = test_ag[['click_id', 'is_attributed']]
ans.to_csv('submission.csv', index=None)

## Results and Discussion
We've seen that the newly created features were correlated with one another, and some of the original features were correlated as well (device and os). This explains why logistic regression was not a good model choice, even with LASSO regularization. Tree-based models solve this high multicollinearity problem by constructing trees in a greedy manner and using all information from all the newly created features. LightGBM proved to be a more preferable method, both in time, memory efficiency and accuracy. <br><br>
AUC score increased as we used more observations in training. I tried to deal with the imbalanced data problem in our training dataset by creating a balanced undersampled train dataset. However, this left us with very little information, as most of the majority samples were discarded. Thus, using more samples and adjusting for the weight of the minority class in the GBDT methods resulted in better accuracys. This was another advantage of GBDT that was not possible in methods like logistic regression.