# Introduction
There are some interesting facts that this competition attracts me.
- Data size is very large:240 million rows
- Imbalanced data
- Use ROC-AUC as performance metrics
- Real life problem

# Data EDA
refer to those solutions
https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56283

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56328

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56268

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56406

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56325

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56481

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56429

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56422

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56475#latest-436248

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56262#latest-517655

# Load Data

In [100]:
import gc
import os
import numpy as np
import pandas as pd
import subprocess
import matplotlib.pyplot as plt
import xgboost as xgb

In [99]:
# check file size
print('Check File Sizes')
for f in os.listdir('./data'):
    if 'zip' not in f:
        print(f.ljust(30) + str(round(os.path.getsize('./data/' + f) / 1000000, 2)) + 'MB')

Check File Sizes
test.csv                      863.27MB
train.csv                     7537.65MB
train_sample.csv              4.08MB
test_supplement.csv           2665.54MB
.ipynb_checkpoints            0.0MB


In [101]:
# check total number of lines of each file
print('Check Line Count:')
for file in ['train.csv', 'test.csv', 'train_sample.csv']:
    lines = subprocess.run(['wc', '-l', './data/{}'.format(file)], stdout=subprocess.PIPE).stdout.decode('utf-8')
    print(lines, end='', flush=True)

Check Line Count:
184903891 ./data/train.csv
18790470 ./data/test.csv
100001 ./data/train_sample.csv


Total line number is huge so just load 1 million row of data for analysis

In [104]:
# Load sample training data
df_train = pd.read_csv('./data/train.csv', nrows=1000000, parse_dates=['click_time'])
df_test = pd.read_csv('./data/test.csv', nrows=1000000, parse_dates=['click_time'])

# Show head
print(df_train.head())

# show shape
print(df_test.shape)

       ip  app  device  os  channel          click_time attributed_time  \
0   83230    3       1  13      379 2017-11-06 14:32:21             NaN   
1   17357    3       1  19      379 2017-11-06 14:33:34             NaN   
2   35810    3       1  13      379 2017-11-06 14:34:12             NaN   
3   45745   14       1  13      478 2017-11-06 14:34:52             NaN   
4  161007    3       1  13      379 2017-11-06 14:35:08             NaN   

   is_attributed  
0              0  
1              0  
2              0  
3              0  
4              0  
(1000000, 7)


# Time Feature

First extract day, minute, hour, second from the click_time.

In [7]:
X_train['day'] = X_train['click_time'].dt.day.astype('uint8')
X_train['hour'] = X_train['click_time'].dt.hour.astype('uint8')
X_train['minute'] = X_train['click_time'].dt.minute.astype('uint8')
X_train['second'] = X_train['click_time'].dt.second.astype('uint8')
X_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,7,9,30,38
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,7,13,40,27
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,7,18,5,24
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,7,4,58,8
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,9,9,0,9


# Confidence Feature

Some ips, apps, devices, etc. might have higher frequencies of is_attributed

In [69]:
cats =['ip', 'app', 'device', 'os', 'channel'] 
# Find frequency of is_attributed for each category
freqs = {}
for cols in cats:
    
    # New feature name
    new_feat = cols + '_confRate'    
    
    # Perform the groupby
    group_object = X_train.groupby(cols)
    
    # category attribution
    cat_atrributed = group_object['is_attributed'].sum()
    
    # category views
    cat_counts = group_object['is_attributed'].count()
    
    # category attribution rates
    cat_rates = cat_atrributed/cat_counts
    
    # attributuion confidence level
    cat_confs = np.log(cat_counts+1)/np.log(100000) # 1000 views -> 60% confidence, 100 views -> 40% confidence 
    cat_confs = [min(1,x) for x in cat_confs]
    # category final attribution
    cat_score = cat_rates * cat_confs
    cat_score = cat_score.to_frame().reset_index().rename(index=str, columns={'is_attributed': new_feat})
    
    # merge
    X_train = X_train.merge(cat_score,on=cols, how='left')

In [70]:
X_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_confRate,app_confRate,device_confRate,os_confRate,channel_confRate
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,7,9,30,38,0.0,6.2e-05,0.00154,0.001019,0.0
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,7,13,40,27,0.0,0.0,0.00154,0.000853,0.0
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,7,18,5,24,0.0,6.2e-05,0.00154,0.001504,0.0
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,7,4,58,8,0.0,0.0,0.00154,0.001019,0.0
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,9,9,0,9,0.0,6.2e-05,0.00154,0.000519,0.0


In [71]:
X_train.shape

(100000, 17)