Skip to content
This repository has been archived by the owner on Oct 8, 2019. It is now read-only.

KDDCup 2012 track 2 CTR prediction (regression)

Makoto YUI edited this page Oct 4, 2013 · 19 revisions

The task is predicting the click through rate (CTR) of advertisement, meaning that we are to predict the probability of each ad being clicked.
http://www.kddcup2012.org/c/kddcup2012-track2

Caution: This example just shows a baseline result. Use token tables and amplifier to get better AUC score.

UDF preparation

use kdd12track2;

delete jar /home/myui/tmp/hivemall.jar;
add jar /home/myui/tmp/hivemall.jar;
source /home/myui/tmp/define-all.hive;

Logistic Regression

Training

select count(1) from training_rcfile;

235582879

235582879 / 56 (mappers) = 4206837

set hivevar:total_steps=5000000;

drop table lr_model;
create table lr_model 
as
select 
 feature,
 cast(avg(weight) as float) as weight
from 
 (select 
     logress(features,label, "-total_steps ${total_steps}") as (feature,weight)
  from 
     training_rcfile
 ) t 
group by feature;

Prediction

drop table lr_predict;
create table lr_predict
  ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LINES TERMINATED BY "\n"
  STORED AS TEXTFILE
as
select
  t.rowid, 
  sigmoid(sum(m.weight)) as prob
from 
  testing_exploded  t LEFT OUTER JOIN
  lr_model m ON (t.feature = m.feature)
group by 
  t.rowid
order by 
  rowid ASC;

Evaluation

hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/lr_predict lr_predict.tbl

gawk -F "\t" '{print $2;}' lr_predict.tbl > lr_predict.submit

pypy scoreKDD.py KDD_Track2_solution.csv  lr_predict.submit

Note: You can use python instead of pypy.

Measure Score
AUC 0.741111
NWMAE 0.045493
WRMSE 0.142395

Passive Aggressive

Training

drop table pa_model;
create table pa_model 
as
select 
 feature,
 cast(avg(weight) as float) as weight
from 
 (select 
     pa2a_regress(features,label) as (feature,weight)
  from 
     training_rcfile
 ) t 
group by feature;

Prediction

drop table pa_predict;
create table pa_predict
  ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LINES TERMINATED BY "\n"
  STORED AS TEXTFILE
as
select
  t.rowid, 
  sum(m.weight) as prob
from 
  testing_exploded  t LEFT OUTER JOIN
  pa_model m ON (t.feature = m.feature)
group by 
  t.rowid
order by 
  rowid ASC;

The "prob" of PA can be used only for ranking and can have a negative value. A higher weight means much likely to be clicked. Note that AUC is sort a measure for evaluating ranking accuracy.

Evaluation

scoreKDD.py

hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/pa_predict pa_predict.tbl

gawk -F "\t" '{print $2;}' pa_predict.tbl > pa_predict.submit

pypy scoreKDD.py KDD_Track2_solution.csv  pa_predict.submit
Measure Score
AUC 0.739722
NWMAE 0.049582
WRMSE 0.143698
Clone this wiki locally