This repository has been archived by the owner on Oct 8, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 153
KDDCup 2012 track 2 CTR prediction (regression)
Makoto YUI edited this page Oct 5, 2013
·
19 revisions
The task is predicting the click through rate (CTR) of advertisement, meaning that we are to predict the probability of each ad being clicked.
http://www.kddcup2012.org/c/kddcup2012-track2
Caution: This example just shows a baseline result. Use token tables and amplifier to get better AUC score.
use kdd12track2;
delete jar /home/myui/tmp/hivemall.jar;
add jar /home/myui/tmp/hivemall.jar;
source /home/myui/tmp/define-all.hive;
select count(1) from training_rcfile;
235582879
235582879 / 56 (mappers) = 4206837
set hivevar:total_steps=5000000;
drop table lr_model;
create table lr_model
as
select
feature,
cast(avg(weight) as float) as weight
from
(select
logress(features,label, "-total_steps ${total_steps}") as (feature,weight)
from
training_rcfile
) t
group by feature;
drop table lr_predict;
create table lr_predict
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
as
select
t.rowid,
sigmoid(sum(m.weight)) as prob
from
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
group by
t.rowid
order by
rowid ASC;
hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/lr_predict lr_predict.tbl
gawk -F "\t" '{print $2;}' lr_predict.tbl > lr_predict.submit
pypy scoreKDD.py KDD_Track2_solution.csv lr_predict.submit
Note: You can use python instead of pypy.
Measure | Score |
---|---|
AUC | 0.741111 |
NWMAE | 0.045493 |
WRMSE | 0.142395 |
drop table pa_model;
create table pa_model
as
select
feature,
cast(avg(weight) as float) as weight
from
(select
pa1a_regress(features,label) as (feature,weight)
from
training_rcfile
) t
group by feature;
PA1a is recommended when using PA for regression.
drop table pa_predict;
create table pa_predict
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
as
select
t.rowid,
sum(m.weight) as prob
from
testing_exploded t LEFT OUTER JOIN
pa_model m ON (t.feature = m.feature)
group by
t.rowid
order by
rowid ASC;
The "prob" of PA can be used only for ranking and can have a negative value. A higher weight means much likely to be clicked. Note that AUC is sort a measure for evaluating ranking accuracy.
hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/pa_predict pa_predict.tbl
gawk -F "\t" '{print $2;}' pa_predict.tbl > pa_predict.submit
pypy scoreKDD.py KDD_Track2_solution.csv pa_predict.submit
Measure | Score |
---|---|
AUC | 0.739722 |
NWMAE | 0.049582 |
WRMSE | 0.143698 |