Skip to content
/ pCTR Public

Predict the Click-Through Rate of ads given the query and user information using Apache Spark

Notifications You must be signed in to change notification settings

mw866/pCTR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pCTR

Predict the Click-Through Rate of ads given the query and user information The data sets are from KDD Cup 2012, Track 2.

Usage

Data

Overview

data.png

Training

Column Position Column Name Description Format
0 Click the number of times, among the above impressions, the user (UserID) clicked the ad (AdID) Integer {0, 1}
1 Impression the number of search sessions in which the ad (AdID) was impressed by the user (UserID) who issued the query (Query) Integer {1, 2, 3, 4 …}
2 DisplayURL a property of the ad. The URL is shown together with the title and description of an ad. It is usually the shortened landing page URL of the ad, but not always. In the data file, this URL is hashed for anonymity String (fixed length)
3 AdID ID of the Ad String
4 AdvertiserID ID of the Advertiser String
5 Depth a property of the session. The number of ads impressed in a session Integer {1, 2, 3}
6 Position a property of an ad in a session. The order of an ad in the impression list Integer {1, 2, 3}
7 QueryID id of the query. This id is a zero‐based integer value. It is the key of the data file 'queryid_tokensid.txt'. String
8 KeywordID id the keyword. This is the key of 'purchasedkeyword_tokensid.txt'. String
9 TitleID a property of ads. This is the key of 'titleid_tokensid.txt'. String
10 DescriptionID a property of ads. This is the key of 'descriptionid_tokensid.txt'. String
11 UserID This is the key of 'userid_profile.txt'. When we cannot identify the user, this field has a special value of 0. String

Test

The testing dataset shares the same format as the training dataset, except for the counts of ad impressions (Impression) and ad clicks (Click) that are needed for computing the empirical CTR.

Reference

Data

Zeppelin

AWS EMR

Spark

Troubleshooting

Zeppelin Timeout after 5minuts

Increase spark.sql.broadcastTimeout in the config:

spark = SparkSession.builder.appName("pCTR").config("spark.sql.broadcastTimeout", "600").getOrCreate()

http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

About

Predict the Click-Through Rate of ads given the query and user information using Apache Spark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published