Skip to content

Latest commit

 

History

History
79 lines (54 loc) · 4.67 KB

README.md

File metadata and controls

79 lines (54 loc) · 4.67 KB

pCTR

Predict the Click-Through Rate of ads given the query and user information The data sets are from KDD Cup 2012, Track 2.

Usage

Data

Overview

data.png

Training

Column Position Column Name Description Format
0 Click the number of times, among the above impressions, the user (UserID) clicked the ad (AdID) Integer {0, 1}
1 Impression the number of search sessions in which the ad (AdID) was impressed by the user (UserID) who issued the query (Query) Integer {1, 2, 3, 4 …}
2 DisplayURL a property of the ad. The URL is shown together with the title and description of an ad. It is usually the shortened landing page URL of the ad, but not always. In the data file, this URL is hashed for anonymity String (fixed length)
3 AdID ID of the Ad String
4 AdvertiserID ID of the Advertiser String
5 Depth a property of the session. The number of ads impressed in a session Integer {1, 2, 3}
6 Position a property of an ad in a session. The order of an ad in the impression list Integer {1, 2, 3}
7 QueryID id of the query. This id is a zero‐based integer value. It is the key of the data file 'queryid_tokensid.txt'. String
8 KeywordID id the keyword. This is the key of 'purchasedkeyword_tokensid.txt'. String
9 TitleID a property of ads. This is the key of 'titleid_tokensid.txt'. String
10 DescriptionID a property of ads. This is the key of 'descriptionid_tokensid.txt'. String
11 UserID This is the key of 'userid_profile.txt'. When we cannot identify the user, this field has a special value of 0. String

Test

The testing dataset shares the same format as the training dataset, except for the counts of ad impressions (Impression) and ad clicks (Click) that are needed for computing the empirical CTR.

Reference

Data

Zeppelin

AWS EMR

Spark

Troubleshooting

Zeppelin Timeout after 5minuts

Increase spark.sql.broadcastTimeout in the config:

spark = SparkSession.builder.appName("pCTR").config("spark.sql.broadcastTimeout", "600").getOrCreate()

http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options