# Project Design Writeup

### Project Problem and Hypothesis

In online advertising, such as paid search and display, many advertisers have switched from a cost-per-impression (CPM) model to performance dependent pricing models, such as cost-per-click (CPC) and cost-per-conversion (CPA). Performance dependent models allow advertisers to reduce risk by pre-defining what actions they are willing to pay for. 

From an ad-serving perspective, it is crucial to accurately estimate the probablity of how likely an impression will lead to a pre-defined action in order to maximize revenue and efficiency. 

There has been a lot of machine learning work focusing on predicting online conversions. (See [here](https://research.google.com/pubs/pub41159.html) for details). However, little attention has been paid to predicting offline conversions such as store visits. 

Vistar Media, the company I work for, built the first programmatic platform for digital out-of-home media and houses over 90 percent of digital out-of-home inventory in the United States. Given that out-of-home is a medium that reaches consumers in the physical world, store visit rate is an important metric for evaluating the ROI of out-of-home campaigns. 

The objective of this project is to build a response prediction model for digital out-of-home media that can facilitate RTB (real-time bidding) based on CPA using store visit as a proxy for predicting conversions. 

Based on the available literature on predicting online conversions, I think that user features, such as demographic and income level when available, and context features, such as where and when a user sees an ad, will have the most impact on predicting offline conversions or store visits. (See [here]( http://people.csail.mit.edu/romer/papers/TISTRespPredAds.pdf) for reference). 

### Datasets

DOOH Ad Logs

Variable | Description | Type of Variable
---| ---| ---
advertiser_id | unique identifier of the advertiser | categorical
timestamp | time of ad play; need to be converted into time of day and day of week | categorical
lease_creative_id | unique identifier of the creative served | categorical 
lease_dma | the DMA/market where the ad was served | categorical
lease_network_id | the network / contextual environment where the ad was served, such as gyms, malls, transit | categorical
lease_latitude | latitude of the DOOH screen; could be turned into a continuous variable by calculating the distance between the DOOH screen and the nearest store location |  categorical
lease_longitude | longitude of the DOOH screen; could be turned into a continuous value by calculating the distance between the DOOH screen and the nearest store location |  categorical
lease_zip | zip code of the DOOH screen | categorical
lease_campaign_id | unique identifier of the ad campaign | categorical
lease_venue_id | unique identifier of the DOOH screen | categorical

Exposure to Conversion Table

Variable | Description | Type of Variable
---| ---| ---
user_id | unique identifier of the user | categorical
timestamp | time of ad exposure, need to be converted into time of day and day of week | categorical 
duration | length of stay in front of a screen | continuous
venue_latitude | latitude of the DOOH screen | categorical
venue_longitude | longitude of the DOOH screen | categorical
user_zipcode | zip code of user's home location | categorical
user_city | city of user's home location | categorical
user_state | state of user's home location | categorical
conversion | whether the user visits a store post exposure or not | categorical

### Domain knowledge

I have been conducting attribution studies at Vistar Media for nearly two years now, and I'm also familiar with the space of location-based media in general. 

We have not built any in-house prediction models, but similar work has been done for online media in recent years. The two papers I linked in the first section of this workbook have a lot of good information on the topic. 

Here is a sample output/benchmark table on feature importance from [this paper](http://people.csail.mit.edu/romer/papers/TISTRespPredAds.pdf): 


Single feature | SMI (bits)
---| ---| ---
event guid | 0.59742
query string | 0.59479
xcookie | 0.49983
user identifier | 0.49842
user segments | 0.43032

Single feature | RMI (bits)
---| ---| ---
section id | 0.20747
creative id | 0.20645
site | 0.19835
campaign id | 0.19142
rm ad grp id | 0.19094

Conjunction feature | RMI (bits)
---| ---| ---
section id x advertiser id | 0.24691
section id x creative id | 0.24317
section id x IO id | 0.24307
creative id x publisher id | 0.24250
creative id x site | 0.24246
site x advertiser id | 0.24234
section id x pixeloffers | 0.24172
site x IO id | 0.23953
publisher id x advertiser id | 0.23903

### Project Concerns

The major concern I have is around the repeatability of the project. Since every ad campaign has its own unique attributes that are different from other ad campaigns, such as advertiser, user_id, creative_id, etc., we are missing feedback features in our model that might play a huge part in accurately predicting the dependent variable. 

We also don't have that many user features, such as demographic information, which could potentially generate relatively large correlation coefficients. 

In addition, this model might not be representative of the U.S. population, given that data primarly comes from consumers who opt in to have their location tracked passively via GPS. 

In terms of the accuracy of source data, validating whether a user actually visited a store after seeing an ad on a digital out-of-home screen is also a challenge, especially in crowded urban neighborhoods. 

However, no matter how accurately the model's output is, this project will serve as a good starting point for driving efficiency on programmatic out-of-home transactions. We'll keep tuning and improving the model to make it as close to reality as possible.

### Outcomes

Since I plan to use a logistic regression model for the project, I expect the outcomes to include the following metrics: recall, precision, AUC, and accuracy for model evaluation; coefficients and feature importances for feature evaluation and selection. L1 and L2 regularization will also be used to avoid over-fitting. I don't expect my model to be very complicated given that the model is expected to be used in a production setting later. 

To consider the project a success, I think my AUC score should stay above 0.75 and the most important features, ideally fewer than 20 features, should be able to provide a good prediction of the dependent variable.