# Predicting Online Credit Card Applications (Digital Advertising)
*Final Project #2: Project Design Writeup and Approval Template*
<br>*Jan 12, 2017*

## Project Problem and Hypothesis
The client we will be researching is a Canadian credit card company looking to learn how they can adjust their digital advertising to increase credit cards applications online. We would like to determine what features of their digital ads may influence users to apply for the client's credit card.  Some example features may include whether we showed the user a video ad, on what sites we showed the ads, or the type of ad campaign we showed the user (e.g. "Upper-Funnel" or "Lower-Funnel"). To determine which features have the most influence on the likelihood to apply, we will be developing a model to predict a categorical/binary outcome:
* 1: Applied for credit card
* 0: Did not apply for credit card

The model will be based on the client's past digital ads campaigns (Oct 1-Dec 31, 2016) that ran through our company's media buying platform DoubleClick Bid Manager (DBM).

Currently, the client informs their campaign strategy based on exploratory analyses using a standard “last-touch” attribution model (i.e. the last ad shown to a user before they apply for the credit card receives 100% credit for influencing the user to apply).  Due to this, we believe the client over-values certain strategies such as "Retargeting" and under-values other strategies.  We hope that by using machine learning methods, we will be able to uncover new insights that will help the client evolve their digital ad strategy.

Our hypothesis is that we will find that users will be more likely to apply for a credit card when they are shown more "Upper-Funnel" ads, shown more video ads, shown a higher frequency of ads, shown "viewable" ads, and shown ads on certain key sites or category of sites.

## Datasets
* **Time frame:** (Oct 1-Dec 31, 2016)
* **Scope:**
    * Canada
    * All of the client's campaigns and strategies 
    * Only ads served via DoubleClick Bid Manager (DBM)
* **Observations at User-level:** The original datasets are at the event-level, e.g. each ad impression.  We will use the **user_id** dimension to collapse the datasets so we have a row for each user.  Therefore, we will be creating a dataset which summarizes the ad exposure for each user.  
* **Description of Datasets:** We will use 3 datasets which share a similar data structure as seen in the table below.  The table calls out which features are available in each dataset.
    * **View** - Dataset with every ad impression
    * **Click** - Dataset with every ad click
    * **Converson -** Dataset with every credit card application

Field|Type|View|Click|Conversion|Description
---|---|---|---|---|---
event_type|string|Yes|Yes|Yes|Details the type of the event for this row: "view", "click", or "conversion".
event_sub_type|string|Yes|Yes|Yes|Contains further details related to the event - these are "view" and "click" for view and click events, but may be "postview", "postclick" or blank for conversion events.
event_time|integer|Yes|Yes|Yes|A Unix timestamp in microseconds (1/1,000,000 second) for when the event occurred, for example "1330403779608570" represents Tuesday February 28th 2012 04:36:19.608570.
advertiser_id|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID for the advertiser related to the event, for example "164332".
insertion_order_id|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID for the insertion order related to the event, for example "1079941".
line_item_id|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID for the line item related to the event, for example "1155785".
creative_id|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID for the creative related to the event, for example "367487".
floodlight_id|integer|No|No|Yes|The ID of the floodlight tag related to the conversion event, for example "802886".
bid_price_advertiser_currency_nanos|integer|Yes|Yes|Yes|The bid price sent to the exchange, in advertiser currency nanos, for example "61200000" nanos is 0.06. Note that although the bid is sent to exchanges as a CPM value, this is represented as a CPI value for consistency with the other metrics.
winning_price_advertiser_currency_nanos|integer|Yes|Yes|Yes|The amount paid for the impression in advertiser currency nanos, for example "61200000" nanos is 0.06. This value may be zero.
partner_revenue_advertiser_currency_nanos|integer|Yes|Yes|Yes|The total amount in advertiser currency nanos made by the partner account for the view event. This value may be zero.
total_media_cost_advertiser_currency_nanos|integer|Yes|Yes|Yes|The total media cost in advertiser currency nanos for the view event. This value may be zero.
data_cost_advertiser_currency_nanos|integer|Yes|Yes|Yes|The cost of any data that was used to target this impression, in advertiser currency nanos.
billable_cost_advertiser_currency_nanos|integer|Yes|Yes|Yes|The total amount of money billed to the partner, including the media cost and partner costs, in advertiser currency nanos.
url|string|Yes|Yes|Yes|The raw URL taken from the bid request received from the exchange, for example "http://www.example.com". As some exchanges mask the URL in their bid requests this value may be "source_url_hidden".
universal_site_id|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID for the most specific universal site that matches the url.
language|string|Yes|Yes|Yes|The ISO-639-1 code or "zh_CN" (Chinese (simplified)), "zh_TW" (Chinese (traditional)) or "other" representing the language related to the view event.
adx_page_categories|string|Yes|Yes|Yes|[Contains the DoubleClick Ad Exchange page category IDs](https://developers.google.com/adwords/api/docs/appendix/verticals) separated by a space, for example "65 189". The categories may not be mutually exclusive.
matching_targeted_keywords|string|Yes|Yes|Yes|A comma separated string containing a list of targeted keywords matching the page related to the view event, for example "apple,orange,banana". Although the page related to the view event may match many keywords, only those which were targeted will be included; if the list of keywords is large we may impose a limit to the number of keywords returned.
exchange|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID for the exchange that requested the ad.
attributed_inventory_source_is_public|integer|Yes|Yes|Yes|"True" if inventory source is available to all buyers. "False" if inventory source is restricted only to certain buyers.
ad_position|integer|Yes|Yes|Yes|Specifies the position of the ad on the page if known (self-declared by source):<br>"1" represents above the fold<br>"2" represents below the fold
country|string|Yes|Yes|Yes|2-letter ISO 3166-1 country code identifying the best-guess country of the impression, for example "US"
postal_code|string|Yes|Yes|Yes|The postal code identifying the best-guess postal area of the impression if known, for example "98033". Do not assume uniqueness across different countries.
geo_region_id|integer|Yes|Yes|Yes|An integer matching the region integer availble in reporting and targeting.
city_id|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID identifying the best-guess city of the impression.
os_id|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID identifying the operating system related to this event.
browser_id|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID identifying the browser related to this event.
net_speed|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID representing the network speed related to the view event:<br>"1" represents dial-up<br>"2" represents EDGE/2G<br>"3" represents UMTS/3G<br>"4" represents Basic DSL<br>"5" represents HSDPA/3.5G<br>"6" represents Broadband/4G<br>"7" represents Unknown
user_id|string|Yes|Yes|Yes|The encrypted user ID related to this event, for example "ABCDEFGH_abcdefgh-0123456789". Do not assume any ordering, structure or meaning to the user_id value - this value is not the user's cookie ID.
matching_targeted_segments|string|Yes|Yes|Yes|The names of targeted user lists that match the visitor separated by a space, for example "-4 456". This includes 1st and 3rd party segments. If the visitor is in a user list that is not targeted by the ad associated with this event it will not be included here.
isp_id|integer|Yes|Yes|Yes|The DoubleClick Bid Manager numerical ID for the best-guess Internet Service Provider of the impression. This value may be missing.
device_type|integer|Yes|Yes|Yes|Populated with the numerical value of the identified device type. <br>"0" represents COMPUTER<br>"1" represents OTHER<br>"2" represents SMARTPHONE<br>"3" represents TABLET<br>"4" represents SMART TV
mobile_make_id|integer|Yes|Yes|Yes|The numerical ID for the mobile make. This value may be missing.
mobile_model_id|integer|Yes|Yes|Yes|The numerical ID for the mobile model. This value may be missing.

More information available at Google's developer documentation for [DBM's Data Transfer File Format](https://developers.google.com/bid-manager/guides/data-transfer/format-v6).

#### Additional Features:
* **Clicks/Conversions**: We will focus on the "View"(impressions) dataset, but will use the click and conversion datasets to create 2 additional features:
    * Click: Whether the user clicked on any ads
    * Converions: Whether the user applied for a credit card
* **Campaign Strategy**: The insertion_order_id and line_item_id features will be used to determine the campaign strategy feature with the following possible values:
    * “Upper-funnel” / "Awareness"
    * “Mid-funnel” / "Evaluation"
    * “Lower-funnel” / "Decision"
* **Creative Details**: The creative_id feature can be mapped to another dataset to determine the following features:
    * Creative Type (e.g. Display, Video, etc.)
    * Creative Size
    * Video Length

## Domain knowledge
I currently work as an "Analytical Lead" at DoubleClick, where I work with clients to develop insights from reports and data from our products.  I am already familiar with the datasets described above and our team has historically used the datasets to perform some exploratory analyses.

Due to over-reliance on "last-touch" conversion attribution, we believe some of our clients over-value the importance of "Lower-Funnel" campaigns and we hope to show that "Upper-Funnel" strategies can influence the likelihood of a user to apply.

![Marketing Purchase Funnel](files/purchase_funnel.jpg)

Historically, we have relied on two methods to make the case for other strategies that are more "Upper-Funnel":
* Using test metrics such as "Brand Lift" (positive change in attitude toward the brand) and "Ad Recall" (user's ability to remember a recent ad campaign).  While this can be effective, we find that some clients need to be able to tie it metrics to their ultimate goal, e.g. credit card applications.
* Rely on data-driven attribution methods that measure the impact of an ad, regardless if it was the last ad shown or earlier in the "ad path".  These methods are very powerful and useful, but we have found that some clients rather not rely on "black-box" methods or do not want to pay external partners for this service.

**Benchmark:** While we don't currently have a benchmark model to compare to, we can use a performance metric from the campaign as a benchmark for what we want to try to improve.  The client's campaign achieved:
* **Conversion rate (conversion/impression):** 0.001039%
* **Cost per conversion:** 391 CAD

## Project Concerns

**General Concerns:**
* I am not certain about what model evaluation method to use. Based on the confusion matrix, it seems that "Recall" may be a good choice since our focus is to capture all the True Positives for applications and False Positives are relatively inexpensive compared the benefit of a True Positive. I could use a cost-benefit analysis to illustrate this. However, should I use the cost benefit analysis itself as the evaluation method, or use it to show that "Recall" is a valid evaluation choice?
* There is a risk that our client will expect the model to predict future campaign performance, which is not the goal of this project. Communication will be key to convey that the goal is finding which features have the greatest influence on likelihood to apply for a credit card.
* There is a risk that we identify factors that matter during a certain period of time (end of year holidays) but may not necessarily matter during other times of the year.
* There is a risk that our results will become obsolete if aspects of the creative design and messaging change significantly in the future.
* The risk of being wrong is that our recommendations to the client actually worsen their performance, which could lead to a strained relationship or lost business.  On the other hand, the potential upside would be to improve performance for the client and to win additional business.

**Dataset Concerns:**
* Our dataset only includes ads bought via DoubleClick Bid Manager (DBM).  My understanding is most of client's digital ads during this period were via DBM, but we don't have any insight into TV, radio, or other ad channels that may add noise to our data.
* The **user_id** is based on "cookies".  We are assuming that a cookie represents a unique person.  However, the reality is that people use multiple devices, wich will have different unique cookies.  Therefore we may see some noise in the data due to users being exposed to an ad on one device (e.g. their phone), but applying for the credit card on another (e.g. their laptop).
* Due to privacy issues, there is usually about ~20% of the data where **user_id** is "zero'd out" by DoubleClick, which will add noise to our data.
* One feature I was hoping to use was what "Audience Segment" a user belonged to (e.g."Foodie" vs "Outdoor Enthusiast").  However, this is only available in cases where the client actively targets these groups, so this data will not be useful.

## Outcomes
**Modeling Plan:**
* As part of the exploratory process, use a Random Forest model to identify important features
* Test various forms of a Logistic Regression with L1 or L2

**Modeling Outcome:**

We will be using **Recall** to evaluate our model's ability to predict if a user applies for a credit card.  For the best model, we will populate a confusion matrix and work to develop a cost-benefit analysis from it.
![Marketing Purchase Funnel](files/confusion.jpg)

**Outcome for Client:**

However, the ultimate outcome is to identify the features that influence credit card application and their relative importance.  To begin, it will be important to present high-level insights from our exploratory analysis to build up to the outcome of the model.  For the model result, we will want an easy way to explain the meaning of the results to our client.  Therefore, we will have a preference for models that can be intuitively understood.  For example, we may want to present the coefficients for each feature from a Logistic Regression.  We can then calculate how the coefficients imply how much of an effect that each feature has on the likelihood to apply for a credit card.  Ideally we will be able to identify combinations of features that will together improve the likelihood to apply by ~50%+ above the average conversion rate of 0.001039%.  Insights like these will be useful to recommend new strategies for the client to test.

**Alternative Result:**

If we determine that none of the features have significant predictive power, that is a finding in itself.  This could really help to redefine the client’s approach to media planning knowing that many of the features we used don't necessarily affect their performance.  This could potentially lead to a second iteration of this project that uses a different different data with features focused on other features (e.g. creative design and messaging).