### Problem Statement

Prompt: Fluent is a performance-based customer acquisition platform that is able to drive quality and volume to clients. While the company excels at customer acquisition, it must improve at selling merchandise. Users that are most likely to buy merchandise (Cost Per Action or CPA Wall buyer) are valuable because there is theoretically limitless upside. The way our merchandise program works now, a broker pays us per sale that originated from our website and we can continue to scale as long as the broker's return on investment remains positive. In addition, users who buy one item on the CPA Wall tend to be our most profitable users.

* Statement: Using July - September 2016 user conversion data, determine the most relevant features associated with CPA Wall buyers, and predict whether a user visiting a Fluent site will buy an item on the CPA Wall. 

* Hypothesis: The user most likely to buy on the CPA Wall will be a female greater than 45 years old. Age will have the most impact on whether a user is a CPA Wall buyer since an older person typically tends to have more disposable income and so is better able to spend. Female users subscribe to offers more often than male users across most of Fluent's sites, so there could be a similar association on the CPA Wall between females and a propensity to spend. At this moment I think a random forest model will work best with the data, but as always, "There is no free lunch."

### Domain Knowledge

* The CPA Wall is not the first thing a user sees after registering on our site, it is actually somewhere from 5th-9th, depending on the user.
* Given that there will be user attrition throughout our site, a purchase rate of 1.5% (based on the data set), for all registered users, is not awful. Generally e-commerce/retail sites have a ~3% conversion rate (based on 2012 MarketSherpa website optimization benchmark survey).
* U.S. e-commerce site visitors convert at 2.5%, while U.S. e-commerce site visitors on smartphones convert at ~1%.

http://www.smartinsights.com/ecommerce/ecommerce-analytics/ecommerce-conversion-rates/attachment/2016-11-21_06-48-53/


### Datasets

Variable | Description | Type of Variable
---| ---| ---
age | integer 13-116; 13 lowest and 116 highest | continuous
gender| male, female| categorical
state | denotes state of home address | categorical
esp | email service provider | categorical
vag | “64% of Americans own their homes. Do you own or rent yours?”; own = TRUE, rent = FALSE | categorical
val | "Do you have a car?"; yes = TRUE, no = FALSE| categorical
vas | “Do you or a loved one have Arthritis?”; yes = TRUE, no = FALSE, arthritis = TRUE| categorical
vaq | “Do you or a loved one have Diabetes?”; yes = TRUE, no = FALSE, diabetes = TRUE | categorical
v2m | “The 2016 Presidential Election is over. How do you identify politically?”; democrat, republican, or independent | categorical
CPAWallBuyer | yes or no | categorical
count(0) | integer count of users that meet observations across other columns in the same row; 1 - 15 | continuous 

* There are three categorical features for which I need to create dummy variables (state, esp, v2m). Also will be changing all binary variables to booleans TRUE or FALSE. 

### Project Concerns

* General concerns:
> * Making the dataset machine legible will take a good amount of time. Specifically, I am concerned with having to clean the "State" categorical feature since it has 93 unique values. Within this column there are some incorrect values and some values that aren't abbreviated properly. 
> * Data spans only three months of site traffic; concerned about insights gained from three months of purchases made on credit cards.

* Assumptions:
> * Generally, we pay affiliate networks to drive traffic to our sites. An interesting piece of missing data is the specific promotion a user clicked on to land on our site.
> * It is not known whether all the users landed on the CPA Wall. It is known whether the user registered, and bought or did not buy at least one item.
> * The dataset does not note the item(s) purchased, or even the category that the purchased item belongs to (ex. Magazine, Electronics, etc.).
> * The dataset does not include the price of the purchased item.

* Risk:
> * Cost of being incorrect: loss in potential revenue associated with scaling this part of the business.
> * Ability to classify a user as a CPA Wall buyers will also allow for an expansion of our audience monetization strategies, otherwise have less diverse audience monetization. 

### Outcomes

* I expect the output to highlight the features that are most important to predicting a buyer, and classify a CPA Wall buyer correctly with ~70% accuracy, and acceptable precision and recall metrics. Since there is nothing in place at the moment instituting a model with ~70% accuracy will ultimately lead to a change in audience monetization (perhaps having the CPA Wall be one of the top 4 things a user sees when he/she registers). Being able to classify a user in real time can give us a real time expected value for that user, thus better allowing us to maximize revenue per registration.

* Age will probably be my most important feature, but on its own it will not add much to the this part of the business.

* If the project results in an unusable model, then the next step would be to explore user segmentation using unsurpervised models with the end goal of redefining (or better defining) established user segments. The classification of a user being a CPA Wall buyer is an exercise in segmentation, and I would continue to attempt to do so in order to optimize audience monetization.