# Problem
Can we predict return on ad spend for a brand advertising on Amazon's online display ad platform?

# Hypothesis
Product success and ad dollars spent can predict a campaign's performance. 

# How does advertising with Amazon work?

* Advertisers buy display ads to drive consumers to purchase their products
* Consumers are targeted with Amazon ads all over the web on any device 
* When a consumer (who was shown then ad) buys the product, that impression (and campaign) is credited with the purchase
* Return on ad spend (ROAS) = sales driven / ad spend (e.g. a brand ran a 20,000 dollar ad campaign which drove 50,000 dollars in product sales, equalling a 2.50 ROAS)

## Data

I first pulled campaign data from the year 2017 for all US advertisers in our data warehouse. This data I was able to export to a CSV nicely and with no cleaning up to do. I have 13,800 rows from this source. Example of the dataset below.

In [6]:
import pandas as pd
campaign = pd.read_csv('GA_NA2017.csv')
campaign.advertiser_name = 'x' #advertiser name hidden for privacy concerns 
campaign[['advertiser_name','campaign_start','campaign_end','impressions','product_sales_$_total']].head(5)

Unnamed: 0,advertiser_name,campaign_start,campaign_end,impressions,product_sales_$_total
0,x,4/27/2017 7:00,5/11/2017 7:00,207444,0.0
1,x,10/23/2017 4:00,12/4/2017 4:59,10839272,17181.39
2,x,10/24/2017 7:00,11/18/2017 8:00,836952,0.0
3,x,7/1/2017 4:00,10/1/2017 3:59,14090108,114623.58
4,x,6/28/2017 7:00,7/12/2017 7:00,295718,0.0


However, I had trouble getting ASIN (Amazon Standard Identification Number) data because in our data warehouse, we store one ASIN per row per campaign which would duplicate each campaign row as many as 100 times. I tried to convert the rows to an array in a single column using Redshift's LISTAGG function but ran into the issue of having too many characters (aka rows) to convert to an array. I then decided to limit the number of ASINs per campaign to five, which would allow me to get a sample of the products without the techinical difficulties of pulling the data. I used the row_number window function to filter out excess ASINs. This will be a known limitation in the model if it is determined that ASINs play a large role in predicting ROAS.

With the additional ASIN data I increased my dataset to 38,929 rows.

Even though I now have the associated ASINs for each campaign, I do not have the ASIN's attributes such as price, rating, number of reviews, etc. I do not have access to this data through work because I am in a different organization. I considered using Amazon's Product API to get the data, but I needed to sign up as an associate and have a website. Therefore I built a webscraper to pull attributes off Amazon's website based on the ASIN. Code below. 

In [None]:
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

asins = pd.read_csv('asins.csv')
asindetails = pd.DataFrame(columns=['asin','price'])
asinsfailed = []

driver = webdriver.Chrome()
driver.implicitly_wait(10)

for index, row in asins.iterrows():
    if index <= 100:
        global asindetails
        global asinsfailed
        baseurl = 'https://www.amazon.com/exec/obidos/ASIN/'
        driver.get(baseurl + str(row[0]))
        try:
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, "//*[@id='priceblock_ourprice']"))
                )
        except: 
            print "Could not find ASIN details. Skipping: ", row[0]
            asinsfailed = asindetails.append(row[0])
            continue 
        temp = pd.DataFrame({
            'asin': row[0],\
            'price': [driver.find_element_by_xpath("//*[@id='priceblock_ourprice']").text.replace('$','')]
            #'reviews': [driver.find_element_by_xpath("//*[@id='acrCustomerReviewText']").text.replace(' customer reviews','').replace(' customer review','')],\
            #'rating': [driver.find_element_by_xpath("//*[@id='reviewSummary']/div[2]/span/a/span").text.replace(' out of 5 stars','')],\
            #'category': [driver.find_element_by_xpath("//*[@id='wayfinding-breadcrumbs_feature_div']/ul/li[1]/span/a").text]
            })
        asindetails = asindetails.append(temp)
        print "Completed ASIN: ", row[0]
        asindetails.to_csv("asinprice_1-100.csv")

print "Job Complete"
print len(asinsfailed), "Failed: ", asinsfailed

I was able to extract price, category, and number of reviews from scraping the detail pages.