# Task

The company Waves produces plugins for use in audio production. To remain competitive in an increasingly saturated market it has transitioned to a pricing model where discounts are offered year round, however, the amount of discount applied to each product varies throught out the year by quite a considerable margin. It would be beneficial to have the ability to see how a current sale price compares historically in order to know if it is an optimum time to buy.

The objective is to write a scraper to perodically retrieve pricing for each item of both regular pricing, and the sale price. Similar to how camel camel camel tracks prices on Amazon.

# Exploring the Target Page

The main products page is found at the following URL https://www.waves.com/plugins, however, after visiting this URL, a URL fragment is automatically added by the site taking the following form

https://www.waves.com/plugins#sort:path~type~order=.hidden-price~number~asc|views:view=grid-view|paging:currentPage=0|paging:number=20

By default 20 items are displayed, with the need to cycle through multiple pages to see futher items. However, by changing the number "20" to "all" in the URL fragment, it will display all availiable products on one page. This obviously bypasses the need to introduce any functionality to drive the website. 

Also, by scanning the list of products we can see that at the very bottom, certain items do not have a regular price or sale price and are only availiable through the purchase of particular bundles.

In [86]:
import csv
import requests
import os
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup

# Download the HTML page using requests
url = 'https://www.waves.com/plugins#sort:path~type~order=.hidden-price~number~asc|views:view=grid-view|paging:currentPage=0|paging:number=all'
r = requests.get(url, allow_redirects=True)

with open('waves_plugins.html', 'wb') as webpage:
    webpage.write(r.content)  

In [100]:
# Open the manually downloaded webpage and create a new BeautifulSoup object
with open('waves_plugins.html', encoding='utf8') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

# Remove the HTML file
#os.remove("Waves_plugins.html")

After inspecting the HTML file manually, it looks like data for each product is contained within `<article>` tags.
Using `find_all` with the article tag and calling len() on this object shows that there are 203 items with the `<article>` tag, which is the same number of products listed.

In [88]:
print(len(soup.find_all('article')))

204


In [95]:
# Should check whether it exists first.

with open('waves_price_history.csv', 'a', newline='', encoding='utf-8') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        spamwriter.writerow(["date", "product_id", "product_title", "regular_price", "sale_price", "coupon_price"])

        
def formatter(price_html):
    if price_html != None:

        price_html = str(price_html).split("$")[1]
        
        if price_html.count("sup") > 0:
            price_html = price_html.split("<sup>")
            dollar = [s for s in price_html[0] if s.isnumeric() == True]
            dollar = "".join(dollar)
            cents = [s for s in price_html[1] if s.isnumeric() == True]
            cents = "".join(cents)
            price_html = dollar + "." + cents
        else:
            price_html = [s for s in price_html if s.isnumeric() == True]
            price_html = "".join(price_html)
            price_html = price_html + ".00"
    else:
        price_html = "NaN"
    
    return price_html

In [107]:
print("{:<40} {:<20} {:>13} {:>13} {:>13}".format("Plugin", "ID", "Reg. Price", "Sale. Price", "Coupon Price"))
print()


new_prices_dict = {}

for product in soup.find_all('article'):
    
    product_id = product['id']
    
    product_title = product.find('p', class_='title').get_text(strip=True)
    
    regular_price = formatter(product.find('div', class_='regular-price align-center'))
    
    sale_price = formatter(product.find('div', class_='on-sale-price align-center'))

    coupon_price = formatter(product.find('div', class_='with-coupon align-center'))

    print("{:<40} {:<20} {:>13} {:>13} {:>13}".format(product_title, 
                                                      product_id, 
                                                      regular_price, 
                                                      sale_price, 
                                                      coupon_price))
    
    entry = [datetime.now(), product_id, product_title, regular_price, sale_price, coupon_price]
    
    new_prices_dict[product_id] = entry

    
    
old_prices_dict = {}

with open('waves_price_history.csv', 'r', newline='', encoding='utf-8') as csvfile:
    csvfile = csv.reader(csvfile, delimiter=' ', quotechar='|')

    for row in csvfile:
        old_prices_dict[row[1]] = row 

        
changed_prices = {}

total = 0

for key in new_prices_dict:
    total +=1
    
    if key in old_prices_dict:
        if new_prices_dict[key][1:] != old_prices_dict[key][1:]:
            print(key, "Not Same")
            print(new_prices_dict[key][1:])
            print(old_prices_dict[key][1:])
            
            changed_prices[key] = new_prices_dict[key]
    
    if key not in old_prices_dict:
        changed_prices[key] = new_prices_dict[key]
        

    

print("Changed prices")

for key in changed_prices:
    print(key, changed_prices[key])
    
    with open('waves_price_history.csv', 'a', newline='', encoding='utf-8') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_ALL)
        spamwriter.writerow(changed_prices[key])
            


print(total)



#price_history = pd.read_csv("waves_price_history.csv", encoding = "ISO-8859-1", sep=" ", quotechar='|')        
#print(price_history)


Plugin                                   ID                      Reg. Price   Sale. Price  Coupon Price

Waves Tune Real-Time                     node-TNELV                  199.00         79.99         39.99
Vocal Rider                              node-VCLRIDTDM              249.00         69.99         34.99
Abbey Road Studio 3                      node-ABRDNXST               199.00        129.00         64.50
SSL G-Master Buss Compressor             node-SSLGMBTDM              249.00         69.99         34.99
CLA-76 Compressor / Limiter              node-CLA76TDM               249.00         59.99         29.99
Abbey Road TG Mastering Chain            node-ABRDTGMC               199.00         59.99         29.99
SSL E-Channel                            node-SSLECHTDM              249.00         69.99         34.99
Waves Tune                               node-TNEBND                 249.00         59.99         29.99
J37 Tape                                 node-J37SG            

B360 Ambisonics Encoder                  node-B360AE                    NaN        299.00        149.50
Z-Noise                                  node-USW379-1362-605         99.00         59.99         29.99
Electric 88 Piano                        node-EL88PIAN                69.00         59.99         29.99
The King's Microphones                   node-ABRDKMICSG              99.00         59.99         29.99
Renaissance Equalizer                    node-V5-RQD40                79.00         59.99         29.99
Linear Phase Multiband Compressor        node-LMBTDM                 149.00         59.99         29.99
Renaissance Reverb                       node-V5-RRD40                99.00         59.99         29.99
Manny Marroquin Triple D                 node-MMDE3SG                 99.00         59.99         29.99
Trans-X                                  node-TRXTDM                  99.00         49.99         24.99
CLA Bass                                 node-CLBASSG           

node-V5-STD40 Not Same
['node-V5-STD40', 'SuperTap', '79.00', '49.99', '24.99']
['node-V5-STD40', 'SuperTap', '72.00', '49.99', '24.99']
Changed prices
node-V5-STD40 [datetime.datetime(2019, 11, 30, 17, 0, 30, 649044), 'node-V5-STD40', 'SuperTap', '79.00', '49.99', '24.99']
203


1. Get the prices

2. Next time scraper run has to see if price is different to the last.

3. If different = true then write if false don't write.

# Current Issues and Bugs

1. For products which cost more than 99.99 they have whole number dollar amounts. This breaks the price formatting.
2. When updating the csv file, it would be more efficient to only add new data when there has been a price change.
3. Seems to be some Unicode issues with the TradeMark sign used in the "Neural" named products. Temporary work around is to use `encoding = "ISO-8859-1"` when opening with pandas.