# Task

The company Waves produces plugins for use in audio production. To remain competitive in an increasingly saturated market it has transitioned to a pricing model where discounts are offered year round, however, the amount of discount applied to each product varies throught out the year by quite a considerable margin. It would be beneficial to have the ability to see how a current sale price compares historically in order to know if it is an optiimum time to buy.

The task is to write a scraper to perodically retrieve the pricing for each item of both regular pricing and the sale price. Similar to how camel camel camel tracks prices on Amazon. Allowing a decision to be made when the best time is to buy a product.

# Exploring the Target Page

The main products page is found at the following URL https://www.waves.com/plugins, however, after visiting this URL, a URL fragment is added and it take the following form

https://www.waves.com/plugins#sort:path~type~order=.hidden-price~number~asc|views:view=grid-view|paging:currentPage=0|paging:number=20

By default 20 items are displayed, with the need to cycle through multiple pages to see futher items. However, by changing the number "20" to "all" in the URL fragment, it will display all availiable products on one page. This obviously bypasses the need to introduce any functionality to drive the website. 

Also, by scanning the list of products we can see that at the very bottom, certain items do not have a regular price or sale price and are only availiable through the purchase of particular bundles.

In [130]:
import csv
import requests
import os
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup

# Download the HTML page using requests
url = 'https://www.waves.com/plugins#sort:path~type~order=.hidden-price~number~asc|views:view=grid-view|paging:currentPage=0|paging:number=all'
r = requests.get(url, allow_redirects=True)

with open('Waves_plugins.html', 'wb') as webpage:
    webpage.write(r.content)  

# Open the manually downloaded webpage and create a new BeautifulSoup object
with open('Waves_plugins.html', encoding='utf8') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

# Remove the HTML file
os.remove("Waves_plugins.html")

After inspecting the HTML file manually, it looks like data for each product is contained within `<article>` tags.
Using `find_all` with the article tag and calling len() on this object shows that there are 203 items with the `<article>` tag, which is the same number of products listed.

In [127]:
print(len(soup.find_all('article')))

203


In [155]:
with open('waves_price_history.csv', 'a', newline='', encoding='utf-8') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        spamwriter.writerow(["date", "product_id", "product_title", "regular_price", "sale_price"])

# Loop over  
for product in soup.find_all('article'):
    
    product_id = product['id']
    product_title = product.find('p', class_='title').get_text(strip=True)
    
    regular_price = product.find('div', class_='regular-price align-center')

    if regular_price != None:
        regular_price = regular_price.get_text(strip=True).strip()
        regular_price = [s for s in regular_price if s.isnumeric() == True]
        regular_price = "".join(regular_price) 
    
    sale_price = product.find('div', class_='on-sale-price align-center')
    
    if sale_price != None:
        sale_price = sale_price.get_text(strip=True).strip()
        sale_price = [s for s in sale_price if s.isnumeric() == True]
        sale_price = "".join(sale_price)
        sale_price = sale_price[:-2] + ".99"
    
    # print(product_id, product_title, regular_price, sale_price)    
    
    with open('waves_price_history.csv', 'a', newline='', encoding='utf-8') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_ALL)
        spamwriter.writerow([datetime.now(), product_id, product_title, regular_price, sale_price])

price_history = pd.read_csv("waves_price_history.csv", encoding = "ISO-8859-1", sep=" ", quotechar='|')        
print(price_history)     

                            date       product_id  \
0     2019-11-12 23:05:02.863031    node-ABRDTGMC   
1     2019-11-12 23:05:02.865025  node-API2500TDM   
2     2019-11-12 23:05:02.868008     node-BSSLAPP   
3     2019-11-12 23:05:02.870007     node-BRERMOT   
4     2019-11-12 23:05:02.872018      node-BSSDPR   
...                          ...              ...   
1218  2019-11-12 23:13:21.847148     node-S360TDM   
1219  2019-11-12 23:13:21.849162      node-DTSNDM   
1220  2019-11-12 23:13:21.851155      node-DTSNUM   
1221  2019-11-12 23:13:21.853152     node-DTSNMST   
1222  2019-11-12 23:13:21.855150      node-DDSPCH   

                       product_title regular_price sale_price  
0      Abbey Road TG Mastering Chain           199      29.99  
1                           API 2500           299      29.99  
2                       Bass Slapper            69      29.99  
3                      Brauer Motion            99      29.99  
4                        BSS DPR-402       

# Next Steps

1. Move code into a seperate .py file and set up as a cron job that runs periodically.
2. Create some kind of interface or way to access the data in a meaningful way.

# Current Issues and Bugs

1. For products which cost more than 99.99 they have whole number dollar amounts. This breaks the price formatting.
2. When updating the csv file, it would be more efficient to only add new data when there is a price change.
3. Seems to be some Unicode issues

# Resources Used

https://www.youtube.com/watch?v=ng2o98k983k
    
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
    
https://stackoverflow.com/questions/30750843/python-3-unicodedecodeerror-charmap-codec-cant-decode-byte-0x9d

https://pythonprogramming.net/introduction-scraping-parsing-beautiful-soup-tutorial/
