# <center> Acorns Challenge - List All Items on Macy's Website </center>

   ### <center >Jennifer Bryson &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; December 10, 2017 </center> 

Preface: My first approach to solve this problem was to use the Macy's API; however, I applied for the API key and waited for approval for over a day until I decided I best not wait any longer due to the 3 day time constraint.  I'm glad I didn't keep waiting, as I still don't have access.

First we load in the required packages.

In [1]:
import requests
from lxml import html
from pprint import pprint
from urlparse import urlparse
import json

Next, we load in all the required headers to be able to access the HTML code on https://www.macys.com/

In [2]:
headers = {
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8,de;q=0.6',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}

Now we're able to get all the HTML code from https://www.macys.com/ and we will use Xpath to access the it.

In [3]:
url_base = 'https://www.macys.com/'
main_page = requests.get(url_base, headers=headers)
if main_page.status_code != 200:
    print('Error: page status code is'), ; print(product_page.status_code) 
tree = html.fromstring(main_page.content)

Great, we have all the HTML code that makes up the main webpage stored as "tree".  We can now use "tree" to pull out the specific information that we want.  First, we're going to want the urls to all of the categories containing products, e.g. the url to "Women's Activewear", the url to "Women's Basics", etc.  The following line of code finds us all these urls and stores them in a list.

In [4]:
categories = tree.xpath('//*[@id="mainNavigationFlyouts"]/div[*]/div[*]/ul/*/a/@href')

Just for fun, we can see how many different categories Macy's has for their products.  This will be the number of urls that we must search to get all the items.

In [5]:
print(len(categories))

783


Now we create an empty dictionary (with boolean values) where we will store our product names.  The reason we use a dictionary instead of a list is because this will allow us to avoid duplicates of items faster than a list would.  We use boolean values since the value doesn't matter, and boolean values take less memory.

In [11]:
product_dict = {}

Now we're ready to get the products!  

We will loop over each category (we have the urls for all the categories from above), and for each category we will do two things: 

1. Take all the product item names off the page (Macy's has 60 products per page) and add them to the dictionary.  
1. Loop over all possible pages for that category (e.g. "Women's Activewear" has 40 pages worth of products).

Remark 1: This will not give us duplicates of product names because of the way dictionaries store data.

Remark 2: This step is the most computationally costly, and it will take approximately 2 hours (on my old slow laptop) to run.

In [12]:
for c in categories:
    # first we get the HTML code from a specific category's url (e.g. Women's Activewear) and parse it:
    url_c = urlparse(c)
    url_c = url_c._replace(netloc="www.macys.com")
    url_c = url_c._replace(scheme="https")
    product_page = requests.get(url_c.geturl(), headers=headers)
    if product_page.status_code != 200:
        print('Error: page status code is'), ; print(product_page.status_code) ; print(c) ; print(url_c.geturl())
    tree = html.fromstring(product_page.content)
    
    # one we have the HTML code for that url, we can take the products off of the page (there will be 60/pg)
    products_raw = tree.xpath('//a [@class="productDescLink"]/text()')  
    products_no_ws = [x.strip() for x in products_raw]  #remove whitespaces
    for i in list(filter(None, products_no_ws)):        #remove empty list elements
        product_dict.update({i: True})                  #add it to the dictionary - note we will not get duplicates
    
    # we must loop over all the pages for the category (e.g. Women's Activewear has 40 pages worth of products)
    while tree.xpath('//li [@class="nextPage "]/a/@href') != []:
        nextpage = tree.xpath('//li [@class="nextPage "]/a/@href')  
        next_url = 'https://www.macys.com' + nextpage[0]

        next_page = requests.get(next_url, headers=headers)
        if next_page.status_code != 200:
            print('Error: page status code is'), ; print(next_page.status_code) ; print(nextpage) 

        tree = html.fromstring(next_page.content)
        products_raw = tree.xpath('//a [@class="productDescLink"]/text()')  
        products_no_ws = [x.strip() for x in products_raw]
        for i in list(filter(None, products_no_ws)):
            product_dict.update({i: True})

Error: page status code is 500
['/shop/womens-clothing/dresses/Pageindex/26?id=5449', '/shop/womens-clothing/dresses/Pageindex/26?id=5449']
Error: page status code is 500
['/shop/womens-clothing/maternity-clothes/Pageindex/2?id=66718', '/shop/womens-clothing/maternity-clothes/Pageindex/2?id=66718']
Error: page status code is 500
['/shop/womens-clothing/pajamas-and-robes/Pageindex/6?id=59737', '/shop/womens-clothing/pajamas-and-robes/Pageindex/6?id=59737']
Error: page status code is 500
['/shop/womens-clothing/womens-pants/Pageindex/4?id=157', '/shop/womens-clothing/womens-pants/Pageindex/4?id=157']
Error: page status code is 500
['/shop/womens-clothing/womens-resort-vacation-wear/Pageindex/2?id=53427', '/shop/womens-clothing/womens-resort-vacation-wear/Pageindex/2?id=53427']
Error: page status code is 500
['/shop/womens-clothing/womens-skirts/Pageindex/6?id=131', '/shop/womens-clothing/womens-skirts/Pageindex/6?id=131']
Error: page status code is 500
/shop/womens-clothing/womens-suits?

We're done!  The names of all the products on Macy's website are now stored as: product_dict.keys()

I owe the reader an explaination about all the error messages above though.  Running this multiple times, I saw status_code=500 errors for different pages each time.  My guess is that this is due to the page being unavailable either due to heavy traffic or Macy's updating the page.  If I had more time to work on this project, I would add to my code and tell it to wait a few seconds and retry the URL.  In any case, these error messages give us a clue as to which products may not have made our list.  It'd be a good idea to check the products on these pages and see if they're in our list.  If they aren't, we should add them by rerunning the code on just these sites, which should work since my various trials suggest that the same webpages aren't coming up every time.  Although there are a few webpages that do come up as 404 errors every time.  This is a separate issue: there are 3 URLs of the form www.social.macys.com and 2 URLS of the form www.customerservice-macys.com (as opposed to www.macys.com) so our code has trouble picking up these 5 specific webpages.  Again, given more time, this is fixable.  The only other 404 error is caused by an unwanted space.  Perhaps there are a few weird URLs that just need to be handled individually, otherwise I would need longer than 3 days to create a robust program that handles each unusual scenario.

Below we see how many products we found Macy's to have, and we store their names in a text file.

In [13]:
print(len(product_dict.keys()))

124634


In [14]:
%store product_dict.keys() >> acorns124634.txt

Writing 'product_dict.keys()' (list) to file 'acorns124634.txt'.


## How can this be generalized to other merchants?

The overall process outline can be repeated for other merchants.  Here are the changes that would need to be made:

1. You would need to replace this set of headers with headers that the other website desires.
1. Change the url_base, the parsed url netloc, and the parsed url scheme to be that of your new merchant.
1. Re-do the paths that point to where the product names are.

## How can we access accuracy?

To test our product list, we can go to https://www.macys.com/, find a **random** product name, and check our dictionary to see if this product name is there or not.  Repeat (with replacement) this many times and record how many times the product name was on the list vs. not on the list.  By the law of large numbers, the percentage of times the product was on the list will converge to the percentage of product names that we have found -- meaning if we pick 1,000 random product names and if we find that 900 of those were on our list, then we would believe we've found roughly 90% of the product names.  And the more random products we pick, the more accurate this will be.  Of course, if you pick more random products than the total number of products that Macy's has, then you might have well just have individually checked each item, which is a bad idea in terms of speed, so keep the number of products in mind when choosing how many times to repeat the test.

Alternatively, due to the way our code works by first selecting a category (like Women's Activewear) and then getting all the products from that category, we could try searching our list for a product on the last page of each category and use the assumption that if our code found that, then it's likely to have found the other products in that category.  This is less mathematically rigorous of an idea though.  Also as mentioned above, one should certainly test for products on the webpages that had the 500 and 404 errors.

## References:

&nbsp; 1\. http://docs.python-guide.org/en/latest/scenarios/scrape/
<br>
&nbsp; 2\.  https://stackoverflow.com/questions/2084670/how-to-extract-links-from-a-webpage-using-lxml-xpath-and-python

This last reference I didn't end up using, since I couldn't get an API key to access Macy's API, but I still learned a lot from the reference.
<br>
&nbsp; 3\.  https://www.digitalocean.com/community/tutorials/how-to-use-web-apis-in-python-3


## Thank you!

I learned a ton during the process of this coding challenge and found it really fun.  Thank you for this opportunity and interesting challenge.

Best,

Jennifer Bryson
<br>
jabryson@uci.edu