## Using Splinter and Beautiful Soup to scrape product information from Home Depot Website

### Beautiful Soup

[Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

### Splinter
[Splinter](https://splinter.readthedocs.io/en/latest/) is an open source tool for testing web applications using Python. It lets you automate browser actions, such as visiting URLs and interacting with their items.

### Webdriver manager

[Webdriver manager](https://pypi.org/project/webdriver-manager/) library provides the way to automatically manage drivers for different browsers. To install, you need to run `pip install webdriver-manager` after activating your conda environment.

In [1]:
from bs4 import BeautifulSoup as bs
from splinter import Browser
from webdriver_manager.chrome import ChromeDriverManager
import time

In [2]:
executable_path = {'executable_path': ChromeDriverManager().install()}



Current google-chrome version is 99.0.4844
Get LATEST chromedriver version for 99.0.4844 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/99.0.4844.51/chromedriver_mac64.zip
Driver has been saved in cache [/Users/pgarias/.wdm/drivers/chromedriver/mac64/99.0.4844.51]


In [3]:
browser = Browser('chrome', **executable_path, headless=False)


### Home Depot Search 

The objective is to design a URL and base URL combination that allows for a search term. The result of the call is displayed in a new browser that is initialized by the `Browser` call above.


In [4]:
base_url = "https://www.homedepot.com/s/"
search_term = "mosaic tile"

url = f"{base_url}{search_term}"
print(url)

https://www.homedepot.com/s/mosaic tile


In [5]:
# Add some time to pause after url visit and after scroll down 

browser.visit(url)
time.sleep(2)

In [6]:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/1.5);")
time.sleep(2)

### Parse the current page to Beautiful Soup

We can now pass the the html of the rendered page to the [beautiful soup object](https://beautiful-soup-4.readthedocs.io/en/latest/#quick-start) and [look for all](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all) the products displayed. Each page by default requests 24 products pods.

In [7]:
html = browser.html
soup = bs(html,'html.parser')


In [8]:
product_details = soup.find_all("h2", class_="product-pod__title product-pod__title__product")

In [9]:
len(soup.find_all("h2", class_="product-pod__title product-pod__title__product"))

24

Count how often the product pod contains a span with the brand name "merola tile".

In [16]:
count_merola = 0

for prod in product_details:
    if prod.find_all("span")[0].text.lower().strip()=="merola tile":
        print(prod.find_all("span")[0].text)
        count_merola += 1

Merola Tile
Merola Tile
Merola Tile


Use the `browser` object to [find an anchor tag with the attribute](https://stackoverflow.com/a/40161280) `aria-label=Next`. This will allow you to click on the next page to grab more results and to repeat the count search again. 

In [17]:
button_c = browser.find_by_css('a[aria-label=Next]')
button_c.click()

Need to test when you reach the final page

In [19]:
pagination_results = soup.find_all("span",class_="results-pagination__counts--number")

In [20]:
page_count_product = int(pagination_results[0].text.split("-")[1].strip())

In [21]:
total_products = int(pagination_results[1].text.strip())

In [22]:
if page_count_product==total_products:
    print("Last page")
else:
    print("Keep going")

Keep going


In [27]:
browser.quit()

### Composing a solution 

In [104]:
# from selenium.webdriver.support import expected_conditions
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait

In [28]:
executable_path = {'executable_path': ChromeDriverManager().install()}



Current google-chrome version is 99.0.4844
Get LATEST chromedriver version for 99.0.4844 google-chrome
Driver [/Users/pgarias/.wdm/drivers/chromedriver/mac64/99.0.4844.51/chromedriver] found in cache


In [29]:
browser = Browser('chrome', **executable_path, headless=False)
browser.driver.maximize_window()

In [30]:
base_url = "https://www.homedepot.com/s/"
search_term = "mosaic tile"

url = f"{base_url}{search_term}"

# Add some time to pause after url visit and after scroll down 

browser.visit(url)
time.sleep(2)

count_merola = 0

cont = True

while(cont):
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight/5);")
    time.sleep(2)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight/5*2);")
    time.sleep(2)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight/5*3);")
    time.sleep(2)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight/5*4);")
    time.sleep(2)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight/5*5);")
    time.sleep(2)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight/5*2.7);")
    time.sleep(2)
    #WebDriverWait(browser, 10).until(expected_conditions.visibility_of_element_located((By.CSS_SELECTOR,)))
    html = browser.html
    soup = bs(html,'html.parser')
    product_details = soup.find_all("h2", class_="product-pod__title product-pod__title__product")
    
    for prod in product_details:
        if prod.find_all("span")[0].text.lower().strip()=="merola tile":
            print(prod.find_all("span")[0].text)
            count_merola += 1
    pagination_results = soup.find_all("span",class_="results-pagination__counts--number")
    page_count_product = int(pagination_results[0].text.split("-")[1].strip())
    total_products = int(pagination_results[1].text.strip())
    if page_count_product!=total_products:
        print("Keep going")
        #browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        button_c = browser.find_by_css('a[aria-label=Next]')
        #button_c.scroll_to()
        time.sleep(5)
        button_c.first.click()
    else:
        print("Done")
        cont=False

Merola Tile
Merola Tile
Merola Tile
Keep going
Merola Tile
Keep going
Merola Tile
Merola Tile
Keep going
Merola Tile
Keep going
Merola Tile
Merola Tile
Merola Tile
Keep going
Merola Tile
Keep going
Merola Tile
Merola Tile
Merola Tile
Merola Tile
Merola Tile
Keep going
Merola Tile
Merola Tile
Merola Tile
Keep going


KeyboardInterrupt: 

In [31]:
count_merola

19

In [192]:
total_products

696