<div class="alert alert-danger" role="alert">
    <span style="font-size:20px">&#9888;</span> <span style="font-size:16px">This is a read-only notebook! If you want to make and save changes, save a copy by clicking on <b>File</b> &#8594; <b>Save a copy</b>. If this is already a copy, you can delete this cell.</span>
</div>

# Webscraping using Selenium webdriver

Selenium was built as a way to automate browser testing of applications / websites, by operating as a web browser. As such, Selenium can be used for webscraping even when other common approaches fails, specifically when: 
* Websites load dynamically with JavaScript
* User needs to interact with the page (click on things, enter input, navigate back and forth)

Selenium is particularly useful when trying the 'manual approach' using the requests library becomes too tedious or not replicable.

Note: Many websites have methods in place to make it harder for scrapers to work, and workarounds may be needed. For a useful OW created guide, see https://owlabs.atlassian.net/wiki/spaces/DE/pages/341213187/Web+scraping+in+Python

**DISCLAIMER**: The legality of web scraping needs to be considered on a site by site basis. The site in question may display a "Terms of Use", which should be passed by Oliver Wyman Legal in advance.

**Table of Contents**

<h1>Table of Contents<span class="tocSkip"></span></h1>
<ul class="toc-item"><li><span><a href="#Webscraping-using-Selenium-webdriver" data-toc-modified-id="Webscraping-using-Selenium-webdriver-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Webscraping using Selenium webdriver</a></span></li><li><span><a href="#Quickstart-to-Selenium-for-webscraping" data-toc-modified-id="Quickstart-to-Selenium-for-webscraping-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Quickstart to Selenium for webscraping</a></span><ul class="toc-item"><li><span><a href="#Fetching-and-navigating-HTML-from-site" data-toc-modified-id="Fetching-and-navigating-HTML-from-site-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Fetching and navigating HTML from site</a></span></li><li><span><a href="#Using-Selenium-to-navigate-within-website" data-toc-modified-id="Using-Selenium-to-navigate-within-website-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Using Selenium to navigate within website</a></span></li></ul></li><li><span><a href="#Re-usable-wrapper-functions-for-Selenium" data-toc-modified-id="Re-usable-wrapper-functions-for-Selenium-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Re-usable wrapper functions for Selenium</a></span><ul class="toc-item"><li><span><a href="#Additional-examples" data-toc-modified-id="Additional-examples-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Additional examples</a></span></li></ul></li><li><span><a href="#Other-tips-and-tricks-with-Selenium" data-toc-modified-id="Other-tips-and-tricks-with-Selenium-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Other tips and tricks with Selenium</a></span><ul class="toc-item"><li><span><a href="#Some-webpages-may-redirect-you,-and-you-can-get-the-redirected-website-URL" data-toc-modified-id="Some-webpages-may-redirect-you,-and-you-can-get-the-redirected-website-URL-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Some webpages may redirect you, and you can get the redirected website URL</a></span></li><li><span><a href="#Save-screenshots" data-toc-modified-id="Save-screenshots-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Save screenshots</a></span></li><li><span><a href="#Waiting-for-website-to-load-before-fetching-contents" data-toc-modified-id="Waiting-for-website-to-load-before-fetching-contents-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Waiting for website to load before fetching contents</a></span></li></ul></li></ul>


**We start by importing the Python libraries that will be used**

In [11]:
import pandas as pd
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

# Quickstart to Selenium for webscraping

This section provides a simple example in using Selenium to scrape websites. There are many tutorials online for this, e.g.:
* https://www.scrapingbee.com/blog/selenium-python/
* https://www.browserstack.com/guide/web-scraping-using-selenium-python
* https://towardsdatascience.com/how-to-use-selenium-to-web-scrape-with-example-80f9b23a843a


**Selenium will open up your web browser, and you need to download the webdriver associated with your browser**

For Chrome, it is here: https://chromedriver.chromium.org/downloads. You need to find the one associated with your browser version.

An alternative method is to use the webdriver_manager library

In [14]:
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
# options.add_argument('headless')  # Uncomment if you want browser to run in background

driver=webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

**If you ran the above without the option ```options.add_argument('headless')```, it would open up your browser instead of running in the background. That may be advantageous for debugging purposes. And if so, do not close that window**

## Fetching and navigating HTML from site

You can ask the webdriver to load a website, and then grab the HTML source from it after the page has loaded. 

In [43]:
driver.get("https://www.nintendo.com/")
print(driver.page_source[1:1000])  # First 1000 characters

<html lang="en-us" class="js-focus-visible svg cssanimations flexboxlegacy fontface csstransforms supports csstransforms3d csstransitions inlinesvg alps-os-windows" data-js-focus-visible="" style="--app-height: 747px;"><head><meta charset="utf-8"><title>Nintendo Official Site: Consoles, Games, News, and More</title><link rel="canonical" href="https://www.nintendo.com/"><meta http-equiv="Accept-CH" content="DPR, Width"><meta name="description" content="Visit the official Nintendo site to shop for Nintendo Switch™ systems and video games, read the latest news, find fun gear and gifts with a Nintendo twist, and much more."><meta name="twitter:card" content="summary_large_image"><meta name="twitter:site" content="@nintendoamerica"><meta property="og:type" content="website"><meta property="og:title" name="twitter:title" content="Nintendo Official Site: Consoles, Games, News, and More"><meta property="og:description" name="twitter:description" content="Visit the official Nintendo site to sho

**Here you can either use beautifulsoup (see separate notebook) to extract relevant information, or you can use the .find_element() method that Selenium webdriver provides.**

Read about it here: https://selenium-python.readthedocs.io/locating-elements.html or https://www.scrapingbee.com/blog/selenium-python/#locating-elements

In [28]:
from selenium.webdriver.common.by import By

In [39]:
h1 = driver.find_element(By.TAG_NAME, 'h1')
h1.text

'Nintendo.com home'

## Using Selenium to navigate within website

You can make use of our browser's JavaScript engine to e.g. scroll the webpage: https://www.scrapingbee.com/blog/selenium-python/#executing-javascript

In [45]:
driver.get("https://www.nintendo.com/")
driver.execute_script("window.scrollBy(0, 1000);")

You can enter text into forms if you know the form's id (which you can get from the page_source). For google, it is an element with name "q"

In [50]:
driver.get("https://www.google.com/")

# Identify element
p = driver.find_element(By.NAME,"q");
# Enter text with sendKeys() then apply submit()
p.send_keys("Testing the Search");

You can also click the submit button. For more details on clicking buttons see https://www.tutorialspoint.com/how-to-click-button-selenium-python

In [52]:
# Submit is very simple once you've found the element that is to be submitted
p.submit()

And then you can grab the output

In [56]:
print(driver.page_source[1:1000])  # First 1000 characters

html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><style>@font-face{font-family:'Google Sans';font-style:normal;font-weight:400;font-display:optional;src:url(//fonts.gstatic.com/s/googlesans/v14/4UaGrENHsxJlGDuGo1OIlL3Kwp5MKg.woff2)format('woff2');unicode-range:U+0301,U+0400-045F,U+0490-0491,U+04B0-04B1,U+2116;}@font-face{font-family:'Google Sans';font-style:normal;font-weight:400;font-display:optional;src:url(//fonts.gstatic.com/s/googlesans/v14/4UaGrENHsxJlGDuGo1OIlL3Nwp5MKg.woff2)format('woff2');unicode-range:U+0370-03FF;}@font-face{font-family:'Google Sans';font-style:normal;font-weight:400;font-display:optional;src:url(//fonts.gstatic.com/s/googlesans/v14/4UaGrENHsxJlGDuGo1OIlL3Bwp5MKg.woff2)format('woff2');unicode-range:U+0102-0103,U+0110-0111,U+0128-0129,U+0168-0169,U+01A0-01A1,U+01AF-01B0,U+030


**Close the webdriver**

In [58]:
# driver.close()  # Uncomment this to run

# Re-usable wrapper functions for Selenium

The functions are stored in utilites/webscraping/base_selenium_scraper.py and are built for Chrome (can easily be adjusted for other web browsers)

**Load re-usable functions**

In [75]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(0, "../../utilities")

from webscraping.base_selenium_scraper import BaseScraper 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


**Start by initializing the BaseScraper class**

In [84]:
scraper = BaseScraper(headless_mode=False)

**You can access the webdriver through**
```scraper._browser```

**Then you can use the built in methods**

- get_page_source: Access and get the source of a webpage
- select_value_dropdown_id: Function to select a value from a dropdown menu.
- click_javascript_element: Click on a element
- delete_cookies: Delete cookies from browser
- randomly_select_dropdown_id: randomly select an item from the dropdown menu
- clean_string: Remove whitespace (trailing and interior) and double quotes from a string
- wind_down_browser: close the browser

**Below is an example using Nordstrom's website**

In [92]:
first_page = scraper.get_page_source("https://www.nordstrom.com/browse/women/clothing")

opening: https://www.nordstrom.com/browse/women/clothing


**Navigate to next page**

In [93]:
counter = 0

In [98]:
next_link = scraper._browser.find_element(By.LINK_TEXT, "Next")
next_href = next_link.get_attribute("href")
print(next_href)

if next_href == "" or next_href is None:
    print("No more pages found")
else:
    counter += 1
    print('Getting page {}...'.format(counter))
    next_page_source = scraper.get_page_source(next_href)

https://www.nordstrom.com/browse/women/clothing?page=3
Getting page 3...
opening: https://www.nordstrom.com/browse/women/clothing?page=3


## Additional examples

We have included examples built in 2018 that extend the Base Scraper class for specific websites, which you can use as inspiration for building robust scrapers. 

These files are located in the ```selenium_examples``` folder.

# Other tips and tricks with Selenium

## Some webpages may redirect you, and you can get the redirected website URL

google.co redirects to google.com

In [23]:
driver.get("https://www.google.co/")
driver.current_url

'https://www.google.com/'

## Save screenshots

Use the .save_screenshot() method

In [40]:
driver.get("https://www.nintendo.com/")
driver.save_screenshot('nintendo_screenshot.png')

True

## Waiting for website to load before fetching contents

You can either use 'time.sleep()' to manually add delay into the code.

Or, you can use a 'WebDriverWait' object, and you can even ask it to wait until an item becomes visible or clickable. Read more here: https://selenium-python.readthedocs.io/waits.html 