# Web Scraping: Selenium
_Automate your browser._ <br>
_Collect data from dynamically generated web pages or those requiring user interaction._

### Docs

- [Selenium homepage](https://www.seleniumhq.org/) 
- [Selenium documentation](https://selenium-python.readthedocs.io/) - unofficial, but helpful

### Installation

With conda:
- `conda install -c conda-forge selenium`

With pip:
- `pip install -U selenium`

#### ChromeDriver

You will also need to install a web driver to use Selenium.  ChromeDriver is recommended but others are also available.

1. Check your browser's version _(Chrome > About Google Chrome)_
![Browser Version](images/browser_version.png) 
<br>
2. Navigate to the [ChromeDriver downloads page](https://sites.google.com/a/chromium.org/chromedriver/downloads).
<br><br>
3. Download appropriately based on your browser's version and your OS.
![Download ChromeDriver zip file](images/chromedriver_options.png)

4. Unzip the driver.
<br><br>
5. Move to Applications folder (or wherever your Chrome application is).

## Example 1 - YouTube

### Dynamic Pages

Some pages serve their content dynamically, which means they could look different each time they are loaded into the browser.  HTML that you see by inspecting elements in your browser might be missing from `requests` and `BeautifulSoup` because it is generated at access time.

In [13]:
query = "data science"
youtube_search = "https://www.youtube.com/results?search_query="
youtube_query = youtube_search + query.replace(' ', '+')

In [14]:
page = requests.get(youtube_query).text
soup = BeautifulSoup(page, 'html5lib')

In [15]:
soup.find('div', id='contents')

Uh oh.  The video links should be under the contents div, but it's missing from our request.

> **QUESTION**: Why do you think this happened?

One option is to first load the page with Selenium THEN parse the page's HTML with BeautifulSoup.

First we launch the YouTube search page through our ChromeDrive.  A new browser should pop up.  **To continue using Selenium, keep this window open!**

In [35]:
url = 'https://www.kickstarter.com/discover/advanced?state=successful&category_id=34&woe_id=0&sort=end_date&page=1&seed=2616749'
driver = webdriver.Chrome(chromedriver)
driver.get(url)

We can access the page's HTML through the driver:

In [7]:
driver.page_source[:1000]

'<html style="font-size: 10px;font-family: Roboto, Arial, sans-serif; " lang="en-US"><head><script data-original-src="/yts/jsbin/player_ias-vflhIMmpR/en_US/miniplayer.js" src="/yts/jsbin/player_ias-vflhIMmpR/en_US/miniplayer.js"></script><script data-original-src="/yts/jsbin/player_ias-vflhIMmpR/en_US/remote.js" src="/yts/jsbin/player_ias-vflhIMmpR/en_US/remote.js"></script><meta http-equiv="origin-trial" data-feature="Web Components V0" data-expires="2019-08-15" content="AqJc7xVOCYsCYj0w3o6XqSYYRSBYxaX3IhxUyz+piton3LBVj3pWQ3DhcWh75fza5OybeMuuGUxvm/2tmDAJsAkAAABneyJvcmlnaW4iOiJodHRwczovL3lvdXR1YmUuY29tOjQ0MyIsImZlYXR1cmUiOiJXZWJDb21wb25lbnRzVjAiLCJleHBpcnkiOjE1NzMwODQ2OTQsImlzU3ViZG9tYWluIjp0cnVlfQ=="><script>var ytcfg = {d: function() {return (window.yt && yt.config_) || ytcfg.data_ || (ytcfg.data_ = {});},get: function(k, o) {return (k in ytcfg.d()) ? ytcfg.d()[k] : o;},set: function() {var a = arguments;if (a.length > 1) {ytcfg.d()[a[0]] = a[1];} else {for (var k in a[0]) {ytcfg.d()

Now we parse this with `BeautifulSoup` and the video information appears!

In [8]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [None]:
soup.find('div', id='contents')

In [None]:
contents_div = soup.find('div', id='contents')

for title in contents_div.find_all('a', id='video-title'):
    print(title.text.strip())

> **QUESTION**: We only got about 20 video titles -- surely there are more videos about data science.  What do you think is happening?

### Interacting with Pages

We can also interact with pages using Selenium.  For example, we can 
- click
- type in input cells
- scroll
- drag and drop, etc.

If we want more data science video titles, we need to scroll down to the bottom of the screen for more videos to populate.

In [41]:
print(driver.find_element_by_css_seTlector('div.load_more.mt3'))

AttributeError: 'WebDriver' object has no attribute 'find_element_by_css_seTlector'

In [8]:
from bs4 import BeautifulSoup
import requests
import time, os
import string
import re
import random


In [12]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import ElementNotInteractableException

chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

In [5]:
data_dir = "../data/selenium/"
category_id = 34
page = 1
url = f'https://www.kickstarter.com/discover/advanced?state=successful&category_id={category_id}&woe_id=0&sort=end_date&page=1&seed=2616749'

In [27]:
def scroll_scraper(data_dir, woe_id_start=4):
    driver = webdriver.Chrome(chromedriver)
    
    for woe_id in range (woe_id_start, 33424977):
        url = f"https://www.kickstarter.com/discover/advanced?state=successful&woe_id={woe_id}&category_id=34&sort=end_date&seed=2616863&page=1"
        driver.get(url)
        time.sleep(2)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        location_span = soup.find('span', id="location_filter")
        location_name_span = location_span.find('span', class_="js-title")
        location_name = location_name_span.text.strip()
        if location_name == 'Earth':
            print("skipping Earth woe_id")
            continue
        count_proj = soup.find('b', class_ ="count ksr-green-500")
        print("woe_id", woe_id, location_name, count_proj.text.strip())
        
        try:
            # does not scroll past 200
            for page in range(1,201):
                #Scroll
                load_more = driver.find_element_by_css_selector('div.load_more.mt3')
                driver.execute_script("arguments[0].scrollIntoView();", load_more)
                time.sleep(random.uniform(0,2))
                driver.execute_script(f"window.scrollBy(0,{-200-random.randint(1,200)} )", load_more)
                time.sleep(random.uniform(0,2))
                driver.execute_script(f"window.scrollBy(0,{-200-random.randint(1,200)} )", load_more)

                load_more.click()
                #Wait for page to load
                time.sleep(random.uniform(2,6))
                soup = BeautifulSoup(driver.page_source, 'html.parser')      
                savefile = data_dir + f"{re.sub('['+string.punctuation+']','_', url)}_{f'{page:05}'}.html"
                print(savefile)
                with open(savefile, "w") as file:
                    file.write(str(soup))
        except ElementNotInteractableException as err:
            print(f"hit end of woe_id {woe_id} list")
        
    

In [28]:
data_dir ="../data/selenium/woe/"
try:
    os.mkdir(data_dir)
except FileExistsError:
    pass
scroll_scraper(
    data_dir,
    4
)


woe_id 4 Advocate Harbour, Canada 0 projects
hit end of woe_id 4 list
woe_id 5 Agincourt, Canada 0 projects
hit end of woe_id 5 list
woe_id 6 Ajax, Canada 0 projects
hit end of woe_id 6 list
woe_id 7 Albanel, Canada 0 projects
hit end of woe_id 7 list
skipping Earth woe_id
woe_id 9 Albertville, Canada 0 projects
hit end of woe_id 9 list
woe_id 10 Aldouane, Canada 0 projects
hit end of woe_id 10 list
woe_id 11 Alexander Bay Station, Canada 0 projects
hit end of woe_id 11 list
woe_id 12 Alida, Canada 0 projects
hit end of woe_id 12 list
woe_id 13 Alma, Canada 0 projects
hit end of woe_id 13 list
woe_id 14 Amazon, Canada 0 projects
hit end of woe_id 14 list
skipping Earth woe_id
skipping Earth woe_id
skipping Earth woe_id
woe_id 18 Apple River, Canada 0 projects
hit end of woe_id 18 list
skipping Earth woe_id
skipping Earth woe_id
woe_id 21 Armagh, Canada 0 projects
hit end of woe_id 21 list
woe_id 22 Armley, Canada 0 projects
hit end of woe_id 22 list
woe_id 23 Armstrong, Canada 0 projec

AttributeError: 'NoneType' object has no attribute 'find'

![v1.png](./v1.png)


In [19]:
data_dir ="../data/selenium/100k_to_1m/"
try:
    os.mkdir(data_dir)
except FileExistsError:
    pass
scroll_scraper(
    "https://www.kickstarter.com/discover/advanced?category_id=34&pledged=3&sort=end_date&seed=2616863&page=1",
    data_dir
)



TypeError: 'str' object cannot be interpreted as an integer

In [27]:
data_dir ="../data/selenium/1m_plus/"
try:
    os.mkdir(data_dir)
except FileExistsError:
    pass
scroll_scraper(
    "https://www.kickstarter.com/discover/advanced?category_id=34&pledged=4&sort=end_date&seed=2616863&page=1",
    data_dir
)

ElementNotInteractableException: Message: element not interactable
  (Session info: chrome=77.0.3865.90)


In [28]:
data_dir ="../data/selenium/10k_to_100k/"
try:
    os.mkdir(data_dir)
except FileExistsError:
    pass
scroll_scraper(
    "https://www.kickstarter.com/discover/advanced?category_id=34&pledged=2&sort=end_date&seed=2616863&page=1",
    data_dir
)



KeyboardInterrupt: 

In [None]:
contents_div = soup.find('div', id='contents')

len(contents_div.find_all('a', id='video-title'))

Awesome!  Now we have several more videos to analyze and we could continue scrolling if we wanted even more.

What if we want to perform a new search for machine learning?

In [None]:
search_box = driver.find_element_by_xpath("//input[@id='search']")

#clear the current search
search_box.clear()

#input new search
search_box.send_keys("machine learning")

#hit enter
search_box.send_keys(Keys.RETURN)  

And can we filter to short videos (< 4 minutes) only?

In [None]:
filter_button = driver.find_element_by_xpath(
    '//a[contains(@class, "ytd-toggle-button")]'
)
filter_button.click()

In [None]:
short_link = driver.find_element_by_xpath(
    '//div[contains(@title, "Search for Short")]'
)
short_link.click()

Now we can either parse the page source with Beautiful Soup like before or pull text directly.  

For example, the title of the first short ML video (that isn't an ad!) can be found with:

In [None]:
first_title = driver.find_element_by_xpath("//a[@id='video-title']")
first_title.text

In [None]:
first_author = driver.find_element_by_xpath(
    "//ytd-video-renderer//ytd-channel-name//a"
)
first_author.text

#### Notes

- Check [here](https://www.w3schools.com/xml/xpath_syntax.asp) for additonal help writing xpath selectors.

- To select multiple elements, just switch to `driver.find_elements_by_xpath(...)`, which will return a list of matching elements.

- You can also access elements by id, name, etc.  Check [the docs](https://selenium-python.readthedocs.io/locating-elements.html) for more options.

Finally, when you are finished with the driver, be sure to close it.

In [9]:
driver.close()

## Example 2 - Open Table  _(Optional)_

Let's try one more example: gathering information from Open Table about restaurants with available reservation slots.

In [None]:
driver = webdriver.Chrome(chromedriver)
driver.get('http://www.opentable.com/')
time.sleep(1)  #pause to be sure page has loaded

Inspecting this page, we see the **name** of the drop down for picking the number of people is `Select_1`. Let's set the reservation for 4 people:

In [None]:
people_dropdown = driver.find_element_by_name('Select_1')
people_dropdown.send_keys("4 people")
time.sleep(1)

Now select the reservation date: 3 days from now.

In [None]:
from datetime import datetime, timedelta

In [None]:
today = datetime.today()
today_truncated = datetime(today.year, today.month, today.day)
res_date = int((today_truncated + timedelta(days=3)).timestamp())
res_date#Open Table uses unix time to label days

In [None]:
#Expand the calendar
date_picker = driver.find_element_by_name('datepicker')
date_picker.click()
time.sleep(1)

In [None]:
#Select the date three days from now
date_element = driver.find_element_by_xpath(f'//div[@data-pick={str(res_date)}000]')
date_element.click()
time.sleep(1)

Set our reservation time for 8 PM.

In [None]:
time_dropdown = driver.find_element_by_name('Select_0')
time_dropdown.send_keys("8:00 PM")
time.sleep(1)

And search!

In [None]:
search_button = driver.find_element_by_xpath('//input[@type="submit"]')
search_button.click()
time.sleep(1)

On this new page we find a long list of restaurants with available reservations for 4 people at roughly our desired day/time.  At this point we could grab the HTML (`driver.page_source`) and parse with BeautifulSoup.  

In [None]:
soup = BeautifulSoup(driver.page_source)

In [None]:
for rest in soup.find_all('div', class_='rest-row-header')[:20]:
    print(rest.find('a').text)

Or we could click into an individual restaurant to learn more.

In [None]:
first_rest = driver.find_element_by_xpath('//div[@class="rest-row-header"]//a')
first_rest.click()

> **QUESTION**:  Why can't we view the restaurant's full menu by clicking "View full menu"? <br>
`driver.find_element_by_xpath('//button[text()='View full menu"]')`

In [None]:
#Switch windows!
driver.switch_to.window(driver.window_handles[1])

In [None]:
full_menu_button = driver.find_element_by_xpath('//button[text()="View full menu"]')
full_menu_button.click()
time.sleep(1)

As usual when working with Selenium, make sure to close your browser.  Since we have two windows up, we use `driver.quit()` to close the entire browser session.

In [10]:
driver.quit()

In [2]:
import re
def save_html(url, dir='./', name=re.sub(r'[^\w\s]','_', url) + '.html' ,attempt=1):
    pass

NameError: name 'url' is not defined

In [None]:
path 