# Web Scraping: Selenium
_Automate your browser._ <br>
_Collect data from dynamically generated web pages or those requiring user interaction._

### Docs

- [Selenium homepage](https://www.seleniumhq.org/) 
- [Selenium documentation](https://selenium-python.readthedocs.io/) - unofficial, but helpful

### Installation

With conda:
- `conda install -c conda-forge selenium`

With pip:
- `pip install -U selenium`

#### ChromeDriver

You will also need to install a web driver to use Selenium.  ChromeDriver is recommended but others are also available.

1. Check your browser's version _(Chrome > About Google Chrome)_
![Browser Version](images/browser_version.png) 
<br>
2. Navigate to the [ChromeDriver downloads page](https://sites.google.com/a/chromium.org/chromedriver/downloads).
<br><br>
3. Download appropriately based on your browser's version and your OS.
![Download ChromeDriver zip file](images/chromedriver_options.png)

4. Unzip the driver.
<br><br>
5. Move to Applications folder (or wherever your Chrome application is).

In [1]:
from bs4 import BeautifulSoup
import requests
import time, os

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

## Example 1 - YouTube

### Dynamic Pages

Some pages serve their content dynamically, which means they could look different each time they are loaded into the browser.  HTML that you see by inspecting elements in your browser might be missing from `requests` and `BeautifulSoup` because it is generated at access time.

In [1]:
query = "data science"
page_search = "http://carsalesbase.com/us-car-sales-analysis-2019-brands/"
youtube_query = youtube_search + query.replace(' ', '+')

In [3]:
page = requests.get(youtube_search).text
soup = BeautifulSoup(page, 'html5lib')

NameError: name 'requests' is not defined

In [5]:
soup.find_all('div', id='contents')

[]

Uh oh.  The video links should be under the contents div, but it's missing from our request.

> **QUESTION**: Why do you think this happened?

One option is to first load the page with Selenium THEN parse the page's HTML with BeautifulSoup.

First we launch the YouTube search page through our ChromeDrive.  A new browser should pop up.  **To continue using Selenium, keep this window open!**

In [6]:
driver = webdriver.Chrome(chromedriver)
driver.get(youtube_query)

We can access the page's HTML through the driver:

In [7]:
driver.page_source[:1000]

'<html style="font-size: 10px;font-family: Roboto, Arial, sans-serif; " lang="en-US"><head><meta http-equiv="origin-trial" data-feature="Web Components V0" data-expires="2020-10-23" content="AhbmRDASY7NuOZD9cFMgQihZ+mQpCwa8WTGdTx82vSar9ddBQbziBfZXZg+ScofvEZDdHQNCEwz4yM7HjBS9RgkAAABneyJvcmlnaW4iOiJodHRwczovL3lvdXR1YmUuY29tOjQ0MyIsImZlYXR1cmUiOiJXZWJDb21wb25lbnRzVjAiLCJleHBpcnkiOjE2MDM0ODY4NTYsImlzU3ViZG9tYWluIjp0cnVlfQ=="><meta http-equiv="origin-trial" data-feature="Web Components V0" data-expires="2020-10-27" content="Av2+1qfUp3MwEfAFcCccykS1qFmvLiCrMZ//pHQKnRZWG9dldVo8HYuJmGj2wZ7nDg+xE4RQMQ+Ku1zKM3PvYAIAAABmeyJvcmlnaW4iOiJodHRwczovL2dvb2dsZS5jb206NDQzIiwiZmVhdHVyZSI6IldlYkNvbXBvbmVudHNWMCIsImV4cGlyeSI6MTYwMzgzNjc3MiwiaXNTdWJkb21haW4iOnRydWV9"><meta http-equiv="origin-trial" data-feature="Web Components V0" data-expires="2021-01-08" content="AixUK+8UEShlt6+JX1wy9eg+XL+eV5PYSEDPH3C90JNVbIkE1Rg1FyVUfu2bZ/y6Pm1xbPLzuwHYHjv4uKPNnA4AAABqeyJvcmlnaW4iOiJodHRwczovL2dvb2dsZXByb2QuY29tOjQ0MyIsI

Now we parse this with `BeautifulSoup` and the video information appears!

In [8]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [9]:
soup.find('div', id='contents')

<div class="style-scope ytd-section-list-renderer" id="contents"><ytd-item-section-renderer class="style-scope ytd-section-list-renderer" use-height-hack="">
<div class="style-scope ytd-item-section-renderer" id="header"></div>
<div class="style-scope ytd-item-section-renderer" id="spinner-container">
<paper-spinner-lite aria-hidden="true" class="style-scope ytd-item-section-renderer"><!--css-build:shady--><div class="style-scope paper-spinner-lite" id="spinnerContainer"><div class="spinner-layer style-scope paper-spinner-lite"><div class="circle-clipper left style-scope paper-spinner-lite"><div class="circle style-scope paper-spinner-lite"></div></div><div class="circle-clipper right style-scope paper-spinner-lite"><div class="circle style-scope paper-spinner-lite"></div></div></div></div></paper-spinner-lite>
</div>
<div class="style-scope ytd-item-section-renderer" id="contents"><ytd-promoted-sparkles-text-search-renderer class="style-scope ytd-item-section-renderer">
<div class="st

In [10]:
contents_div = soup.find('div', id='contents')

for title in contents_div.find_all('a', id='video-title'):
    print(title.text.strip())

What REALLY is Data Science? Told by a Data Scientist
Learn Data Science Tutorial - Full Course for Beginners
Data Science In 5 Minutes | Data Science For Beginners | What Is Data Science? | Simplilearn
Data Scientist vs Data Analyst: What's the difference? ($120,000 vs $70,000 salary)
Demystifying Data Science | Mr.Asitang Mishra | TEDxOakLawn
What Do You Need to Become a Data Scientist in 2020?
Is Data Science Really a Rising Career in 2020 ($100,000+ Salary)
Data Science – Baba Brinkman Music Video
Why I left my Data Science Job at FANG (Facebook Amazon Netflix Google)
A Day in The Life of a Data Scientist 👨🏻‍💻| upGrad
A Day In The Life Of A Data Scientist
Python for Data Science | Data Science with Python | Python for Data Analysis | 11 Hours Full Course
Can You Become a Data Scientist?
Data Science: Reality vs Expectations ($100k+ Starting Salary 2018)
Top 3 Programming Languages to Learn in 2019
Data Science Full Course - Learn Data Science in 10 Hours | Data Science For Beginner

> **QUESTION**: We only got about 20 video titles -- surely there are more videos about data science.  What do you think is happening?

### Interacting with Pages

We can also interact with pages using Selenium.  For example, we can 
- click
- type in input cells
- scroll
- drag and drop, etc.

If we want more data science video titles, we need to scroll down to the bottom of the screen for more videos to populate.

In [16]:
for i in range(5):
    #Scroll
    driver.execute_script(
        "window.scrollTo(0, document.documentElement.scrollHeight);" #Alternatively, document.body.scrollHeight
    )
    
    #Wait for page to load
    time.sleep(1)

In [12]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [13]:
contents_div = soup.find('div', id='contents')

len(contents_div.find_all('a', id='video-title'))

98

Awesome!  Now we have several more videos to analyze and we could continue scrolling if we wanted even more.

What if we want to perform a new search for machine learning?

In [14]:
search_box = driver.find_element_by_xpath("//input[@id='search']")

#clear the current search
search_box.clear()

#input new search
search_box.send_keys("machine learning")

#hit enter
search_box.send_keys(Keys.RETURN)  

And can we filter to short videos (< 4 minutes) only?

In [15]:
filter_button = driver.find_element_by_xpath(
    '//a[contains(@class, "ytd-toggle-button")]'
)
filter_button.click()

ElementClickInterceptedException: Message: element click intercepted: Element <a class="yt-simple-endpoint style-scope ytd-toggle-button-renderer" tabindex="-1">...</a> is not clickable at point (133, 18). Other element would receive the click: <div id="logo-icon-container" class="yt-icon-container style-scope ytd-topbar-logo-renderer">...</div>
  (Session info: chrome=79.0.3945.130)


In [None]:
short_link = driver.find_element_by_xpath(
    '//div[contains(@title, "Search for Short")]'
)
short_link.click()

Now we can either parse the page source with Beautiful Soup like before or pull text directly.  

For example, the title of the first short ML video (that isn't an ad!) can be found with:

In [None]:
first_title = driver.find_element_by_xpath("//a[@id='video-title']")
first_title.text

In [None]:
first_author = driver.find_element_by_xpath(
    "//ytd-video-renderer//ytd-channel-name//a"
)
first_author.text

#### Notes

- Check [here](https://www.w3schools.com/xml/xpath_syntax.asp) for additonal help writing xpath selectors.

- To select multiple elements, just switch to `driver.find_elements_by_xpath(...)`, which will return a list of matching elements.

- You can also access elements by id, name, etc.  Check [the docs](https://selenium-python.readthedocs.io/locating-elements.html) for more options.

Finally, when you are finished with the driver, be sure to close it.

In [None]:
driver.close()

## Example 2 - Open Table  _(Optional)_

Let's try one more example: gathering information from Open Table about restaurants with available reservation slots.

In [None]:
driver = webdriver.Chrome(chromedriver)
driver.get('http://www.opentable.com/')
time.sleep(1)  #pause to be sure page has loaded

Inspecting this page, we see the **name** of the drop down for picking the number of people is `Select_1`. Let's set the reservation for 4 people:

In [None]:
people_dropdown = driver.find_element_by_name('Select_1')
people_dropdown.send_keys("4 people")
time.sleep(1)

Now select the reservation date: 3 days from now.

In [None]:
from datetime import datetime, timedelta

In [None]:
today = datetime.today()
today_truncated = datetime(today.year, today.month, today.day)
res_date = int((today_truncated + timedelta(days=3)).timestamp())
res_date  #Open Table uses unix time to label days

In [None]:
#Expand the calendar
date_picker = driver.find_element_by_name('datepicker')
date_picker.click()
time.sleep(1)

In [None]:
#Select the date three days from now
date_element = driver.find_element_by_xpath(f'//div[@data-pick={str(res_date)}000]')
date_element.click()
time.sleep(1)

Set our reservation time for 8 PM.

In [None]:
time_dropdown = driver.find_element_by_name('Select_0')
time_dropdown.send_keys("8:00 PM")
time.sleep(1)

And search!

In [None]:
search_button = driver.find_element_by_xpath('//input[@type="submit"]')
search_button.click()
time.sleep(1)

On this new page we find a long list of restaurants with available reservations for 4 people at roughly our desired day/time.  At this point we could grab the HTML (`driver.page_source`) and parse with BeautifulSoup.  

In [None]:
soup = BeautifulSoup(driver.page_source)

In [None]:
for rest in soup.find_all('div', class_='rest-row-header')[:20]:
    print(rest.find('a').text)

Or we could click into an individual restaurant to learn more.

In [None]:
first_rest = driver.find_element_by_xpath('//div[@class="rest-row-header"]//a')
first_rest.click()

> **QUESTION**:  Why can't we click on a time to start booking a reservation? <br>
`driver.find_element_by_xpath('//div[@data-auto="timeslot"]')`

In [None]:
#Switch windows!
driver.switch_to.window(driver.window_handles[1])

In [None]:
full_details_button = driver.find_element_by_xpath('//div[@data-auto="timeslot"]')
full_details_button.click()
time.sleep(1)

As usual when working with Selenium, make sure to close your browser.  Since we have two windows up, we use `driver.quit()` to close the entire browser session.

In [None]:
driver.quit()