# <center>Web Scraping II -- Dyamic Web Page Scraping with Selenium </center>

References:
- http://selenium-python.readthedocs.io/getting-started.html
- https://www.scrapingbee.com/blog/selenium-python/


## 1. Why Selenium
- So far, we have learned how to scrape **static** HTML pages using **Requests + BeautifulSoup**
- However, if the web content relies on **javascript or AJAX** to build the content, this combination does not work
  - Elements in a web page loaded **asynchronously**
     * while requests.get(url) can only return the initial content
     * you may need to wait for a while to get web content fully loaded
  - You need to **interact with the page** to get some content loaded, e.g.
     * scroll down to load more
     * click a button like "more"
     * fill a form
- Example: https://www.quora.com/topic/Machine-Learning

In [6]:
# Exercise 1.1. Scape quora page using requests+beautifulsoup

# import requests package
import requests                   

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)
from bs4 import BeautifulSoup   

page = requests.get("https://www.quora.com/topic/Machine-Learning")    # send a get request to the web page

if page.status_code==200:      

    soup = BeautifulSoup(page.content, 'html.parser')
    
    # get all questions
    questions=soup.select("span.q-box.qu-userSelect--text")
    
    for i, q in enumerate(questions):
        print(i, q.get_text())
        print("\n")
    
# Note: nothing is returned. Do you know why?

## 2. Selenium WebDriver
- Selenium WebDriver is one of the most popular tools for **Web UI Automation**
- It uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. 
- Selenium is really useful when you have to perform action on a website such as:
  - clicking on buttons
  - filling forms
  - scrolling
  - taking a screenshot
  - execute Javascript code.
- Installation:
  - Install Selenium package: 
    - pip install selenium
  - Download a webdirver based on your browser: https://www.selenium.dev/documentation/en/getting_started_with_webdriver/third_party_drivers_and_plugins/. `Be sure to download the latest version!`
  - Place the webdrive (unzip it if the download is zipped) in a folder, e.g. a sub-folder called `drive` under the current working folder. When call selenium, point the `executable_path` parameter to that folder, i.e. `driver = webdriver.Firefox(executable_path='driver/geckodriver')`
  - Here we use **Firefox**

## 3. Use of Selenium WebDriver

### 3.1. **Navigating** (similar to beautifulsoup, but using different syntax)
  * navigate to a link
  * find elements by id, name, xpath, CSS selectors
    * Check this for detailed syntax: https://selenium-python.readthedocs.io/locating-elements.html
    * Note **find_element()** vs. **find_element<font color='red'>s</font>()**
  
|    | requests/BeautifulSoup | Selenium WebDriver |
| -- |:------------------      |:-----------|
| Navigate to a link |   `requests.get(url)`           | `driver.get(url)`    |
| find elements  | `soup.select()` | `driver.find_element(By.ID, 'abc')`<br>`driver.find_element(By.CSS_SELECTOR,'div p#abc')`<br>`driver.find_element(By.XPATH, '//button')`<br> `driver.find_elements(By.CSS_SELECTOR,'div p#abc')`<br>`driver.find_elements(By.XPATH, '//button')`<br>|
| get attributes of <br>element (say `p`) | `p.attrs` <br>    `p["class"]` | `p.get_attribute("class")` |
| get tag name | `p.name` | `p.tag_name` |
| get text | `p.get_text()` | `p.text` |
 

In [7]:
import selenium
selenium.__version__

'4.13.0'

In [13]:
# Exercise 3.1.1 Scrape using Selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions 

# for Firefox browser, do the following
# (1) find the path where you save the webdriver 
executable_path = 'driver/geckodriver'
# (2) initialize the driver
driver = webdriver.Safari()

# Selenium is built in in Safari
# Make sure you enable "Develop -> Allow Remote Automation"
#driver = webdriver.Safari()

# send a request
driver.get('https://www.quora.com/topic/Machine-Learning')

# you should see a Firefox window open
driver.quit()

In [14]:
# Exercise 3.1.2. Select truncated text using Selenium
driver.get('https://www.quora.com/topic/Machine-Learning')

# get all questions using css selector
questions=driver.find_elements(By.CSS_SELECTOR, "span.q-box.qu-userSelect--text")

#questions=driver.find_elements(By.CSS_SELECTOR, "span.q-box.qu-userSelect--text")
    
    
for i, q in enumerate(questions):
    print(i, q.text)
    print("\n")
    
# close the webdriver. The firefox window closes
driver.quit()

MaxRetryError: HTTPConnectionPool(host='localhost', port=54720): Max retries exceeded with url: /session/ACEFC9BF-AF44-430B-BCEC-4594DA9E4FE2/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10795e7d0>: Failed to establish a new connection: [Errno 61] Connection refused'))

### 3.2. Simulates users' actions performed in a web browser. 

  - click a button
    * e.g. submit_button.click()
  - fill a form
    * e.g. text_box.send_keys("enter some text")
  - scroll page down or up
    * e.g. body.send_keys(Keys.PAGE_DOWN)
  - move between windows and frames
    * e.g. driver.switch_to_frame("frameName")
  ...
  - For details see https://selenium-python.readthedocs.io/navigating.html

In [5]:
# 3.2.1 Simulate "click"
# Click "more" link to get full answer

driver = webdriver.Firefox(executable_path=executable_path)

#driver = webdriver.Safari()

driver.get('https://www.quora.com/topic/Machine-Learning')

driver.implicitly_wait(10)  # set implict wait

# locate a "more" link by css selector
more_link=driver.find_element(By.CSS_SELECTOR, "div.q-text.qu-cursor--pointer.qt_read_more")

# click the link element
more_link.click()

# Check firefox browser to see an expanded answer


  driver = webdriver.Firefox(executable_path=executable_path)


In [6]:
driver.quit()

In [15]:
# Scroll down to load more questions
import time

#driver = webdriver.Firefox(executable_path=executable_path)

driver = webdriver.Safari()
driver.get('https://www.quora.com/topic/Machine-Learning')

# scroll down 5 times
for i in range():
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # wait for the content to be loaded
    time.sleep(2)   

questions=driver.find_elements(By.CSS_SELECTOR,"span.q-box.qu-userSelect--text")
    
for i, q in enumerate(questions):
    print(i, q.text)
    print("\n")


0 Why is it said that if you would want to be a Data scientist, don't start with Machine Learning?


1 What is a neural network in layman’s terms?


2 How can you set yourself apart when everyone is doing machine learning or data science in 2019?


3 How secure will available jobs as a machine learning engineer be in 2026?


4 Google's 'Bard' chatbot will compete with ChatGPT, what is special?


5 How can a student doing machine learning stand out from the crowd, these days when everyone is learning ML?


6 What are the best institutes to study data science in chennai?


7 Are you worried about the possible effects of artificial intelligence?


8 What is something about the field of data science that only a professional would know?


9 What are some embarrassing statistics mistakes that people with fancy credentials often make?


10 Is it possible for a neural network to be too deep?


11 What are the best institutes to study data science in chennai?


12 What are the best institutes t

In [None]:
driver.quit()

### 3.3. Wait
  - Because of the use of AJAX technologies, web elements often load at different time intervals. 
  - This makes locating elements difficult. 
    - if an element is not loaded,  a locating function will raise an ElementNotVisibleException exception.
  - Two types of waits 
    - `implicit`: When a Webdriver locates for any element, but the element is not available, instead of throwing "No Such Element Exception" immediately, the Webdriver waits for a certain amount of time. By the time it is still not available, then the error is thrown. 
      * Implicit wait is set at the driver level and applies to any locating function
    - `explicit`: WebDriver waits for a certain condition to occur before proceeding further with execution
      * Explicit wait is set at each locating function 

Assume you're looking for a question `How should you start a career in Machine Learning?`, but you're not sure if this question has been loaded into the page
- Case 1: If this question is not in the page, you get an error immediately
- Case 2: If it takes time to load the question, use implict wait to wait for some time
- Case 3: You can keep scroll down until the question has been loaded or max tries reached
    - Use `try ... else` block to handle the exception more elegently

In [8]:
# If the element is not there, you'll see an error immediately

#driver = webdriver.Safari()
driver = webdriver.Firefox(executable_path=executable_path)
driver.get('https://www.quora.com/topic/Machine-Learning')

q = driver.find_element(By.CSS_SELECTOR, 'a[href="https://www.quora.com/How-good-are-chess-computers-now-compared-to-Deep-Blue"]')

print(q.text)
driver.quit()

  driver = webdriver.Firefox(executable_path=executable_path)


NoSuchElementException: Message: Unable to locate element: a[href="https://www.quora.com/How-good-are-chess-computers-now-compared-to-Deep-Blue"]
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.jsm:12:1
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:192:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.jsm:404:5
element.find/</<@chrome://remote/content/marionette/element.js:291:16


In [18]:
# If the element is not there, you'll see an error after the waiting

driver = webdriver.Safari()
driver.get('https://www.quora.com/topic/Machine-Learning')

driver.implicitly_wait(10)

q = driver.find_element(By.CSS_SELECTOR,'a[href="https://www.quora.com/How-good-are-chess-computers-now-compared-to-Deep-Blue"]')

print(q.text)
driver.quit()

NoSuchElementException: Message: ; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception


In [17]:
# If the element is not there, 
# you keep scrolling down until the element is there
# or scrolling for max number of times
# try except can catch the error

driver = webdriver.Safari()
driver.get('https://www.quora.com/topic/Machine-Learning')

found = False
max_try = 10  # max number of scroll-downs
cnt = 0

while not found and cnt < max_try:
    
    try:
        # This waits up to 5 seconds before throwing a TimeoutException 
        # unless it finds the element to return within 10 seconds.
        
        q = WebDriverWait(driver,5).until(\
                    expected_conditions.\
                    presence_of_element_located((By.CSS_SELECTOR, \
                    'a[href="https://www.quora.com/How-good-are-chess-computers-now-compared-to-Deep-Blue"]')))
        
        found = True
        print(q.text)

    except:     # item not there yet
        
        driver.execute_script("window.scrollTo(0, \
        document.body.scrollHeight);")
     
        cnt += 1
        
driver.quit()

### 3.4. Combine WebDriver with BeautifulSoup

- Use Selenium to retrieve the content
- Use BeautifulSoup to parse the content (although Selenium can do the same!)
- Example:
    - Scroll down the page to load more content
    - Parse the content using BeautifulSoup

In [16]:
from bs4 import BeautifulSoup 

driver = webdriver.Safari()

driver.get('https://www.quora.com/topic/Machine-Learning')

# scroll down 5 times
for i in range(5):
    driver.execute_script("window.scrollTo(0, \
    document.body.scrollHeight);")
    
    # wait for the content to be loaded
    time.sleep(2)   

# Collect html page source
content = driver.page_source

# Parse the content by BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

# get all questions
questions=soup.select("span.q-box.qu-userSelect--text")
    
for i, q in enumerate(questions):
    print(i, q.get_text())
    print("\n")

driver.quit()

0 Why is it said that if you would want to be a Data scientist, don't start with Machine Learning?


1 What is a neural network in layman’s terms?


2 How can you set yourself apart when everyone is doing machine learning or data science in 2019?


3 How secure will available jobs as a machine learning engineer be in 2026?


4 Google's 'Bard' chatbot will compete with ChatGPT, what is special?


5 How can a student doing machine learning stand out from the crowd, these days when everyone is learning ML?


6 What are the best institutes to study data science in chennai?


7 Are you worried about the possible effects of artificial intelligence?


8 What is something about the field of data science that only a professional would know?


9 What are some embarrassing statistics mistakes that people with fancy credentials often make?


10 Is it possible for a neural network to be too deep?


11 What are the best institutes to study data science in chennai?


12 What are the best institutes t