# Scraping using Selenium

*This Notebook was originally prepared by **Jude Michael Teves**, Faculty, Department of Software Technology, College of Computer Studies, De La Salle University.*

*Updated functions to use latest version of `selenium` (4.4.0) as of Sep 2022.*

---

In this Notebook, we will learn how to srape dynamic webpages using [Selenium in Python](https://selenium-python.readthedocs.io/). 

**Selenium** is originally a tool to automate browsers. It is still widely used for automation testing in the software development lifecycle. However, with the need for web scraping in data science projects, Selenium has become a useful tool for also automating the browser and retrieving data we need.

In this example, we will be scraping the https://quotes.toscrape.com/scroll as it is dedicated for practicing scraping, similar to https://quotes.toscrape.com/. There are also other sites for practice available in the [Scraping Sanbox](https://toscrape.com/).

## Reminder

> *"With great power, comes great responsibility"*
    
Remember to perform web scraping with extra caution and to not abuse it. The boundaries are not so clear when it comes to what you can and cannot legally do with scraping. Use your own judgment to determine if what you are about to do is unethical or illegal.
<hr>

## Import libraries

We will be using the `requests` and `BeautifulSoup` libraries for the succeeding cells. These two will give us the functionalities we need to scrape a webpage. If this is not already installed in your environment, you may use the either of the following commands in your command line:

```conda install -c anaconda beautifulsoup4``` or
```pip install beautifulsoup4```

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import os

In [2]:
page = requests.get("https://quotes.toscrape.com/scroll")

#feed it into beautiful soup for parsing
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quotes">
     </div>
    </div>
   </div>
   <div id="loading" style="background-color: #eeeecc">
    <h5>
     Loading...
    </h5>
   </div>
   <script src="/static/jquery.js">
   </script>
   <script>
    $(function(){
        var page = 1, tag = null, hasNextPage = true;
        function appendQuotes(quotes) {
            var $quotes = $('.quotes');
            var html = $.ma

### Inspect HTML code

Inspect the code we retrieved and compare it against the webpage. This is what we should be seeing.

<img src="./images/quotes-to-scrape-console.png">

Why is it different? Why did we not get the contents in the actual webpage? This is because the contents are dynamically generated. BeautifulSoup cannot handle such pages. And there are lots of webpages that are like this.

### Selenium to the rescue!

Selenium is an automation library that can be used to deal with dynamic webpages. To install it, you may use the following commands:

```conda install -c conda-forge selenium``` or
```pip install selenium```

You will also be needing a driver for your browser. See this section of the Selenium documentation for more details: https://selenium-python.readthedocs.io/installation.html#drivers

### Setup browser automation

You should see a new browser open after executing the cell below. This is the browser that is under the influence of our code--we are fully controlling it.

In [4]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

ModuleNotFoundError: No module named 'webdriver_manager'

There are two ways to create the driver object for `selenium`. 

First is to use the `webdriver-manager` and install the required driver from the notebook. This will downlaod the driver on the first run if your `webdriver-manager` doesn't have the requested browser driver yet.

To use this, you would also need to install the package `webdriver-manager` first before this will work.

To install, run `pip3 install webdriver-manager` or `conda install -c conda-forge webdriver-manager`.

In [None]:
# option 1: using the webdriver-manager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

The second option is when you already have your own `chromedriver.exe` downloaded, you can also use the local executable path and pass it into the `Service` object from `selenium`.

In [None]:
# option 2: using a local path 
# driver_path = os.getenv('WEBDRIVER') + '/chromedriver.exe'
# driver = webdriver.Chrome(service=Service(driver_path))

### Surprise!

Notice that once the driver gets initiated, a blank browser window will open in your machine. Make sure that you do not close that browser window or else, your web scraping script will not be able to find the browser anymore!

In [None]:
url = "https://quotes.toscrape.com/scroll"
driver.get(url)
print(driver.page_source)

### XPath: Getting the elements that we want

> XPath is the language used for locating nodes in an XML document. As HTML can be an implementation of XML (XHTML), Selenium users can leverage this powerful language to target elements in their web applications. XPath supports the simple methods of locating by id or name attributes and extends them by opening up all sorts of new possibilities such as locating the third checkbox on the page.

For the XPath syntax, you may refer to the following link: https://www.w3schools.com/xml/xpath_syntax.asp

In [None]:
quotes = driver.find_elements(by="xpath", value="//div[@class='quote']")
quotes

In [None]:
len(quotes)

In [None]:
quotes[0].text

In [None]:
quotes_text = [quote.text for quote in quotes]
quotes_text

This returns all the texts inside the element. How can we choose specific parts of the element then?

In [None]:
quotes[0].find_element(by="xpath", value="span[@class='text']").text

This one did not return all the other text after the quote. How about the others?

In [None]:
quotes[0].find_element(by="xpath", value=".//small[@class='author']").text

In [None]:
tags = quotes[0].find_elements(by="xpath", value=".//a[@class='tag']")
tags = [tag.text for tag in tags]
tags

### Handle scrolling

You will notice that we are only getting the first 10 quotes on the page. This is because we have to scroll first so that the other quotes get generated by the page. The following line of code automates that scrolling. Code for handling infinite scrolling is taken from <a href="https://stackoverflow.com/questions/28928068/scroll-down-to-bottom-of-infinite-page-with-phantomjs-in-python/28928684#28928684">the answer to this Stackoverflow question</a>.

In [None]:
import time

pause = 0.5
lastHeight = driver.execute_script("return document.body.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(pause)
    newHeight = driver.execute_script("return document.body.scrollHeight")
    if newHeight == lastHeight:
        break
    lastHeight = newHeight

Now let's check the quotes on the page

In [None]:
quotes = driver.find_elements(by="xpath", value="//div[@class='quote']")
len(quotes)

We now got 100 quotes instead of 10!

## Exercise

Scrape the page and save the results into a `pandas` Dataframe with the following format:

| author | tags | quote |
| --- | --- | --- |
| Albert Einstein | ['change', 'deep-thoughts', 'thinking', 'world'] | “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.“ |

In [None]:
# your code here


### Display the first 10 records of the `DataFrame`.

In [None]:
# your code here


### Display the last 5 records of the `DataFrame`.

In [None]:
# your code here


### Print out the `shape` of the `DataFrame`. 

In [None]:
# your code here


### Save the `DataFrame` into a file called `quotes.csv`.

In [None]:
# your code here


## Closing the browser

At the end of your scraping, make sure that you close the window of the browser by calling `driver.quit()`.

In [None]:
driver.quit()

## Running a "Headless" Browser

In our configuration, we ran the driver (our browser) and opened an actual window. This can be very helpful during development since the browser window will stay open (unless you call `driver.quit()`) and you can inspect the elements from that window. It can also be particularly useful for you to observe what's happening on the page and if your script is working as intended. 

However, once you've finalized your script and happy with the results, it is generally advisable to switch to **headless** mode in production (and in `.py` scripts).

To do so, simply add some options for your driver.

```python
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
```

## References

1. https://selenium-python.readthedocs.io/
2. https://www.scrapingbee.com/blog/selenium-python/

## End
This notebook was initially written by **Jude Michael Teves** | for comments, corrections, suggestions, please email: judemichaelteves@gmail.com or jude.teves@dlsu.edu.ph.