# Web Scraping with Selenium


In this notebook, I will show how to use Selenium to search Google and parse the results.


**Table of Contents**
1. [Part 1: Test Selenium](#part1)
2. [Part 2: Get to know Selenium commands](#part2)
3. [Part 3: Searching Google (with an example)](#part3)
4. [Part 4: Do your own search](#part4)

<a id="part1"></a>
## Part 1: Test Selenium

In this part we will make sure that selenium is installed and can communicate with the Firefox browser.

In [None]:
import selenium 

If you are getting an error, such as: ModuleNotFoundError: No module named 'selenium', then you'll need to install the selenium package.

In [None]:
pip install selenium

**IMPORTANT STEP:** Check the slides for Day 3 to deal with the "path" of where Geckodriver is stored, otherwise, the code below will show an error.

Now we are creating a Firefox instance and telling it to open a web page:

In [None]:
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://cs.wellesley.edu')

If the test was successful, you should be able to see a Firefox browser open, and then display the Wellesley CS homepage.

**Note:** In order to continue with the commands in Part 2, DO NOT close the Firefox window that was opened above. Without the browser, the commands will fail.

<a id="part2"></a>

## Part 2: Get to know Selenium commands

One way to learn more about what methods and attributes of a class are available is to use the Python function `dir`:

In [None]:
print(dir(browser))

Here are some useful attributes of the `browser` object that we might be interested to use:

In [None]:
browser.current_url

In [None]:
browser.title

In [None]:
browser.page_source[500:1000] # don't print the entire page, might be too big

Selenium has its own set of commands for accessing the DOM of a page, here are some examples:

In [None]:
from selenium.webdriver.common.by import By # contains operators for the type of search we want to do

# Find the element with id="navbar"
try:
    print(browser.find_element(By.ID, 'navbar').text)
except Exception as e:
    print(e)

In [None]:
# Find the element with a given class name
try:
    print(browser.find_element(By.CLASS_NAME, 'md-left-sidebar').text)
except Exception as e:
    print(e)

Let's close the browser instance:

In [None]:
browser.close()

<a id="part3"></a>

## Part 3: Searching Google

We'll open a new browser instance and get to visit Google.

In [None]:
browser = webdriver.Firefox()
browser.get('https://google.com')

We want to access the search box, which we know it's named "q", so it can be accessed by its name:

In [None]:
browser.find_element(By.NAME, "q")

This result shows us that there is an element named "q". We will assign a variable to this instance so that we can interact with it:

In [None]:
inputBox = browser.find_element(By.NAME, "q")
inputBox.send_keys("wellesley college")

Notice how the browser copied the phrase into the search box and Google showed the suggested searches. If we want the search to start, we can send an enter event via code:

In [None]:
from selenium.webdriver.common.keys import Keys
inputBox.send_keys(Keys.ENTER)

This sends the query phrase and then the page is loaded.

It's possible to perform other operations with the keyboard, here is a [list of keyboard keys](https://www.selenium.dev/selenium/docs/api/py/webdriver/selenium.webdriver.common.keys.html) that Selenium recognizes.

Now that we know how to search Google, here are some simple tasks to try.

### Example: Artist Popularity

Let's take the list of **five** famous artists, e.g. Lady Gaga, Rihana, Taylor Swift, Beyonce, and Britney Spears, and find the number of results (hits) that Google returns for each of them. We will use these numbers to rank the artists based on the number of hits. In a sense, one can use the number of hits as a signal of an artist's popularity. More hits means more people have created pages mentioning the search phrase.

**Good to know**

- The element of the search page that contains the number of results has id="result-stats".
- The result usually looks like this: 'About 6,120,000 results (1.25 seconds) ', thus, we will extract the number, turn it into an integer, before doing the ranking.
- It takes some time between sending a query and the page loading, so you want the program to wait in between calls. We can use Python's time.sleep(N) as the simplest way to wait.

In [None]:
import time

def getResults(query):
    """Given a query, open a browser instance, search Google for the 
    phrase, then get the result-stats phrase from the page and return it.
    """
    browser = webdriver.Firefox()
    browser.get('https://google.com')
    inputBox = browser.find_element(By.NAME, "q")
    inputBox.clear() # so that in between searches it starts empty
    inputBox.send_keys(query)
    inputBox.send_keys(Keys.ENTER)
    time.sleep(1) # wait for the page to load
    try:
        result = browser.find_element(By.ID, "result-stats").text
    except:
        # Occasionally, Google Search shows something else at the top of the page.
        print(f"Couldn't find result for {query}")
        result = ""
    browser.close()
    return result

A helper function to return the hit number.

In [None]:
def processHitNumber(phrase):
    """Assumes that the phrase has the following format:
    About 39,600,000 results (0.75 seconds)
    and extracts the number of results.
    """
    hitNumber = phrase.split()[1]
    return int(hitNumber.replace(',', ''))

We'll iterate over a list of artists and get the results.

In [None]:
artistsAndHits = [] # to store the pairs of (artistName, numberOfHits)

for name in ['Lady Gaga', 'Rihanna', 'Taylor Swift', 
             'Beyonce', 'Britney Spears']:
    
    results = getResults(name)
    
    print(name, '|', results) 
    if results:
        hitnumber = processHitNumber(results)
        artistsAndHits.append((name, hitnumber))
    
    
sorted(artistsAndHits, key=lambda item: item[1], reverse=True)

<a id="part4"></a>
## Part 4: Do your own search

Make a list of Universities and Colleges (that you applied to or that you were interested in before Wellesley), repeat the search as before and find out which of them is most "popular" according to Google.

**Optional:** Do the results correlate with other metrics about these institutions (e.g. rankings, endowment, etc.)? How would you go about finding out?