There are various tools to scrape a web page. Selenium package in python is one of them. Its advantage and the reason i decided to use this particular package is that Selenium allows to scrape information from rendered by Javascript websites. Most of websites use Javascript which runs on a web page after it is loaded into a browser to create additional HTML elements.

As a result fetching original HTML code wouldn't allow me to navigate web page and capture information from it. Beautiful Soup is another python package which can only retrieve original HTML. Selenium on the other hand is capable of working with rendered HTML code. "Ctrl+Shift+I" is used to see HTML code of a web page.

In [1]:
# Importing python libraries
import pandas as pd
from selenium import webdriver
from time import sleep
import random
import pickle

I have previously installed ChromeDriver to use Selenium with Chrome browser

In [2]:
# This will start chrome browser and open indeed.com
driver = webdriver.Chrome()
url = "https://www.indeed.com"
driver.get(url)

In chrome browser on Debian "Ctrl+Shift+I" allows to inspect HTML elements of the page.

And this is how it looks:

<img src="screenshot1.png" align="left" style="height:300px">

![](screenshot1.png)

To find html code for any element on the page i selected arrow at the top in a red circle and pointed at "Company Reviews" in this case. The html code got highlighted in blue on the right. I can see that the element can be identified by tag "a" and href attribute "/companies". Now i can work with this element. The below code allows to find the element and click on it. Let's scrape a few reviews.

In [3]:
# Click on "company reviews"
driver.find_element_by_xpath('//a[@href="/companies"]').click()

find_element_by_xpath function searches for web element based on XML path. In the code above i use tag name "a" and attribute "href" value "/companies". Here is a good tutorial on XML path: https://www.guru99.com/xpath-selenium.html. 

I am going to collect information from "popular companies" section of the web site: company name and number of reviews. Then identify a company with highest number of reviews and capture those.

In [4]:
# Find "popular companies" area on the page
table = driver.find_element_by_xpath("//div[text() = 'Popular Companies']/following-sibling::div")
# Capture all elements associated with companies from the area above
companies = table.find_elements_by_xpath("//div[@class='cmp-PopularCompaniesWidget-companyName']")

<img src="screenshot2.png" align="left" style="height:500px">

Using similar method in the above code i identified area of the page for "popular companies" section and then elements for each item (company) in that area. The XML paths above use tag "div" and 2 attributes: text and class in this case. Following-sibling::div means to look for next element with tag "div" at the same level. That's why it's called "sibling".

The function below will capture company name and number of reviews and save this information in a dictionary

In [5]:
def get_companies_info(companies):
    comp_dict = {}
    for i in range(len(companies)):
        # capture company name
        name = companies[i].find_element_by_tag_name("a").text
        # capture number of reviews
        reviews = companies[i].find_element_by_xpath("./following-sibling::a").text
        comp_dict[i] = {"name":name, "reviews":reviews}
    return comp_dict

In [6]:
# Put company name and number of reviews in a dataframe
comp_dict = get_companies_info(companies)
comp_df = pd.DataFrame().from_dict(comp_dict, orient='index')
comp_df.head()

Unnamed: 0,name,reviews
0,The Home Depot,"43,693 reviews"
1,Kmart,"18,477 reviews"
2,Kroger Stores,"25,744 reviews"
3,Bank of America,"26,470 reviews"
4,Sears,"25,854 reviews"


The code below finds a company with the highest number of reviews in the dataframe then identifies element that corresponds to this company on a web page and navigates to this company page to collect its reviews.

In [7]:
# Retrieve numeric values from character "reviews" field
comp_df["reviews_n"] = comp_df["reviews"].apply(lambda x: int(x.split()[0].replace(',', '')))
# Find company with highest nubmer of reviews
max_reviews = comp_df.iloc[comp_df['reviews_n'].idxmax()]['reviews']
# Find HTML element on the page with corresponding number of reviews and click on it
table.find_element_by_xpath('.//div[text() = "%s"]' % max_reviews).click()

The Home Depot has the highest number of reviews so let' scrape those. But first i am going to define a few functions.

In [8]:
comp_df.loc[comp_df['reviews'] == max_reviews]

Unnamed: 0,name,reviews,reviews_n
0,The Home Depot,"43,693 reviews",43693


There are multiple pages with reviews for each company. This function clicks on "next page" button.

In [9]:
# This function tries to locate "next page" button. If it can't find the button it means this is the last page.
# It's not going to throw an exception but return 0 instead 
def next_page():
    try:
        driver.find_element_by_xpath('//a[@data-tn-element = "next-page"]').click()
    except:
        return 0
    else:
        return 1

XML path allows to find all elements representing individual text reviews on a page and capture text and ratings information. The following function does exactly that on a particular page.

In [10]:
# Capture all text company reviews along with ratings from a particular page and save them in a dictionary
def grab_one_page():
    one_page = driver.find_elements_by_xpath('//div[@id = "cmp-content"]/div[@class = "cmp-review-container"]')
    reviews_dict = {}
    for i in range(len(one_page)):
        review_text = one_page[i].find_element_by_class_name("cmp-review-text").text
        review_rating = one_page[i].find_element_by_class_name("cmp-ratingNumber").text
        reviews_dict[i] = {"review_text":review_text, "review_rating":review_rating}
    return reviews_dict

In [11]:
# To save dataframe with reviews information on a disk
def save_results(reviews_df):
    with open('results.pickle', 'wb') as f:
        pickle.dump(reviews_df, f, protocol=pickle.HIGHEST_PROTOCOL)

The below function checks whether title of the page has changed. I found that any problems with page download change the title. This way i can identify such issue and refresh the page before continuing with scraping information.

In [12]:
# Copy title of the page
title = driver.title

# Check whether title of the page changed and try to refresh the page a few times
def check_page_load():
    for i in range(3):
        if driver.title == title:
            return 1
        else:
            print("Reloading a page...")
            driver.refresh()
            sleep(10)
    else:
        return 0

The following function navigate through pages with reviews, collects and saves them. It uses the functions already defined above.

While scraping information from indeed.com i am going to wait between 3 and 6 seconds before moving to the next reviews page. This is to make sure i am not hitting the server too frequently. 

In [13]:
# This function will scrape reviews untill it reaches number of pages requested by user to scrape or number of reviews
def scrape_reviews(max_number_pages, max_reviews, reviews_df):
    for i in range(max_number_pages):
        print("Page %i is being scraped" %(i+1))
        reviews_dict = grab_one_page()
        
        # save reviews in a dataframe
        reviews_df = reviews_df.append(pd.DataFrame.from_dict(reviews_dict, orient='index'), ignore_index=True)
        print("%i reviews have been collected so far\n" %(len(reviews_df)))
        
        # wait between 3 and 6 seconds to proceed
        sleep(random.randint(3,6))
        
        # save dataframe on a disk every 10 pages
        if i%10 == 0:
            save_results(reviews_df)
        
        # in case there are no more pages with reviews left, save results and quit
        if next_page() == 0:
            save_results(reviews_df)
            print("No more reviws, %i reviews were successfully captured" %(len(reviews_df)))
            return reviews_df
        
        # if number of reviews requested by user was captured, save results and quit
        if len(reviews_df) > max_reviews:
            save_results(reviews_df)
            print("%i reviews captured is greater than number of reviews requested" %(len(reviews_df)))
            return reviews_df
        
        # if web page failed to load after trying to refresh it, save results and quit
        if check_page_load() == 0:
            save_results(reviews_df)
            print("Page %i failed to load - %i reviews were captured" %(i+2, len(reviews_df)))
            return reviews_df
    
    # in case number of pages requested by user was captured, save results and quit
    else:
        save_results(reviews_df)
        print("%i pages were successfully scraped as requested" %(i+1))
        return reviews_df

Everything is ready to start scraping data. I am going to collect 1000 reviews on 50 pages - there are 20 reviews per page. This will be enough to show how it works and i can collect more later.

In [None]:
reviews_df = pd.DataFrame(columns=['review_text', 'review_rating'])
reviews_df = scrape_reviews(50, 1000, reviews_df)

Let's make sure the dataframe saved on a disk is the same as original dataframe.

In [15]:
with open('results.pickle', 'rb') as f:
     reviews_df2 = pickle.load(f)

print(reviews_df.equals(reviews_df2))

True


1000 reviews with ratings information can be a good start for a machine learning project, for example to train NLP model to predict review rating based on review text. This number can be increased by changing the function parameters or running the code a few times to avoid unnecessary load on the website.