# CCSS Web Scraping in Python (BeautifulSoup) Fall 2023

First. click 'File' -> 'Save a copy in Drive' to work in your own session.

The first module will introduce the fundamentals of web scraping in Python by highlighting the Beautiful Soup library. The second module will explore how to navigate dynamically generated website content through interactive scraping powered by Selenium.

We'll be using WebScraper.io's [webscraping test site](https://webscraper.io/test-sites/e-commerce/static) structured as an electronics e-commerce site throughout the exercises. This webpage serves as a great testing ground to learn about web scraping without having to worry about site blockages when building our initial scraper. We'll be toggling back and forth between this notebook and the website to use its Inspect feature to learn more about the site's underlying structure throughout the demos.


#Module 1- Fundamentals with Beautiful Soup

Let's start by importing in the libraries we'll be using within the first module:

In [1]:
!pip install bs4
!pip install pandas
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import pandas as pd
import pprint

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1256 sha256=a16eae6342357ec7288c84ccb828eb73d2bd7cef08e4118d2c86cf3a1e9840e3
  Stored in directory: /root/.cache/pip/wheels/25/42/45/b773edc52acb16cd2db4cf1a0b47117e2f69bb4eb300ed0e70
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1



Our initial goal will be to retrieve the name, price, description, and number of reviews of the three electronics featured on the site home page. `urllib.requests` is a library that goes hand-in-hand with Beautiful Soup by establishing the HTTP connection we'll need with the e-commerce site to parse the site's HTML. Let's start by inspecting the full HTML of the site itself by directly using the `BeautifulSoup` method with a designated `html.parser`:

In [2]:
html = urlopen('https://webscraper.io/test-sites/e-commerce/static')
bs_html = BeautifulSoup(html, 'html.parser')
print(bs_html)

<!DOCTYPE html>

<html lang="en">
<head>
<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');</script>
<!-- End Google Tag Manager -->
<title>Web Scraper Test Sites</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper" name="keywords">
<meta content="The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scraper. No software to download, no coding needed." name="description">
<link href="/favicon.png" rel="icon" size

### HTML Tags & Attributes

- **Tags** are used to represent different component types within an HTML file, such as the `<title>`, `<div>` for a designated section within the document, and `<h1>` for the first header.
- HTML tags usually require a closing tag such as </title> or </div>. Sections in HTML begin and end.
- HTML tags will often include additional information within the tag itself that are known as **attributes**. You can think of attributes as an additional identifier for a given tag.
- Attributes are marked by key words such as `class`, `id`, or `href` (referring to hyperlinks) followed by an equal sign and a label such as `<div class='container'>`

Here are two external references for learning more about the available [tags](https://www.w3schools.com/tags/default.asp) and [attributes](https://www.w3schools.com/tags/ref_attributes.asp) within HTML.


Our created Beautiful Soup parsed HTML object offers a variety of methods that allows us to look at specific tags within the HTML. The following retrieves the first specified instance of the `h4` heading tag in the HTML file. Then finds all instances of `h4` heading tag in the HTML file:

In [4]:
#find option pulls the first occurence.
name = bs_html.find('h4')
print(name)
print(name.get_text())

#find_all includes all tags.
names = bs_html.find_all('h4')
print(names)

<h4 class="pull-right price">$1326.83</h4>
$1326.83
[<h4 class="pull-right price">$1326.83</h4>, <h4>
<a class="title" href="/test-sites/e-commerce/static/product/622" title="Hewlett Packard ProBook 640 G3">Hewlett Packard...</a>
</h4>, <h4 class="pull-right price">$1178.19</h4>, <h4>
<a class="title" href="/test-sites/e-commerce/static/product/602" title="Dell Latitude 5580">Dell Latitude 55...</a>
</h4>, <h4 class="pull-right price">$295.99</h4>, <h4>
<a class="title" href="/test-sites/e-commerce/static/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>
</h4>]


Attributes are used to distinguish different subgroups of the same base HTML tag. This facilitates the easy retrieval of distinct HTML elements within a web scraping program that we would otherwise struggle to differentiate due to having the same tags. An example of this is the e-commerce test site `<div class="caption">` and `<div class="ratings">` to distinguish between the product descriptions and the ratings of the items being displayed on the website.

We can use the `find_all` method to identify any of the `div` tags with their `class` attribute specified as a `caption` as follows:

In [5]:
bs_html.find_all('div', {'class':'caption'})

[<div class="caption">
 <h4 class="pull-right price">$1326.83</h4>
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/622" title="Hewlett Packard ProBook 640 G3">Hewlett Packard...</a>
 </h4>
 <p class="description">Hewlett Packard ProBook 640 G3, 14" FHD, Core i7-7600U, 8GB, 256GB SSD, Windows 10 Pro</p>
 </div>,
 <div class="caption">
 <h4 class="pull-right price">$1178.19</h4>
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/602" title="Dell Latitude 5580">Dell Latitude 55...</a>
 </h4>
 <p class="description">Dell Latitude 5580, 15.6" FHD, Core i5-7300U, 16GB, 256GB SSD, Linux + Windows 10 Home</p>
 </div>,
 <div class="caption">
 <h4 class="pull-right price">$295.99</h4>
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>
 </h4>
 <p class="description">Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd</p>
 </div>]

The `get_text` method strips away the HTML tags to just express the text itself:

In [6]:
captions = bs_html.find_all('div', {'class':'caption'})
for name in captions:
    print(name.get_text())


$1326.83

Hewlett Packard...

Hewlett Packard ProBook 640 G3, 14" FHD, Core i7-7600U, 8GB, 256GB SSD, Windows 10 Pro


$1178.19

Dell Latitude 55...

Dell Latitude 5580, 15.6" FHD, Core i5-7300U, 16GB, 256GB SSD, Linux + Windows 10 Home


$295.99

Asus VivoBook X4...

Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd



The basic workflow behind using Beautiful Soups' `find` and `find_all` methods is to first specify the tag you're interested in within the HTML document, and then the associated attributes that further narrows your search from the tag.

We can build from this to retrieve multiple tag types at once, such as all of the headers set to the 1, 2, and 4 subheader sizes as follows:  

In [7]:
bs_html.find_all(['h1','h2','h4'])

[<h1>Test Sites</h1>,
 <h1>E-commerce training site</h1>,
 <h2>Top items being scraped right now</h2>,
 <h4 class="pull-right price">$1326.83</h4>,
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/622" title="Hewlett Packard ProBook 640 G3">Hewlett Packard...</a>
 </h4>,
 <h4 class="pull-right price">$1178.19</h4>,
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/602" title="Dell Latitude 5580">Dell Latitude 55...</a>
 </h4>,
 <h4 class="pull-right price">$295.99</h4>,
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>
 </h4>]

We can also identify multiple attribute matches to the same tag in a single `find_all` call. We achieve this by setting our second parameter as a dictionary with the `class` attribute as the key as the attribute labels `caption` and `ratings` as values. Another way to extract what we did above.

In [8]:
caption_reviews = bs_html.find_all('div', {'class':{'caption', 'ratings'}})
for name in caption_reviews:
    print(name.get_text())


$1326.83

Hewlett Packard...

Hewlett Packard ProBook 640 G3, 14" FHD, Core i7-7600U, 8GB, 256GB SSD, Windows 10 Pro


2 reviews





$1178.19

Dell Latitude 55...

Dell Latitude 5580, 15.6" FHD, Core i5-7300U, 16GB, 256GB SSD, Linux + Windows 10 Home


6 reviews







$295.99

Asus VivoBook X4...

Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd


14 reviews








If you look closely, you'll notice how the values for the electronic goods returned from Beautiful Soup don't actually align with what you currently see when looking at the site directly via Inspect. The site is updating the items shown on the homepage outside of default values set within the HTML.

### Saving Scraped Data

After identifying the tags and attributes that will allow us to match with the specific data we're interested in collecting from the website's HTML, we'd then proceed with the actual data collection within our scraper. The `pandas` library naturally intergrates with Beautiful Soup HTML parsing by structuring different tag retrievals into designated data columns.

Let's build up to creating a function that retrieves the product data we're interested in via the e-commerce site's URL and stores it within a pandas DataFrame for us. We can envision creating columns that separate out the title, description, price, and reviews of each of the products on the site's home page.

(Check out [CCSS' workshop on pandas](https://github.com/ccss-rs/python-bootcamp/blob/main/pandas.ipynb) on your own if you haven't worked with Pandas before or need a refresher.)  

A helpful strategy that moves us closer to our goal of producing a final data frame is using a different Beautiful Soup pattern matching of the HTML tag that has the data we're interested in with the tag's unique attributes and passing the text of all matches into a list through list comprehension statements.

Here's an example for the `a` tag distinguished by the `class` attribute labeled as `title`:

As we're building up to a function that will represent our web scraped data into a row and column format that will be ideal for final data storage on our local computer, it's often helpful to think through what data types can be handled by pandas that effectively matches column names (title, description, price, and reviews) to rows representing each scraped product.

A go-to is dictionaries since the key-value structure naturally translates to a column-row format within pandas. We'll therefore create a dictionary to hold our scraped data from Beautiful Soup as follows:

In [9]:
caption_reviews = bs_html.find_all('div', {'class':{'caption', 'ratings'}})
print(caption_reviews)

[<div class="caption">
<h4 class="pull-right price">$1326.83</h4>
<h4>
<a class="title" href="/test-sites/e-commerce/static/product/622" title="Hewlett Packard ProBook 640 G3">Hewlett Packard...</a>
</h4>
<p class="description">Hewlett Packard ProBook 640 G3, 14" FHD, Core i7-7600U, 8GB, 256GB SSD, Windows 10 Pro</p>
</div>, <div class="ratings">
<p class="pull-right">2 reviews</p>
<p data-rating="1">
<span class="ws-icon ws-icon-star"></span>
</p>
</div>, <div class="caption">
<h4 class="pull-right price">$1178.19</h4>
<h4>
<a class="title" href="/test-sites/e-commerce/static/product/602" title="Dell Latitude 5580">Dell Latitude 55...</a>
</h4>
<p class="description">Dell Latitude 5580, 15.6" FHD, Core i5-7300U, 16GB, 256GB SSD, Linux + Windows 10 Home</p>
</div>, <div class="ratings">
<p class="pull-right">6 reviews</p>
<p data-rating="3">
<span class="ws-icon ws-icon-star"></span>
<span class="ws-icon ws-icon-star"></span>
<span class="ws-icon ws-icon-star"></span>
</p>
</div>, <div

In [10]:
title = [values.get_text() for values in bs_html.find_all('a', {'class': 'title'})]
links = [values['href'] for values in bs_html.find_all('a', {'class': 'title'})]
desc = [values.get_text() for values in bs_html.find_all('p', {'class': 'description'})]
price = [values.get_text() for values in bs_html.find_all('h4', {'class': 'pull-right price'})]
reviews = [values.get_text() for values in bs_html.find_all('p', {'class': 'pull-right'})]

#Create pandas dataframe.
data = pd.DataFrame({
    'title': title,
    'price': price,
    'url': links,
    'description': desc,
    'reviews': reviews
})
display(data)

Unnamed: 0,title,price,url,description,reviews
0,Hewlett Packard...,$1326.83,/test-sites/e-commerce/static/product/622,"Hewlett Packard ProBook 640 G3, 14"" FHD, Core ...",2 reviews
1,Dell Latitude 55...,$1178.19,/test-sites/e-commerce/static/product/602,"Dell Latitude 5580, 15.6"" FHD, Core i5-7300U, ...",6 reviews
2,Asus VivoBook X4...,$295.99,/test-sites/e-commerce/static/product/545,"Asus VivoBook X441NA-GA190 Chocolate Black, 14...",14 reviews


Now that we have the individual components thought through regarding how to automate collecting data from the website in a format ideal to create a final pandas data frame with, we're ready to combine these steps into one function that will run through the entire process for us and return the data frame as its output.

`html_to_pandas` achieves this for us:

In [11]:
def extract(url):
    # Establish our connection to the website and create a Beautiful Soup object
    html = urlopen(url)
    bs_html = BeautifulSoup(html, 'html.parser')
    #Extract desired contents
    title = [values.get_text() for values in bs_html.find_all('a', {'class': 'title'})]
    links = [values['href'] for values in bs_html.find_all('a', {'class': 'title'})]
    desc = [values.get_text() for values in bs_html.find_all('p', {'class': 'description'})]
    price = [values.get_text() for values in bs_html.find_all('h4', {'class': 'pull-right price'})]
    reviews = [values.get_text() for values in bs_html.find_all('p', {'class': 'pull-right'})]

    #Place into data frame
    data = pd.DataFrame({
    'title': title,
    'price': price,
    'url': links,
    'description': desc,
    'reviews': reviews
    })

    #Output dataframe
    return data

In [12]:
data = extract('https://webscraper.io/test-sites/e-commerce/static')
display(data)

Unnamed: 0,title,price,url,description,reviews
0,Hewlett Packard...,$1326.83,/test-sites/e-commerce/static/product/622,"Hewlett Packard ProBook 640 G3, 14"" FHD, Core ...",2 reviews
1,Dell Latitude 55...,$1178.19,/test-sites/e-commerce/static/product/602,"Dell Latitude 5580, 15.6"" FHD, Core i5-7300U, ...",6 reviews
2,Asus VivoBook X4...,$295.99,/test-sites/e-commerce/static/product/545,"Asus VivoBook X441NA-GA190 Chocolate Black, 14...",14 reviews


We've successfully gone from a whole jumble of HTML and website text to a ordered DataFrame with the exact information we're interested in via Beautiful Soup. In an actual web scraping data collection workflow you'd likely save the above data frame to your personal computer such as via a CSV file.

##BS4 Exercise
Goal: Extract job title, job link, brief description provided.
Website URL :
https://pythonjobs.github.io/

Fill in the blank code:
https://colab.research.google.com/drive/1Qv-qfO-ZWlI375GFRcOaNXVDq7a3WeTq#scrollTo=TDpQlqUNFNwQ




