# **Web Scraping with Python**

### **AJ Zerouali (21/06/24)**

This notebook follows Lectures 121-122 in Section 15. The idea is to practice web scraping on the sandbox website https://books.toscrape.com/. 

I think my main issue with how this material is presented is that there's no systematic discussion of the data structure that bs4 extracts from an HTML page (at least in how Portilla does it). As such, I am also looking at the documentation for bs4 at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

**Comment:** One of the interesting parts of this exercise is that there are several pages to scrap. A priori, one doesn't know how many pages there are to scrape on a website.


Again, the original notebook is at:
https://github.com/Pierian-Data/Complete-Python-3-Bootcamp/blob/master/13-Web-Scraping/00-Guide-to-Web-Scraping.ipynb

In [1]:
import requests
import lxml
import bs4

__________________________

### **Warning: (21/06/25)**

Below I am messing around. It's a sequence of trials and errors to understand the Beautiful Soup data structures.

### 1) Initializations:

We start by scraping the first page of https://books.toscrape.com/ and parsing it to acquire the Beautiful Soup (nested) data structure.

In [2]:
## We'll be scraping "https://books.toscrape.com/"
ini_request = requests.get("https://books.toscrape.com/")
### Could alternatively use bs4.BeautifulSoup(ini_request.text,"html.parser"), but lxml parses better
ini_html= bs4.BeautifulSoup(ini_request.text,"lxml")

To display the resulting HTML code with proper indentation etc., we can use the "prettify()" method of bs4.

In [3]:
print(ini_html.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [8]:
# The following extracts all the text from the page:
print(ini_html.get_text())

 


    All products | Books to Scrape - Sandbox

















Books to Scrape We love being scraped!








Home

All products









                            
                                Books
                            
                        



                            
                                Travel
                            
                        



                            
                                Mystery
                            
                        



                            
                                Historical Fiction
                            
                        



                            
                                Sequential Art
                            
                        



                            
                                Classics
                            
                        



                            
                                Philosophy
               

### 2) Extracting list of books on the first page:

Using the "Inspect" function in Chrome, we determine the class of the books, which is "product_pod" here:

<img src = "Fig_Books_ToScrape_product_pod_class.png">


Below, we use the ".select()" method (from bs4) store all the "product_pod" objects on the first page in a list:

In [4]:
## If we're interested in books that are rated 3 stars, we'll need the code to look for: <p class="star-rating Three">
# Recall the command html_code.select(".toctext")
# fp_html stands for First Page  HTML
# More importantly, here is how one extracts the list of books on the given page:
fp_html_book_lst = ini_html.select(".product_pod")

In [4]:
print(f'len(fp_html_book_lst) = {len(fp_html_book_lst)}')
print(fp_html_book_lst)

len(fp_html_book_lst) = 20
[<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>, <article class="product_pod">
<div class="image_container">
<a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="th

**Note:** There's an alternative way of doing this, using the "find_all" function of bs4 (see documentation website):

In [5]:
fp_article_prod_lst = ini_html.find_all("article")

In [6]:
fp_article_prod_lst

[<article class="product_pod">
 <div class="image_container">
 <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="th

In [6]:
# There are 20 products on the first page:
len(fp_article_prod_lst)

20

If I'm not mistaken, the "article" class is what is referred to as tags in the bs4 documentation. We'll explore this more below.

### 3) Attributes of one product:

Here we'll work with the documentation (not with Portilla's functions).
The fifth product on the first page is Yuval N. Harari's "Sapiens" (as of 21/06/25).

In [7]:
prod_YNHarari = fp_article_prod_lst[4]
print(prod_YNHarari.prettify())

<article class="product_pod">
 <div class="image_container">
  <a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html">
   <img alt="Sapiens: A Brief History of Humankind" class="thumbnail" src="media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg"/>
  </a>
 </div>
 <p class="star-rating Five">
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
 </p>
 <h3>
  <a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">
   Sapiens: A Brief History ...
  </a>
 </h3>
 <div class="product_price">
  <p class="price_color">
   Â£54.23
  </p>
  <p class="instock availability">
   <i class="icon-ok">
   </i>
   In stock
  </p>
  <form>
   <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
    Add to basket
   </button>
  </form>
 </div>
</article>



From the image container we have the following image:
<img src="https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg">

Returning to the code before the image, it gives us a tree of tags within the "article" tag, which looks like this in text:

**- div (1st division in "article"):** The class here is "image_container". A div(ision) is a section styled in CSS (could be declared in header of HTML, but it was imported from elsewhere for this page). For "div" tag see https://www.w3schools.com/tags/tag_div.ASP, for "a" tag see https://www.w3schools.com/tags/tag_a.asp.

**- p (1st parag):** The class is "star-rating XY", constains "i" tags with images for each star. The "p" marks a paragraph element (https://www.w3schools.com/tags/tag_p.asp), while "i" is for text in different mood (typically displayed in italics, see https://www.w3schools.com/tags/tag_i.asp).

**- h3 (3rd level heading):** This tag contains the title of the book and a link to its description page, which are stored as a dictionary. This tag contains one "a" tag, typically used to define a hyperlink, see https://www.w3schools.com/tags/tag_a.asp. Third level heading is essentially like a sub-subsection, see https://www.techonthenet.com/html/elements/h3_tag.php for h3 tags.

**- div (2nd division in "article"):** Contains the product price and its availability as paragraphs and a "form" (for user input in HTML).

This gives the basic layout of the "article" tags on this page. Now to extract specific information, we again look at the html code, and use the find_all("tag") function of bs4. For the paragraph tags "p" for instance, this gives the following:

In [32]:
p_tags_YNHarari = prod_YNHarari.find_all("p")
print(p_tags_YNHarari)

[<p class="star-rating Five">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>, <p class="price_color">Â£54.23</p>, <p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>]


Note that this extracts **all** of the * \<p\> parag \</p\> * blocks from the "article" tag, irrespective of their parent tag.

The rating of the book is stored in the first paragraph, and we have:

In [34]:
p_tags_YNHarari[0].attrs

{'class': ['star-rating', 'Five']}

In [37]:
# We're interested in this dictionary then
YNHarari_star_rating =  p_tags_YNHarari[0]["class"][1]
print(YNHarari_star_rating)

Five


Alternatively:

In [9]:
print(prod_YNHarari.p["class"][1])

Five


In [13]:
prod_YNHarari.p["class"][1] == "Five"


True

Now that we know how to extract the rating of a product, we look at how to find its title. We know this is in the hyperlink section "a" of the level 3 heading "h3" in "article". A more elegant way of extracting this info is to use the tree discussed above:

In [51]:
prod_YNHarari.h3.a.attrs

{'href': 'catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'title': 'Sapiens: A Brief History of Humankind'}

In [52]:
print(prod_YNHarari.h3.a["title"])

Sapiens: A Brief History of Humankind


**Useful note:** It seems that when HTML has a block of the form:

\<a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind"\>
   Sapiens: A Brief History ...
  \</a\>
the resulting "a" is interpreted as a dictionary by bs4.

### 4) Getting all books with star-rating 3 from first page:

Now we'll make the instructions that will be executed for each scraped page.

In [3]:
# Initializations
n_page = 1
page_url = f"https://books.toscrape.com/catalogue/page-{n_page}.html"
titles_rated3_lst = []


# Scrape page and parse with lxml
page_request = requests.get(page_url)
page_html = bs4.BeautifulSoup(page_request.text,"lxml")

# Extract all books from page
page_article_lst = page_html.find_all("article")

# For each book in the list
for book in page_article_lst:
    # Check rating and add to titles_rated3_lst if 3 star
    if book.p["class"][1] == "Three":
        titles_rated3_lst.append(book.h3.a["title"])
        
# Print result. Should get three titles for first page of books.toscrape.com
for title in titles_rated3_lst:
    print(title)
len(titles_rated3_lst)

A Light in the Attic
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991


3

In [4]:
page_html.title

<title>
    All products | Books to Scrape - Sandbox
</title>

In [5]:
page_html.select("title")[0].getText()

'\n    All products | Books to Scrape - Sandbox\n'

________________________________________________________________________________

### **Non-Existent pages (21/06/24)**

This will be useful for later, as we'll need to use a while loop for our scraping. We'll need to stop scraping when the title says "404 Not Found". The key is to extract the title properly, which we do as follows:

0) We import the string representing the page using: *page_req = requests.get("page_address")*.

1) Convert the string into HTML layout using beautiful soup and lxml: *page_html = bs4.BeautifulSoup(page_req.text,"lxml")*.

2) Extract the "title" tag from page_html *page_html.select("title")* (returns a list), then convert first entry into text: *page_title = page_html.select("title")[0].getText()*



In [18]:
## This cell is to test a non-existeing page. Later, we'll need to use a while loop for our scraping
# nep_request for non-existent page https://books.toscrape.com/catalogue/page-51.html
nep_request = requests.get("https://books.toscrape.com/catalogue/page-51.html")
nep_html = bs4.BeautifulSoup(nep_request.text,"lxml")
nep_title = nep_html.select("title")[0].getText()
print(nep_title)

404 Not Found


We'll use this as a condition to stop the "while" loop, as we'll loop over "https://books.toscrape.com/catalogue/page-XX.html"

________________________________________________________________________________

# **Main exercise** 
### 21/06/25

The goal is to scrape books.toscrape.com and find all titles of books with a 3-star rating.


In [2]:
# inclusions

import requests
import lxml
import bs4

In [28]:
# Initializations
n_page = 1
page_exists = True
# Titles list. Each entry is a list [page number, "title"]
titles_rated3_lst = []


# Main while loop
while page_exists:
    page_url = f"https://books.toscrape.com/catalogue/page-{n_page}.html"
    
    # Scrape page and parse with lxml
    page_request = requests.get(page_url)
    page_html = bs4.BeautifulSoup(page_request.text,"lxml")
    
    if page_html.select("title")[0].getText() == "\n    All products | Books to Scrape - Sandbox\n":
        # Extract all books from page
        page_article_lst = page_html.find_all("article")

        # For each book in the list
        for book in page_article_lst:
            # Check rating and add to titles_rated3_lst if 3 star
            if book.p["class"][1] == "Three":
                titles_rated3_lst.append([n_page, book.h3.a["title"]])
        # Increment page number
        n_page += 1 
        
    else:
        page_exists = False
        temp_title = page_html.select("title")[0].getText()
        print(f"Page {n_page} of books.toscrape.com has wrong title: \'{temp_title}\'")
        
    

Page 51 of books.toscrape.com has wrong title: '404 Not Found'


In [31]:
# Print result. Should get three titles for first page of books.toscrape.com
print("Titles with 3-star rating:\n")
for title in titles_rated3_lst:
    print(f"- {title[1]}; on page {title[0]}")
print("\n\n Results:")
print(f'len(titles_rated3_lst)={len(titles_rated3_lst)}')
print(f"n_page={n_page}")


Titles with 3-star rating:

- A Light in the Attic; on page 1
- The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull; on page 1
- Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991; on page 1
- Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More; on page 2
- Birdsong: A Story in Pictures; on page 2
- America's Cradle of Quarterbacks: Western Pennsylvania's Football Factory from Johnny Unitas to Joe Montana; on page 2
- Aladdin and His Wonderful Lamp; on page 2
- The Five Love Languages: How to Express Heartfelt Commitment to Your Mate; on page 2
- Penny Maybe; on page 2
- Slow States of Collapse: Poems; on page 3
- Unicorn Tracks; on page 3
- Throwing Rocks at the Google Bus: How Growth Became the Enemy of Prosperity; on page 3
- The Natural History of Us (The Fine Art of Pretending #2); on page 3


___________________________________

___________________________________

### **Silly tests**

In [11]:
n =20
str(n)

'20'

In [None]:
## If we're interested in books that are rated 3 stars, we'll need the code to look for: <p class="star-rating Three">
# Recall the command html_code.select(".toctext")
# fp_html stands for First Page  HTML
# Somehow the space is replaced with a dot, can see this when hovering mouse over rating
fp_html = ini_html.select(".star-rating.Three")

In [20]:
divs = prod_YNHarari.find_all("div")
print(divs)

[<div class="image_container">
<a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html"><img alt="Sapiens: A Brief History of Humankind" class="thumbnail" src="media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg"/></a>
</div>, <div class="product_price">
<p class="price_color">Â£54.23</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>]


In [21]:
h3s = prod_YNHarari.find_all("h3")
print(h3s)

[<h3><a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>]


In [24]:
h3s[0].a.attrs

{'href': 'catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'title': 'Sapiens: A Brief History of Humankind'}

In [25]:
h3s[0].a["title"]

'Sapiens: A Brief History of Humankind'

In [26]:
ps = prod_YNHarari.find_all("p")
print(ps)

[<p class="star-rating Five">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>, <p class="price_color">Â£54.23</p>, <p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>]


In [28]:
ps[0].attrs

{'class': ['star-rating', 'Five']}

#### **Old stuff using Portilla's approach**

In [26]:
fp_html_book_lst[4]

<article class="product_pod">
<div class="image_container">
<a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html"><img alt="Sapiens: A Brief History of Humankind" class="thumbnail" src="media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg"/></a>
</div>
<p class="star-rating Five">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>
<div class="product_price">
<p class="price_color">Â£54.23</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [5]:
fp_html_book_lst[4].p

<p class="star-rating Five">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>

In [7]:
book_rating = fp_html_book_lst[4].select(".star-rating.Five")
book_rating

[<p class="star-rating Five">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

In [15]:
fp_html_book_lst[4].select("a")[1]["title"]

'Sapiens: A Brief History of Humankind'

________________________________________________________________________________