___

<a href='https://www.udemy.com/user/joseportilla/'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Content Copyright by Pierian Data</em></center>

# Guide to Web Scraping

Let's get you started with web scraping and Python. Before we begin, here are some important rules to follow and understand:

1. Always be respectful and try to get premission to scrape, do not bombard a website with scraping requests, otherwise your IP address may be blocked!
2. Be aware that websites change often, meaning your code could go from working to totally broken from one day to the next.
3. Pretty much every web scraping project of interest is a unique and custom job, so try your best to generalize the skills learned here.

OK, let's get started with the basics!

## Basic components of a WebSite

### HTML
HTML stands for  Hypertext Markup Language and every website on the internet uses it to display information. Even the jupyter notebook system uses it to display this information in your browser. If you right click on a website and select "View Page Source" you can see the raw HTML of a web page. This is the information that Python will be looking at to grab information from. Let's take a look at a simple webpage's HTML:

    <!DOCTYPE html>  
    <html>  
        <head>
            <title>Title on Browser Tab</title>
        </head>
        <body>
            <h1> Website Header </h1>
            <p> Some Paragraph </p>
        <body>
    </html>

Let's breakdown these components.

Every <tag> indicates a specific block type on the webpage:

    1.<DOCTYPE html> HTML documents will always start with this type declaration, letting the browser know its an HTML file.
    2. The component blocks of the HTML document are placed between <html> and </html>.
    3. Meta data and script connections (like a link to a CSS file or a JS file) are often placed in the <head> block.
    4. The <title> tag block defines the title of the webpage (its what shows up in the tab of a website you're visiting).
    5. Is between <body> and </body> tags are the blocks that will be visible to the site visitor.
    6. Headings are defined by the <h1> through <h6> tags, where the number represents the size of the heading.
    7. Paragraphs are defined by the <p> tag, this is essentially just normal text on the website.

    There are many more tags than just these, such as <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns, and more!

### CSS

CSS stands for Cascading Style Sheets, this is what gives "style" to a website, including colors and fonts, and even some animations! CSS uses tags such as **id** or **class** to connect an HTML element to a CSS feature, such as a particular color. **id** is a unique id for an HTML tag and must be unique within the HTML document, basically a single use connection. **class** defines a general style that can then be linked to multiple HTML tags. Basically if you only want a single html tag to be red, you would use an id tag, if you wanted several HTML tags/blocks to be red, you would create a class in your CSS doc and then link it to the rest of these blocks.

### Scraping Guidelines

Keep in mind you should always have permission for the website you are scraping! Check a websites terms and conditions for more info. Also keep in mind that a computer can send requests to a website very fast, so a website may block your computer's ip address if you send too many requests too quickly. Lastly, websites change all the time! You will most likely need to update your code often for long term web-scraping jobs.

## Web Scraping with Python

There are a few libraries you will need, you can go to your command line and install them with conda install (if you are using anaconda distribution), or pip install for other python distributions.

    conda install requests
    conda install lxml
    conda install bs4
    
if you are not using the Anaconda Installation, you can use **pip install** instead of **conda install**, for example:

    pip install requests
    pip install lxml
    pip install bs4
    
Now let's see what we can do with these libraries.

----

### Example Task 0 - Grabbing the title of a page

Let's start very simple, we will grab the title of a page. Remember that this is the HTML block with the **title** tag. For this task we will use **www.example.com** which is a website specifically made to serve as an example domain. Let's go through the main steps:

In [1]:
!pip install bs4



# REQUESTS

In [2]:
import requests

In [3]:
# Step 1: Use the requests library to grab the page
# Note, this may fail if you have a firewall blocking Python/Jupyter 
# Note sometimes you need to run this twice if it fails the first time

res = requests.get('https://www.oskarwang.com')

This object is a requests.models.Response object and it actually contains the information from the website, for example:

In [4]:
type(res)

requests.models.Response

In [5]:
res.text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">\n    <meta name="theme-color" content="#0a0a0a">\n    <title>owhy Photography</title>\n    <link rel="icon" href="logos/favicon.svg" type="image/svg+xml">\n    <link href="https://fonts.googleapis.com/css2?family=Urbanist:wght@300;400;600&display=swap" rel="stylesheet">\n    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css">\n    <link rel="stylesheet" href="style.css">\n</head>\n<body>  \n    <!-- Custom cursor removed for better compatibility -->\n    <div class="preloader">\n        <div class="spinner"></div>\n    </div>\n\n    <!-- Remove inline onclick from back button -->\n    <div class="back-btn">← Home</div>\n\n    <!-- Main Page -->\n    <div class="page main-page active-page">\n        <h1 class="logo">Oskar Wang Studios</h1>\n        <p cl

____
Now we use BeautifulSoup to analyze the extracted page. Technically we could use our own custom script to loook for items in the string of **res.text** but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file). Using BeautifulSoup we can create a "soup" object that contains all the "ingredients" of the webpage. Don't ask me about the weird library names, I didn't choose them! :)

# BEAUTIFULSOUP

In [6]:
import bs4

In [7]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

In [8]:
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
<meta content="#0a0a0a" name="theme-color"/>
<title>owhy Photography</title>
<link href="logos/favicon.svg" rel="icon" type="image/svg+xml"/>
<link href="https://fonts.googleapis.com/css2?family=Urbanist:wght@300;400;600&amp;display=swap" rel="stylesheet"/>
<link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css" rel="stylesheet"/>
<link href="style.css" rel="stylesheet"/>
</head>
<body>
<!-- Custom cursor removed for better compatibility -->
<div class="preloader">
<div class="spinner"></div>
</div>
<!-- Remove inline onclick from back button -->
<div class="back-btn">← Home</div>
<!-- Main Page -->
<div class="page main-page active-page">
<h1 class="logo">Oskar Wang Studios</h1>
<p class="tagline">Capturing moments you'll feel forever.</p>
<!-- Update nav buttons section -->
<div clas

Now let's use the **.select()** method to grab elements. We are looking for the 'title' tag, so we will pass in 'title'


In [9]:
soup.select('title')

[<title>owhy Photography</title>]

In [10]:
soup.select('h2')

[<h2 class="portfolio-title">Portfolio Categories</h2>,
 <h2 class="portfolio-title album-category-title">Category Albums</h2>,
 <h2 class="portfolio-title album-title">Album Photos</h2>,
 <h2 class="pricing-title">Photography Packages</h2>]

Notice what is returned here, its actually a list containing all the title elements (along with their tags). You can use indexing or even looping to grab the elements from the list. Since this object it still a specialized tag, we cna use method calls to grab just the text.

In [11]:
title_tag = soup.select('title')

In [12]:
title_tag[0]

<title>owhy Photography</title>

In [13]:
type(title_tag[0])

bs4.element.Tag

In [14]:
title_tag[0].getText()

'owhy Photography'

### Example Task 1 - Grabbing all elements of a class

Let's try to grab all the section headings of the Wikipedia Article on Grace Hopper from this URL: https://en.wikipedia.org/wiki/Grace_Hopper

In [15]:
# First get the request
res = requests.get('https://en.wikipedia.org/wiki/Grace_Hopper')

In [16]:
# Create a soup from request
soup = bs4.BeautifulSoup(res.text,"lxml")

Now its time to figure out what we are actually looking for. Inspect the element on the page to see that the section headers have the class "mw-headline". Because this is a class and not a straight tag, we need to adhere to some syntax for CSS. In this case

<table>

<thead >
<tr>
<th>
<p>Syntax to pass to the .select() method</p>
</th>
<th>
<p>Match Results</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<p><code>soup.select('div')</code></p>
</td>
<td>
<p>All elements with the <code>&lt;div&gt;</code> tag</p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('#some_id')</code></p>
</td>
<td>
<p>The HTML element containing the <code>id</code> attribute of <code>some_id</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('.notice')</code></p>
</td>
<td>
<p>All the HTML elements with the CSS <code>class</code> named <code>notice</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div span')</code></p>
</td>
<td>
<p>Any elements named <code>&lt;span&gt;</code> that are within an element named <code>&lt;div&gt;</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div &gt; span')</code></p>
</td>
<td>
<p>Any elements named <code class="literal2">&lt;span&gt;</code> that are <span><em >directly</em></span> within an element named <code class="literal2">&lt;div&gt;</code>, with no other element in between</p>
</td>
</tr>
<tr>

</tr>
</tbody>
</table>

In [17]:
# note depending on your IP Address, 
# this class may be called something different
soup.select(".vector-toc-text")

[]

In [18]:
for item in soup.select(".vector-toc-text"):
    print(item.text)

### Example Task 3 - Getting an Image from a Website

Let's attempt to grab the image of the Deep Blue Computer from this wikipedia article: https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)

In [19]:
res = requests.get("https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)")

In [20]:
# wikipedia is denying my request so we have to try a different website... for example: 
res = requests.get('https://www.britannica.com/biography/Alan-Turing')

In [21]:
res.text

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<!doctype html>\n\n<html lang="en" class="topic-desktop ui-unknown0 ui-unknown">\n\n<head prefix="og: https://ogp.me/ns# fb: https://ogp.me/ns/fb#">\n\n    <meta charset="utf-8">\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" />\n\n    \n\n    <link rel="dns-prefetch" href="https://cdn.britannica.com/mendel-resources/3-150">\n    <link rel="preconnect" href="https://cdn.britannica.com/mendel-resources/3-150">\n\n    \n\n    <link rel="preload" as="script" href="https://www.googletagservices.com/tag/js/gpt.js" />\n\n    \n\n    <link rel="icon" href="/favicon.png" />\n\n    \n\n    \n        \n        \n            <meta name="description" content="Alan Turing was a British mathematician and logician, a major contributor to mathema

In [22]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

In [23]:
# .mw-file-element

image_info = soup.select('img')

In [24]:
image_info

[<img alt="Encyclopedia Britannica" class="global-nav-logo global-nav-logo-left" loading="lazy" src="https://cdn.britannica.com/mendel/eb-logo/MendelNewThistleLogo.png"/>,
 <img alt="Encyclopedia Britannica" class="global-nav-center global-nav-logo non-homepage-logo" loading="lazy" src="https://cdn.britannica.com/mendel/eb-logo/MendelNewThistleLogo.png"/>,
 <img alt="Alan Turing" height="50" loading="lazy" src="https://cdn.britannica.com/81/191581-004-95328E05/Alan-Turing.jpg"/>,
 <img alt="Enigma machine explained" class="col-100" loading="lazy" src="https://cdn.britannica.com/79/222279-138-B1AAFBB0/The-Enigma-Machine-Explained.jpg?w=400&amp;h=225&amp;c=crop"/>,
 <img alt="Bombe machine" height="50" loading="lazy" src="https://cdn.britannica.com/95/194795-004-87A9A9D3/Detail-drums-Bombe-machine-code-breaking-others-Alan.jpg"/>,
 <img alt="Enigma" height="50" loading="lazy" src="https://cdn.britannica.com/51/182351-004-9F7E9A75/Germans-Enigma-machine-military-communications-Alan-Turing

In [25]:
len(image_info)

22

In [26]:
alan = image_info[2]

In [27]:
type(alan)

bs4.element.Tag

You can make dictionary like calls for parts of the Tag, in this case, we are interested in the **src** , or "source" of the image, which should be its own .jpg or .png link:

In [28]:
alan['src']

'https://cdn.britannica.com/81/191581-004-95328E05/Alan-Turing.jpg'

We can actually display it with a markdown cell with the following:

    <img src='https://cdn.britannica.com/81/191581-004-95328E05/Alan-Turing.jpg'>

<img src='https://cdn.britannica.com/81/191581-004-95328E05/Alan-Turing.jpg'>

Now that you have the actual src link, you can grab the image with requests and get along with the .content attribute. Note how we had to add https:// before the link, if you don't do this, requests will complain (but it gives you a pretty descriptive error code).

In [29]:
image_link = requests.get('https://cdn.britannica.com/81/191581-004-95328E05/Alan-Turing.jpg')

In [30]:
# The raw content (its a binary file, meaning we will need to use binary read/write methods for saving it)
image_link.content

b'\xff\xd8\xff\xdb\x00C\x00\x06\x04\x05\x06\x05\x04\x06\x06\x05\x06\x07\x07\x06\x08\n\x10\n\n\t\t\n\x14\x0e\x0f\x0c\x10\x17\x14\x18\x18\x17\x14\x16\x16\x1a\x1d%\x1f\x1a\x1b#\x1c\x16\x16 , #&\')*)\x19\x1f-0-(0%()(\xff\xdb\x00C\x01\x07\x07\x07\n\x08\n\x13\n\n\x13(\x1a\x16\x1a((((((((((((((((((((((((((((((((((((((((((((((((((\xff\xc0\x00\x11\x08\x01\xc2\x01\\\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1c\x00\x00\x01\x05\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x01\x03\x04\x05\x06\x00\x07\x08\xff\xc4\x00=\x10\x00\x01\x03\x03\x03\x02\x05\x01\x07\x02\x04\x06\x03\x01\x01\x00\x01\x02\x03\x11\x00\x04!\x05\x121AQ\x06\x13"aq\x81\x07\x142\x91\xa1\xb1\xc1#\xd1\x15BR\xe1\x08$3b\xf0\xf1\x164r%C\xff\xc4\x00\x14\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xc4\x00\x14\x11\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xda\x00\x0c\x03\x01\x00\x02\x11\x03\x11\x00?\x00\xfa\x89G\'\xf7\xa4\xdd\x1f=i\x14\xac\xfbP\x15fE\x03\x84\x9e\

**Let's write this to a file:=, not the 'wb' call to denote a binary writing of the file.**

In [31]:
f = open('alanturing_scraped.jpg','wb')

In [32]:
f.write(image_link.content)

24224

In [33]:
f.close()

Now we can display this file right here in the notebook as markdown using:

    <img src="'my_new_file_name.jpg'>
    
Just write the above line in a new markdown cell and it will display the image we just downloaded!

<img src='alanturing_scraped.jpg'>

___

___

### Example Project - Working with Multiple Pages and Items

Let's show a more realistic example of scraping a full site. The website: http://books.toscrape.com/index.html is specifically designed for people to scrape it. Let's try to get the title of every book that has a 2 star rating and at the end just have a Python list with all their titles.

We will do the following:

1. Figure out the URL structure to go through every page
2. Scrap every page in the catalogue
3. Figure out what tag/class represents the Star rating
4. Filter by that star rating using an if statement
5. Store the results to a list

In [34]:
"""
initial = 
https://books.toscrape.com/index.html

https://books.toscrape.com/catalogue/category/books/mystery_3/index.html
https://books.toscrape.com/catalogue/category/books/sports-and-games_17/index.html

whole catalogue:
https://books.toscrape.com/catalogue/page-2.html

1. resource --> 
2. beautifulSoup 
"""

'\ninitial = \nhttps://books.toscrape.com/index.html\n\nhttps://books.toscrape.com/catalogue/category/books/mystery_3/index.html\nhttps://books.toscrape.com/catalogue/category/books/sports-and-games_17/index.html\n\nwhole catalogue:\nhttps://books.toscrape.com/catalogue/page-2.html\n\n1. resource --> \n2. beautifulSoup \n'

In [35]:
import requests

In [36]:
res = requests.get("https://books.toscrape.com/catalogue/page-1.html")

In [37]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

In [38]:
soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:30" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="../static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="../static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="

In [39]:
books = soup.select('.product_pod')

In [40]:
for book in books:
    print(book.select('h3'))

[<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>]
[<h3><a href="tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>]
[<h3><a href="soumission_998/index.html" title="Soumission">Soumission</a></h3>]
[<h3><a href="sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>]
[<h3><a href="sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>]
[<h3><a href="the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>]
[<h3><a href="the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>]
[<h3><a href="the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the Infamous Femi

In [41]:
books = soup.select('h3')

In [42]:
for book in books:
    # print(book.getText())
    print(book.select('p')[0]['class'][1])

IndexError: list index out of range

In [43]:
books

[<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 <h3><a href="tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>,
 <h3><a href="soumission_998/index.html" title="Soumission">Soumission</a></h3>,
 <h3><a href="sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>,
 <h3><a href="sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>,
 <h3><a href="the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>,
 <h3><a href="the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>,
 <h3><a href="the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the Infamous Femi

In [None]:
books[0].getText()

'A Light in the ...'

In [44]:
books[1].getText()

'Tipping the Velvet'

In [45]:
soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:30" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="../static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="../static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="

In [46]:
books[3].getText()

'Sharp Objects'

In [47]:
books = soup.select('h3')

In [48]:
ratings = soup.select('p')
ratings

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="price_color">Â£51.77</p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="price_color">Â£53.74</p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="price_color">Â£50.10</p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="star-rating Four">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <

In [49]:
ratings[0]['class'][1]

'Three'

#### Look through all pages, take each book, take the name, look at ratings, if 2 stars -> add to list

In [50]:
mapping = {
    'One': 1,
    'Two': 2,
    'Three': 3,
    'Four': 4,
    'Five': 5
}

In [53]:
# base_url = 'https://books.toscrape.com/catalogue/page-{}.html'

Done = False
page = 1
books_with_rating_2 = []

while True:
    print(f'page = {page}')
    current_web = requests.get(f'https://books.toscrape.com/catalogue/page-{page}.html')
    current_soup = bs4.BeautifulSoup(current_web.text, 'lxml')
    books_on_page = current_soup.select('.product_pod')
    for book in books_on_page:
        title = book.select('h3')[0].getText()
        rating = book.select('p')[0]['class'][1]
        print(f'This is the book: {title}')
        print(f'This is the rating: {rating}')
        if len(book.select('.star-rating.Two')) != 0: # matching the length of the list, not existence
            print(f'\tMATCH for book: {title}, which is rated with Two stars!')
            books_with_rating_2.append(title)
    if page > 50:
        break
    else:
        page += 1

page = 1
This is the book: A Light in the ...
This is the rating: Three
This is the book: Tipping the Velvet
This is the rating: One
This is the book: Soumission
This is the rating: One
This is the book: Sharp Objects
This is the rating: Four
This is the book: Sapiens: A Brief History ...
This is the rating: Five
This is the book: The Requiem Red
This is the rating: One
This is the book: The Dirty Little Secrets ...
This is the rating: Four
This is the book: The Coming Woman: A ...
This is the rating: Three
This is the book: The Boys in the ...
This is the rating: Four
This is the book: The Black Maria
This is the rating: One
This is the book: Starving Hearts (Triangular Trade ...
This is the rating: Two
	MATCH for book: Starving Hearts (Triangular Trade ..., which is rated with Two stars!
This is the book: Shakespeare's Sonnets
This is the rating: Four
This is the book: Set Me Free
This is the rating: Five
This is the book: Scott Pilgrim's Precious Little ...
This is the rating: Five


In [54]:
books_with_rating_2

['Starving Hearts (Triangular Trade ...',
 'Libertarianism for Beginners',
 "It's Only the Himalayas",
 'How Music Works',
 'Maude (1883-1993):She Grew Up ...',
 "You can't bury them ...",
 'Reasons to Stay Alive',
 'Without Borders (Wanderlove #1)',
 'Soul Reader',
 'Security',
 'Saga, Volume 5 (Saga ...',
 'Reskilling America: Learning to ...',
 'Political Suicide: Missteps, Peccadilloes, ...',
 'Obsidian (Lux #1)',
 'My Paris Kitchen: Recipes ...',
 'Masks and Shadows',
 'Lumberjanes, Vol. 2: Friendship ...',
 'Lumberjanes Vol. 3: A ...',
 'Judo: Seven Steps to ...',
 'I Hate Fairyland, Vol. ...',
 'Giant Days, Vol. 2 ...',
 'Everydata: The Misinformation Hidden ...',
 "Don't Be a Jerk: ...",
 'Bossypants',
 'Bitch Planet, Vol. 1: ...',
 'Avatar: The Last Airbender: ...',
 'Tuesday Nights in 1980',
 'The Psychopath Test: A ...',
 'The Power of Now: ...',
 "The Omnivore's Dilemma: A ...",
 'The Love and Lemons ...',
 'The Girl on the ...',
 'The Emerald Mystery',
 'The Argonauts',
 '

In [55]:
len(books_with_rating_2)

196

We can see that the URL structure is the following:

    http://books.toscrape.com/catalogue/page-1.html

In [177]:
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'

We can then fill in the page number with .format()

In [178]:
res = requests.get(base_url.format('1'))

Now let's grab the products (books) from the get request result:

In [179]:
soup = bs4.BeautifulSoup(res.text,"lxml")

In [180]:
soup.select(".product_pod")

[<article class="product_pod">
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="../media/cach

Now we can see that each book has the product_pod class. We can select any tag with this class, and then further reduce it by its rating.

In [181]:
products = soup.select(".product_pod")

In [182]:
example = products[0]

In [183]:
type(example)

bs4.element.Tag

In [184]:
example.attrs

{'class': ['product_pod']}

Now by inspecting the site we can see that the class we want is class='star-rating Two' , if you click on this in your browser, you'll notice it displays the space as a . , so that means we want to search for ".star-rating.Two"

In [185]:
list(example.children)

['\n',
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>,
 '\n',
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 '\n',
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 '\n',
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>,
 '\n']

In [186]:
example.select('.star-rating.Three')

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

But we are looking for 2 stars, so it looks like we can just check to see if something was returned

In [187]:
example.select('.star-rating.Two')

[]

Alternatively, we can just quickly check the text string to see if "star-rating Two" is in it. Either approach is fine (there are also many other alternative approaches!)

Now let's see how we can get the title if we have a 2-star match:

In [188]:
example.select('a')

[<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
 <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

In [189]:
example.select('a')[1]

<a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

In [190]:
example.select('a')[1]['title']

'A Light in the Attic'

Okay, let's give it a shot by combining all the ideas we've talked about! (this should take about 20-60 seconds to complete running. Be aware a firwall may prevent this script from running. Also if you are getting a no response error, maybe try adding a sleep step with time.sleep(1).

In [191]:
two_star_titles = []

for n in range(1,51):

    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    
    soup = bs4.BeautifulSoup(res.text,"lxml")
    books = soup.select(".product_pod")
    
    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            two_star_titles.append(book.select('a')[1]['title'])

In [192]:
two_star_titles

['Starving Hearts (Triangular Trade Trilogy, #1)',
 'Libertarianism for Beginners',
 "It's Only the Himalayas",
 'How Music Works',
 'Maude (1883-1993):She Grew Up with the country',
 "You can't bury them all: Poems",
 'Reasons to Stay Alive',
 'Without Borders (Wanderlove #1)',
 'Soul Reader',
 'Security',
 'Saga, Volume 5 (Saga (Collected Editions) #5)',
 'Reskilling America: Learning to Labor in the Twenty-First Century',
 'Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics',
 'Obsidian (Lux #1)',
 'My Paris Kitchen: Recipes and Stories',
 'Masks and Shadows',
 'Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)',
 'Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)',
 'Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)',
 'I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)',
 'Giant Days, Vol. 2 (Giant Day

** Excellent! You should now have the tools necessary to scrape any websites that interest you! Keep in mind, the more complex the website, the harder it will be to scrape. Always ask for permission! **

In [193]:
len(two_star_titles)

196