In [2]:
import requests

In [3]:
import bs4

Web scraping is a general term for techniques involving automating the gathering of data from a website.

As you can imagine, there's often websites that have information that you want to use for some other
project.
However, actually manually going in and copying and pasting the information yourself would just take
too long to be realistically possible.

First, the rules of web scraping.

You should always try to get permission before scraping.
If you make too many scraping attempts or requests, then it's possible that your IP address could get
blocked, which means you won't even be able to visit that website in a normal browser.
Now, if you're dealing with a website that gets millions of visitors, something like Wikipedia, and
you're just going to scrape it a few times to get some information, then really that's no problem because
that website deals with much, much higher traffic.

But keep in mind that some sites do automatically block scraping software.
So you should always make sure and check the permissions or rules or guideline page for whatever particular
website you're going to be attempting to scrape.
So always try to get permission before scraping.

And you should also check the laws of whatever country you're operating in to see if it's legal to web
scrape.
Typically, it's okay, but again, you should always consult the website rules in order to make sure
you're okay.

The second thing, the limitations of web scraping.

And in general, every website is unique, which means unfortunately, every web scraping script is
unique.
So you can't really just take one web scraping script and Python and just easily apply it to any other
website in the world.
Since every website's HTML code is going to be unique to that website, more than likely you're going
to have to adjust your script in order to fit other websites.

And keep in mind that any slight change or update to a website may completely break your web scraping
script.
So the scripts in general are pretty static and they're not going to be able to adjust to changes in
the website.
So this is something that is a little annoying about web scraping, and it's a fact that if you plan
to have kind of a long term web scraping project, then it's more than likely you're going to have to
make adjustments to that script over time.

Now to web scrape of python we can use the beautiful soup and requests libraries and these are external
libraries outside of base python.

http://example.com/

In [4]:
result = requests.get("http://example.com/")

In [5]:
type(result)

requests.models.Response

In [6]:
result.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

So essentially what happened here is the request library that we downloaded goes and gets a response
from example.com and we can actually then call result.text and it's an attribute.

So if I run this notice that it's actually this HTML document and if I take a look at example domain
and hit view page source, it's essentially this information here just stored as a giant python string.
And this is nice because anything I can do if a python string, I can now do this large text.

However, this is actually just a string.
In order to parse through this, what we need to do is use beautiful soup.
The beautiful soup library which we installed with bs4 is actually going to allow us to then grab
and easily obtain information from this due to IDs or class calls or HTML tags.

So right now we just have this giant string and we're going to convert it into a soup.

In [7]:
soup = bs4.BeautifulSoup(result.text, "lxml")

In [8]:
soup

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

we're going to pass in two things here. We're going to pass in this text string result.text

And then finally a string code for what engine to use to parse through this HTML text.
And for this it's lxml.
And that's why we actually had to pip install lxml because beautiful soup uses that on the back end to
essentially go through this HTML document and then figure out what is a CSS class, what is a CSS ID,
which are the different HTML elements and tags, etc.?

So if I run this and now call soup, you'll notice that soup has essentially made this really easy to
read, and now it looks exactly the same as the source code

So essentially we went from kind of this raw string to this soup object, and now the soup object is
smart enough to be able to grab things based off their tags or elements.

In [9]:
soup.select('title')

[<title>Example Domain</title>]

So basically looks through this document and it's smart enough to figure out, okay, where are these
title tags?
Notice that by default it actually returns a list because technically there could be more than one tag
or element on this page, especially for a really complicated pages.

In [10]:
soup.select('h1')

[<h1>Example Domain</h1>]

In [11]:
soup.select('title')[0].getText()

'Example Domain'

In [16]:
site_para = soup.select('p')[0].getText()

In [17]:
site_para

'This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.'

In [20]:
type(site_para)

str

In [21]:
soup.select('p')[1].getText()

'More information...'

<table>
    <tr>
        <th>syntax</th>
        <th>match results</th>
    </tr>
    <tr>
        <td>soup.select('div')</td>
        <td>all elements with div tag</td>
    </tr>
    <tr>
        <td>soup.select('#some_id')</td>
        <td>elements containg id='some_id'</td>
    </tr>
    <tr>
        <td>soup.select('.some_class')</td>
        <td>elements containg class='some_class'</td>
    </tr>
    <tr>
        <td>soup.select('div span')</td>
        <td>all elements named span within div element</td>
    </tr>
</table>

In [23]:
res = requests.get('https://en.wikipedia.org/wiki/Taylor_Swift')

In [24]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

In [26]:
# soup

In [32]:
first_item = soup.select('.toclevel-1')[2]

In [33]:
first_item

<li class="toclevel-1 tocsection-15"><a href="#Other_activities"><span class="tocnumber">3</span> <span class="toctext">Other activities</span></a>
<ul>
<li class="toclevel-2 tocsection-16"><a href="#Philanthropy"><span class="tocnumber">3.1</span> <span class="toctext">Philanthropy</span></a></li>
<li class="toclevel-2 tocsection-17"><a href="#Politics_and_activism"><span class="tocnumber">3.2</span> <span class="toctext">Politics and activism</span></a></li>
<li class="toclevel-2 tocsection-18"><a href="#Endorsements"><span class="tocnumber">3.3</span> <span class="toctext">Endorsements</span></a></li>
</ul>
</li>

In [34]:
first_item.text

'3 Other activities\n\n3.1 Philanthropy\n3.2 Politics and activism\n3.3 Endorsements\n\n'

In [35]:
for item in soup.select('.toclevel-1'):
    print(item.text)

1 Life and career

1.1 1989–2003: Early life and education
1.2 2004–2008: Career beginnings and first album
1.3 2008–2010: Fearless and acting debut
1.4 2010–2014: Speak Now and Red
1.5 2014–2018: 1989 and Reputation
1.6 2018–2020: Lover, Folklore and Evermore
1.7 2021–present: Re-recordings and Midnights


2 Artistry

2.1 Influences
2.2 Musical styles
2.3 Voice
2.4 Songwriting
2.5 Video and film


3 Other activities

3.1 Philanthropy
3.2 Politics and activism
3.3 Endorsements


4 Public image
5 Impact
6 Accolades and achievements

6.1 Wealth


7 Discography

7.1 Studio albums
7.2 Re-recordings


8 Filmography
9 Tours
10 See also
11 Footnotes
12 References
13 Cited literature
14 External links


In [36]:
soup.select('.thumbimage')

[<img alt="" class="thumbimage" data-file-height="3456" data-file-width="4608" decoding="async" height="165" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/09/Grandview_Blvd_76%2C_Wyomissing_PA.JPG/220px-Grandview_Blvd_76%2C_Wyomissing_PA.JPG" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/09/Grandview_Blvd_76%2C_Wyomissing_PA.JPG/330px-Grandview_Blvd_76%2C_Wyomissing_PA.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/09/Grandview_Blvd_76%2C_Wyomissing_PA.JPG/440px-Grandview_Blvd_76%2C_Wyomissing_PA.JPG 2x" width="220"/>,
 <img alt="Taylor Swift singing on a microphone and playing a guitar" class="thumbimage" data-file-height="546" data-file-width="819" decoding="async" height="173" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Swift%2C_Taylor_%282007%29.jpg/260px-Swift%2C_Taylor_%282007%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Swift%2C_Taylor_%282007%29.jpg/390px-Swift%2C_Taylor_%282007%29.jpg 1.5x, //upload.wikimedia.o

In [37]:
soup.select('img')

[<img alt="Featured article" data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>,
 <img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/>,
 <img alt="Medium shot of Swift in a dress with little sequins, in front of AMA backdrop" data-file-hei

In [40]:
taylor = soup.select('.thumbimage')[1]

In [41]:
taylor

<img alt="Taylor Swift singing on a microphone and playing a guitar" class="thumbimage" data-file-height="546" data-file-width="819" decoding="async" height="173" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Swift%2C_Taylor_%282007%29.jpg/260px-Swift%2C_Taylor_%282007%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Swift%2C_Taylor_%282007%29.jpg/390px-Swift%2C_Taylor_%282007%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Swift%2C_Taylor_%282007%29.jpg/520px-Swift%2C_Taylor_%282007%29.jpg 2x" width="260"/>

In [42]:
type(taylor)

bs4.element.Tag

In [43]:
taylor['src']

'//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Swift%2C_Taylor_%282007%29.jpg/260px-Swift%2C_Taylor_%282007%29.jpg'

<img src = '//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Swift%2C_Taylor_%282007%29.jpg/260px-Swift%2C_Taylor_%282007%29.jpg'>

In [44]:
image_link = requests.get('https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Swift%2C_Taylor_%282007%29.jpg/260px-Swift%2C_Taylor_%282007%29.jpg')

In [45]:
image_link.content

b'\xff\xd8\xff\xdb\x00C\x00\x04\x03\x03\x04\x03\x03\x04\x04\x03\x04\x05\x04\x04\x05\x06\n\x07\x06\x06\x06\x06\r\t\n\x08\n\x0f\r\x10\x10\x0f\r\x0f\x0e\x11\x13\x18\x14\x11\x12\x17\x12\x0e\x0f\x15\x1c\x15\x17\x19\x19\x1b\x1b\x1b\x10\x14\x1d\x1f\x1d\x1a\x1f\x18\x1a\x1b\x1a\xff\xdb\x00C\x01\x04\x05\x05\x06\x05\x06\x0c\x07\x07\x0c\x1a\x11\x0f\x11\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\xff\xc0\x00\x11\x08\x00\xad\x01\x04\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1d\x00\x00\x01\x05\x00\x03\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x05\x06\x07\x08\x02\x04\t\x03\xff\xc4\x00@\x10\x00\x01\x03\x03\x03\x02\x04\x05\x02\x03\x05\x06\x07\x01\x00\x00\x01\x02\x03\x04\x00\x05\x11\x06\x12!\x071\x13"AQ\x08\x142aq\x15\x81#B\x91\x16$Rr\xa13b\x82\xb1\xc1\xf0\x17\x18%&4Cc\xf1\xff\xc4\x00\x1b\x01\x00\x02\x03\x01\x01\x01\x00\x00\x00\x00\x

And what this is, it's the raw content of the actual image.
So this is a binary file and this is the computer's way of representing internally what this image actually
looks like.
So this is actually kind of not readable by human.
This is a binary file.

However, Python is smart enough to be able to read and write this file and save the image onto your
computer.

And we do this by opening a new file and then writing to it and then closing to it.
So recall that this image linked content has the information I need, and now I want to save this onto
my computer in the form of an image.

In [46]:
f = open('taylor-img.jpg','wb')

For the writing or reading permission.
So recall the actual mode we're going to be doing.

It's going to be wb, which means write binary so denotes binary writing because that's not a typical
string or text file.
It's actually a binary representation of that image.

In [47]:
f.write(image_link.content)

13752

In [48]:
f.close()

If I click this open, notice it downloaded the actual image and now we have it written on our computer.
So that's an example of downloading images.

Basically we've already seen how to grab elements just one at a time, such as a couple of images off
a single Wikipedia article.
But realistically, we want to be able to grab multiple elements and most likely across multiple pages
of a website.
And this is where we can combine our prior Python knowledge with the web scraping libraries to create
really powerful scripts that are heavily customized to whatever task we're trying to achieve.

We're going to be using a website specifically designed to practice web scraping called www.toscrape.com

In [49]:
# GOAL: get the title of every book with a two star rating.

In [50]:
base_url = 'https://web.archive.org/web/20220907003339/https://books.toscrape.com/catalogue/page-{}.html'

In [51]:
base_url.format(20)

'https://web.archive.org/web/20220907003339/https://books.toscrape.com/catalogue/page-20.html'

In [52]:
base_url.format(2)

'https://web.archive.org/web/20220907003339/https://books.toscrape.com/catalogue/page-2.html'

notice the difference!!!

In [53]:
page_num = 12
base_url.format(page_num)

'https://web.archive.org/web/20220907003339/https://books.toscrape.com/catalogue/page-12.html'

In [54]:
res = requests.get(base_url.format(1))

In [59]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

In [60]:
soup.select('.product_pod')

[<article class="product_pod">
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="/web/20220907003336im_/https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="tipping-the-velvet_999/index.html"><img alt="Tipping t

In [61]:
len(soup.select('.product_pod'))

20

In [62]:
products = soup.select('.product_pod')

In [63]:
example = products[0]

In [64]:
'star-rating Two' in str(example)

False

In [65]:
'star-rating Three' in str(example)

True

In [66]:
example.select('.star-rating.Three')

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

In [67]:
[] == example.select('.star-rating.Two') # now we know that it wasnt two stars

True

In [70]:
example.select('a')[1]['title']

'A Light in the Attic'

In [None]:
two_star_titles = []

for n in range(1,51):
    
    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    
    soup = bs4.BeautifulSoup(res.text, 'lxml')
    books = soup.select(".product_pod")
    
    for book in books:
        
        if len(book.select('.star-rating.Two')) != 0:
            
            book_title = book.select('a')[1]['title']
            two_star_titles.append(book_title)

In [None]:
two_star_titles

it should display all the boooks with 2 star ratings