# Building Your First Web Scraper

In [2]:
from urllib.request import urlopen

# Get the url page
page_url = 'https://pythonscraping.com/pages/page1.html'

# open the page
html = urlopen(page_url)

# Print the byte string
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


> `urllib` is a standard Python library (meaning you don’t have to install anything extra to run this example) and contains functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent.

## The BeautifulSoup Library



In [3]:
from bs4 import BeautifulSoup

html = urlopen(page_url)

# Get the html content of the page
bs = BeautifulSoup(html.read(), 'html.parser')  # You can also use lxml or html5lib parser for messy HTML (needs to be installed)

print(bs.h1)

<h1>An Interesting Title</h1>


Note that this returns only the first instance of the h1 tag found on the page. By convention,
only one h1 tag should be used on a single page, but conventions are often
broken on the web, so you should be aware that this will retrieve the first instance of
the tag only, and not necessarily the one that you’re looking for.

## Connecting Reliably and Handling Exceptions


One of the most frustrating experiences in web scraping is to go to sleep
with a scraper running, dreaming of all the data you’ll have in your database the next
day—only to find that the scraper hit an error on some unexpected data format and
stopped execution shortly after you stopped looking at the screen.


Error Types:
- HTTP Error
- URL Error

In [4]:
from urllib.error import HTTPError
from urllib.error import URLError


def get_title(url: str):
    """Get the title from a URL

    Args:
        url (string): the url string

    Returns:
        returns None or the title
    """
    try:
        html = urlopen(page_url)

    # Catch the HTTP Error 
    except HTTPError as e:
        return None
    # Catch the URL Error
    except URLError as e:
        return None
    
    else:
        # Program continues
        # Pass bs and reader
        try:
            bs = BeautifulSoup(html.read(), 'html.parser')
            title = bs.h1
            
        except AttributeError as e:
            return None
        
        else:
            return title
        
        
        
title = get_title(url=page_url)

if title == None:
    print('Title could not be found')
    
else:
    print(title)

<h1>An Interesting Title</h1>


In [5]:
page_url = 'https://www.pythonscraping.com/pages/warandpeace.html'


html = urlopen(page_url)


bs = BeautifulSoup(html.read(), 'html.parser')

Using this BeautifulSoup object, you can use the **find_all** function to extract a Python list of proper nouns found by selecting only the text within 
```html
<span class="green"></span>
```

tags

In [6]:
name_list = bs.find_all('span', {'class': 'green'})


for name in name_list:
    # Strip all the text in the tags and present the data
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [7]:
# Select the class green and red within the tag spac 
name_list = bs.find_all('span', {'class': {'green', 'red'}})


for name in name_list:
    # Strip all the text in the tags and present the data
    print(name.get_text())

Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.
Heavens! what a virulent attack!
the prince
Anna Pavlovna
First of all, dear friend, tell me how you are. Set your friend's
mind at rest,
Can one be well while suffering morally? Can one be calm in times
like these if one

BeautifulSoup’s **find()** and **find_all()** are the two functions you will likely use the most. With them, you can easily filter HTML pages to find lists of desired tags, or a single tag, based on their various attributes.

> find_all(tag, attributes, recursive, text, limit, keywords)

> find(tag, attributes, recursive, text, keywords)

In all likelihood, 95% of the time you will need to use only the first two arguments:
tag and attributes.

- With the tag argument you can pass a string name of a tag
or even a Python list of string tag names

- The attributes argument takes a Python dictionary of attributes and matches tags that contain any one of those attributes.

The **text** argument is unusual in that it matches based on the text content of the tags, rather than properties of the tags themselves. For instance, if you want to find the
number of times “the prince” is surrounded by tags on the example page, you could replace your .find_all() function in the previous example with the following lines:

In [8]:
prince_list = bs.find_all(text='the prince')

print(len(prince_list))

7


### Navigating Trees

In the **BeautifulSoup library**, as well as many other libraries, there is a distinction
drawn between children and descendants: much like in a human family tree, children
are always exactly one tag below a parent, whereas descendants can be at any level in
the tree below a parent. 

For example, the **tr** tags are children of the table tag,
whereas **tr, th, td, img, and span** are all descendants of the table tag (at least in our
example page). All children are descendants, but not all descendants are children.


In general, BeautifulSoup functions always deal with the descendants of the current
tag selected. 

For instance, **bs.body.h1** selects the first h1 tag that is a descendant of the body tag. It will not find tags located outside the body.


Similarly, **bs.div.find_all('img')** will find the first div tag in the document, and then retrieve a list of all img tags that are descendants of that div tag.

In [9]:
# Children: exactly one tag below the parent

page_url = 'https://www.pythonscraping.com/pages/page3.html'


html = urlopen(page_url)
bs = BeautifulSoup(html.read(), 'html.parser')


for child in bs.find('table', {'id': 'giftList'}).children:
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


> This code prints the list of product rows in the giftList table, including the initial row of column labels.

In [10]:
# Descendants: all tags below the parent

page_url = 'https://www.pythonscraping.com/pages/page3.html'


html = urlopen(page_url)
bs = BeautifulSoup(html.read(), 'html.parser')


for desc in bs.find('table', {'id': 'giftList'}).descendants:
    print(desc)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<th>
Item Title
</th>

Item Title

<th>
Description
</th>

Description

<th>
Cost
</th>

Cost

<th>
Image
</th>

Image



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<td>
Vegetable Basket
</td>

Vegetable Basket

<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

<span class="excitingNote">Now with super-colorful bell peppers!</span>
Now with super-colorful bell peppers!


<td>
$15.00
</td>

$15.00

<td>
<img src="../img/gifts/img1.jpg"

## Dealing with siblings


The BeautifulSoup next_siblings() function makes it trivial to collect data from
tables, especially ones with title rows:

In [11]:
page_url = 'https://www.pythonscraping.com/pages/page3.html'


html = urlopen(page_url)
bs = BeautifulSoup(html.read(), 'html.parser')


for sibling in bs.find('table', {'id': 'giftList'}).tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

> The output of this code is to print all rows of products from the product table, except for the first title row.

## Dealing with Parents


When scraping pages, you will likely discover that you need to find parents of tags
less frequently than you need to find their children or siblings. Typically, when you
look at HTML pages with the goal of crawling them, you start by looking at the top
layer of tags, and then figure out how to drill your way down into the exact piece of
data that you want. Occasionally, however, you can find yourself in odd situations
that require BeautifulSoup’s parent-finding functions,
and .parents.
.parent


In [12]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

print(bs.find('img',
             {'src':'../img/gifts/img1.jpg'})
             .parent.previous_sibling.get_text())


$15.00



In [19]:
import re

img_regex = '\.\.\/img\/gifts\/img.*\.jpg'


images = bs.find_all('img', {'src': re.compile(img_regex)})


for image in images:
    # Access an attribute in the tag
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
