Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
images
mitchell-ch3
more-from-mitchell
README.md
example1.html

README.md

Introduction to Web Scraping

We use this book: Web Scraping with Python: Collecting Data from the Modern Web, by Ryan Mitchell (O’Reilly, 2015). A new 2nd edition of this will be published in 2018 but is not yet available, so we must use the 1st edition.

Python3 is used throughout this book.

Note: This document assumes you have already installed Python3, pip, and virualenv. If not, refer to these instructions.

This document covers our second week in this section of the course. It's our second week with Python, and our first week with scraping.

Contents

See also, elsewhere in this repo:

  • mitchell-ch3 — Mitchell chapter 3: More web scraping. This covers our third week's assigned reading.
  • more-from-mitchell — More from Mitchell: Web scraping beyond the basics. This covers our fourth week's assigned reading.

BeautifulSoup documentation:

Setup for BeautifulSoup

BeautifulSoup is a scraping library for Python. We want to run all our scraping projects in a virtual environment. Students have already installed both Python3 and virtualenv.

Create a directory and change into it

The first step is to create a new folder (directory) for all your scraping projects. Mine is:

Documents/python/scraping

Do not use any spaces in your folder names. If you must use punctuation, do not use anything other than an underscore (_). It's easiest if you use only lowercase letters.

Change into that directory. For me, the command would be:

cd Documents/python/scraping

Create a new virtualenv in that directory and activate it

Create a new virtualenv there (this is done only once).

Mac OS/bash

$ virtualenv --python=/usr/local/bin/python3 env

Windows PowerShell

PS> virtualenv --python=C:\Python36\python.exe env

Activate the virtualenv:

Mac OS/bash

$ source env/bin/activate

Windows PowerShell

PS> env\Scripts\activate.bat

Important: You should now see (env) at the far left side of your prompt. This indicates that the virtualenv is active. Example (Mac OS/bash):

(env) mcadams scraping $

When you are finished working in a virtualenv, you should deactivate it. The command is the same in Mac OS or Windows (DO NOT DO THIS NOW):

deactivate

You'll know it worked because (env) will no longer be at the far left side of your prompt.

Install the BeautifulSoup library

In Mac OS or Windows, at the $ bash prompt (or Windows PS>), type:

pip install beautifulsoup4

This is how you install any Python library that exists in the Python Package Index. Pretty handy. pip is a tool for installing Python packages, which is what you just did.

Note: You installed BeautifulSoup in the Python3 virtualenv that is currently active. When that virtualenv is not active, BeautifulSoup will not be available to you. This is ideal, because you will create different virtual environments for different Python projects, and you won't need to worry about updated libraries in the future breaking your (past) code.

Test BeautifulSoup

Start Python. Because you are in a Python3 virtualenv, you need only type python.

You should now be at the >>> prompt — the Python prompt.

In Mac OS or Windows, type (or copy/paste) one line at a time:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://weimergeeks.com/examples/scraping/example1.html")
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj.h1)
  1. You imported two Python modules, urlopen and BeautifulSoup (the first two lines).
  2. You used urlopen to copy the entire contents of the URL given into a new Python variable, html.
  3. You used BeautifulSoup to process the value of that variable (the contents of the file at that URL) through a built-in HTML parser (html.parser is not the only option for this; html5lib is more robust and can be installed with pip). The result: All the HTML from the file is now in a BeautifulSoup object with the new Python variable name bsObj.
  4. Using the syntax of the BeautifulSoup library, you printed the first H1 element (including its tags) from that parsed value. Check out the page on the web to see what you scraped.

If it works, you'll see:

<h1>We Are Learning About Web Scraping!</h1>

If you got an error about SSL, quit Python (quit() or Command-D) and enter this at the bash prompt:

/Applications/Python\ 3.6/Install\ Certificates.command

Then return to the Python prompt and retry the five lines above.

The example is based on the one on page 8 of Mitchell's book; the code is updated in her GitHub repo (chapter1) for the book, where we can find updated code that will no doubt appear in her 2nd edition.

The command bsObj.h1 would work the same way for any HTML tag (if it exists in the file). Instead of printing it, you might stash it in a variable:

heading = bsObj.h1

Understanding BeautifulSoup

BeautifulSoup is a Python library that enables us to extract information from web pages and even entire websites.

We use BeautifulSoup commands to create a well-structured data object (more about objects below) from which we can extract, for example, everything with an <li> tag, or everything with class="book-title".

After extracting the desired information, we can use other Python commands (and libraries) to write the data into a database, CSV file, or other usable format.

What is the BeautifulSoup object?

It's very important to understand that many of the BeautifulSoup commands work on an object, which is not the same as a simple string. Throughout her book, Mitchell uses the variable name bsObj to remind us of that fact.

Many programming languages include objects as a data type. Python does, JavaScript does, etc. An object is an even more powerful and complex data type than an array (JavaScript) or a list (Python) and can contain many other data types in a structured format.

When you extract information from an object with a BeautifulSoup command, sometimes you get a simple string, and sometimes you get a Python list (which is very similar to an array in JavaScript). The way you treat that extracted information will be different depending on whether it is a string (one item) or a list (usually more than one item).

How BeautifulSoup handles the object

In the previous code, when this line ran:

html = urlopen("https://weimergeeks.com/examples/scraping/example1.html")

... you copied the entire contents of a file into a new Python variable named html. The contents were stored as an HTTPResponse object. We can read the contents of that object like this:

Results of html.read()

... but that's not going to be very usable, or useful — especially for a file with a lot more content in it.

When you transform that HTTPResponse object into a BeautifulSoup object — with the following line — you create a well-structured object from which you can extract any HTML element and the text within any HTML element.

bsObj = BeautifulSoup(html, "html.parser")

Let's look at a few examples of what BeautifulSoup can do.

Finding elements that have a particular class

Deciding the best way to extract what you want from a large HTML file requires you to dig around in the source before you write the Python/BeautifulSoup commands. In many cases, you'll see that everything you want has the same CSS class on it. After creating a BeautifulSoup object (here, as before, it is in the variable bsObj), this line will create a Python list (you can think of it as an array) containing all the <td> elements that have the class city.

city_list = bsObj.findAll( "td", {"class":"city"} )

Maybe there were 10 cities in <td> tags in that HTML file. Maybe there were 10,000. No matter how many, they are now in a list (in the variable city_list), and you can search them, print them, write them out to a database or a JSON file — whatever you like. Often you will want to perform the same actions on each item in the list, so you will use a for-loop:

for city in city_list:
    print( city.get_text() )

get_text() is a handy BeautifulSoup method that will extract the text — and only the text — from the item. If instead you wrote just print(city), you'd get the <td> and any other tags inside them as well.

Finding all vs. finding one

The BeautifulSoup findAll() method you just saw always produces a list. If you know there will be only one item of the kind you want in a file, you should use the find() method instead.

For example, maybe you are scraping the address and phone number from every page in a large website. There is only one phone number on the page, and it is enclosed in a pair of tags with the attribute id="call". One line of your code gets the phone number from the current page:

phone_number = bsObj.find(id="call")

Naturally, you don't need to loop through that result — the variable phone_number will contain only a string, including any HTML tags. To test what the text alone will look like, just print it using get_text() to strip out the tags.

print( phone_number.get_text() )

Notice that you're always using bsObj. Review above if you've forgotten where that came from.

Finding the contents of a particular attribute

One last example: You've made a BeautifulSoup object from a page that has dozens of images on it. You want to capture the path to each image file on that page (perhaps so that you can download all the images). This requires two steps:

image_list = bsObj.findAll('img')
for image in image_list:
    print(image.attrs['src'])

First, you make a Python list containing all the img elements that exist in the object.

Second, you loop through that list and print the contents of the src attribute from each img tag in the list.

We do not need get_text() in this case, because the contents of the src attribute are nothing but text. There are never tags inside the src attribute.

There's a lot more to learn about BeautifulSoup, and we'll be using Mitchell's book for that. You can also read the docs.