Beautiful Soup is the most common Python library for web scraping

In [None]:
from bs4 import BeautifulSoup

For simplicity, we will use Brandon's example `index.html`, which must be in the current directory. We'll use it because it has short html code, in contrast with real pages that have huge code.

In [23]:
webpage = open("./Example1/index.html") # finds index.html in the current directory

In [24]:
bs = BeautifulSoup(webpage, 'html.parser') # parses the website into bs object

In [54]:
bs # it prints the html context at which bs is pointing to

<html>
<head>
<title>My Website</title>
<link href="style.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div id="container">
<div id="header">
<h1>My Website</h1>
</div>
<div id="content">
<div id="nav">
<h3>Navigation</h3>
<ul>
<li><a class="selected" href="">Home</a></li>
<li><a href="">About</a></li>
<li><a href="">Contact</a></li>
</ul>
</div>
<div id="main">
<h2>Home Page</h2>
<p> Paragraph 1</p>
<p> Paragraph 2</p>
<p> Paragraph 3</p>
</div>
</div>
<div id="footer">
				Copyright ©2018 Brandon Podojil
			</div>
</div>
</body>
</html>

We can find items with css paths

In [26]:
bs.body.div.ul

<ul>
<li><a class="selected" href="">Home</a></li>
<li><a href="">About</a></li>
<li><a href="">Contact</a></li>
</ul>

With css path you only find the first item of each tag, so it's not very useful. What if we want all `div`s. We can use `find_all`. Compare the three commands `bs.div`, `bs.find("div")`, `bs.find_all("div")` below. The first two are equivalent.

In [57]:
bs.find_all("div")

[<div id="container">
 <div id="header">
 <h1>My Website</h1>
 </div>
 <div id="content">
 <div id="nav">
 <h3>Navigation</h3>
 <ul>
 <li><a class="selected" href="">Home</a></li>
 <li><a href="">About</a></li>
 <li><a href="">Contact</a></li>
 </ul>
 </div>
 <div id="main">
 <h2>Home Page</h2>
 <p> Paragraph 1</p>
 <p> Paragraph 2</p>
 <p> Paragraph 3</p>
 </div>
 </div>
 <div id="footer">
 				Copyright ©2018 Brandon Podojil
 			</div>
 </div>, <div id="header">
 <h1>My Website</h1>
 </div>, <div id="content">
 <div id="nav">
 <h3>Navigation</h3>
 <ul>
 <li><a class="selected" href="">Home</a></li>
 <li><a href="">About</a></li>
 <li><a href="">Contact</a></li>
 </ul>
 </div>
 <div id="main">
 <h2>Home Page</h2>
 <p> Paragraph 1</p>
 <p> Paragraph 2</p>
 <p> Paragraph 3</p>
 </div>
 </div>, <div id="nav">
 <h3>Navigation</h3>
 <ul>
 <li><a class="selected" href="">Home</a></li>
 <li><a href="">About</a></li>
 <li><a href="">Contact</a></li>
 </ul>
 </div>, <div id="main">
 <h2>Home P

In [58]:
bs.div

<div id="container">
<div id="header">
<h1>My Website</h1>
</div>
<div id="content">
<div id="nav">
<h3>Navigation</h3>
<ul>
<li><a class="selected" href="">Home</a></li>
<li><a href="">About</a></li>
<li><a href="">Contact</a></li>
</ul>
</div>
<div id="main">
<h2>Home Page</h2>
<p> Paragraph 1</p>
<p> Paragraph 2</p>
<p> Paragraph 3</p>
</div>
</div>
<div id="footer">
				Copyright ©2018 Brandon Podojil
			</div>
</div>

In [59]:
bs.find("div")

<div id="container">
<div id="header">
<h1>My Website</h1>
</div>
<div id="content">
<div id="nav">
<h3>Navigation</h3>
<ul>
<li><a class="selected" href="">Home</a></li>
<li><a href="">About</a></li>
<li><a href="">Contact</a></li>
</ul>
</div>
<div id="main">
<h2>Home Page</h2>
<p> Paragraph 1</p>
<p> Paragraph 2</p>
<p> Paragraph 3</p>
</div>
</div>
<div id="footer">
				Copyright ©2018 Brandon Podojil
			</div>
</div>

In [60]:
bs.find("div") == bs.div

True

In [61]:
bs.find_all("div") == bs.div

False

Imagine we discover we want the object with id nav. What can we do?

We can use the function find.

In [27]:
help(bs.find)

Help on method find in module bs4.element:

find(name=None, attrs={}, recursive=True, text=None, **kwargs) method of bs4.BeautifulSoup instance
    Return only the first child of this Tag matching the given
    criteria.



In [50]:
bs.find(id = "nav")

<div id="nav">
<h3>Navigation</h3>
<ul>
<li><a class="selected" href="">Home</a></li>
<li><a href="">About</a></li>
<li><a href="">Contact</a></li>
</ul>
</div>

We now want to the selected item

In [33]:
nav = bs.find(id = 'nav')

In [34]:
nav

<div id="nav">
<h3>Navigation</h3>
<ul>
<li><a class="selected" href="">Home</a></li>
<li><a href="">About</a></li>
<li><a href="">Contact</a></li>
</ul>
</div>

What if we want the children of this nav. 
Note what happens, often html is parsed with some errors, there are empty childs

In [43]:
i = 1
for child in nav.children:
    print("=== child", i, "===")
    print(child)
    print("=============== \n")
    i += 1

=== child 1 ===



=== child 2 ===
<h3>Navigation</h3>

=== child 3 ===



=== child 4 ===
<ul>
<li><a class="selected" href="">Home</a></li>
<li><a href="">About</a></li>
<li><a href="">Contact</a></li>
</ul>

=== child 5 ===





Supose that we want the selected list item. Note that class we write `class_` with an underscore at the end, that's because `class` is a special word in Python, and the BeautifulSoup developers decided to use the underscore to differentiate.

In [51]:
selected = nav.find(class_="selected")

In [52]:
selected

<a class="selected" href="">Home</a>

Finally, how to extract the text

In [53]:
selected.text

'Home'

Congrats! Now you've built your first scraper.