In [1]:
%%HTML
<!-- execute this cell before continue -->
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Lato">
<style>.reveal * { font-family: "Lato" !important; } .reveal .code_cell * { font-family: monospace !important; }</style>

<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:30%; left:10%;">
    Web Scraping in Python
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:43%; left:10%;">
    David Mertz, Ph.D.
</h3>
</div>

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

<img src="https://user-images.githubusercontent.com/7065401/98864025-08deda80-2448-11eb-9600-22aa17884cdf.png" style="height: 100%; max-height: inherit; position: absolute; top: 20%; left: 0px;"></img>
<br>

<h2 style="font-weight: bold;">
    David Mertz
</h2>

<h3 style="color: #ef7d22; margin-top: 0.8em">
    Data Scientist
</h3>
<hr>
<br><br>

<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    mertz@kdm.training
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    @mertz_david
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    linkedin.com/in/dmertz
</p>

</div>

<br><br><br>

<h2 style="font-weight: bold;">
    Beautiful Soup
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Beautiful Soup is a versatile library for extracting from and manipulating HTML documents.  Lending its colorful name, Beautiful Soup will also process the *almost-HTML* documents that are common on the World Wide Web, with markup that is not *quite* grammatical HTML (including fragments that could be part of a valid document).  Beautiful Soup will parse "tag soups" that are merely largely structured (similar to the "quirks mode" of popular web browsers).

Beautiful Soup is **not** a tool for obtaining any documents from the web, but only for processing them once obtained.  For many documents, the `requests` third-party library, or even the standard library `urllib` are sufficient.  For more dynamic website, see the later lessons in this course.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   

At the start, let us import a few capabilities, as commonly in these courses.

</div>

In [2]:
import re
from urllib.parse import quote
import requests
from bs4 import BeautifulSoup

<h2 style="font-weight: bold;">
    Getting a feel
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Let us look at a Wikipedia web page.  Like most modern web pages, this page has a great deal of CSS styling, a little JavaScript, many `class` and `id` elements within tags, and quite a bit of nesting of elements, especially within `<span>` and `<div>` elements that have CSS attributes.

For a stipulated goal, we would like to pull out all the related pages listed under the "See also" section.  Quite likely, this same code will work for other Wikipedia pages, which are similar in markup and structure.

In [3]:
# Use specific snapshot version for permanence of content
url = "https://en.wikipedia.org/w/index.php?title=Web_scraping&oldid=986505339"
resp = requests.get(url)
resp.status_code

200

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   

Converting a document text to a beautiful soup gives as a nested collection of *nodes*, each of which has numerous useful methods and attributes.

</div>

In [4]:
soup = BeautifulSoup(resp.text)
print(soup.title)
print(soup.title.string)

<title>Web scraping - Wikipedia</title>
Web scraping - Wikipedia


<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   

We know (or suspect) that the "See also" is one of the `<h2>` elements.

</div>

In [5]:
[e.text for e in soup.find_all('h2')]

['Contents',
 'History[edit]',
 'Techniques[edit]',
 'Software[edit]',
 'Legal issues[edit]',
 'Methods to prevent web scraping[edit]',
 'See also[edit]',
 'References[edit]',
 'Navigation menu']

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   

Let us find the element itself using that knowledge.

</div>

In [6]:
see_also = [e for e in soup.find_all('h2') if e.text.startswith('See also')][0]
print(see_also.prettify())

<h2>
 <span class="mw-headline" id="See_also">
  See also
 </span>
 <span class="mw-editsection">
  <span class="mw-editsection-bracket">
   [
  </span>
  <a href="/w/index.php?title=Web_scraping&amp;action=edit&amp;section=17" title="Edit section: See also">
   edit
  </a>
  <span class="mw-editsection-bracket">
   ]
  </span>
 </span>
</h2>



<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   

Of course, what we want is the stuff that comes *after* the actual "See also" heading.

</div>

In [7]:
see_also_links = see_also.find_next_sibling()
print(see_also_links.prettify()[:474], "...")

<div class="div-col columns column-width" style="-moz-column-width: 22em; -webkit-column-width: 22em; column-width: 22em;">
 <ul>
  <li>
   <a href="/wiki/Archive.today" title="Archive.today">
    Archive.today
   </a>
  </li>
  <li>
   <a href="/wiki/Comparison_of_feed_aggregators" title="Comparison of feed aggregators">
    Comparison of feed aggregators
   </a>
  </li>
  <li>
   <a href="/wiki/Data_scraping" title="Data scraping">
    Data scraping
   </a>
  </li>
   ...


<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

Finding just the link elements within our current focus:

</div>

In [8]:
see_also_links.find_all('a')[:6]

[<a href="/wiki/Archive.today" title="Archive.today">Archive.today</a>,
 <a href="/wiki/Comparison_of_feed_aggregators" title="Comparison of feed aggregators">Comparison of feed aggregators</a>,
 <a href="/wiki/Data_scraping" title="Data scraping">Data scraping</a>,
 <a href="/wiki/Data_wrangling" title="Data wrangling">Data wrangling</a>,
 <a href="/wiki/Importer_(computing)" title="Importer (computing)">Importer</a>,
 <a href="/wiki/Job_wrapping" title="Job wrapping">Job wrapping</a>]

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

Perhaps we want just the names of the related concepts:

</div>

In [9]:
[e.text for e in see_also_links.find_all('a')][:9]

['Archive.today',
 'Comparison of feed aggregators',
 'Data scraping',
 'Data wrangling',
 'Importer',
 'Job wrapping',
 'Knowledge extraction',
 'OpenSocial',
 'Scraper site']

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

Or only the link URIs (relative links, in this case, within the same domain):

</div>

In [10]:
[e['href'] for e in see_also_links.find_all('a')][:9]

['/wiki/Archive.today',
 '/wiki/Comparison_of_feed_aggregators',
 '/wiki/Data_scraping',
 '/wiki/Data_wrangling',
 '/wiki/Importer_(computing)',
 '/wiki/Job_wrapping',
 '/wiki/Knowledge_extraction',
 '/wiki/OpenSocial',
 '/wiki/Scraper_site']

<h2 style="font-weight: bold;">
    A simple Wikipedia crawler
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Given what we have seen with only a small part of Beautiful Soup capabilities, we can write a small web scraper to do the following.

* Given a simple name that *might* be a Wikipedia page title as a function argument
* Massage the name to *probably* match the pattern of URLs
* Return None if no such page exists
* If a page exists return a dictionary with search term as a key, and the page corresponding to each "see also" items likewise contributing keys from it "see also".

```python
{'Foobar': ['Foo', 'Bar', 'Baz'],
 'Foo': ['Fliz', 'Flam'],
 'Bar': ['Bim', 'Bop', 'Foobar'],
 'Baz': ['Do', 'Wap', 'Diddy']}
```

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img> 

A small support function can make Wikipedia URLs from article names
</div>

In [11]:
def urlify(name):
    # Make Wikipedia URLs from article titles
    if name.startswith(('https://', 'http://')):
        return name  # Already a URL
    else:
        base = 'https://en.wikipedia.org/wiki/'
        # Space to underscore in Wikipedia URLs
        url = f"{base}{quote(name.replace(' ', '_'))}"
        return url

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

The main crawler:

</div>

In [12]:
def see_also_graph(name, _depth=0, maxdepth=1):
    graph = dict()
    resp = requests.get(urlify(name))
    if resp.status_code != 200:
        print("Unable to find page at", urlify(name))
        return None
    soup = BeautifulSoup(resp.text, 'lxml')
    see_also = [e for e in soup.find_all('h2') 
                  if e.text.startswith("See also")]
    if see_also:   # Some outgoing links
        see_also_links = see_also[0].find_next_sibling('ul')
        names = [e.text for e in see_also_links.find_all('a')]
        graph[name] = names
        if _depth < maxdepth:   # Limited recursion depth
            for name in names:
                graph.update(see_also_graph(name, _depth=_depth+1))
    return graph

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

Crawling from a particular starting article.

</div>

In [13]:
see_also_graph('Percent-encoding')

{'Percent-encoding': ['Internationalized Resource Identifier',
  'Punycode',
  'Binary-to-text encoding',
  'Shellcode'],
 'Internationalized Resource Identifier': ['IDN',
  'Semantic Web',
  'Punycode',
  'XRI'],
 'Punycode': ['Emoji domain', 'UTF-5', 'UTF-6', 'Website spoofing'],
 'Shellcode': ['Alphanumeric code',
  'Computer security',
  'Buffer overflow',
  'Exploit (computer security)',
  'Heap overflow',
  'Metasploit Project',
  'Shell (computing)',
  'Shell shoveling',
  'Stack buffer overflow',
  'Vulnerability (computing)']}

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

An article topic might not exist.

</div>

In [14]:
see_also_graph('Silly Page')

Unable to find page at https://en.wikipedia.org/wiki/Silly_Page


<h2 style="font-weight: bold;">
    Special features
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

There are some edge-case features we have already subtly used that it is worth noting.  Although the Wikipdia pages looked at are complete HTML pages, Beautiful Soup will also parse fragments of HTML (or near-HTML).

Beautiful Soup supports multiple parsers, all of which are able to deal with not-quite-well-formed HTML.  The default parser is `html.parser` that only relies on the Python standard library.  If installed, the external libraries `lxml` or `html5lib` may be used.  I recommend lxml if it is an option (the library is discussed in the Python Serialization INE course as well).

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

The different libraries will parse ill-formed HTML slightly differently.  The differences will rarely matter to most scripts.

</div>

In [15]:
BeautifulSoup("<a></p></a>", "lxml")

<html><body><a></a></body></html>

In [16]:
BeautifulSoup("<a></p></a>", "html5lib")

<html><head></head><body><a><p></p></a></body></html>

In [17]:
BeautifulSoup("<a></p></a>", "html.parser")

<a></a>

<h2 style="font-weight: bold;">
    Text versus string
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

In the above examples, I sometimes used the attribute `.string` on nodes, and other times `.text`.  These two attributes are similar, but also slightly different.  Use `.text` to pull out all the plain text from inside a node, even if other tags are nested.  In contrast, `.string` will only pull out the element body if it is *only* text.

We also look at an example of using a child element tag name (`<p>` in this case) as an attribute of a soup to access it.  Within the fragment.  The attribute style will simply get the first child.  To find other children or siblings, use attributes like `.next_sibling` and `.previous_sibling`, or `.next_siblings` and `.previous_siblings` for multiple, or methods like `.find()` and `.find_all()`.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

Parse a fragment of HTML.

</div>

In [18]:
fragment1 = """<div>
<h1>Header</h1>
<p>This text has <i>italics</i> and <b>boldface</b></p>
<p>Another paragraph</p>
</div>"""

soup1 = BeautifulSoup(fragment1)
print(soup1.p)
print(soup1.p.string)
print(soup1.p.text)
print(list(soup1.p.next_siblings))

<p>This text has <i>italics</i> and <b>boldface</b></p>
None
This text has italics and boldface
['\n', <p>Another paragraph</p>, '\n']


When there is, indeed, only body text (PCDATA, in XML terms), the two attributes still give you slightly different objects back.

In [19]:
fragment2 = """<div>
<h1>Header</h1>
<p>This text has only plain text in paragraph</p>
<p>Another paragraph</p>
</div>"""

soup2 = BeautifulSoup(fragment2)
print(soup2, '\n')
print(soup2.p.string, type(soup2.p.string))
print(soup2.p.text, type(soup2.p.text))

<html><body><div>
<h1>Header</h1>
<p>This text has only plain text in paragraph</p>
<p>Another paragraph</p>
</div></body></html> 

This text has only plain text in paragraph <class 'bs4.element.NavigableString'>
This text has only plain text in paragraph <class 'str'>


The `NavigableString` that we got from a `.string` attribute will work as a string in other functions.  It is a subclass of string, so you can pass it to your regular text manipulation functions.  But it also adds some characteristic soup methods and attributes.

In [20]:
def title(s, nwords=6):
    words = s.split()[:nwords]
    s = ' '.join(words)
    return s.title()

para_text = soup2.p.string
print(title(para_text))
# Navigating from the string, not from the tag
para_text.find_next('p')

This Text Has Only Plain Text


<p>Another paragraph</p>

<h2 style="font-weight: bold;">
    Character encoding
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

In the best of circumstances, web pages will always be encoded as UTF-8.  Sometimes it doesn't turn out that well.  Beautiful Soup will use *chardet*-style heuristics to try to guess encodings.  This usually works, but you can manually specify if it does not.  Of course, that only helps if you *know* the correct encoding.  

Often Beautiful Soup will even detect "mixed encodings" where special characters from Windows-1252 (usually "smart quotes") are illegally interspersed with Unicode.  The library really is clever.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

<div style="position: relative; text-align: left;">
There are times when a META tag within an HTML document will indicate a character encoding, e.g.

```html
<meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
```

It is **possible** that this tag is telling the truth.  Beautiful Soup is smart enough to look at the advice, but also not fail it it is inaccurate.  For example, here is a document that contains an illegal character for its declaration.

</div>
</div>

In [21]:
page = b'''<html>
  <head><meta content="text/html; charset=utf-8"/></head>
  <body><p>Sacr\xe9 bleu!</p></body>
</html>'''

soup = BeautifulSoup(page)
print(soup.original_encoding)
soup.body

ISO-8859-1


<body><p>Sacré bleu!</p></body>

In [22]:
page = b'''
 <html>
  <head><meta content="text/html; charset="iso-8859-1"/></head>
  <body><p>Greek \xc5\xeb\xeb\xe7\xed\xe9\xea\xfc</p></body>
 </html>
'''
soup = BeautifulSoup(page)
print(soup.original_encoding)
soup.body

iso-8859-1


<body><p>Greek Åëëçíéêü</p></body>

In [23]:
page = b'''
 <html>
  <head><meta content="text/html; charset="iso-8859-1" /></head>
  <body><p>Greek \xc5\xeb\xeb\xe7\xed\xe9\xea\xfc</p></body>
 </html>
'''
soup = BeautifulSoup(page, exclude_encodings=["iso-8859-1", "windows-1252"])
print(soup.original_encoding)
soup.body

ISO-8859-7


<body><p>Greek Ελληνικό</p></body>

In [24]:
page = b'''
 <html>
  <body><p>Greek \xc5\xeb\xeb\xe7\xed\xe9\xea\xfc</p></body>
 </html>
'''
soup = BeautifulSoup(page, from_encoding="iso-8859-7")
print(soup.original_encoding)
soup.body

iso-8859-7


<body><p>Greek Ελληνικό</p></body>

<h2 style="font-weight: bold;">
    Searching for content
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

We have already looked at several methods that will search for tags.  In general, the variations including "next" will look at the same level of the hierarchy, while the variations including "find" will search across levels.  All of these share almost all the same potential arguments.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

The above examples only looked particular tag names, but other options are available.  Let us look through a book available from Project Gutenberg (PG).  Note that repeated web crawling of PG is not permitted; questions and concerns about this are discussed in lesson 4 of this course.

</div>

In [25]:
url = "https://www.gutenberg.org/files/25007/25007-h/25007-h.htm"
resp = requests.get(url)
soups = resp.text

In [26]:
soup = BeautifulSoup(soups)
print(soup.title.text)


      The Project Gutenberg eBook of Fifty Soups, by Thomas J. Murrey
    


Beyond searching for a certain tag name, we might also search for tags belonging to a certain class. E.g. tags that contain `class="smcap"`.  In this book, and within PG, this indicates a table-of-contents listing.

In [27]:
toc = soup.find_all(class_="smcap")  # `class` is reserved, use trailing underscore
toc[:12]

[<span class="smcap">Copyright, 1884</span>,
 <span class="smcap">By WHITE, STOKES, &amp; ALLEN.</span>,
 <span class="smcap">Remarks on Soups</span>,
 <span class="smcap">Artichoke Soup</span>,
 <span class="smcap">Asparagus Soup</span>,
 <span class="smcap">Barley Soup</span>,
 <span class="smcap">Beans, Puree of</span>,
 <span class="smcap">Beef Stock</span>,
 <span class="smcap">Beef Tea</span>,
 <span class="smcap">Bouille-Abaisse</span>,
 <span class="smcap">Cauliflower Soup</span>,
 <span class="smcap">Celery, Cream of</span>]

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   

Most attribute names can simply be specified as named parameters to a search method.

</div>

In [28]:
soup.find(id="Page_7")

<a id="Page_7" name="Page_7">7</a>

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   

The word `name` is reserved for the tag name, but you can specify the attribute `name` with a slight circumlocution (an HTML attribute `attrs` would require this same style, if it occurred in the document).  This is also necessary when attributes (or tag names) contain characters outside those allowed in Python identifiers (most often dashes).

</div>

In [29]:
soup.find_all(attrs={"name": "Page_7"})

[<a id="Page_7" name="Page_7">7</a>]

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

You might search for text content *within* an element.

</div>

In [30]:
[e.parent for e in soup.find_all(string="Artichoke Soup")]

[<span class="smcap">Artichoke Soup</span>]

A problem here is that a string search only matches *exactly* that element body.  If you want more flexibility, you can use a regular expression pattern to search for substrings or more complex patterns.

In [31]:
[e.parent for e in soup.find_all(string=re.compile('[Aa]rtichoke'))]

[<span class="smcap">Artichoke Soup</span>,
 <b>Artichoke Soup.</b>,
 <p><b>Artichoke Soup.</b>—Melt a piece of butter the size of an egg in a
 saucepan; then fry in it one white turnip sliced, one red onion sliced,
 three pounds of Jerusalem artichokes washed, pared, and sliced, and a
 rasher of bacon. Stir these in the boiling butter for about ten minutes,
 add gradually one pint of stock. Let all boil together until the
 vegetables are thoroughly cooked, then add three pints more of stock;
 stir it well; add pepper and salt to taste, strain and press the
 vegetables through a sieve, and add one pint of boiling milk. Boil for
 five minutes more and serve.</p>]

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   

Most sophisticated of all the options is to write a custom function to find the elements of relevance to your purpose.  To contrive an example, this PG book has markers indicating where the page breaks occurred in the original text.  

Let us identify those paragraphs that cross a page break and contain eggs in their recipe.

</div>

In [32]:
def egg_near_pagebreak(e):
    if e.name == 'p':
        if e.find(class_="pagenum"):
            if 'egg' in e.text:
                return True

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

Using our custom search function.

</div>

In [33]:
for recipe in soup.find_all(egg_near_pagebreak):
    page = recipe.find('a')
    print(f"Page {page.text}: {recipe.text[:130]} ...\n")

Page 9: Milk or cream should be boiled and strained9 and added hot when intended
for soups; when eggs are used beat them thoroughly, and  ...

Page 10: When the vegetables are thoroughly cooked, strain the soup into a large
saucepan, and set10 it on back of range to keep hot, but  ...

Page 15: Bisque of Lobster.—Procure two large live lobsters; chop them up while
raw, shells and all; put them into a mortar with three-fou ...

Page 26: The yolks of the eggs may be worked to a26 paste, and made into round
balls to imitate turtle eggs if this is desired. ...



<h2 style="font-weight: bold;">
    Summary
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


In this lesson we have seen a good subset of the Beautiful Soup API.  The documentation will detail additional methods on node objects, but working with them is largely similar to those we have seen.

As with XML libraries like `xml.dom` or ElementTree, Beautiful Soup also has methods to add or rearrange elements and write out new HTML documents.  The `.prettyprint()` method we saw mostly suffices for the latter.  Those capabilities are useful, but not central to the purpose of web *scraping*.