## HTML Content Management with BeautifulSoup

It will very often happen in your data experience that you will have to scramble web data, for example to create your dataset. Since the web is made up of HTML pages, it is good that you know how to use the BeautifulSoup library.


### Share content via BeautifulSoup

First of all, to read HTML content, you will have to _parser_ your data via the library. This is done very simply as follows:


```python
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>", "html.parser")
```


If you have more complex content, you have other _parsers_, especially for XML, which may be useful. Be careful, however, you will have to install the _parser_ via pip. These are the ones you can find:


<table>
  <tr>
      <td>Parser</td>
      <td>Typical usage</td>
      <td>Advantages</td>
      <td>Disadvantages</td>
  </tr>
  <tr>
      <td>Python’s html.parser</td>
      <td>BeautifulSoup(markup, "html.parser")</td>
      <td>
         <ul>
            <li>Batteries included</li>
            <li>Decent speed</li>
            <li>Lenient (as of Python 2.7.3 and 3.2.)</li>
         </ul>
      </td>
      <td>
         <ul>
            <li>Not very lenient (before Python 2.7.3 or 3.2.2)</li>
         </ul>
      </td>
  </tr>
  <tr>
      <td>lxml’s HTML parser</td>
      <td>BeautifulSoup(markup, "lxml")</td>
      <td>
         <ul>
            <li>Very fast</li>
            <li>Lenient</li>
         </ul>
         <ul>
            <li>External C dependency</li>
         </ul>
      </td>
  </tr>
  <tr>
      <td>lxml’s XML parser</td>
      <td>BeautifulSoup(markup, "lxml-xml")BeautifulSoup(markup, "xml")</td>
      <td>
         <ul>
            <li>Very fast</li>
            <li>The only currently supported XML parser</li>
         </ul>
      </td>
      <td>
         <ul>
            <li>External C dependency</li>
         </ul>
      </td>
   </tr>
   <tr>
      <td>html5lib</td>
      <td>BeautifulSoup(markup, "html5lib")</td>
      <td>
         <ul>
            <li>Extremely lenient</li>
            <li>Parses pages the same way a web browser does</li>
            <li>Creates valid HTML5</li>
         </ul>
      </td>
   <td>
      <ul>
         <li>Very slow</li>
         <li>External Python dependency</li>
      </ul>
   </td>
</tr>
</table>


To install these _parsers_, you can do it via: 

In [None]:
!pip install lxml
!pip install html5lib

### Play content via BeautifulSoup

The following code will be used for the rest:


In [3]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

### Find HTML content via HTML tag name

You can find the content of an HTML page by the name of its HTML tag:

In [5]:
soup.head

<head><title>The Dormouse's story</title></head>

In [8]:
soup.title

<title>The Dormouse's story</title>

### Find a parent item

You can find a parent element to another element via a loop:

In [6]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [7]:
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

p
body
html
[document]


### Find a nearby item

Conversely, you can take a close element this way:

In [9]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'


In [10]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'


### Find all items that meet a specific condition

There is a very handy function in beautifulsoup which is _find_all()_. It allows you to find all the elements of an HTML page that meet certain criteria. For example:

In [11]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [12]:
soup.find_all("p", "title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [13]:
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [14]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [15]:
import re
soup.find(string=re.compile("sisters"))

'Once upon a time there were three little sisters; and their names were\n'

### Find elements via CSS

Finally, content can be found via CSS selectors:

In [16]:
soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## Resources

BeautifulSoup - [https://bit.ly/7Uhgz](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree)
