# Beautiful Soup 101
This notebook gets you familiar with some of the basic concepts of beautiful soup and its main classes and methods. This notebook uses a simple sample structure stored as a string.

In [2]:
soup_html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    <h1>This is the title</h1>
    <h2>This is a sub title</h2>
    <div class="outer">
        <div class="inner extraclass">
            <h2>This is a sub heading</h2>
            This is some text inside a string
            <ul>
                <li>list item one</li>
                <li>list item two</li>
                <li>list item three <a href="linkhref">a link in a list</a></li>
                <li>list item four</li>
                <li>list item 5</li>
            </ul>
        </div>
    </div>
    <h2>This is another sub title</h2>
    <p class="pclass">
        This is a paragraph
        <span><b>Bold text in front of</b>plain text</span>
        <br />
        <br />
        <br />
        <br />
        <img src="images/someimgage.jpg" alt="someimgage"></img>
        <a href="somehyperlink">some content</a>
    </p>
    <div class="bottomdiv">
        <h2>This is another sub heading</h2>
        this is at the bottomdi
    </div>
    <footer>
        some footer
    </footer>
</body>
</html>
'''

## Setup the notebook
First we need to install the relevant libraries (I am using pipenv but you can use !pip if running in colab or similiar). We need beautiful soup and the parser(s)

In [3]:
#!pip install beautifulsoup4

# Install the parsers
#!pip install html5lib
#!pip install lxml

## Create the beautiful soup object

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(soup_html, 'html.parser')

## Basic navigation

### Navigating by object using the .tag syntax
The simplest way to navigate is if you know the structure (if only right?). Lets find the title using dot syntax:

In [5]:
print(soup.h1)

<h1>This is the title</h1>


In [6]:
print(soup.div)

<div class="outer">
<div class="inner extraclass">
<h2>This is a sub heading</h2>
            This is some text inside a string
            <ul>
<li>list item one</li>
<li>list item two</li>
<li>list item three <a href="linkhref">a link in a list</a></li>
<li>list item four</li>
<li>list item 5</li>
</ul>
</div>
</div>


If we look at this we see two divs. However there were two top level divs in our html - one had a nested div. Beautiful soup has a `prettify()` method. If we format this with prettify() we see bs4 has found the firt __branch__ of the tree that is a div tag.

In [7]:
print(soup.div.prettify())

<div class="outer">
 <div class="inner extraclass">
  <h2>
   This is a sub heading
  </h2>
  This is some text inside a string
  <ul>
   <li>
    list item one
   </li>
   <li>
    list item two
   </li>
   <li>
    list item three
    <a href="linkhref">
     a link in a list
    </a>
   </li>
   <li>
    list item four
   </li>
   <li>
    list item 5
   </li>
  </ul>
 </div>
</div>



If we look at the type of the `soup.div` it is a `bs4.element.tag`. It is important to note that __the tag is the whole branch starting at that tag not just the tag itself__.

In [8]:
print(type(soup.div))

<class 'bs4.element.Tag'>


For known or single obects we can find them anywhere in the tree by tag. Also if we look for the nested div using `div.div` we get the next 'branch' of the tree. 

In [9]:
print(soup.div.div.prettify())

<div class="inner extraclass">
 <h2>
  This is a sub heading
 </h2>
 This is some text inside a string
 <ul>
  <li>
   list item one
  </li>
  <li>
   list item two
  </li>
  <li>
   list item three
   <a href="linkhref">
    a link in a list
   </a>
  </li>
  <li>
   list item four
  </li>
  <li>
   list item 5
  </li>
 </ul>
</div>



You can use this approach with any tag that exists.

In [10]:
print(soup.ul.prettify())

<ul>
 <li>
  list item one
 </li>
 <li>
  list item two
 </li>
 <li>
  list item three
  <a href="linkhref">
   a link in a list
  </a>
 </li>
 <li>
  list item four
 </li>
 <li>
  list item 5
 </li>
</ul>



If we want to find the next element but we don't know what tag it is then we need to navigate.

## Navigating up and down

If we take the following structure
```
html
head
body
h1
h2
div 
    div
        h2
            ul
                li
                li
                li
                    a
                li
                li
h2
p
    span
        b
    br
    br
    br
    br
    i
    a
div
h2
footer
```
Then `soup.div.div.a` would be the hyperlink

In [11]:
print(soup.div.div.a.prettify())

<a href="linkhref">
 a link in a list
</a>



Similarly there are two divs, one is at `soup.div` the other is at `soup.div.div`, `soup.div.div` does not find the second div which is an immediate child of body. Effectively .tagname is a shortcut to the `find()` method of a tag object (explained later).

In [12]:
# Find the parent of the h2 - the first h2 is under body
print(soup.h2.parent.name)

# this will say none as the h2 has no siblings
print(soup.div.div.find_next_sibling())

# find the type first sibling of the first div
print(type(soup.div))

body
None
<class 'bs4.element.Tag'>


### Children that are Navigeable strings
The children of tags can be text. Beautiful Soup has `NavigeableStrings` to represent these. Be aware the carriage returns are treated as NavigeableStrings.

In [13]:
# First child of the nested div is actually text
print(type(soup.div.div.next_sibling))

<class 'bs4.element.NavigableString'>


With formatted HTML you can get unexpected results, as the first child can be a string which is a carriage return.

In [14]:
# The next sibling of the div inside the first div is of type navigable string.
print('next sibling is an empty string', soup.div.div.nextSibling.text == '\n' or soup.div.div.next_sibling == '')

next sibling is an empty string True


If we look at the parent we can see its next sibling is an empty string.

In [15]:
print('next sibling is an empty string', soup.div.div.parent.nextSibling.text == '\n' or soup.div.div.parent.next_sibling == '')

next sibling is an empty string True


In [16]:
# Notice this next statement returns a carriage return NOT the H2. In other words the text looks like this <div>\n\t<div>
soup.div.div.parent.next_sibling

'\n'

You can iterate over siblings

In [17]:
for sib in soup.div.next_siblings :
    print('---------')
    print(sib.name, 'type:', type(sib))

---------
None type: <class 'bs4.element.NavigableString'>
---------
h2 type: <class 'bs4.element.Tag'>
---------
None type: <class 'bs4.element.NavigableString'>
---------
p type: <class 'bs4.element.Tag'>
---------
None type: <class 'bs4.element.NavigableString'>
---------
div type: <class 'bs4.element.Tag'>
---------
None type: <class 'bs4.element.NavigableString'>
---------
footer type: <class 'bs4.element.Tag'>
---------
None type: <class 'bs4.element.NavigableString'>


So the siblings of the first div tag looks something like this:
\n<h2>This is another sub title</h2>\n<p class="pclass"></p>\n<div class="bottomdiv"></div>\n<footer></footer>

In [18]:
new_contents = []
for c in soup.div.contents :
    if( not c == "\n") :
        new_contents.append(c)

soup.div.unwrap()
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Document
  </title>
 </head>
 <body>
  <h1>
   This is the title
  </h1>
  <h2>
   This is a sub title
  </h2>
  <div class="inner extraclass">
   <h2>
    This is a sub heading
   </h2>
   This is some text inside a string
   <ul>
    <li>
     list item one
    </li>
    <li>
     list item two
    </li>
    <li>
     list item three
     <a href="linkhref">
      a link in a list
     </a>
    </li>
    <li>
     list item four
    </li>
    <li>
     list item 5
    </li>
   </ul>
  </div>
  <h2>
   This is another sub title
  </h2>
  <p class="pclass">
   This is a paragraph
   <span>
    <b>
     Bold text in front of
    </b>
    plain text
   </span>
   <br/>
   <br/>
   <br/>
   <br/>
   <img alt="someimgage" src="images/someimgage.jpg"/>
   <a href="somehyperlink">
    some content
   </a>
  </p>
  <div class="bottomdiv">
  

## Dealing with recursion

## Dealing with attributes

Attributes are a `Dictionary` of `list` called `.attrs`. If we look at the first div

```html
    <div class="outer">
        <div class="inner extraclass">
            <h2>This is a sub heading</h2>
            This is some text inside a string
            <ul>
                <li>list item one</li>
                <li>list item two</li>
                <li>list item three <a href="linkhref">a link in a list</a></li>
                <li>list item four</li>
                <li>list item 5</li>
            </ul>
        </div>
    </div>
```

In [None]:
print('The divs attrs type is', type(soup.div.attrs))
# <class dict>

In [None]:
print('attr len is', len(soup.div.attrs))
# 1

In [None]:

# You cant use an index as it is a dictionary
# print(soup.div.attrs[0])

In [None]:

print('class attr type is', type(soup.div.attrs['class']))
# <class list>

In [None]:

print('length of the class attr is', len(soup.find('div').attrs['class']))
# Notice that gives 2 as bs4 doesn't class the two selectors as one

In [None]:
# This goes boom
# print(soup.div.attrs['foo'])

Other variations

See [This Stack article](https://stackoverflow.com/questions/5015483/test-if-an-attribute-is-present-in-a-tag-in-beautifulsoup) for lots more suggestions

In [64]:
soup.select("[class=pclass]")

[<p class="pclass">
         This is a paragraph
         <span><b>Bold text in front of</b>plain text</span>
 <br/>
 <br/>
 <br/>
 <br/>
 <img alt="someimgage" src="images/someimgage.jpg"/>
 <a href="somehyperlink">some content</a>
 </p>]

In [67]:
for pea in soup.select("[class]") :
    print('-------')
    print(pea) #do something useful like strip the CSS.

-------
<div class="inner extraclass">
<h2>This is a sub heading</h2>
            This is some text inside a string
            <ul>
<li>list item one</li>
<li>list item two</li>
<li>list item three <a href="linkhref">a link in a list</a></li>
<li>list item four</li>
<li>list item 5</li>
</ul>
</div>
-------
<p class="pclass">
        This is a paragraph
        <span><b>Bold text in front of</b>plain text</span>
<br/>
<br/>
<br/>
<br/>
<img alt="someimgage" src="images/someimgage.jpg"/>
<a href="somehyperlink">some content</a>
</p>
-------
<div class="bottomdiv">
<h2>This is another sub heading</h2>
        this is at the bottomdi
    </div>
