# Introduction to `BeautifulSoup`

### BeautifulSoup

- open-source Python library
- extract data from HTML files
- understands HTML structure by working with a [parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) (`lxml`, `html5lib`, etc.) 
- [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for reference
<br> <br>

`BeautifulSoup` does not actually gather information from the web.  We will use the `requests` library for that.

# Learn to Scrape with Simple Inline HTML

Let's start with this simple HTML page given below as a string: 

In [1]:
simple_html = """
<html>

<head>
  <style>
    li {font-size: 18px;}
  </style
</head>

<body>
  <div style="border-style: dotted; padding: 10px">
    <h1>Today's Learning Objectives</h1>
    <ul>
      <li>Decipher basic HTML</li>
      <li>Retrieve information from Internet</li>
      <li>Parse web data</li>
      <li>Gather and prepare data systematically</li>
    </ul>
    <br>
  </div>
</body>

</html>
"""

Now we will tell Python to render this string as HTML.

In [2]:
from IPython.display import display, HTML
display(HTML(simple_html)) 

This simple "page" contains a list of learning objectives for today's workshop. Now we will see how `BeautifulSoup` can extract information from this HTML.

First we need to import `BeautifulSoup` and parse the HTML string.

In [3]:
from bs4 import BeautifulSoup as bs

In [4]:
soup = bs(simple_html)

In [5]:
soup


<html>
<head>
<style></style></head></html>

When we print out `soup`, it looks like `BeautifulSoup` knows how to navigate through the HTML DOM.

In [6]:
type(soup)

bs4.BeautifulSoup

### Find by tag

We begin by using the `find()` method to extract the header of our HTML.

In [7]:
soup.find('h1')

In [8]:
type(soup.find('h1'))

NoneType

`find()` returns a tagged element, but we can grab just the inner HTML text instead.

In [9]:
soup.find('h1').text

AttributeError: 'NoneType' object has no attribute 'text'

In [10]:
type(soup.find('h1').text)

AttributeError: 'NoneType' object has no attribute 'text'

We now have a way to extract text from a webpage -- powerful stuff!  

What do you think will be returned if we look for list tags (`li`)?

In [11]:
soup.find('li')

**Warning**: `BeautifulSoup` returns ONLY the FIRST matching element when we use `find()`.

### Find all

If we would like `BeautifulSoup` to return ALL matching elements, we can use `find_all()` instead.

In [12]:
soup.find_all('li')

[]

In [13]:
type(soup.find_all('li'))

bs4.element.ResultSet

Using `find_all()` yields a result set containing all of list elements on the "page."  You can basically think of a result set as actinly like a list. 

**Warning**: `BeautifulSoup` does not allow you to apply `.text` to a result set.  The following code **will fail**.

In [14]:
soup.find_all('li').text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Instead, you must apply `.text` to each item in the result set individually.

In [15]:
for item in soup.find_all('li'):
  print(item.text)

In [16]:
learning_objectives = [item.text for item in soup.find_all('li')]

learning_objectives

[]

**Tip**: The two **most common mistakes** I see in web scraping with `BeautifulSoup` are:
- Using `find()` when you really want `find_all()`
- Attempting to apply `.text` to a result set like the output of `find_all()`

In [17]:
workshop_html = """
<html>

<body>
  <h1>Today's Workshop</h1>
  <div id='agenda' style="background-color: aliceblue">
    <h2>Agenda</h2>
    <p>Today's workshop is comprised of three main sections:</p>
    <ol>
      <li>HTML Basics</li>
      <li>Scraping Basics</li>
      <li>Scraping Pipeline</li>
    </ol>
  </div>
  
  <div id='tools' style='background-color: honeydew'>
    <h2>Tools</h2>
    <p>You will be learning about two primary Python libraries:</p>  
    <ol>
      <li>BeautifulSoup</li>
      <li>requests</li>
    </ol>
  </div>
</body>

</html>
"""

In [18]:
from IPython.display import display, HTML
display(HTML(workshop_html)) 

> Parse `workshop_html` with `BeautifulSoup`.  Find the main header text (`h1`) and save it in a variable.  Verify that you have the text by checking the `type` of your variable.

In [19]:
soup = bs(workshop_html)

In [20]:
header = soup.find('h1').text
print(header)

Today's Workshop


In [21]:
type(header)

str

Now find all the paragraphs in `workshop_html` and print out the text that you find.

In [22]:
soup.find_all('p')

for paragraph in soup.find_all('p'):
  print(paragraph.text)

Today's workshop is comprised of three main sections:
You will be learning about two primary Python libraries:


Create a list of all of the agenda items for today's workshop.  Be sure to store only the TEXT for the AGENDA items!

In [23]:
agenda_items = [li.text for li in soup.find_all('li')[:3]]

print(agenda_items)

['HTML Basics', 'Scraping Basics', 'Scraping Pipeline']


In [24]:
#Later we will learn a better way: 
#  First look for the div that contains the agenda items

agenda_div = soup.find('div', id='agenda')

agenda_items = [li.text for li in agenda_div.find_all('li')]

print(agenda_items)

['HTML Basics', 'Scraping Basics', 'Scraping Pipeline']


# Scrape Test Webpage

In the last exercise, we found out that oftentimes using only the HTML tags alone won't be granular enough.  

Let's work with a more complicated HTML file to see what other options are available.

First download this file to your computer.

In [27]:
bootcamp_html = open('data/bootcamp.html').read()

In [28]:
print(bootcamp_html)

<html>
    <head>
        <title>Data Science Bootcamp Info</title>

        <style>
            body {
                background-color: cornsilk;
            }

            h1 {
                font-size: 40px;
                font-family: courier new, arial;
                text-align: center;
                margin-top: 50px;
            }

            a {
                color: #411B2D;
                font-size: 20px;
            }

            p {
                font-size: 20px;
            }

            a:hover{
                color: white;
                background-color: #411B2D;
            }

            #toolbar {
                background-color: #F3B643;
                font-family: courier new, arial;
                font-weight: bold;
                font-size: 16px;
                display: flex;
                justify-content: space-around;
                flex-direction: row;
                border: 1px solid black;
                border-radius: 1px;
         

Since our HTML is a string, we can parse it with `BeautifulSoup` and begin collecting data.  

Let's say we are interested in gathering titles and links of events happening today.  Links can be found by looking for anchor, `a`, tags.  

In [29]:
soup = bs(bootcamp_html)

In [30]:
soup.find_all('a')

[<a href="https://us.dsbc.org/about/">WHAT IS DSBC?</a>,
 <a href="https://us.dsbc.org/tutorials/">TUTORIAL SCHEDULE</a>,
 <a href="https://us.dsbc.org/speaking/">SPEAKING AT DSBC</a>,
 <a href="https://us.dsbc.org/schedule/presentation/50/">Foundations of Numerical Computing in Python</a>,
 <a href="https://us.dsbc.org/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a>,
 <a href="https://us.dsbc.org/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a>,
 <a href="https://us.dsbc.org/schedule/presentation/55/">Scalable Computing with Dask</a>,
 <a href="https://us.dsbc.org/schedule/presentation/63/">Creating a Great Python Package</a>,
 <a href="https://us.dsbc.org/schedule/presentation/45/">Minimum Viable Documentation</a>,
 <a href="https://us.dsbc.org/schedule/presentation/74/">Effective Data Visualization</a>]

Whoa -- there are a lot more links on this page other than today's events!

### Find by attribute

In order to drill down to just the links we are interested in, notice that today's events are contained within a `div` that has `id=today`.  We can first isolate this `div` by searching for it by its `id`.

In [None]:
today_div = soup.find(id='today')

today_div

In [None]:
type(today_div)

Now we will look for all of the anchor tags that are contained within this division.

In [None]:
today_div.find_all('a')

**Tip**:  You can find elements by pretty much any attribute.  Let's find elements with that are members of the `events` class.

In [None]:
soup.find_all(class_ = 'events')

Passing a dictionary of attributes works as well.

In [None]:
soup.find_all(attrs={'class':'events', 'id': 'tomorrow'}) 

### Retrieve attributes

If we want to just get the names of today's events, we can simply cycle through today's links and collect the `.text`.

In [None]:
today_text = [link.text for link in today_div.find_all('a')]

today_text

But what would we do if we wanted the **hyperlinks** to each of those events?

`BeautifulSoup` allows you to retrieve element attributes.  You will reference these using the same syntax as dictionary key.

In [None]:
today_div.find('a')

In [None]:
today_div.find('a')['href']

In [None]:
type(today_div.find('a')['href'])

In [None]:
today_links = [link['href'] for link in today_div.find_all('a')]

today_links

Create a list of tuples for each of tomorrow's events.  The first element in your tuples will be the event title and the second will be the event link.

In [None]:
tomorrow_tuples = [(a.text, a['href']) for a in soup.find(id='tomorrow').find_all('a')]

tomorrow_tuples

Using `bootcamp_html` find the header text for today's and tomorrow's events by referencing the `events` class.

In [None]:
event_headers = [div.find('h2') for div in soup.find_all(class_='events')]

event_header_text = [header.text for header in event_headers]

event_header_text