# Tutorial - Beautiful Soup

### What is Beautiful Soup?

**Beautiful Soup** is a Python package for extracting data from HTML files. Other packages provide more powerful tools, but, since Beautiful Soup is friendlier, most **web scraping** practitioners start there and do not leave Beautiful Soup unless their projects get really complex.

This tutorial covers the very basics of Beautiful Soup using a toy example of HTML code. It is assumed that version 4 of the package is already installed in your computer. If it is not, you can install it by entering `pip install bs4` in the operating system shell (or in the Jupyter console). 

### A toy HTML example

An extremely simple example of a **HTML document** follows. 

    <html>

	<head>
    
        <title>Data Viz</title>
    
    </head>

	<body>

		<div class="course">Data Visualization</div>

		<div class="program">MBA full-time</div>`

		<a class="professor" href="https://www.iese.edu/faculty-research/faculty/miguel-angel-canela">Miguel Ángel Canela</a>

	</body>

	</html>

You can use Beautiful Soup with any string that contains HTML code. To use the above example, I create a string variable, whose value is the HTML code. Note that I mark the line breaks with the backslash (`\`).

In [15]:
html_str = '<html> \
  <head> \
  <title>Data Viz</title> \
  </head> \
  <body> \
  <div class="course">Data Visualization</div> \
  <div class="program">MBA full-time</div> \
  <a class="professor" \
  href="https://www.iese.edu/faculty-research/faculty/miguel-angel-canela">Miguel Ángel Canela</a> \
  </body> \
  </html>'

In [16]:
html_str

'<html>   <head>   <title>Data Viz</title>   </head>   <body>   <div class="course">Data Visualization</div>   <div class="program">MBA full-time</div>   <a class="professor"   href="https://www.iese.edu/faculty-research/faculty/miguel-angel-canela">Miguel Ángel Canela</a>   </body>   </html>'

### Parsing HTML code

I import the function `BeautifulSoup` as:

In [17]:
from bs4 import BeautifulSoup

This function can parse the string `html_str`, learning the tree structure. To do this, I enter:

In [18]:
soup = BeautifulSoup(html_str, 'html.parser')

`BeautifulSoup` returns a "soup" object, storing the contents of `html_str` in a way that the different pieces of information can be extracted. To get this, it uses a **parser**, which is a program which breaks the string into substrings based on the tags. 

Beautiful Soup does not come with a parser. It uses the one that it preferes among those available in your computer. If `'html.parser'` is specified, the choice is the parser available in the Python Standard Library, so you do not need any additional package. Since this is a rather technical issue, it is better to start in this way. 

The object `soup` has a special type:

In [19]:
type(soup)

bs4.BeautifulSoup

The contents of `soup` can be displayed:

In [20]:
soup

<html> <head> <title>Data Viz</title> </head> <body> <div class="course">Data Visualization</div> <div class="program">MBA full-time</div> <a class="professor" href="https://www.iese.edu/faculty-research/faculty/miguel-angel-canela">Miguel Ángel Canela</a> </body> </html>

The same is true por the elements contained in `soup`, as we will see below. But we have to see first how to extract this elelements from the soup.

### The method find

 `BeautifulSoup` objects come with several methods attached. This tutorial focuses on two of them, `find` and `find_all`. A first example of `find`:

In [26]:
soup.find('head')

<head> <title>Data Viz</title> </head>

`find` returns an object of type `Tag`:

In [27]:
type(soup.find('head'))

bs4.element.Tag

Though, formally, `BeautifulSoup` and `Tag` are different types, in practice, a tag works as smaller soup:

In [28]:
soup.find('head').find('title')

<title>Data Viz</title>

So far, I'm using `find` with a single argument, which is the name of the tag I wish to capture. If there is no such tag, `find` returns `None`:

In [29]:
soup.find('head').find('div')

In [30]:
soup.find('head').find('div') == None

True

But it can also be that there are more than one tag with the name specified in that element. Then, `find` returns the first one: 

In [31]:
soup.find('div')

<div class="course">Data Visualization</div>

In this case, we can use the attributes to distinguish among tags with the same name:

In [38]:
soup.find('div', {'class': 'course'})

<div class="course">Data Visualization</div>

In general, the attribute values are specified in a dictionary, as shown in the above example. But the attribute `class`, which is the one used in most cases, is an exception, and can be specified in a shorter way:

In [34]:
soup.find('div', 'program')

<div class="program">MBA full-time</div>

### The method find_all

The method `find_all` uses the same syntax as `find`, but returns a list with all the tags that satisfy the specification:

In [35]:
soup.find_all('div')

[<div class="course">Data Visualization</div>,
 <div class="program">MBA full-time</div>]

`find_all` *always returns a list*. The list can be empty (`find` would return `None` in that case): 

In [36]:
soup.find('head').find_all('div')

[]

When there is only one tag in the list, that tag is precisely the one returned by `find`:

In [37]:
soup.find_all('div', 'course')

[<div class="course">Data Visualization</div>]

### Extracting information from a tag element

The information we wish to extract from a tag element can come as text between the start tag and the end tag, or as the value of an attribute. The method `text` extracts all the text between the tags (including children tags, if there are):

In [39]:
soup.find('a').text

'Miguel Ángel Canela'

Note that this method cannot be applied directly to the list returned by `find_all`: 

In [40]:
soup.find_all('div').text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

The right way to extract the text from all the tags in that list is:

In [41]:
[t.text for t in soup.find_all('div')]

['Data Visualization', 'MBA full-time']

Beautiful Soup stores the attribute values as a dictionary. So, to extract the value of an attribute, we use the attribute name as a key.

In [43]:
soup.find('a')['href']

'https://www.iese.edu/faculty-research/faculty/miguel-angel-canela'

### Homework

IESE Business School displays information of the Faculty members in 11 web pages. The URL for the second one is `https://www.iese.edu/search/professors/2`. You can get the source code of the page through the contextual menu that opens when right-clicking anywhere on the page. The file `iese.html` contains that code, slightly edited by dropping tabs and line breaks that may foul Python when copypasting the code in the console. 

1. Copy the code from the file from the file `iese.html`, enter `html_str = ''` in the Pyton console and paste the code between the quote marks. Then press the `Return` key. Then, you will have the source code as a string in Python. This sgring is much longer than in my toy example, and contains other tags, like `<ul>`, `<li>` and `<script>`. 

2. Use the tools presented in this tutorial to parse `html_str`, extracting three lists, with the professors's names (eg "Miguel Ángel Canela"), the professors' descriptions (eg "Associate Professor of Managerial Decision Sciences") and the links to the professors' individual pages (eg "https://www.iese.edu/faculty-research/faculty/miguel-angel-canela"), respectively. 

3. Use the function `pd.DataFrame` to create a data frame with three columns, `name`, `description` and `link`, containing the data of the these lists.

4. Use the function `.to_excel` to export the data collected to an Excel file.