# Introduction to BeautifulSoup


Let's begin with the first code you saw earlier in this lecture:

In [None]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

<p>This command outputs the complete HTML code for <em>page1</em> located at the URL http://pythonscraping.com/pages/page1.html. 

More accurately, this outputs the HTML file <em>page1.html</em>, found in the directory <em>&lt;web root&gt;/pages</em>, on the server located at the domain name <a class="link" href="http://pythonscraping.com">http://pythonscraping.com</a>.</p>

___

**Now go to "Module 3  Class Exercise" notebook and complete Exercise 3.**

___

The most commonly used function in the BeautifulSoup library is `BeautifulSoup`. Let’s take a look at it in action by modifying the code above:

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

**<font color='red'>Note 1:</font>** This returns only the *first instance* of the `h1` tag found on the page. 

Now, let's see how this code actually works:

<p> When you run the above code, the HTML content is transformed into a <code>BeautifulSoup</code> object, with the following structure:</p>

<ul>
	<li>
	<p><strong>html</strong> → <em>&lt;html&gt;&lt;head&gt;...&lt;/head&gt;&lt;body&gt;...&lt;/body&gt;&lt;/html&gt;</em></p>

	<ul>
		<li>
		<p><strong>head</strong> → <em>&lt;head&gt;&lt;title&gt;A Useful Page&lt;title&gt;&lt;/head&gt;</em></p>

		<ul>
			<li><strong>title</strong> → <em>&lt;title&gt;A Useful Page&lt;/t</em><em>itle&gt;</em></li>
		</ul>
		</li>
		<li>
		<p><strong>body</strong> → <em>&lt;body&gt;&lt;h1&gt;An Int...&lt;/h1&gt;&lt;div&gt;Lorem ip...&lt;/div&gt;&lt;/body&gt;</em></p>

		<ul>
			<li><strong>h1</strong> → <em>&lt;h1&gt;An Interesting Title&lt;/h1&gt;</em></li>
			<li><strong>div</strong> → <em>&lt;div&gt;Lorem Ipsum dolor...&lt;/div&gt;</em></li>
		</ul>
		</li>
	</ul>
	</li>
</ul>

**<font color='red'>Note 2:</font>** You can use the `print()` funtion and the `prettify()` method to see the structure:

In [None]:
print(bs.prettify())

**<font color='red'>Note 3:</font>** The `h1` tag that you extract from the page is nested two layers deep into your BeautifulSoup object structure `(html → body → h1)`. However, when you actually fetch it from the object, you call the `h1` tag directly:

`bs.h1`

In fact, any of the following function calls produce the same output:

* <code>bs.html.body.h1</code>

* <code>bs.body.h1</code>

* <code>bs.html.h1</code>

Give them a try in the cell below (we will discuss this in depth later on):

In [1]:
# test different tags here:




___

**Now go to "Module 3  Class Exercise" notebook and complete Exercise 4.**

___

## BeautifulSoup() input arguments:

As you saw earlier, the BeautifulSoup function has two input arguments: `BeautifulSoup(markup, "html.parser")`. The first is the HTML text the object is based on, and the second specifies the **parser** that you want BeautifulSoup to use in order to create that object. 

Here are, two notes about the inputs:

**<font color='red'>Note 4:</font>** Thus far, we have been calling <code>html.read()</code> in order to get the HTML content of the page as a <font color='blue'>text string</font>. BeautifulSoup can also use the <font color='blue'> file object </font> directly returned by <code>urlopen</code>, without needing to call <code>.read()</code> first:

`bs = BeautifulSoup(html, 'html.parser')`

**<font color='red'>Note 5:</font>** For the parser, there are four options availabe:

* **Python’s html.parser:** `BeautifulSoup(markup, "html.parser")`
* **lxml’s HTML parser:** 	`BeautifulSoup(markup, "lxml")`	
* **lxml’s XML parser:**    `BeautifulSoup(markup, "lxml-xml")` or `BeautifulSoup(markup, "xml")`
* **html5lib:**             `BeautifulSoup(markup, "html5lib")`


In the majority of cases, it makes no difference which parser you choose.

`html.parser` is a included with Python 3 and requires no extra installations in order to use. Except where required, we will use this parser throughout this course. 

lxml is included in the Anaconda. If it's not installed in your version of Python, you can pip install it: pip install lxml (in the commmand line).

For pros and cons of the parsers refer to [this](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Generally:

* An advantage of lxml and html5lib over html.parser is that they are more lenient (i.e. better at parsing “messy” or malformed HTML codes). They are forgiving and fix problems like unclosed tags, tags that are improperly nested, and missing head or body tags.

* lxml and html5lib are also somewhat faster than html.parser, although speed is not necessarily an advantage in web scraping, given that the **speed of the network itself will almost always be your largest bottleneck**. 

___

**Now go to "Module 3  Class Exercise" notebook and complete Exercise 5.**

___