# Introduction to Web Scraping

-----

Previously, we have looked at different data formats, including the CSV, JSON and XML text-based formats. In this notebook, we explore how to actually pull data of interest out of  semi-structured and structured text data. To do this, we begin by reviewing the concept of parsing, where we use the structure of a document to extract contextual information. Next, we move on to parsing structured documents, for which we use the parsing tool [BeautifulSoup][bs]. This library provides an elegant and simple method to parse and access XML formatted data, which includes HTML and SVG documents. BeautifulSoup was actually designed to simplify the task of scraping data from websites; thus we can use it to parse any XML formatted data. 

-----
[bs]: http://www.crummy.com/software/BeautifulSoup/

## Table of Contents


[Sample Document](#Sample-Document)

[Document Object Model](#Document-Object-Model)

[Parsing Documents](#Parsing-Documents)

[Using Regular Expressions](#Using-Regular-Expressions)

-----

[[Back to TOC]](#Table-of-Contents)

## Sample Document

To demonstrate parsing, we need a sample, XML compliant document. For this purpose, we create an HTML document and assign it to the `html` variable in the following Code cell. This document has a defined _doctype_, follows standard HTMl rules including the use of a parent _html_ element, both _head_ and _body_ elements, as well as several other HTML elements including a paragraph element, a header element, an unordered list element, a table element, and a footer element. 

----

In [1]:
# A simple HTML document to demonstrate DOM processing

html = '''
<!DOCTYPE html>
<html>
<head id='hid' class='hclass'>
<title> Test, this is only a test ... </title>
</head>
<body id='bid' class='bclass'>
<header> 
This is text in the header.
</header>

<h2 color='mycolor'>This is a Header Level 2</h2>

<p align='myalign'>Here is some text in a paragraph.</p>

<p> Here is a list </p>
<ul id='ulid'>
<li> List Item #1 </li>
<li> List Item #2 </li>
</ul>

<p type='caption'> Here is a table </p>
<table id='tid'>
<tr>
<th> Column #1 </th>
<th> Column #2 </th>
</tr>
<tr>
<td> A value </td>
<td> Another Value </td>
</tr>
</table>

<p> Some concluding text </p>

<footer>
<hr />
This is a text in the footer.
</footer>

</body>
</html>
'''

-----

In the following Code cell, we display the HTML document inline, showing the different document components mentioned earlier. The second Code cell below, writes the document to a file to simplify subsequent parsing.

-----

In [2]:
from IPython.display import display_html

display_html(html, raw=True)

Column #1,Column #2
A value,Another Value


-----

Now that we have our sample document, we save it the local filesystem to simplify subsequent parsing. First, we define our local data directory, before creating a file for the document.

-----

In [3]:
# First we find our HOME directory
home_dir = !echo $HOME

# Define data directory
data_dir = home_dir[0] +'/data/'

In [4]:
# Now save the HTML string
with open(data_dir + 'test.html', 'w') as fout:
    fout.write(html)

-----

[[Back to TOC]](#Table-of-Contents)

## Document Object Model

There are at least two techniques used to parse a structured file like an XML document. The first approach is known as [Simple API for XML][sax] (or SAX), which is an event driven parser that reads and processes each part of an XML document sequentially. The second approach is the [Document Object Model][dom] (or DOM), which reads and parses the entire document. While the SAX approach can be fast and uses a smaller memory footprint, the DOM approach can be more easily used to extract all or most of the information contained in an XML document. 

To demonstrate using a DOM, we can process our newly minted [HTML file](test.html), which is rendered rather simply as shown in the following figure:

![HTML Page view](images/html-view.png)

This HTML document, which is a valid XML document, demonstrates both hierarchical elements, as well as element attributes and values. This can be seen more easily by examining the document object model (or DOM) representation of this document, which is shown in the following figure:

![HTML DOM view](images/html-dom.png) 

This figure is actually a screenshot from the Safari Web Browser _Developer Source View_, other browsers provide similar functionality (although you may  need to install an add-on package).  This representation of the DOM very clearly illustrates the hierarchical nature of the document. At the highest level we have the `html` element, inside of which are two separate elements: `body` and `head`. 

![HTML DOM view](images/dom-tree.png) 

Looking at the document tree more closely, we see that the `head` element has an associated `id` and `class` attributes as well as a child element called `title`, which has a value of  `Test, this is only a test ...`. The `body` element has a number of children elements, including the `header`, `h2`, `p`, `ul`, `table`, and `footer` elements. Some of these elements have both child elements, values, and possibly their own attributes. The relationship between the DOM element and the HTML view can be seen in the following two figures, where the `ul` element is highlighted in the DOM model, 

![HTML DOM element](images/dom-element.png) 

and the corresponding element is highlighted in the HTML view.

![HTML html element](images/html-element.png) 

-----
[sax]: https://en.wikipedia.org/wiki/Simple_API_for_XML
[dom]: https://en.wikipedia.org/wiki/Document_Object_Model

-----

[[Back to TOC]](#Table-of-Contents)

## Parsing Documents

To parse an XML document, like our example HTML document, we can use the Python [Beautiful Soup][bs] library. This library uses an XML/HTML parser to build a DOM tree, and Beautiful Soup then provides traversal methods to access and modify the DOM for a specific document. BeautifulSoup has been extremely popular for the ease with which it allows web scraping, for example, you can pull data out of an HTML table. But it is more powerful than this, as it allows you to easily parse and manipulate any XML document.

To use Beautiful Soup, we first need to import the library, and then create a BeautifulSoup object that provides access to the parsed data. Document elements, like `body` or `table` are directly accessed from the parsed tree; and element attributes or data can be easily extracted, deleted, or replaced. If required, new data can also be added to an existing document, allowing for the dynamic creation of a new document. These capabilities are demonstrated in the following code cells.

-----
[bs]: http://www.crummy.com/software/BeautifulSoup/

In [5]:
# Parse our HTML document

# We use BeautifulSoup version 4
from bs4 import BeautifulSoup
  
# load our doucment, and specify parser
soup = BeautifulSoup(open(data_dir + 'test.html'), 'lxml')

# Now lets print out the start of the HTMl file
print(soup.prettify()[:108])

<!DOCTYPE html>
<html>
 <head class="hclass" id="hid">
  <title>
   Test, this is only a test ...
  </title>


-----

To extract an element, we simply use the element's name as an attribute. Thus, to extract the title of the HTML document, we use `soup.title`. These elements have several special attributes, including `name` to extract the name of the element and the `string` attribute to extract the data enclosed between the opening and closing element tags. We can also traverse the DOM tree by requesting the parent element by using the `parent` attribute on an element.

The following Code cell demonstrate these concepts by extracting the _title_ element, the data within the _title_ element, and the name of the _title_ element's parent element.

-----

In [6]:
# We can access document elements directly
print('title element:= ', soup.title)
print('title value:', soup.title.string)

# We can access parent data
print('title parent element: ', soup.title.parent.name)

title element:=  <title> Test, this is only a test ... </title>
title value:  Test, this is only a test ... 
title parent element:  head


-----

We can also access element attributes by using a dictionary-style access method. For example, the following Code cell extracts the `class` attribute from the _body_ element.

-----

In [7]:
# We can directly access element attributes
print('body class attribute: ', soup.body['class'])

body class attribute:  ['bclass']


-----

Accessing an element directly provides the entire content of the element, even child elements. This is demonstrated in the following Code cell, where the unordered list (_ul_) element is accessed, providing the entire list contents.

-----

In [8]:
# We can access an entire element's content
print(soup.ul)

<ul id="ulid">
<li> List Item #1 </li>
<li> List Item #2 </li>
</ul>


-----

We can also search for all elements in a document, and iterate through the search results using a Python loop. This functionality is provided by the `find_all` function, which takes an element and returns a result set containing all matches. This set can be iterated over to provide access to each matching element, as shown in in the next Code cell that finds all paragraph elements.


-----

In [9]:
# We can find all occurances of a particular element

for el in soup.find_all('p'):
    print(el)

<p align="myalign">Here is some text in a paragraph.</p>
<p> Here is a list </p>
<p type="caption"> Here is a table </p>
<p> Some concluding text </p>


-----

By accessing the elements, we can also change their values. For example, we can change the title of the document or the attributes of an element. Changing content simply requires assigning the new value to the element or the attribute. These concepts are demonstrated in the next Code cell where we modify the content of our HTML document

-----


In [10]:
# We can also change data in the document
soup.title.string = 'This is a new title!'
print(f'New title = {soup.title}\n')

# Change attribute and display
soup.body['class'] = 'newClass'
print(f'Body class attribute = {soup.body["class"]}\n')

New title = <title>This is a new title!</title>

Body class attribute = newClass



-----

We can remove elements from the document by using the [`extract`][bse] function. Applying this function directly to an element (or tag) removes the tag (and its children) from the XML document. We demonstrate this in the next Code cell where we remove the entire table element from the document.

-----
[bse]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#extract

In [11]:
# We can delete elements, the display is 
# None since the element is gone
myTable = soup.table.extract()
print(soup.table)

None


-----

We can also select elements based on a Cascading Style Sheet (CSS), which follows an attribute style access. The following example demonstrates selecting a paragraph element with the `type` attribute, which can be used by a CSS document to apply a styling to the element.

-----

In [12]:
# We can select elements based on CSS Selectors
target = soup.select('p[type]')
print(target)

[<p type="caption"> Here is a table </p>]


-----

We can also insert new content into a document. The simplest approach is to use the `insert_before` or `insert_after` functions, which insert the new element before or after, respectively, the indicated element. In the following example, we insert the table we removed earlier, and place it right after the paragraph element we selected int he previous Code cell.


-----

In [13]:
# We need to pull out the first element in the list to get tag
# Now we can insert our table back into the DOM

target[0].insert_after(myTable)
print(soup.table)

<table id="tid">
<tr>
<th> Column #1 </th>
<th> Column #2 </th>
</tr>
<tr>
<td> A value </td>
<td> Another Value </td>
</tr>
</table>


-----

We can also create entirely new content. To do this, we need to create the new tag, and add it to an existing tag. The [`insert`][bsi] function takes a first argument that specifies the insert position within the existing element. For example, `1` means at the start of the existing element. These concepts are demonstrated in the next Code cell where we modify the content of our HTML document

-----

[bsi]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#insert

In [14]:
# We can also insert entirely new elements.

# First we create a new element (tag), with an attribute
tag = soup.new_tag('h3', id='h3id')
tag.string = 'A New Header'

# Now we can append (in this case we append to the end of the body element)
soup.body.append(tag)
print(f'New header element = {soup.h3}\n')

# Now create a new tag
nt = soup.new_tag('body_title')
nt.string = "A body title"

# Insert at start of the body element
soup.body.insert(1, nt)
print(f'New title element = {soup.body.body_title}\n')

New header element = <h3 id="h3id">A New Header</h3>

New title element = <body_title>A body title</body_title>



-----

[[Back to TOC]](#Table-of-Contents)

## Using Regular Expressions

While Beautiful Soup provides a great deal of power and simplicity in DOM parsing and element retrieval, the full power of parsing a document requires the use of regular expressions. We introduced regular expressions in a previous lesson; for completeness, however, we briefly review their use in Python.

Regular expressions, or RE or regexes, are expressions that can be used to match one or more occurrences of a particular pattern. Regular expressions are not unique to Python, they are used in many programming languages and many Unix command line tools like `sed`, `grep`, or `awk`. [Regular expressions][re] are used in Python through the `re` module. Given a regular expression, the first task in Python is to compile the RE, which is done by using the `compile` method in the `re` module. 

We demonstrate this approach in the following Code cell, where we now parse a much larger document, the airport XML document created in an earlier lesson. Below, we use a regular expression to find and display the element containing `CMI` to display our local airport.

-----
[re]: https://docs.python.org/3/howto/regex.html

In [15]:
# We need the re module
import re 

# Open and parse our XML document
soup = BeautifulSoup(open(data_dir + 'data.xml'), 'lxml')

# Findelements containing the CMI string
for el in soup.find_all(text=re.compile('CMI')):

    # To get the entire airport element, we need to go 
    # up two levels in the DOM tree.
    print(el.parent.parent)

<airport name="University of Illinois-Willard">
<iata>CMI</iata>
<city>Champaign/Urbana</city>
<state>IL</state>
<country>USA</country>
<latitude>40.03925</latitude>
<longitude>-88.27805556</longitude>
</airport>


-----

<font color='red' size = '5'> Student Exercise </font>

Earlier in this notebook, we used the BeautifulSoup module, the libXML parser, and regular expressions to extract information from web pages. Now that you have run the cells in this notebook, go back to the relevant cells and make these changes. Be sure to understand how your changes impact the file input and output process.

3. Add an ordered HTML list containing at least five elements (i.e., use an `<ol>` element, with five child `<li>` elements to the original HTML document. Use the BeautifulSoup library to extract and display the five items.
4. Change all words in the HTML document to be upper-case.
56. Use a regular expression to find and display all airports in the _data.xml_ document located within the state of Wyoming.

As a challenge problem:

1. Save several webpages (perhaps by using wget), and modify the BeautifulSoup code example to parse out and display the page title, any Javascript code libraries, and any css style file references.

-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

2. [BeautifulSoup][2] tutorial.
3. The [scrapy][3] web parsing tool
23. An older, but easy to follow blog article on using [BeautifulSoup][4] to parse a webpage.
43. A tutorial notebook on [web scraping][43] with Python

-----

[2]: http://programminghistorian.org/lessons/intro-to-beautiful-soup
[3]: http://scrapy.org
[4]: https://programminghistorian.org/lessons/intro-to-beautiful-soup
[43]: http://nbviewer.jupyter.org/url/www.unc.edu/%7Encaren/Lax-1.ipynb.json

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode