# Basic Webpage Scraping
Webpage scraping consists of two steps: crawling and parsing.  In this tutorial, we focus on parsing HTML data.   Beautifulsoup is a powerful tool to process static HTML.  More details can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

To simplify our learning, we will use a simple example from W3Schools: https://www.w3schools.com/howto/tryit.asp?filename=tryhow_css_example_website

This simple example contains in a single HTML for simplicity and has been saved in an html file, simple_page.html.


## Parsing a webpage
First we read the content from simple_page.html file, store content in variable 'html'

In [1]:
with open('simple_page.html') as f:
    html = f.read()

In [2]:
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n<title>Page Title</title>\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<style>\n* {\n  box-sizing: border-box;\n}\n\n/* Style the body */\nbody {\n  font-family: Arial, Helvetica, sans-serif;\n  margin: 0;\n}\n\n/* Header/logo Title */\n.header {\n  padding: 80px;\n  text-align: center;\n  background: #1abc9c;\n  color: white;\n}\n\n/* Increase the font size of the heading */\n.header h1 {\n  font-size: 40px;\n}\n\n/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */\n.navbar {\n  overflow: hidden;\n  background-color: #333;\n  position: sticky;\n  position: -webkit-sticky;\n  top: 0;\n}\

Import all necessary packages

In [3]:
from bs4 import BeautifulSoup
from bs4.element import Tag

Parse HTML content in 'html' variable.  Note that prettify is a bs4 method that returns a string with formatted HTML content.

In [4]:
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Page Title
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style>
   * {
  box-sizing: border-box;
}

/* Style the body */
body {
  font-family: Arial, Helvetica, sans-serif;
  margin: 0;
}

/* Header/logo Title */
.header {
  padding: 80px;
  text-align: center;
  background: #1abc9c;
  color: white;
}

/* Increase the font size of the heading */
.header h1 {
  font-size: 40px;
}

/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */
.navbar {
  overflow: hidden;
  background-color: #333;
  position: sticky;
  position: -webkit-sticky;
  top: 0;
}

/* Style the nav

## Beautiful Soup DOM Tree
The structure of Beautiful Soup bases on the concept of DOM, which is used in all web browsers.  DOM is a tree of all elements in the webpage.  Each element node consists of:
- tag
- innerHTML/outerHTML
- id
- attributes
- parent and children

### DOM Element Node
Tag **'title'** is a tag of one of the element node in the example; 
we can refer to the node by using the tag name

Note that title is not a root node.  The root node is 'html'.  We can access nodes via the root node.  We can also refer to the tag directly.  In this case, bs4 will return the first node with the refered tag.

In [5]:
type(soup.title)

bs4.element.Tag

In [6]:
# we can get tag of a node with 'name'
soup.title.name

'title'

In [7]:
# we can get outerHTML by converting node to string
str(soup.title)

'<title>Page Title</title>'

In [8]:
# direct reference to a node leads to the same result
soup.title

<title>Page Title</title>

In [9]:
# we can get innerHTML with 'string'
soup.title.string

'Page Title'

In [10]:
# we can get id with 'id' (it is empty in this example)
soup.title.id

In [11]:
# we can get attribute values with 'attrs'
soup.title.attrs

{}

In [12]:
# getting the parent node with 'parent'
soup.title.parent.name

'head'

In [13]:
# referring to children
soup.title.children

<list_iterator at 0x107a0b220>

### Traversing simple HTML's DOM Tree

In our example, the structure is as followed:

```
html
+-- head
|   +-- title
|   +-- meta
|   +-- meta
|   +-- style
+-- body
    +-- div
    |   +-- h1
    |   +-- p
    |       +-- b
    +-- div
    |   +-- a
    |   +-- a
    |   +-- a
    |   +-- a
    +-- div
    |   +--div
    |   |   +-- h2
    |   |   +-- h5
    |   |   +-- ...
    |   +--div
    |       +-- h2
    |       +-- h5
    |       +-- ...
    +-- div
        +-- h2
```

Let refer to the first anchor (tag = a) in this structure.  Notice that even if we have 4 anchors in this HTML, bs4 returns only the first one.

In [14]:
soup.a

<a class="active" href="#">Home</a>

Assign this node to a variable 'a_node'

In [15]:
a_node = soup.a

A node can have multiple (or 0) attributes.  We can access all attributes using get method.

In [16]:
a_node.attrs

{'href': '#', 'class': ['active']}

In [17]:
print(a_node.get('href'))
print(a_node.get('class'))

#
['active']


### Relative Reference
Let's navigate based on a_node using parent and children

In [18]:
a_parent = a_node.parent
a_parent

<div class="navbar">
<a class="active" href="#">Home</a>
<a href="#">Link</a>
<a href="#">Link</a>
<a class="right" href="#">Link</a>
</div>

In [19]:
a_parent.name

'div'

Note that bs4 turns any spaces in between elements into NavigableString

In [20]:
for node in a_parent.children:
    print('[{}] {}'.format(type(node), node))

[<class 'bs4.element.NavigableString'>] 

[<class 'bs4.element.Tag'>] <a class="active" href="#">Home</a>
[<class 'bs4.element.NavigableString'>] 

[<class 'bs4.element.Tag'>] <a href="#">Link</a>
[<class 'bs4.element.NavigableString'>] 

[<class 'bs4.element.Tag'>] <a href="#">Link</a>
[<class 'bs4.element.NavigableString'>] 

[<class 'bs4.element.Tag'>] <a class="right" href="#">Link</a>
[<class 'bs4.element.NavigableString'>] 



In [21]:
for node in a_node.previous_siblings:
    print('[{}] {}'.format(type(node), node))

[<class 'bs4.element.NavigableString'>] 



In [22]:
for node in a_node.next_siblings:
    print('[{}] {}'.format(type(node), node))

[<class 'bs4.element.NavigableString'>] 

[<class 'bs4.element.Tag'>] <a href="#">Link</a>
[<class 'bs4.element.NavigableString'>] 

[<class 'bs4.element.Tag'>] <a href="#">Link</a>
[<class 'bs4.element.NavigableString'>] 

[<class 'bs4.element.Tag'>] <a class="right" href="#">Link</a>
[<class 'bs4.element.NavigableString'>] 



Let's try with first div node in this HTML

In [23]:
soup.div

<div class="header">
<h1>My Website</h1>
<p>A <b>responsive</b> website created by me.</p>
</div>

In [24]:
soup.div.parent.name

'body'

Here is the list of DOM navigation:
- **node.children**: iterator for all children of a node
- **node.descendants**: iterator for all of a tag’s children, recursively: its direct children, the children of its direct children, and so on
- **node.parent**: parent of the existing node
- **node.parents**: iterator for all of an element’s parents to the root of the DOM tree
- **node.next_sibling / node.previous_sibling**: navigate between page elements that are on the same level of the DOM tree
- **node.next_element / node.previous_element**: navigate between page elements in the DOM tree, regardless of the level

### Direct Reference
We can refer to any node via the root node, html, with dot separated.
Refer to our DOM tree structure in the following examples.

In [25]:
soup.head

<head>
<title>Page Title</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style>
* {
  box-sizing: border-box;
}

/* Style the body */
body {
  font-family: Arial, Helvetica, sans-serif;
  margin: 0;
}

/* Header/logo Title */
.header {
  padding: 80px;
  text-align: center;
  background: #1abc9c;
  color: white;
}

/* Increase the font size of the heading */
.header h1 {
  font-size: 40px;
}

/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */
.navbar {
  overflow: hidden;
  background-color: #333;
  position: sticky;
  position: -webkit-sticky;
  top: 0;
}

/* Style the navigation bar links */
.navbar a {
  float: left;
  di

We can access the css style at node:

**html -> head -> style**

Note that we do not have to include html in the reference

In [26]:
soup.head.style

<style>
* {
  box-sizing: border-box;
}

/* Style the body */
body {
  font-family: Arial, Helvetica, sans-serif;
  margin: 0;
}

/* Header/logo Title */
.header {
  padding: 80px;
  text-align: center;
  background: #1abc9c;
  color: white;
}

/* Increase the font size of the heading */
.header h1 {
  font-size: 40px;
}

/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */
.navbar {
  overflow: hidden;
  background-color: #333;
  position: sticky;
  position: -webkit-sticky;
  top: 0;
}

/* Style the navigation bar links */
.navbar a {
  float: left;
  display: block;
  color: white;
  text-align: center;
  padding: 14px 20px;
  text-decoration: none;
}


/* Right-aligned link */

We do not have to include everythin in the path.  Let access the first h2 node inside body.

In [27]:
soup.body.h2

<h2>About Me</h2>

In [28]:
soup.body.h2.parent

<div class="side">
<h2>About Me</h2>
<h5>Photo of me:</h5>
<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>
<p>Some text about me in culpa qui officia deserunt mollit anim..</p>
<h3>More Text</h3>
<p>Lorem ipsum dolor sit ame.</p>
<div class="fakeimg" style="height:60px;">Image</div><br/>
<div class="fakeimg" style="height:60px;">Image</div><br/>
<div class="fakeimg" style="height:60px;">Image</div>
</div>

We can jump back and forth between nodes, parents, and children

In [29]:
soup.body.h2.parent.p

<p>Some text about me in culpa qui officia deserunt mollit anim..</p>

## Finding Nodes
We usually have more than one element with the same tag.  We can get all those nodes using find_all method.  Note that find method is also avaiable and will return the first node that match the criteria.

In [30]:
all_div = soup.find_all('div')

In [31]:
len(all_div)

12

In [32]:
n = 0
for div in all_div:
    print('-- {} --'.format(n))
    print(div)
    n += 1

-- 0 --
<div class="header">
<h1>My Website</h1>
<p>A <b>responsive</b> website created by me.</p>
</div>
-- 1 --
<div class="navbar">
<a class="active" href="#">Home</a>
<a href="#">Link</a>
<a href="#">Link</a>
<a class="right" href="#">Link</a>
</div>
-- 2 --
<div class="row">
<div class="side">
<h2>About Me</h2>
<h5>Photo of me:</h5>
<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>
<p>Some text about me in culpa qui officia deserunt mollit anim..</p>
<h3>More Text</h3>
<p>Lorem ipsum dolor sit ame.</p>
<div class="fakeimg" style="height:60px;">Image</div><br/>
<div class="fakeimg" style="height:60px;">Image</div><br/>
<div class="fakeimg" style="height:60px;">Image</div>
</div>
<div class="main" id="div_1">
<h2>TITLE HEADING</h2>
<h5>Title description, Dec 7, 2017</h5>
<div class="fakeimg" style="height:200px;">Image</div>
<p id="more_text">Some text..</p>
<p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmo

In [33]:
div8 = all_div[8]
div8

<div class="main" id="div_1">
<h2>TITLE HEADING</h2>
<h5>Title description, Dec 7, 2017</h5>
<div class="fakeimg" style="height:200px;">Image</div>
<p id="more_text">Some text..</p>
<p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>
<br/>
<h2>TITLE HEADING</h2>
<h5>Title description, Sep 2, 2017</h5>
<div class="fakeimg" style="height:200px;">Image</div>
<p>Some text..</p>
<p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>
</div>

In [None]:
div8.attrs

In [None]:
div8.get('class')

find_all method can be used at any node.  This effectly limits the finding scope.

In [None]:
div8.find_all('div')

In [None]:
div8.find('div')

### Find with Criteria
find and find_all methods accept criteria for searching.  For example, we can find all nodes with specific id or with specific attributes.

In [34]:
soup.find_all(id='more_text')

[<p id="more_text">Some text..</p>]

In [35]:
soup.find_all(attrs={'class': 'fakeimg'})

[<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>]

In [36]:
soup.find_all(attrs={'style': 'height:60px;'})

[<div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>]

In [37]:
soup.find_all('div', attrs={'class': 'main'})

[<div class="main" id="div_1">
 <h2>TITLE HEADING</h2>
 <h5>Title description, Dec 7, 2017</h5>
 <div class="fakeimg" style="height:200px;">Image</div>
 <p id="more_text">Some text..</p>
 <p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>
 <br/>
 <h2>TITLE HEADING</h2>
 <h5>Title description, Sep 2, 2017</h5>
 <div class="fakeimg" style="height:200px;">Image</div>
 <p>Some text..</p>
 <p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>
 </div>]

## CSS Selector
CSS Selector is very powerful for node searching.  We can search by tag name, id, class, and combination of criteria.  The CSS Selector includes:
- **string**: select node with the specific *tag* e.g. div for node with tag 'div'
- **.class**: select node with the specific *class*
- **#id**: select node with the specific *id*
- **tag[attr]**: select node with the specific *tag* and *attr*

In [38]:
soup.select('p')

[<p>A <b>responsive</b> website created by me.</p>,
 <p>Some text about me in culpa qui officia deserunt mollit anim..</p>,
 <p>Lorem ipsum dolor sit ame.</p>,
 <p id="more_text">Some text..</p>,
 <p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>,
 <p>Some text..</p>,
 <p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p>]

In [39]:
soup.select('h2')

[<h2>About Me</h2>,
 <h2>TITLE HEADING</h2>,
 <h2>TITLE HEADING</h2>,
 <h2>Footer</h2>]

In [40]:
soup.select('#more_text')

[<p id="more_text">Some text..</p>]

In [41]:
soup.select('.fakeimg')

[<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:60px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>]

In [42]:
soup.select('#my_photo.fakeimg')

[<div class="fakeimg" id="my_photo" style="height:200px;">Image</div>]

In [43]:
node = soup.select('div div h2')
node

[<h2>About Me</h2>, <h2>TITLE HEADING</h2>, <h2>TITLE HEADING</h2>]