# Basic Webpage Scraping
Webpage scraping consists of two steps: crawling and parsing.  In this tutorial, we focus on parsing HTML data.   Beautifulsoup is a powerful tool to process static HTML.  More details can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

To simplify our learning, we will use a simple example from W3Schools: https://www.w3schools.com/howto/tryit.asp?filename=tryhow_css_example_website

This simple example contains in a single HTML for simplicity and has been saved in an html file, simple_page.html.


## Parsing a webpage
First we read the content from simple_page.html file, store content in variable 'html'

In [1]:
with open('simple_page.html') as f:
    html = f.read()

In [2]:
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n<title>Page Title</title>\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<style>\n* {\n  box-sizing: border-box;\n}\n\n/* Style the body */\nbody {\n  font-family: Arial, Helvetica, sans-serif;\n  margin: 0;\n}\n\n/* Header/logo Title */\n.header {\n  padding: 80px;\n  text-align: center;\n  background: #1abc9c;\n  color: white;\n}\n\n/* Increase the font size of the heading */\n.header h1 {\n  font-size: 40px;\n}\n\n/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */\n.navbar {\n  overflow: hidden;\n  background-color: #333;\n  position: sticky;\n  position: -webkit-sticky;\n  top: 0;\n}\

Import all necessary packages

In [3]:
from bs4 import BeautifulSoup
from bs4.element import Tag

Parse HTML content in 'html' variable.  Note that prettify is a bs4 method that returns a string with formatted HTML content.

In [4]:
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Page Title
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style>
   * {
  box-sizing: border-box;
}

/* Style the body */
body {
  font-family: Arial, Helvetica, sans-serif;
  margin: 0;
}

/* Header/logo Title */
.header {
  padding: 80px;
  text-align: center;
  background: #1abc9c;
  color: white;
}

/* Increase the font size of the heading */
.header h1 {
  font-size: 40px;
}

/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */
.navbar {
  overflow: hidden;
  background-color: #333;
  position: sticky;
  position: -webkit-sticky;
  top: 0;
}

/* Style the nav

## Beautiful Soup DOM Tree
The structure of Beautiful Soup bases on the concept of DOM, which is used in all web browsers.  DOM is a tree of all elements in the webpage.  Each element node consists of:
- tag
- innerHTML/outerHTML
- id
- attributes
- parent and children

### DOM Element Node
Tag **'title'** is a tag of one of the element node in the example; 
we can refer to the node by using the tag name

Note that title is not a root node.  The root node is 'html'.  We can access nodes via the root node.  We can also refer to the tag directly.  In this case, bs4 will return the first node with the refered tag.

In [5]:
type(soup.title)

bs4.element.Tag

In [6]:
# we can get tag of a node with 'name'
soup.title.name

'title'

In [7]:
# we can get outerHTML by converting node to string
str(soup.title)

'<title>Page Title</title>'

In [None]:
# direct reference to a node leads to the same result
soup.title

In [None]:
# we can get innerHTML with 'string'
soup.title.string

In [None]:
# we can get id with 'id' (it is empty in this example)
soup.title.id

In [None]:
# we can get attribute values with 'attrs'
soup.title.attrs

In [None]:
# getting the parent node with 'parent'
soup.title.parent.name

In [None]:
# referring to children
soup.title.children

### Traversing simple HTML's DOM Tree

In our example, the structure is as followed:

```
html
+-- head
|   +-- title
|   +-- meta
|   +-- meta
|   +-- style
+-- body
    +-- div
    |   +-- h1
    |   +-- p
    |       +-- b
    +-- div
    |   +-- a
    |   +-- a
    |   +-- a
    |   +-- a
    +-- div
    |   +--div
    |   |   +-- h2
    |   |   +-- h5
    |   |   +-- ...
    |   +--div
    |       +-- h2
    |       +-- h5
    |       +-- ...
    +-- div
        +-- h2
```

Let refer to the first anchor (tag = a) in this structure.  Notice that even if we have 4 anchors in this HTML, bs4 returns only the first one.

In [None]:
soup.a

Assign this node to a variable 'a_node'

In [None]:
a_node = soup.a

A node can have multiple (or 0) attributes.  We can access all attributes using get method.

In [None]:
a_node.attrs

In [None]:
print(a_node.get('href'))
print(a_node.get('class'))

### Relative Reference
Let's navigate based on a_node using parent and children

In [None]:
a_parent = a_node.parent
a_parent

In [None]:
a_parent.name

Note that bs4 turns any spaces in between elements into NavigableString

In [None]:
for node in a_parent.children:
    print('[{}] {}'.format(type(node), node))

In [None]:
for node in a_node.previous_siblings:
    print('[{}] {}'.format(type(node), node))

In [None]:
for node in a_node.next_siblings:
    print('[{}] {}'.format(type(node), node))

Let's try with first div node in this HTML

In [None]:
soup.div

In [None]:
soup.div.parent.name

Here is the list of DOM navigation:
- **node.children**: iterator for all children of a node
- **node.descendants**: iterator for all of a tag’s children, recursively: its direct children, the children of its direct children, and so on
- **node.parent**: parent of the existing node
- **node.parents**: iterator for all of an element’s parents to the root of the DOM tree
- **node.next_sibling / node.previous_sibling**: navigate between page elements that are on the same level of the DOM tree
- **node.next_element / node.previous_element**: navigate between page elements in the DOM tree, regardless of the level

### Direct Reference
We can refer to any node via the root node, html, with dot separated.
Refer to our DOM tree structure in the following examples.

In [None]:
soup.head

We can access the css style at node:

**html -> head -> style**

Note that we do not have to include html in the reference

In [None]:
soup.head.style

We do not have to include everythin in the path.  Let access the first h2 node inside body.

In [None]:
soup.body.h2

In [None]:
soup.body.h2.parent

We can jump back and forth between nodes, parents, and children

In [None]:
soup.body.h2.parent.p

## Finding Nodes
We usually have more than one element with the same tag.  We can get all those nodes using find_all method.  Note that find method is also avaiable and will return the first node that match the criteria.

In [None]:
all_div = soup.find_all('div')

In [None]:
len(all_div)

In [None]:
n = 0
for div in all_div:
    print('-- {} --'.format(n))
    print(div)
    n += 1

In [None]:
div8 = all_div[8]
div8

In [None]:
div8.attrs

In [None]:
div8.get('class')

find_all method can be used at any node.  This effectly limits the finding scope.

In [None]:
div8.find_all('div')

In [None]:
div8.find('div')

### Find with Criteria
find and find_all methods accept criteria for searching.  For example, we can find all nodes with specific id or with specific attributes.

In [None]:
soup.find_all(id='more_text')

In [None]:
soup.find_all(attrs={'class': 'fakeimg'})

In [None]:
soup.find_all(attrs={'style': 'height:60px;'})

In [None]:
soup.find_all('div', attrs={'class': 'main'})

## CSS Selector
CSS Selector is very powerful for node searching.  We can search by tag name, id, class, and combination of criteria.  The CSS Selector includes:
- **string**: select node with the specific *tag* e.g. div for node with tag 'div'
- **.class**: select node with the specific *class*
- **#id**: select node with the specific *id*
- **tag[attr]**: select node with the specific *tag* and *attr*

In [None]:
soup.select('p')

In [None]:
soup.select('h2')

In [None]:
soup.select('#more_text')

In [None]:
soup.select('.fakeimg')

In [None]:
soup.select('#my_photo.fakeimg')

In [None]:
node = soup.select('div div h2')
node