# Basic Webpage Scraping
Webpage scraping consists of two steps: crawling and parsing.  In this tutorial, we focus on parsing HTML data.   Beautifulsoup is a powerful tool to process static HTML.  More details can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

To simplify our learning, we will use a simple example from W3Schools: https://www.w3schools.com/howto/tryit.asp?filename=tryhow_css_example_website

In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    !wget https://www.dropbox.com/s/w5khgpro1ym3icg/simple_page.html?dl=0 -O simple_page.html

In [None]:
with open('simple_page.html') as f:
    html = f.read()

In [None]:
html

In [None]:
from bs4 import BeautifulSoup
from bs4.element import Tag
from IPython.core.display import HTML

In [None]:
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

## Beautiful Soup DOM Tree
The structure of Beautiful Soup bases on the concept of DOM, which is used in all web browsers.  DOM is a tree of all elements in the webpage.  Each element node consists of:
- tag
- innerHTML/outerHTML
- id
- attributes
- parent and children

### Traversing simple HTML's DOM Tree

In our example, the structure is as followed:

```
html
+-- head
|   +-- title
|   +-- meta
|   +-- meta
|   +-- style
+-- body
    +-- div
    |   +-- h1
    |   +-- p
    |       +-- b
    +-- div
    |   +-- a
    |   +-- a
    |   +-- a
    |   +-- a
    +-- div
    |   +--div
    |   |   +-- h2
    |   |   +-- h5
    |   |   +-- ...
    |   +--div
    |       +-- h2
    |       +-- h5
    |       +-- ...
    +-- div
        +-- h2
```

In [None]:
# title is a tag of one of the element node in the example.
# we can refer to the node by using the tag name
type(soup.title)

In [None]:
soup.head.style

In [None]:
# we can get tag of a node with 'name'
soup.title.name

In [None]:
# we can get outerHTML by converting node to string
str(soup.title)

In [None]:
# we can get innerHTML with 'string'
soup.title.string

In [None]:
# we can get id with 'id' (it is empty in this example)
soup.title.id

In [None]:
# we can get attribute values with 'attrs'
soup.title.attrs

In [None]:
# getting the parent node with 'parent'
soup.title.parent.name

In [None]:
# referring to children
soup.title.children

## DOM Structure

In [None]:
def walk_dom(node, depth=None, indent='', only_tag=True):
    if only_tag and (not isinstance(node, Tag)):
        return
    
    print('{}{} : {}'.format(indent, node.name, type(node)))
    if isinstance(node, Tag):
        if len(node.attrs) > 0:
            print(indent, '>>', node.attrs)
        if depth is None or depth > 1:
            indent += '    '
            for c in node.children:
                if depth is None:
                    walk_dom(c, indent=indent, only_tag=only_tag)
                else:
                    walk_dom(c, depth-1, indent=indent, only_tag=only_tag)

In [None]:
walk_dom(soup.html, depth=2, only_tag=False)

In [None]:
walk_dom(soup.html)

In [None]:
walk_dom(soup.head)

In [None]:
body_text = str(soup.body)
body_text[:300]

In [None]:
HTML(body_text)

In [None]:
walk_dom(soup.body)

In [None]:
a = soup.a
a

In [None]:
a.attrs

In [None]:
a.get('href')

In [None]:
soup.div

## Finding Nodes

In [None]:
all_div = soup.find_all('div')

In [None]:
n = 0
for div in all_div:
    print('-- {} --'.format(n))
    print(div)
    n += 1

In [None]:
div8 = all_div[8]
HTML(str(div8))

In [None]:
walk_dom(div8, depth=2)

In [None]:
div8.attrs

In [None]:
str(div8)

In [None]:
div8.get('class')

In [None]:
div8.find_all('div')

In [None]:
soup.find(id='more_text')

In [None]:
soup.find_all(attrs={'class': 'fakeimg'})

In [None]:
soup.find_all(attrs={'style': 'height:60px;'})

## CSS Selector

In [None]:
soup.select('p')

In [None]:
soup.select('#more_text')

In [None]:
soup.select('.fakeimg')

In [None]:
soup.select('#my_photo.fakeimg')

In [None]:
for node in soup.select('h2'):
    print(str(node))
    print('----')

In [None]:
node = soup.select('div div h2')

In [None]:
str(node)

The CSS Selector includes:
- **string**: select node with the specific *tag* e.g. div for node with tag 'div'
- **.class**: select node with the specific *class*
- **#id**: select node with the specific *id*
- **tag[attr]**: select node with the specific *tag* and *attr*

## Advanced DOM Walk

In [None]:
walk_dom(div8.parent, depth=2)

In [None]:
div8.find_previous_sibling('div')

Here is the list of DOM navigation:
- **node.children**: iterator for all children of a node
- **node.descendants**: iterator for all of a tag’s children, recursively: its direct children, the children of its direct children, and so on
- **node.parent**: parent of the existing node
- **node.parents**: iterator for all of an element’s parents to the root of the DOM tree
- **node.next_sibling / node.previous_sibling**: navigate between page elements that are on the same level of the DOM tree
- **node.next_element / node.previous_element**: navigate between page elements in the DOM tree, regardless of the level