# Basic Webpage Scraping
Webpage scraping consists of two steps: crawling and parsing.  In this tutorial, we focus on parsing HTML data.   Beautifulsoup is a powerful tool to process static HTML.  More details can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

To simplify our learning, we will use a simple example from W3Schools: https://www.w3schools.com/howto/tryit.asp?filename=tryhow_css_example_website

In [1]:
import requests

Since this web page does not exist, we will download the file from our internal cloud storage

In [2]:
url = 'https://drive.google.com/uc?export=download&id=1vSKo7oQJYDTul4IBPOgK6F2PBW3-8kk9'

In [3]:
page = requests.get(url)
html = page.text

In [4]:
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n<title>Page Title</title>\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<style>\n* {\n  box-sizing: border-box;\n}\n\n/* Style the body */\nbody {\n  font-family: Arial, Helvetica, sans-serif;\n  margin: 0;\n}\n\n/* Header/logo Title */\n.header {\n  padding: 80px;\n  text-align: center;\n  background: #1abc9c;\n  color: white;\n}\n\n/* Increase the font size of the heading */\n.header h1 {\n  font-size: 40px;\n}\n\n/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */\n.navbar {\n  overflow: hidden;\n  background-color: #333;\n  position: sticky;\n  position: -webkit-sticky;\n  top: 0;\n}\

In [6]:
from bs4 import BeautifulSoup
from bs4.element import Tag
from IPython.core.display import HTML

In [7]:
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Page Title
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style>
   * {
  box-sizing: border-box;
}

/* Style the body */
body {
  font-family: Arial, Helvetica, sans-serif;
  margin: 0;
}

/* Header/logo Title */
.header {
  padding: 80px;
  text-align: center;
  background: #1abc9c;
  color: white;
}

/* Increase the font size of the heading */
.header h1 {
  font-size: 40px;
}

/* Sticky navbar - toggles between relative and fixed, depending on the scroll position. It is positioned relative until a given offset position is met in the viewport - then it "sticks" in place (like position:fixed). The sticky value is not supported in IE or Edge 15 and earlier versions. However, for these versions the navbar will inherit default position */
.navbar {
  overflow: hidden;
  background-color: #333;
  position: sticky;
  position: -webkit-sticky;
  top: 0;
}

/* Style the nav

## Beautiful Soup DOM Tree
The structure of Beautiful Soup bases on the concept of DOM, which is used in all web browsers.  DOM is a tree of all elements in the webpage.  Each element node consists of:
- tag
- innerHTML/outerHTML
- id
- attributes
- parent and children

### Traversing simple HTML's DOM Tree

In our example, the structure is as followed:

```
html
+-- head
|   +-- title
|   +-- meta
|   +-- meta
|   +-- style
+-- body
    +-- div
    |   +-- h1
    |   +-- p
    |       +-- b
    +-- div
    |   +-- a
    |   +-- a
    |   +-- a
    |   +-- a
    +-- div
    |   +--div
    |   |   +-- h2
    |   |   +-- h5
    |   |   +-- ...
    |   +--div
    |       +-- h2
    |       +-- h5
    |       +-- ...
    +-- div
        +-- h2
```

In [8]:
# title is a tag of one of the element node in the example.
# we can refer to the node by using the tag name
type(soup.title)

bs4.element.Tag

In [None]:
soup.head.style

In [None]:
# we can get tag of a node with 'name'
soup.title.name

In [None]:
# we can get outerHTML by converting node to string
str(soup.title)

In [None]:
# we can get innerHTML with 'string'
soup.title.string

In [None]:
# we can get id with 'id' (it is empty in this example)
soup.title.id

In [None]:
# we can get attribute values with 'attrs'
soup.title.attrs

In [None]:
# getting the parent node with 'parent'
soup.title.parent.name

In [None]:
# referring to children
soup.title.children

## DOM Structure

In [None]:
def walk_dom(node, depth=None, indent='', only_tag=True):
    if only_tag and (not isinstance(node, Tag)):
        return
    
    print('{}{} : {}'.format(indent, node.name, type(node)))
    if isinstance(node, Tag):
        if len(node.attrs) > 0:
            print(indent, '>>', node.attrs)
        if depth is None or depth > 1:
            indent += '    '
            for c in node.children:
                if depth is None:
                    walk_dom(c, indent=indent, only_tag=only_tag)
                else:
                    walk_dom(c, depth-1, indent=indent, only_tag=only_tag)

In [None]:
walk_dom(soup.html, depth=2, only_tag=False)

In [None]:
walk_dom(soup.html)

In [None]:
walk_dom(soup.head)

In [None]:
body_text = str(soup.body)
body_text[:300]

In [None]:
HTML(body_text)

In [None]:
walk_dom(soup.body)

In [None]:
a = soup.a
a

In [None]:
a.attrs

In [None]:
a.get('href')

In [None]:
soup.div

## Finding Nodes

In [None]:
all_div = soup.find_all('div')

In [None]:
n = 0
for div in all_div:
    print('-- {} --'.format(n))
    print(div)
    n += 1

In [None]:
div8 = all_div[8]
HTML(str(div8))

In [None]:
walk_dom(div8, depth=2)

In [None]:
div8.attrs

In [None]:
div8.get('class')

In [None]:
div8.find_all('div')

In [None]:
soup.find(id='more_text')

In [None]:
soup.find_all(attrs={'class': 'fakeimg'})

In [None]:
soup.find_all(attrs={'style': 'height:60px;'})

## CSS Selector

In [None]:
soup.select('p')

In [None]:
soup.select('#more_text')

In [None]:
soup.select('.fakeimg')

In [None]:
soup.select('#my_photo.fakeimg')

In [None]:
soup.select('h2')

In [None]:
soup.select('div div h2')

The CSS Selector includes:
- **string**: select node with the specific *tag* e.g. div for node with tag 'div'
- **.class**: select node with the specific *class*
- **#id**: select node with the specific *id*
- **tag[attr]**: select node with the specific *tag* and *attr*

## Advanced DOM Walk

In [None]:
walk_dom(div8.parent, depth=2)

In [None]:
div8.find_previous_sibling('div')

Here is the list of DOM navigation:
- **node.children**: iterator for all children of a node
- **node.descendants**: iterator for all of a tag’s children, recursively: its direct children, the children of its direct children, and so on
- **node.parent**: parent of the existing node
- **node.parents**: iterator for all of an element’s parents to the root of the DOM tree
- **node.next_sibling / node.previous_sibling**: navigate between page elements that are on the same level of the DOM tree
- **node.next_element / node.previous_element**: navigate between page elements in the DOM tree, regardless of the level