# Web Scraping Part 2.0
### Parse HTML with Beautifulsoup

Part 2 expands on Part 1, being able to handle data other than HTML tables.

This tutorial uses the following Python packages:

- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): provides a way to view source code.

- Requests: Use GET request method to fetch the web page.

- Regular expression operations - re


Pages used in this tutorial: 

[Gapminder](https://www.gapminder.org/data/) 

[Gutenberg: Top 100 EBooks](https://www.gutenberg.org/browse/scores/top#books-last1)

> Get Libraries

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup




> assign url to variable, type Response

In [3]:
url = input('Enter URL: ')
html_scraped = requests.get(url)
type(html_scraped)

requests.models.Response

> Create a beautifulSoup parse tree.

| HTML Parsers |
  ------------
  | html.parser |
  | html5lib |
    

In [4]:
soup = BeautifulSoup(html_scraped.text, 'html.parser')

> View Data: Put the parse tree into a nested data structure with pretty print.

In [5]:
preti=soup.prettify()
#preti

4 Python objects in the parse tree to search: 

- Tag
- NavigableString
- BeautifulSoup
- Comment

> TAGS

Some common methods to navigate the BeautifulSoup parse tree based on tags.

| Approach | Description |
| -------- | ----------- |
| Dot Operator | soup.p |
| String Filter | soup.find_all('p') |
| List Filter | soup.find_all(['p', 'link']) |
| Regular Expressions | Search Strings, CSS class |

dot operator -> bs4

In [6]:
soup.h3

<h3>Data documentation</h3>

string filter -> list

In [7]:
soup.find_all('b')


[]

list filter -> list

In [8]:
soup.find_all(['h2', 'b'])

[]

other filters

In [9]:
btag = soup.find_all('b')
bpara = [ b.parent for b in btag]
bpara

[]

## Web Scraping with BeautifulSoup Part 2.1

Scraping and working with links to data.
- parse tree and string(text) format.

- filtering the text with get(). 

> Search for files on web page

Finding pages with downloadable data

In [10]:
# Get list of documentation pages

file_type='documentation'

for link in soup.find_all('li'):
    for a in link.find_all('a'):
        file_link=a.get('href')
        if file_type in file_link:
            print(file_link)

https://www.gapminder.org/data/documentation/
https://www.gapminder.org/data/documentation/air-accident-risk-documentation/
https://www.gapminder.org/data/documentation/gd009/
https://www.gapminder.org/data/documentation/gd008/
https://www.gapminder.org/data/documentation/leaded-gas-ban/
https://www.gapminder.org/data/documentation/caries/
https://www.gapminder.org/data/documentation/child-labour/
https://www.gapminder.org/data/documentation/gd005/
https://www.gapminder.org/data/documentation/co2/
https://www.gapminder.org/data/documentation/death-penalty/
https://www.gapminder.org/data/documentation/democracy-index/
https://www.gapminder.org/data/documentation/drownings/
https://www.gapminder.org/data/documentation/epovrate/
https://www.gapminder.org/data/documentation/quint-income-tfr/
https://www.gapminder.org/data/documentation/gd007/
https://www.gapminder.org/data/documentation/gd001/
https://www.gapminder.org/data/documentation/gini/
https://www.gapminder.org/data/documentation/g

Find tags that contain certain letters. Using ^ find tags that begin with said letter.

> re creates a search such that it is regex search, searching for any tag that begins with 'b'

In [12]:
import re

In [13]:
soup.find_all(re.compile('^b'))

[<body class="page-template-default page page-id-3137 page-parent">
 <svg focusable="false" height="0" role="none" style="visibility: hidden; position: absolute; left: -9999px; overflow: hidden;" viewbox="0 0 0 0" width="0" xmlns="http://www.w3.org/2000/svg">
 <defs>
 <filter id="wp-duotone-dark-grayscale">
 <fecolormatrix color-interpolation-filters="sRGB" type="matrix" values="
 						.299 .587 .114 0 0
 						.299 .587 .114 0 0
 						.299 .587 .114 0 0
 						.299 .587 .114 0 0
 					"></fecolormatrix>
 <fecomponenttransfer color-interpolation-filters="sRGB">
 <fefuncr tablevalues="0 0.49803921568627" type="table"></fefuncr>
 <fefuncg tablevalues="0 0.49803921568627" type="table"></fefuncg>
 <fefuncb tablevalues="0 0.49803921568627" type="table"></fefuncb>
 <fefunca tablevalues="1 1" type="table"></fefunca>
 </fecomponenttransfer>
 <fecomposite in2="SourceGraphic" operator="in"></fecomposite>
 </filter>
 </defs>
 </svg>
 <svg focusable="false" height="0" role="none" style="visibili

In [14]:
soup.find_all(re.compile('^l'))

[<link href="https://gmpg.org/xfn/11" rel="profile"/>,
 <link href="https://fonts.googleapis.com/css2?family=Rubik:wght@300;400;500;700&amp;display=swap" rel="stylesheet"/>,
 <link href="https://www.gapminder.org/icn/favicon.ico" rel="shortcut icon"/>,
 <link href="https://www.gapminder.org/icn/apple-touch-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>,
 <link href="https://www.gapminder.org/icn/apple-touch-icon-114x114.png" rel="apple-touch-icon" sizes="114x114"/>,
 <link href="https://www.gapminder.org/icn/apple-touch-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>,
 <link href="https://www.gapminder.org/icn/apple-touch-icon-144x144.png" rel="apple-touch-icon" sizes="144x144"/>,
 <link href="https://www.gapminder.org/icn/apple-touch-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>,
 <link href="https://www.gapminder.org/icn/apple-touch-icon-120x120.png" rel="apple-touch-icon" sizes="120x120"/>,
 <link href="https://www.gapminder.org/icn/apple-touch-icon-76x76.png