# Web Scraping Part 2.0
### Parse HTML with Beautifulsoup

Part 2 expands on Part 1, but can handle data other than HTML tables.

This tutorial uses the following Python packages:

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): provides a way to view source code.

Requests: Use GET request to fetch the web page.

Regular expression operations - re

Pages used in this tutorial: 
[Gapminder](https://www.gapminder.org/data/) 

> Get Libraries

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


> assign url to variable, type Response

In [3]:
url = input('Enter URL: ')
html_scraped = requests.get(url)
type(html_scraped)

requests.models.Response

> Create a beautifulSoup parse tree.

| HTML Parsers |
  ------------
  | html.parser |
  | html5lib |
    

In [4]:
soup = BeautifulSoup(html_scraped.text, 'html.parser')

> View Data: Put the parse tree into a nested data structure with pretty print.

In [None]:
preti=soup.prettify()
preti

4 Python objects in the parse tree to search: 

- Tag
- NavigableString
- BeautifulSoup
- Comment

> TAGS

Some common methods to navigate the BeautifulSoup parse tree based on tags.

| Approach | Description |
| -------- | ----------- |
| Dot Operator | soup.p |
| String Filter | soup.find_all('p') |
| List Filter | soup.find_all(['p', 'link']) |
| Regular Expressions | Search Strings, CSS class |

dot operator -> bs4

In [6]:
soup.h3

<h3>Data documentation</h3>

string filter -> list

In [7]:
soup.find_all('b')


[]

list filter -> list

In [8]:
soup.find_all(['h2', 'b'])

[]

other filters

In [9]:
btag = soup.find_all('b')
bpara = [ b.parent for b in btag]
bpara

[]

## Web Scraping with BeautifulSoup Part 2.1

Scraping and working with links to data.
- parse tree and string(text) format.

- filtering the text with get(). 

> Search for files on web page

In [11]:
import re

In [13]:
soup.find_all(re.compile('^a')) 

[<a class="link" href="/privacy/cookie-policy/">our cookie policy</a>,
 <a href="#"></a>,
 <a href="https://www.gapminder.org"><img class="logo" src="https://www.gapminder.org/wp-content/themes/gapminder2/images/gapminder-logo.svg" width="216"/></a>,
 <a href="#"></a>,
 <a href="https://www.gapminder.org/donations/">Donate</a>,
 <a href="https://www.gapminder.org/resources/">Resources</a>,
 <a href="https://www.gapminder.org/about/">About</a>,
 <a href="https://upgrader.gapminder.org/account/login/">Log in</a>,
 <a href="https://www.gapminder.org/donations/">Donate</a>,
 <a href="https://www.gapminder.org/resources/">Resources</a>,
 <a href="https://www.gapminder.org/about/">About</a>,
 <a href="https://upgrader.gapminder.org/account/login/">Log in</a>,
 <a href="https://www.gapminder.org/data/">Download the data</a>,
 <a href="https://www.gapminder.org/data/doubt/">Doubt</a>,
 <a href="https://www.gapminder.org/data/geo/">Geography</a>,
 <a href="https://www.gapminder.org/data/geo/cha

In [None]:
file_type='documentation'

In [None]:
for link in soup.find_all('li'):
    for a in link.find_all('a'):
        file_link=a.get('href')
        if file_type in file_link:
            print(file_link)

https://www.gapminder.org/data/documentation/
https://www.gapminder.org/data/documentation/air-accident-risk-documentation/
https://www.gapminder.org/data/documentation/gd009/
https://www.gapminder.org/data/documentation/gd008/
https://www.gapminder.org/data/documentation/leaded-gas-ban/
https://www.gapminder.org/data/documentation/caries/
https://www.gapminder.org/data/documentation/child-labour/
https://www.gapminder.org/data/documentation/gd005/
https://www.gapminder.org/data/documentation/co2/
https://www.gapminder.org/data/documentation/death-penalty/
https://www.gapminder.org/data/documentation/democracy-index/
https://www.gapminder.org/data/documentation/drownings/
https://www.gapminder.org/data/documentation/epovrate/
https://www.gapminder.org/data/documentation/quint-income-tfr/
https://www.gapminder.org/data/documentation/gd007/
https://www.gapminder.org/data/documentation/gd001/
https://www.gapminder.org/data/documentation/gini/
https://www.gapminder.org/data/documentation/g

Find tags that contain certain letters. Using ^ find tags that begin with said letter.

In [None]:
soup.find_all(re.compile('^b'))

[<body class="page-template-default page page-id-3137 page-parent">
 <svg focusable="false" height="0" role="none" style="visibility: hidden; position: absolute; left: -9999px; overflow: hidden;" viewbox="0 0 0 0" width="0" xmlns="http://www.w3.org/2000/svg">
 <defs>
 <filter id="wp-duotone-dark-grayscale">
 <fecolormatrix color-interpolation-filters="sRGB" type="matrix" values="
 						.299 .587 .114 0 0
 						.299 .587 .114 0 0
 						.299 .587 .114 0 0
 						.299 .587 .114 0 0
 					"></fecolormatrix>
 <fecomponenttransfer color-interpolation-filters="sRGB">
 <fefuncr tablevalues="0 0.49803921568627" type="table"></fefuncr>
 <fefuncg tablevalues="0 0.49803921568627" type="table"></fefuncg>
 <fefuncb tablevalues="0 0.49803921568627" type="table"></fefuncb>
 <fefunca tablevalues="1 1" type="table"></fefunca>
 </fecomponenttransfer>
 <fecomposite in2="SourceGraphic" operator="in"></fecomposite>
 </filter>
 </defs>
 </svg>
 <svg focusable="false" height="0" role="none" style="visibili

In [None]:
soup.find_all(re.compile('^img')) 

[<img class="logo" src="https://www.gapminder.org/wp-content/themes/gapminder2/images/gapminder-logo.svg" width="216"/>,
 <img alt="Gapminder logo" src="https://www.gapminder.org/wp-content/themes/gapminder2/images/gapminder-logo.svg" width="350"/>,
 <img src="https://www.gapminder.org/wp-content/themes/gapminder2/images/icons/tw.svg"/>,
 <img src="https://www.gapminder.org/wp-content/themes/gapminder2/images/icons/ig.svg"/>,
 <img src="https://www.gapminder.org/wp-content/themes/gapminder2/images/icons/fb.svg"/>,
 <img src="https://www.gapminder.org/wp-content/themes/gapminder2/images/icons/li.svg"/>,
 <img src="https://www.gapminder.org/wp-content/themes/gapminder2/images/icons/yt.svg"/>]