# Web Scraping Part 2.0
### Parse HTML with Beautifulsoup

Part 2 expands on Part 1, but can handle data other than HTML tables.

This tutorial uses the following Python packages:

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): provides a way to view source code.

Requests: Use GET request method to fetch the web page.

Regular expression operations - re

Pages used in this tutorial: 
[Gapminder](https://www.gapminder.org/data/) 

> Get Libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


> assign url to variable, type Response

In [2]:
url = input('Enter URL: ')
html_scraped = requests.get(url)
type(html_scraped)

requests.models.Response

> Create a beautifulSoup parse tree.

| HTML Parsers |
  ------------
  | html.parser |
  | html5lib |
    

In [3]:
soup = BeautifulSoup(html_scraped.text, 'html.parser')

> View Data: Put the parse tree into a nested data structure with pretty print.

In [4]:
preti=soup.prettify()
#preti

4 Python objects in the parse tree to search: 

- Tag
- NavigableString
- BeautifulSoup
- Comment

> TAGS

Some common methods to navigate the BeautifulSoup parse tree based on tags.

| Approach | Description |
| -------- | ----------- |
| Dot Operator | soup.p |
| String Filter | soup.find_all('p') |
| List Filter | soup.find_all(['p', 'link']) |
| Regular Expressions | Search Strings, CSS class |

dot operator -> bs4

In [5]:
soup.h3

<h3>Data documentation</h3>

string filter -> list

In [6]:
soup.find_all('b')


[]

list filter -> list

In [7]:
soup.find_all(['h2', 'b'])

[]

other filters

In [8]:
btag = soup.find_all('b')
bpara = [ b.parent for b in btag]
bpara

[]

## Web Scraping with BeautifulSoup Part 2.1

Scraping and working with links to data.
- parse tree and string(text) format.

- filtering the text with get(). 

> Search for files on web page

In [9]:
import re

In [10]:
#soup.find_all(re.compile('^li'))

In [29]:
file_type='documentation'

In [None]:
for link in soup.find_all('li'):
    for a in link.find_all('a'):
        file_link=a.get('href')
        if file_type in file_link:
            print(file_link)

In [32]:
for link in soup.find_all('li'):
    link_list= [a.get('href') for link in soup.find_all('li') for a in link.find_all('a')]
    

In [35]:
doc_link_list= [doc for doc in link_list if file_type in doc]
doc_link_list

['https://www.gapminder.org/data/documentation/',
 'https://www.gapminder.org/data/documentation/air-accident-risk-documentation/',
 'https://www.gapminder.org/data/documentation/gd009/',
 'https://www.gapminder.org/data/documentation/gd008/',
 'https://www.gapminder.org/data/documentation/leaded-gas-ban/',
 'https://www.gapminder.org/data/documentation/caries/',
 'https://www.gapminder.org/data/documentation/child-labour/',
 'https://www.gapminder.org/data/documentation/gd005/',
 'https://www.gapminder.org/data/documentation/co2/',
 'https://www.gapminder.org/data/documentation/death-penalty/',
 'https://www.gapminder.org/data/documentation/democracy-index/',
 'https://www.gapminder.org/data/documentation/drownings/',
 'https://www.gapminder.org/data/documentation/epovrate/',
 'https://www.gapminder.org/data/documentation/quint-income-tfr/',
 'https://www.gapminder.org/data/documentation/gd007/',
 'https://www.gapminder.org/data/documentation/gd001/',
 'https://www.gapminder.org/data/

Find tags that contain certain letters. Using ^ find tags that begin with said letter.

In [None]:
soup.find_all(re.compile('^b'))

In [None]:
soup.find_all(re.compile('^img')) 