# Web Scraping Part 2.0
### Parse HTML with Beautifulsoup

Part 2 expands on Part 1, being able to handle data other than HTML tables.

This tutorial uses the following Python packages:

- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): provides a way to view source code.

- Requests: Use GET request method to fetch the web page.

- Regular expression operations - re


Pages used in this tutorial: 

[Gapminder](https://www.gapminder.org/data/) 

[Gutenberg: Top 100 EBooks](https://www.gutenberg.org/browse/scores/top#books-last1)

> Get Libraries

In [18]:
import pandas as pd
pd.set_option('display.max_rows', 100)
import requests
from bs4 import BeautifulSoup
import urllib3

> assign url to variable, type Response

In [19]:
url = input('Enter URL: ')
html_scraped = requests.get(url)
type(html_scraped)

requests.models.Response

> Create a beautifulSoup parse tree.

| HTML Parsers |
  ------------
  | html.parser |
  | html5lib |
    

In [20]:
soup = BeautifulSoup(html_scraped.text, 'html.parser')

> View Data: Put the parse tree into a nested data structure with pretty print.

In [21]:
preti=soup.prettify()
#preti

4 Python objects in the parse tree to search: 

- Tag
- NavigableString
- BeautifulSoup
- Comment

> TAGS

Some common methods to navigate the BeautifulSoup parse tree based on tags.

| Approach | Description |
| -------- | ----------- |
| Dot Operator | soup.p |
| String Filter | soup.find_all('p') |
| List Filter | soup.find_all(['p', 'link']) |
| Regular Expressions | Search Strings, CSS class |

dot operator -> bs4

In [22]:
soup.h3

string filter -> list

In [23]:
soup.find_all('b')


[]

list filter -> list

In [24]:
soup.find_all(['h2', 'b'])

[<h2 id="books-last1">Top 100 EBooks yesterday</h2>,
 <h2 id="authors-last1">Top 100 Authors yesterday</h2>,
 <h2 id="books-last7">Top 100 EBooks last 7 days</h2>,
 <h2 id="authors-last7">Top 100 Authors last 7 days</h2>,
 <h2 id="books-last30">Top 100 EBooks last 30 days</h2>,
 <h2 id="authors-last30">Top 100 Authors last 30 days</h2>]

other filters

In [25]:
btag = soup.find_all('b')
bpara = [ b.parent for b in btag]
bpara

[]

## Web Scraping with BeautifulSoup Part 2.1

Scraping and working with links to data.
- parse tree and string(text) format.

- filtering the text with get(). 

> Search for files on web page

Finding pages with downloadable data

In [26]:
# Get list of documentation pages

file_type='documentation'

for link in soup.find_all('li'):
    for a in link.find_all('a'):
        file_link=a.get('href')
        if file_type in file_link:
            print(file_link)

Find tags that contain certain letters. Using ^ find tags that begin with said letter.

> re creates a search such that it is regex search, '^', searching for any tag that begins with a letter.

In [27]:
import re

Begins with 'b'

In [40]:
soup.find_all(re.compile('^i'), limit=5)

[<img alt="Project Gutenberg" draggable="false" src="/gutenberg/pg-logo-129x80.png"/>,
 <input id="tm" type="checkbox"/>,
 <input id="sm0" type="checkbox"/>,
 <input id="sm8" type="checkbox"/>,
 <input id="sm3" type="checkbox"/>]

Begins with 'l'

In [42]:
soup.find_all(re.compile('^l'), limit=5)

[<link href="/gutenberg/style.css?v=1.1" rel="stylesheet"/>,
 <link href="/gutenberg/collapsible.css?1.1" rel="stylesheet"/>,
 <link href="/gutenberg/new_nav.css?v=1.321231" rel="stylesheet"/>,
 <link href="/gutenberg/pg-desktop-one.css" rel="stylesheet"/>,
 <link href="https://www.gnu.org/copyleft/fdl.html" rel="copyright">
 <link href="/gutenberg/favicon.ico?v=1.1" rel="shortcut icon">
 <meta content="Project Gutenberg" property="og:title"/>
 <meta content="website" property="og:type"/>
 <meta content="https://www.gutenberg.org/" property="og:url"/>
 <meta content="Project Gutenberg is a library of free eBooks." property="og:description"/>
 <meta content="615269807" property="fb:admins"/>
 <meta content="115319388529183" property="fb:app_id"/>
 <meta content="Project Gutenberg" property="og:site_name"/>
 <meta content="https://www.gutenberg.org/gutenberg/pg-logo-144x144.png" property="og:image"/>
 </link></link>]

More on RegEx [HERE]('https://docs.python.org/3/library/re.html') and see the PDF, 'Python Regex Cheat Sheet'

Put the top 100 into a dataframe

In [30]:
file_type='ebooks'
books=[]
for link in soup.find_all(re.compile('^a')):
    if file_type in link.get('href'):
        books.append(link.text.strip())  

In [31]:
df=pd.DataFrame(books, columns=['top_ebooks'])

In [32]:
df

Unnamed: 0,top_ebooks
0,Search and Browse\n \t ▾
1,Book Search
2,Bookshelves
3,Offline Catalogs
4,Pride and Prejudice by Jane Austen (2082)
...,...
299,Common Sense by Thomas Paine (4590)
300,Siddhartha by Hermann Hesse (4555)
301,The Art of War by active 6th century B.C. Sunz...
302,Josefine Mutzenbacher by Felix Salten (4382)
