# Web Scraping Part 2.0
### Parse HTML with Beautifulsoup

Part 2 expands on Part 1, being able to handle data other than HTML tables.

This tutorial uses the following Python packages:

- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): provides a way to view source code.

- Requests: Use GET request method to fetch the web page.

- Regular expression operations - re

Part 2.0 Vocab

| Term | Description |
| -------- | ----------- |
| GET method |  to request data from the server. |
| Parser | allows Python code to edit the parse tree of a Python expression and create executable code. |
| Beautiful Soup | Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree. |
| Regular Expressions | used to match strings of text such as particular characters, words, or patterns of characters. |


[Test your regular expressions](https://pythex.org/)

Pages used in this tutorial: 

[Gutenberg: Top 100 EBooks](https://www.gutenberg.org/browse/scores/top#books-last1)

> Get Libraries

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 100)
import requests
from bs4 import BeautifulSoup
import urllib3

> assign url to variable, type Response

In [3]:
# get url
url = input('Enter URL: ')
html_scraped = requests.get(url)
type(html_scraped)

requests.models.Response

> Create a beautifulSoup parse tree.

| HTML Parsers |
  ------------
  | html.parser |
  | html5lib |
    

In [4]:
# BeautifulSoup parse tree
soup = BeautifulSoup(html_scraped.text, 'html.parser')

> View Data: Put the parse tree into a nested data structure with pretty print.

In [5]:
preti=soup.prettify()
#preti

4 Python objects in the parse tree to search: 

- Tag
- NavigableString
- BeautifulSoup
- Comment

> TAGS

Some common methods to navigate the BeautifulSoup parse tree based on tags.

| Approach | Description |
| -------- | ----------- |
| Dot Operator | soup.p |
| String Filter | soup.find_all('p') |
| List Filter | soup.find_all(['p', 'link']) |
| Regular Expressions | Search Strings, CSS class |

dot operator -> bs4

In [9]:
# use dot operator to get html tag
soup.h3

string filter -> list

In [10]:
# use find_all to get tag
soup.find_all('p')


[<p>Calculated from the number of times each eBook gets
 downloaded. (Multiple downloads from the same Internet
 address on the same day count as one download. Addresses
 that download more than 100 eBooks in a day are considered
 robots and are not counted.)</p>,
 <p>Visualizations and graphs are available as
 <a href="/about/pretty-pictures.html">pretty pictures</a>.</p>]

list filter -> list

In [11]:
# use find_all to get multiple tags
soup.find_all(['h2', 'b'])

[<h2 id="books-last1">Top 100 EBooks yesterday</h2>,
 <h2 id="authors-last1">Top 100 Authors yesterday</h2>,
 <h2 id="books-last7">Top 100 EBooks last 7 days</h2>,
 <h2 id="authors-last7">Top 100 Authors last 7 days</h2>,
 <h2 id="books-last30">Top 100 EBooks last 30 days</h2>,
 <h2 id="authors-last30">Top 100 Authors last 30 days</h2>]

other filters

In [12]:
# access parent of an element
btag = soup.find_all('b')
bpara = [ b.parent for b in btag]
bpara

[]

## Web Scraping with BeautifulSoup Part 2.1

Scraping and working with links to data.
- parse tree and string(text) format.

- filtering the text with get(). 

> Search for files on web page

Finding pages with downloadable data

In [13]:
# Get list of documentation pages

file_type='documentation'

for link in soup.find_all('li'):
    for a in link.find_all('a'):
        file_link=a.get('href')
        if file_type in file_link:
            print(file_link)

Find tags that contain certain letters. Using ^ find tags that begin with said letter.

> re creates a search such that it is regex search, '^', searching for any tag that begins with a letter.

In [14]:
# import package for regex
import re

Begins with 'b'

In [15]:
# use find_all and regex to search
soup.find_all(re.compile('^i'), limit=5)

[<input id="search-toggle" name="toggle" style="display: none" type="radio"/>,
 <input id="search-close" name="toggle" style="display: none" type="radio"/>,
 <input id="about-toggle" style="display: none" type="checkbox"/>,
 <img alt="Project Gutenberg" draggable="false" src="/gutenberg/pg-logo-129x80.png"/>,
 <input aria-label="Search books" class="search-input" name="query" placeholder="Quick search" type="text"/>]

Begins with 'l'

In [16]:
soup.find_all(re.compile('^l'), limit=5)

[<link href="/gutenberg/style2.css?v=1.7" rel="stylesheet"/>,
 <link href="/gutenberg/collapsible.css?1.3" rel="stylesheet"/>,
 <link href="/gutenberg/new_nav.css?v=1.6" rel="stylesheet"/>,
 <link href="/gutenberg/pg-desktop-one.css?v=1.1" rel="stylesheet"/>,
 <link href="https://www.gnu.org/copyleft/fdl.html" rel="copyright"/>]

More on RegEx [HERE]('https://docs.python.org/3/library/re.html') and see the PDF, 'Python Regex Cheat Sheet'

Put the top ebooks into a dataframe

In [17]:
# loop through soup find url links and put in list
file_type='ebooks'
books=[]
for link in soup.find_all(re.compile('^a')):
    if file_type in link.get('href'):
        books.append(link.text.strip())  

In [18]:
# create a dataframe from list
df=pd.DataFrame(books, columns=['top_ebooks'])

In [19]:
# view dataframe
df.head(10)

Unnamed: 0,top_ebooks
0,Offline Catalogs
1,Main Categories
2,Reading Lists
3,Search Options
4,Main Categories
5,"Moby Dick; Or, The Whale by Herman Melville (4..."
6,"Frankenstein; Or, The Modern Prometheus by Mar..."
7,Romeo and Juliet by William Shakespeare (2627)
8,Pride and Prejudice by Jane Austen (2247)
9,Alice's Adventures in Wonderland by Lewis Carr...


Delete first 4 rows

In [20]:
# clean up dataframe
df = df.drop( labels=range(0, 4), axis=0)

In [21]:
# view df
df.head(10)

Unnamed: 0,top_ebooks
4,Main Categories
5,"Moby Dick; Or, The Whale by Herman Melville (4..."
6,"Frankenstein; Or, The Modern Prometheus by Mar..."
7,Romeo and Juliet by William Shakespeare (2627)
8,Pride and Prejudice by Jane Austen (2247)
9,Alice's Adventures in Wonderland by Lewis Carr...
10,The Complete Works of William Shakespeare by W...
11,A Room with a View by E. M. Forster (1934)
12,Middlemarch by George Eliot (1916)
13,Jane Eyre: An Autobiography by Charlotte Bront...
