# Scraping the Web

## Why Scrape?
Sometimes the data you require is not available in a structured, downloadable format. Often the most current data is available only on a web site. This notebook demonstrates how to "scrape" data from web sites using several different methods.

> If you understand how to web scrape, any available data on the web is a database for you!!

> **Should I use RegeX (regular expressions) to parse web data?**
> Although extracting patterned data from a text file is directly in the wheelhouse of RegEx, I recommend *against* using it to parse HTML. Crafting an expression that returns all desired strings while excluding all undesired strings is likely to fail. HTML pages vary widely and will return or exclude data in ways you cannot anticipate. Instead, use a library such as Beautiful Soup, to parse HTML.

## Before Scraping
Scraping a web site is not always the best option. Consider the following questions prior to scraping:
1. Does the web site in question allow scraping? (check the robots.txt)
2. Am I able to adhere to the requests of the web site (again, see robots.txt)?
3. Is scraping the easiest, most efficient, or most reliable method? Copy/paste? API? Download data file?
4. Is it ethical to scrape this information?

## HTML Knowledge
Although it's not necessary to be an HTML expert, having a strong grasp of how HTML works is helpful if you intend to extract information from a web page.

There are many HTML tutorials available, but one of the most concise options is W3Schools (https://www.w3schools.com/html/). 

## Viewing Source HTML
To determine the best method for extracting data from an HTML page, you must view the HTML. To view HTML using Firefox or Chrome, right-click the page and then click **View page source** (to view the entire page as a flat file) or click **Inspect** (to view an HTML inspection tool to navigate the page elements).


## Using Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. (source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## Installing Beautiful Soup
To install the beautiful soup package (bs4) use pip.
```
pip install bs4
```

## Use Requests module
In addition to the HTML parser (bs4), you also need a method to fetch URLs (uniform resource locators). There are several options available to you. This example uses the requests package.

```python
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
```

In [3]:
import requests
from bs4 import BeautifulSoup as soup

### Simple Scraping - Quote of the Day
Below is an example of how to use Requests and Beautiful Soup to obtain an HTML page and parse it.

Start by navigating to the page that you want to scrape and obtaining its URL. For this example, we want to scrape the quoteof the day from WisdomQuotes.com. The URL is: http://wisdomquotes.com/quote-of-the-day/



In [4]:
url = "http://wisdomquotes.com/quote-of-the-day/"

## Supply a header to the web server

In [None]:
# Unless you send a header, a web server may reject your request
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

#Request page, send headers
page_html = requests.get(url, headers=headers)

print(page_html.content)

In [11]:
# Use .text to return string (.content returns bytes, .text returns string)
page_soup = soup(page_html.content, "html.parser")

page_soup.title # Print the page title to confirm we have successfully parsed the web page

<title>Quote Of The Day - Wisdom Quotes</title>

Next, inpsect the HTML to find a unique method to identify the information you wish to extract.

All quotes are prefaced with "blockquote", which makes it easy to filter all the quotes on the page.

Use the ```find_all()``` method to return a list of quotes.

In [14]:
quotes = page_soup.find_all("blockquote")

The find_all method returns a result set, which can be used like a list.

In [15]:
print(type(quotes))

<class 'bs4.element.ResultSet'>


In [17]:
# Print the 5th quote
print(quotes[4])

<blockquote><p>Somewhere, something incredible is waiting to be known. Sharon Begley</p></blockquote>


In [18]:
print(quotes[2].text)

Men must live and create. Live to the point of tears. Albert Camus


Loop through the list of quotes and print the text attribute of the element.

In [None]:
for quote in quotes:
    print("-----------------------------------")
    print(quote.text)

In [20]:
for quote in quotes:
    if "Tracy" in quote.text:
        print("-----------------------------------")
        print(quote.text)

-----------------------------------
Always give without remembering and always receive without forgetting. Brian Tracy
-----------------------------------
Successful people are simply those with successful habits. Brian Tracy


## Find by tag and class
Often the information that you need is not labled by tag alone. For example, if you wanted to extract quotes from GoodReads, you could not use tag alone. All the quotes on the page are within div tags, but are uniquely identifed by class name.

In [21]:
url = "https://www.goodreads.com/quotes"
page_html = requests.get(url, headers=headers)
page_soup = soup(page_html.text, "html.parser")
goodread_quotes = page_soup.find_all("div", class_="quoteText")

In [22]:
goodread_quotes[3].text.replace("\n","").replace("  ","")

'“So many books, so little time.”―Frank Zappa'