# Scraping the Web
Sometimes the data you require is not available in a structured, downloadable format. Often the most current data is available only on a web site. This notebook demonstrates how to "scrape" data from web sites using several different methods.

> If you understand how to web scrape, any available data on the web is a database for you!!

> **Should I use RegeX (regular expressions) to parse web data?**
> Although extracting patterned data from a text file is directly in the wheelhouse of RegEx, I recommend *against* using it to parse HTML. Crafting an expression that returns all desired strings while excluding all undesired strings is likely to fail. HTML pages vary widely and will return or exclude data in ways you cannot anticipate. Instead, use a library such as Beautiful Soup, to parse HTML.

## Viewing Source HTML
To determine the best method for extracting data from an HTML page, you must view the HTML. To view HTML using Chrome, right-click the page and then click **View page source** (to view the entire page as a flat file) or click **Inspect** (to view an HTML inspection tool to navigate the page elements).


## Using Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. (source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## Installing Beautiful Soup
To install the beautiful soup package (bs4) use pip.
```
pip install bs4
```

## Use Requests module
In addition to the HTML parser (bs4), you also need a method to fetch URLs (uniform resource locators). There are several options available to you. This example uses the requests package.

```python
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
```

In [1]:
import bs4 # parses HTML
import requests
#from urllib.request import urlopen as req # requests HTML
from bs4 import BeautifulSoup as soup
#import lxml as lh


### Scraping Quotes
Below is an example of how to use Requests and Beautiful Soup to obtain an HTML page and parse it.


In [2]:
url = "http://wisdomquotes.com/quote-of-the-day/"

# Unless you send a header, a web server may reject your request
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

#Request page, send headers
page_html = requests.get(url, headers=headers)

#DEBUG:print(page_html.content)

# Use .text to return string (.content returns bytes)
page_soup = soup(page_html.text, "html.parser")

page_soup.title # Print the page title to confirm we have successfully parsed the web page

Next, inpsect the HTML to find a unique method to identify the information you wish to extract.

All quotes are prefaced with "blockquote", which makes it easy to filter all the quotes on the page.

Use the ```find_all()``` method to return a list of quotes.

In [2]:
quotes = page_soup.find_all("blockquote")
print(quotes[0].text)

The world belongs to those who set out to conquer it armed with self confidence and good humour. Charles Dickens


Loop through the list of quotes and print the text attribute of the element.

In [None]:
for quote in quotes:
    print("-----------------------------------")
    print(quote.text)

# XPATH


In [69]:
import lxml.html
import requests

html_response = requests.get("https://www.cbssports.com/nba/standings/")

# Results in an HtmlElement object which has the xpath method.
doc = lxml.html.fromstring(html_response.content)

# Use xpath syntax to filter html elements
teams = doc.xpath('//tr')

print(str(len(teams)) + " <tr> elements found.\n")

# list[col][row] - 1 = header
for i in (2,3,4,5,6,19,20,21,22,23): # Dangerous assuption: row numbers cannot change
    ranking = teams[i][0].text_content().strip()
    team_name = teams[i][1].text_content().strip()
    wl_streak = teams[i][13].text_content().strip()
        
    print(ranking, wl_streak, team_name)

34 <tr> elements found.

1 W4 Milwaukee
2 L1 Toronto
3 W2 Indiana
4 L1 Philadelphia
5 L2 Boston
1 L1 Golden St.
2 W4 Denver
3 L1 Oklahoma City
4 W3 Portland
5 W1 Houston
