### High level
What is web scraping?
> Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.

What questions can you answer with web scraping?
- What TV shows are airing tonight?
- What is the name and price of the first 5 results for X on ebay?
- How many words is the wiki page for X?
- Has X been updated recently with this text?
- is X band playing at Doug Fir any time soon?
- is that [refurbished Baratza](http://www.baratza.com/product/encore-refurb/) in stock yet?
- Are tickets available for sale yet?


Ethics of web scraping
- https://news.ycombinator.com/item?id=12345693


### Tools
name | Purpose
-----|--------
[Selector Gadget](http://selectorgadget.com/) | find css selectors visually
[CSS selector cheat-sheet](http://www.cheetyr.com/css-selectors) | CSS selector reference
[BeautifulSoup4](http://beautiful-soup-4.readthedocs.io/en/latest/) | Parse HTML webpages with selectors
[requests](http://docs.python-requests.org/en/master/) | Connect to and download webpages (HTML)

## HTML
**HyperText Markup Language** 

It's the code that forms websites.  We won't be learning HTML today, but we'll learn enough to understand how we can navigate it.
 
#### HTML is made up of elements as its base components

Elements have structure:

![element structure](https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/HTML_element_structure.svg/330px-HTML_element_structure.svg.png)



When nested inside eachother, they give the document form

![html structure](http://www.htmlgoodies.com/img/2007/06/page_container.gif)


This can also be viewed as a tree-like structure.  Here's the above when we only care about *children* and *ancestors*
![html tree-like structure](http://www.htmlgoodies.com/img/2007/06/flowChart2.gif)


In [None]:
import bs4
import requests

## Fetching the HTML

First step will be to actually get the website's html.  To do that, we'll be using the 3rd-party *requests*\* module.
This simulates:
1. opening your browser
2. typing in the url you want to visit
3. selecting 'View Source'
4. copying the text
5. pasting it into a variable.

\* we could do this using just the std-lib, but requests is popular enough you'll encounter it often.

In [None]:
url = 'https://raw.githubusercontent.com/hassanshamim/python_foundations/master/README.md'
response = requests.get(url)

In [None]:
response # If you're not familiar with HTTP codes, this output might be totally useless.

In [None]:
help(response) # let's see what this *response object* can do.

In [None]:
response.ok # Did the website/server respond properly?

The following result is Markup, not HTML.  Why?
The page we requested was just plain text - not HTML.

In [None]:
response.text # the contents.  In this example it's markup, not HTML.

So let's try a real web page!

In [None]:
response2 = requests.get('http://www.hackoregon.org/upcoming-courses')
response2.ok

In [None]:
response2.text

YAY! It's working.  But what if the thing we're getting isn't text?  What if it's an image?

Well, that's out of scope for today, but the general process is:
- get response from image url - `requests.get('http://www.website.com/file/image.jpeg')`
- get the binary data out, **not** the text - `response.content`
- save it to a file or render it as in image in python

## Finding the Data we want
If we want all the dates on a webpage, we can't just search for 'dates'
We either:
- have to know **where the dates occur** consistently in the webpage (structurally)
- have to know **how the dates are marked** (are they all in an element with a certain keyword? like 'arrival-date')
- or we have to **know how dates are formatted**, and look for everything that follows that format (i.e. some slashes then numbers then slashes then more numbers - this is what regular expressiosn do)

We'll be using a combination of the first two, with some help from Selector Gadget



### Beautiful Soup cheatsheet

**NOTE**: the traversal methods (select, find, .h3) can be used on tags as well as the whole soup

command | what it does
--------|------------
bs4.BeautifulSoup(data, 'html.parser') | creates our soup object that we use to scan the document
soup.find_all | return a *list* of tag objects that match our query
soup.find | returns the *first* tag object that matches our query
soup.select | uses **css selectors** to query our data.  returns the first
soup.select_all | same as above, but returns a list
soup.h3 | returns the first h3 tag matched.  same as `soup.find('h3')`  Works for any tag name
tag.text | returns text inside
tag.get_text() | fetches inner text ignoring any tags
tag.stripped_strings | returns a *generator* of component strings with whitespace removed.  Pass to `list()` to get a list object from the generator


### Hack University Example

In [None]:
selector = '.span-6 h2 , .span-8 h2 , .span-7 h2 , .span-7 strong'

soup = bs4.BeautifulSoup(response2.text, 'html.parser')

In [None]:
result = soup.select(selector)

In [None]:
result

In [None]:
for tag in result:
    print(type(tag), tag.name, tag.string, sep=',  ')

In [None]:
t = result[0]

In [None]:
list(t.stripped_strings)

In [None]:
[r.text for r in result]

In [None]:
test = bs4.BeautifulSoup('<div class="sqs-block-content" id="yui_3_17_2_1_1480579287682_383"><h2 id="yui_3_17_2_1_1480579287682_382">Applied Data Visualization</h2><h3>LEVEL: Advanced</h3><h3>START DATE: JAN 23RD, MON + WED, 6-9PM</h3><h3>DURATION: 8 WEEKS</h3><h3>COST $850</h3><h3>REACT OFFICE HOURS: +$250, TUES + THURS, 6-9PM</h3><h3>Instructor: David Daniel</h3></div>', 'html.parser')

In [None]:
test.find_all('h2')

In [None]:
test.find_all('h3')

In [None]:
test.find_all(['h2', 'h3'])

In [None]:
result = soup.select('.span-6 h2 , .span-8 h2 , .span-7 h2 , .span-7 strong')
[tag.text for tag in result if tag.text]

### Yelp Example

In [None]:
yelp_url = 'https://www.yelp.com/search?find_desc=pizza&find_loc=Portland'

In [None]:
yelp_page = requests.get(yelp_url)

In [None]:
yelp_page.status_code

In [None]:
yelp_soup = bs4.BeautifulSoup(yelp_page.text, 'html.parser')

In [None]:
yelp_soup.body.find_all('li', {'class': 'regular-search-result'})

In [None]:
yelp_soup.body.find_all('li', class_='regular-search-result')

In [None]:
result = yelp_soup.body.select('li.regular-search-result')

In [None]:
r = result[0]

In [None]:
r.find('a', class_='biz-name').span.text

In [None]:
r.select_one('a.biz-name span').text

In [None]:
r.select_one('div.i-stars').get('title').split()

In [None]:
list(r.address.stripped_strings)

#### Pagination

After we hit 'next' in the yelp search page, we get the second page of results.  the url looks like this:

`https://www.yelp.com/search?find_desc=pizza&find_loc=Portland&start=10`

Same as our original URL, but notice the **&start=10**  This is called a **query parameter**.  It's a key/value pair (in this case *start* and *10* respectively) that yelp uses to find and create the page we're looking for.

We can manually or programmatically adjust this to get the page we want.  Alternatively, we could find the 'next' button every time and follow that link.

In [None]:
hundreth_page = requests.get('https://www.yelp.com/search?find_desc=pizza&find_loc=Portland&start=1000')

In [None]:
# Same as the above
params = {'find_desc': 'pizza', 'find_loc': 'Portland OR', 'start': 100}
hundreth_page = requests.get('https://www.yelp.com/search', params=params)

In [None]:
hundreth_page.url

In [None]:
requests.utils.urlparse('https://www.yelp.com/search?find_desc=pizza&start=100&find_loc=Portland+OR').query

In [None]:
hundreth_page.ok
soup = bs4.BeautifulSoup(hundreth_page.text, 'html.parser')

In [None]:
soup.select('li.regular-search-result')

In [None]:
soup.select('#super-container > div > div > div > div > h3')

### Wikipedia Example

In [None]:
wiki_page_url = 'https://en.wikipedia.org/wiki/ISO_4217'
wiki_html = requests.get(wiki_page_url).text
wsoup = bs4.BeautifulSoup(wiki_html, 'html.parser')

In [None]:
wsoup.find_all('table')

In [None]:
wsoup.select_one('#Active_codes').parent.next_sibling.next_sibling.next_sibling.next_sibling

In [None]:
currency_table = wsoup.select_one('h2 + p + table')

In [None]:
row = currency_table.select('tr')[1]
row

In [None]:
wsoup.select('h2 + p + table tr')[1]

In [None]:
row.find_all('td')

In [None]:
requests.utils.quote('Portland, OR')

## Practice:
- write a script to play the wikipedia game.
- write a script to download all the comics from xkcd.
- Write a function that pulls the current weather
- Just think of a website you use often and play around.

### Additional References:
- https://automatetheboringstuff.com/chapter11/
- [HTTP Status Codes](http://www.restapitutorial.com/httpstatuscodes.html)