# Introduction

> Each web page is different

> Web pages change

# Follow the law and be careful what you scrape

[Nice article on this](https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/)

# APIs (Application Programming Interface)

Return JSON, XML, YAML etc. (lightwieght formats for storing and transporting data) 

# Web Scraping process

1. Examine
2. Scrape
3. Parse

# Examine the Web Page

## Explore

Lets take a look at the [American Economics Society Website](https://www.aeaweb.org/) 

## Understanding URL's

Lets take a closer look at [Internet Resources for Economists Webpage](https://www.aeaweb.org/rfe/showCat.php?cat_id=91)

It looks like this:
> <font size="6">```https://www.aeaweb.org/rfe/showCat.php?cat_id=91```</font>

which can be decomposed into:

<font size="6">Base URL: ```https://www.aeaweb.org/rfe/showCat.php```</font>

<font size="6">   Query: ```?cat_id=91```</font>

### Deeper Understanding

* query parameters often come from information entered on the web page
* `?` is the query command
*  `cat_id=91` is the query parameter (key=value format
    * `cad_id` is the key
    * `91` is the value
* may include multiple parameters separated by
    * `&` separator for and.
    
### Exercise 

Spend a few minutes looking at the URLs of other pages on the website

## Using Developer Tools

* This usually works:  Right mouse click on page and choose Inspect

* Firefox (Menu -> More Tools -> Web Developer Tools (Ctrl-Shift-I)

### Finding HTML elements
* Look at Inspector (or elements) and click on box with arrow at far left of menu
* If you click on elements on page it will show you where they are in the source.

# Scraping the Web Page
You may have to `pip install requests` if you don't already have it

In [None]:
import requests

## Static Websites

In [None]:
url = "https://www.aeaweb.org/rfe/showCat.php?cat_id=96"
response = requests.get(url)
type(response.content)

In [None]:
response.content[:500]

In [None]:
loc = str(response.content).find('Tyler')
print(loc)
response.content[loc-100:loc+100]

# Parse: Using Beautiful Soup

[webpage](https://www.crummy.com/software/BeautifulSoup/)

[documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [None]:
from bs4 import BeautifulSoup as bsoup
soup = bsoup(response.content)
print(soup.prettify())

In [None]:
soup.head

In [None]:
soup.title

In [None]:
soup.body

In [None]:
soup.body.find_all('ul')

In [None]:
soup.p

In [None]:
result = soup.find_all('a')
result

In [None]:
for link in result:
    print(link.get('href'))

In [None]:
print(soup.get_text())

In [None]:
print(type(soup.get_text()))

In [None]:
'Tyler' in soup.get_text()
