# Introduction

The internet is an esssential source of information for empirical research projects in economics. Retrieving information stored on the web can be as simple as downloading a csv file that is hosted online. However, we will be discussing slightly more interesting versions, where the data can be accessed directly within your browser, via an Application Programming Interface (API) or where they are embedded in a website and we want to extract them. The latter is called *web scraping*. 

While copy pasting information from the browesr and using an API is fairly standardized, the real trick of web scraping is that there is no silver bullet. Any web scraping procedure will work only with the page it was designed for. These notes will cover a few general approaches which can be used, but if you are scraping your own data, you will often need to start from scratch, figure out which method is appropriate, and implement it.

The coding requirements in these notes start easy and will gradually become more demanding. We will cover the following web scraping techniques

1. copy pasting clean tables
2. APIs
3. Scraping static webpages with BeautifulSoup
4. Scraping dynamic wepages with Selenium



# 1. Copy Pasting

No, this isn’t a joke. If the data is in a clean table, and small enough that you can hightlight it all, copying and pasting often produces a text document that can be read in easily. For example, Wikipedia tables are easy to approach this way. Consider the list of nominal GDP by countries according to [Wikipedia](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)). If you highlight one of the tables, paste it in a plaintext editor such as Atom, you could load with


In [16]:
import pandas as pd
data = pd.read_csv("gdp_copy_paste.txt", sep = "\t", names = ['countries', 'gdp'])
data

Unnamed: 0,countries,gdp
0,United States,18036648
1,European Union,16832631
2,China,11158457
3,Japan,4383076
4,Germany,3363600
5,United Kingdom,2858003
6,France,2418946
7,India,2116239
8,Italy,1821580
9,Brazil,1772591


Note that the format is what’s known as tsv (tab separated values), which is similar to csv (comma separated values). The “\t” is an escape character and stands for *tab*. 

If the data is not clearly presented (or the webpage is “modern”), or if the data is at all large, this method will not work.

# 2. API

An Application Programming Interface (API) is a set methods to access data which is not publically available as a complete data set. For example, if you wanted to get access to Google search results, you’d use the [Google JSON/Atom Custom Search API](https://developers.google.com/custom-search/json-api/v1/overview) or if you wanted to get a list of Yelp businesses, you’d use the [Yelp Fusion API](https://www.yelp.com/developers). These are most commonly used by programmers (e.g. if you use any third-party apps to connect to resources such as Facebook, Twitter, etc; the developers of those apps use APIs to access your information) but are also useful to extract data.

A lot of APIs are private - either they’re only available to people who are authorized to use them, or they’re used internally by development teams. However, a good number are public, but may require purchasing access or at least registering to get an API key. There’s a useful list of [public APIs](https://github.com/toddmotto/public-apis) maintained by Todd Motto. A lot of the truely open APIs are done by fans and volunteers as opposed to the official company released ones.

## Rate Limits
Note that a lot of API’s have rate limits - the number of requests you can send in a particular window of time (e.g. 100 requests per hour). Before spending a lot of time scraping, make sure you know the rate limit and structure your requests appropriately. You may need to space your scraping out over a couple of days.

## JSON
A lot of formal APIs return data in a format known as JSON (JavaScript Object Notation). It is similar to dictionaries in Python. An example modified from [Wikipedia](https://en.wikipedia.org/wiki/JSON):

```{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "children": [],
  "spouse": null
}```

We do not have to parse this ourselves, thanks to the Python `json` module which parses the information and converts JSON objects into dictionaries, JSON arrays into lists, and JSON strings into strings. In this way, it makes it extremely easy to access and manipulate values stored in JSON.

In [17]:
import json
with open('wiki_example.json', 'r') as json_file:
  json_data = json.load(json_file)
json_data

{'address': {'city': 'New York',
  'postalCode': '10021-3100',
  'state': 'NY',
  'streetAddress': '21 2nd Street'},
 'age': 25,
 'children': [],
 'firstName': 'John',
 'isAlive': True,
 'lastName': 'Smith',
 'spouse': None}

In [23]:
type(json_data)

dict

and you can access the elements using the familiar Python syntax,

In [18]:
json_data["lastName"]

'Smith'

## Example: Currency Rates
The site [fixer.io](fixer.io) provides an API to obtain daily currency conversion rates. The URL differs depending on the date desired; current results are [http://api.fixer.io/latest](http://api.fixer.io/latest) whereas historical data is for example [http://api.fixer.io/2005-04-20](http://api.fixer.io/2005-04-20). Query parameters include `base=` to get the comparison (default is “EUR”). We will obtain the JSON data from an API query and use the json library to parse the data.

In [105]:
import json
import requests

url = "http://api.fixer.io/2000-01-03?base=USD"
response = requests.get(url)
data = json.loads(response.text)
data

{'base': 'USD',
 'date': '2000-01-03',
 'rates': {'AUD': 1.5209,
  'CAD': 1.4447,
  'CHF': 1.59,
  'CYP': 0.57156,
  'CZK': 35.741,
  'DKK': 7.374,
  'EEK': 15.507,
  'EUR': 0.99108,
  'GBP': 0.61903,
  'HKD': 7.7923,
  'HUF': 252.26,
  'ISK': 72.379,
  'JPY': 101.83,
  'KRW': 1129.8,
  'LTL': 4.0093,
  'LVL': 0.58632,
  'MTL': 0.4114,
  'NOK': 7.9901,
  'NZD': 1.9159,
  'PLN': 4.1462,
  'ROL': 18110.0,
  'SEK': 8.4757,
  'SGD': 1.6619,
  'SIT': 197.12,
  'SKK': 41.94,
  'TRL': 541260.0,
  'ZAR': 6.146}}

Note that this piece of code uses the important `requests` module and the `get` function within that module. The remainder of these notes will make use of both all the time. 

Finally, we want to access the CHF/USD rate,

In [107]:
data["rates"]["CHF"]

1.59

## Exercise: Count the Number of Victims of US Drone Strikes

In [121]:
url = "https://api.dronestre.am/data"
response = requests.get(url)
data = json.loads(response.text)
for i in data["strike"]:
    print(i['deaths_min'])

6
6
2
8
5
8
13
81
8
3
20
5
0
12
8
12
12
1
6
13
8
0
4
6
4
5
5
17
10
4
4
3
21
4
4
5
7
15
4
4
11
11
4
4
5
2
3
6
2
3
2
3
3
7
5
26
30
7
14
2
7
4
12
13
4
0
6
9
8
25
3
12
0
6
67
0
13
16
8
35
3
0
5
2
14
17
8
5
10
5
10
5
4
8
3
3
20
4
3
3
4
6
2
12
5
14
3
3
5
5
5
5
4
15
5
6
20
5
9
23
0
9
0
7
4
4
4
5
9
8
7
10
5
5
5
6
4
5
13
4
3
7
6
4
10
13
7
5
10
8
7
13
16
2
6
7
10
16
14
4
4
14
4
20
5
9
0
3
6
2
8
6
10
4
10
6
5
11
4
12
5
4
6
7
16
8
4
4
3
2
3
4
9
8
5
5
3
4
4
5
7
11
4
3
7
4
4
5
6
4
4
4
9
4
5
4
20
3
6
5
3
0
5
2
4
3
7
7
11
32
18
4
5
5
9
4
8
5
4
4
4
3
3
4
2
4
6
7
5
5
5
6
5
4
0
6
3
4
26
6
25
2
15
3
7
3
4
3
4
4
7
8
7
8
3
18
4
1
2
3
0
5

7
3
2
6
14
4
5
12
5
13
8
6
0
4
5
13
4
3
20
4
3
3
0
6
0
3
4
4
6
3
0
0
3
4
4
4
3
0
5
7
15
2
3
3
11
13
4
4
4
2
2
?
1
7
16
7
1
1
5
1
4
10
9
5
6
10
4
8
0
23
24
3
4
6
7
29
4
5
38
0
16
8
10
7
5
12
4
3
3
3
3
4
3
13
8
2
5
2
6
10
6
14
2
2
4
10
3
3
5
5
2
11
2
7
1
10
9
4
?
3
7
1
3
4
6
2
3
17
11
4
3
7
2
5
4
2
5
13
4
3
2
6
8
4
12
5
4
6
2
3
5
2
3
5
16
7
4
1
3
2
1
3
4
3
3
2
3
2
4
3
1
6
3


In [124]:
url = "https://api.dronestre.am/data"
response = requests.get(url)
data = json.loads(response.text)
sum = 0
for i in data["strike"]:
    sum += int(i['deaths_min'])
    print(sum)

6
12
14
22
27
35
48
129
137
140
160
165
165
177
185
197
209
210
216
229
237
237
241
247
251
256
261
278
288
292
296
299
320
324
328
333
340
355
359
363
374
385
389
393
398
400
403
409
411
414
416
419
422
429
434
460
490
497
511
513
520
524
536
549
553
553
559
568
576
601
604
616
616
622
689
689
702
718
726
761
764
764
769
771
785
802
810
815
825
830
840
845
849
857
860
863
883
887
890
893
897
903
905
917
922
936
939
942
947
952
957
962
966
981
986
992
1012
1017
1026
1049
1049
1058
1058
1065
1069
1073
1077
1082
1091
1099
1106
1116
1121
1126
1131
1137
1141
1146
1159
1163
1166
1173
1179
1183
1193
1206
1213
1218
1228
1236
1243
1256
1272
1274
1280
1287
1297
1313
1327
1331
1335
1349
1353
1373
1378
1387
1387
1390
1396
1398
1406
1412
1422
1426
1436
1442
1447
1458
1462
1474
1479
1483
1489
1496
1512
1520
1524
1528
1531
1533
1536
1540
1549
1557
1562
1567
1570
1574
1578
1583
1590
1601
1605
1608
1615
1619
1623
1628
1634
1638
1642
1646
1655
1659
1664
1668
1688
1691
1697
1702
1705
1705
1710
1712
1716

ValueError: invalid literal for int() with base 10: ''

# 3. Scraping Static Webpages with BeautifulSoup

If copy pasting the information from tables is not possible and there is no public API to access the data, we are left with grabbing the entire webpage and extracting the pieces of information we want. There are two components - obtaining the text of the webpage and extracting only the relevant information. The first piece is very straightforward, but the second can, in some situations, be extremely time-consuming. Modern web design is much more opaque than classic design, not to mention often very poorly done, such that even if the data is clearly laid out on screen, it may not be as clearly laid in the HTML.

## Grabbing a Webpage

We already did this above when we were working with an API. You use the `get` function from the `requests` module. Let's grab a very simple website to illustrate this point again,

In [43]:
import requests
html = requests.get("http://pythonscraping.com/pages/page1.html") 
html

<Response [200]>

The Response object contains the complete HTML code for the page *http://pythonscraping.com/pages/page1.html*. The status code `200` means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a `2` generally indicates success, and a code starting with a `4` or a `5` indicates an error. We can print out the HTML content of the page using the `text` property:

In [44]:
html.text

'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

## Extracting Information: BeautifulSoup

The BeautifulSoup library was named after a poem of the same name in *Alice's Adventures in Wonderland*. Like in the original, BeautifulSoup tries to make sense of the non-sensical; it helps to format and organize messy web by fixing bad HTML and presenting us with easily accessible Python objects. 

To illustrate what BeautifulSoup does, let's go back to the easy webpage we looked at before,

In [47]:
import requests
from bs4 import BeautifulSoup
html = requests.get("http://pythonscraping.com/pages/page1.html") 
soup = BeautifulSoup(html.text)
soup




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

or use the `prettify` function of the the BeautifulSoup library to get a better overview of the nested nature of HTML.

In [48]:
print(soup.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div>
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



### The Components of a Webpage

When you visit a web page, your web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for you. The files fall into a few main types:

* HTML -- contain the main content of the page
* CSS -- add styling to make the page look nicer
* JS -- Javascript files add interactivity to web pages
* Images -- image formats, such as JPG and PNG allow web pages to show pictures


### A Primer in HTML

After your browser receives all the files, it renders the page and displays it to you. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML (HyperText Markup Language). Note that HTML is not a programming lanuage (like Python) but a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word – make text bold, create paragraphs, and so on. Because HTML isn’t a programming language, it isn’t nearly as complex as Python.

HTML consists of elements called tags. All HTML pages (at least all well-formatted ones) are surrounded by opening and cloing `<html> </html>` tags, with `<head>` and `<body>` tags in between. Other tags populate these page headers and page bodies to form the structure and content of the page.

In the example above, the page title (this is the text that is seen in a tab in your browser) is "A Useful Page" and the first header "An Interesting Title" is contained in the `<h1>` tag. Immediately below that is a `div` ("divider") tag of the class "body", containing what might be a main article or longer piece of text. 

You may notice that the `head` and `body` tags are inside the `html` tag. In HTML, tags are nested, and can go inside other tags. Tags have commonly used names that depend on their position in relation to other tags:

* `child` -- a child is a tag inside another tag. So the `div` tag above is a child of the `body` tag
* `parent` – a parent is the tag another tag is inside. Above, the `html` tag is the parent of the `body` tag.
* `sibiling` – a sibiling is a tag that is nested inside the same parent as another tag. For example, `head` and `body` are siblings, since they’re both inside `html`. `h1` and `div` are siblings, since they’re both inside `body`.

The tags above are extremely common html tags. Here are a few others:

* p -- creates a paragraph
* a -- renders a link to another webpage; it often comes with the `href` property that determines where the link goes
* table – creates a table
* form – creates an input form

You can find a full list of HTML tags [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

Before we move into actual web scraping, let’s learn about the `class` and `id` properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them. They do not change how the tags are rendered. 

### Navigating the BeautifulSoup Object

Now we look at the functionality of BeautifulSoup. Let's look again at our simple website:

In [49]:
print(soup.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div>
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



In [None]:
You can access tags by their names. For example, you can access the title by

In [50]:
soup.title

<title>A Useful Page</title>

and the page body by 

In [52]:
soup.body

<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>

In [None]:
the first header by

In [53]:
soup.h1

<h1>An Interesting Title</h1>

or by

In [55]:
soup.body.h1

<h1>An Interesting Title</h1>

Note that both commands give the same result. You can access the tag which contains the main text by

In [54]:
soup.div

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

You can also drop the HTML tags and access the text within, 

In [57]:
soup.div.text

'\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n'

Note tat `.text` strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.
Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Calling `.text` should always be the last thing you do, immediately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible.

In [None]:
soup.body.children

In [None]:
for child in soup.body.children:
    print(child)

In [74]:
soup.body.h1.next_sibling

'\n'

In [75]:
soup.body.h1.next_sibling.next_sibling

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

In [76]:
soup.div.parent

<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>

In [78]:
soup.div.parent.previous_sibling.previous_sibling

<head>
<title>A Useful Page</title>
</head>

### find() and findAll() with BeautifulSoup
BeautifulSoup’s `find()` and `findAll()` are the two functions you will likely use the most. With them, you can easily filter HTML pages to find lists of desired tags, or a single tag, based on their various attributes. 

The two functions are extremely similar, as evidenced by their definitions in the BeautifulSoup documentation:

```
findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)
```

In all likelihood, 95% of the time you will find yourself only needing to use the first two arguments: tag and attributes. 

The tag argument is one that we’ve seen before —- you can pass a string name of a tag or even a Python list of string tag names. To illustrate this point, we will use a slightly more sophisticated website:

In [80]:
import requests
from bs4 import BeautifulSoup
html = requests.get("http://www.pythonscraping.com/pages/warandpeace.html") 
soup = BeautifulSoup(html.text)
print(soup.prettify())

<html>
 <head>
  <style>
   .green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
  </style>
 </head>
 <body>
  <h1>
   War and Peace
  </h1>
  <h2>
   Chapter 1
  </h2>
  <div id="text">
   "
   <span class="red">
    Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
   </span>
   "
   <p>
   </p>
   It was in July, 1805, and the speaker was the well-known
   <span class="green">
    Anna
Pavlovna Scherer
   </span>
   , maid of honor and favorite of the
   <span class="green">
    Empress Marya
Fedorovna
   </span>
   . With these words she greeted
   <span 



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


This website has a few CSS elemnts to control for font colour and text size. The following will return a list of all header tags in a document:

In [83]:
soup.findAll({"h1","h2"})

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

and the following will return a list of both the green and the red `span` tags in the HTML document. 

In [88]:
soup.findAll("span", {"class":"green", "class":"red"})

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span class="red">Heavens! what a virulent attack!</span>,
 <span class="red">First of all, dear friend, tell me how you are. Set your friend's
 mind at rest,</span>,
 <span class="red">Can one be well while suffering morally? Can one be calm in times
 l

In [90]:
soup.findAll("", {"class":"red"})

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span class="red">Heavens! what a virulent attack!</span>,
 <span class="red">First of all, dear friend, tell me how you are. Set your friend's
 mind at rest,</span>,
 <span class="red">Can one be well while suffering morally? Can one be calm in times
 l

### Example: Salaries of All Employees at Tennessee Public Schools

In [125]:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.tbr.edu/hr/salaries?firstname=&lastname=&department=&jobtitle=&institution=&page=0") 
soup = BeautifulSoup(html.text)
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="ie6 ie" lang="en" dir="ltr"> <![endif]-->
<!--[if IE 7]>    <html class="ie7 ie" lang="en" dir="ltr"> <![endif]-->
<!--[if IE 8]>    <html class="ie8 ie" lang="en" dir="ltr"> <![endif]-->
<!--[if IE 9]>    <html class="ie9 ie" lang="en" dir="ltr"> <![endif]-->
<!--[if !IE]> -->
<html dir="ltr" lang="en">
 <!-- <![endif]-->
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="https://www.tbr.edu/profiles/tbr_hosting/themes/tbr_bootstrap/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
  <meta content="Drupal 7 (http://drupal.org)" name="generator"/>
  <link href="https://www.tbr.edu/hr/salaries" rel="canonical"/>
  <link href="https://www.tbr.edu/hr/salaries" rel="shortlink"/>
  <meta content="Tennessee Board of Regents" property="og:site_name"/>
  <meta content="article" property="og:type"/>
  <meta content="https://www.tbr.edu/hr/salaries" property="og:url"/>
  <meta content="Salar



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [159]:
institutions = [i.text for i in soup.findAll("td", {"class":"views-field views-field-institution-1"})]
institutions

['\n            Walters State Comm College          ',
 '\n            Southwest TN Comm College          ',
 '\n            Chattanooga State Comm College          ',
 '\n            Tennessee Board of Regents          ',
 '\n            Nashville State Comm College          ',
 '\n            Chattanooga State Comm College          ',
 '\n            Nashville State Comm College          ',
 '\n            Southwest TN Comm College          ',
 '\n            Jackson State Comm College          ',
 '\n            Cleveland State Comm College          ',
 '\n            Dyersburg State Comm College          ',
 '\n            Volunteer State Comm College          ',
 '\n            TCAT Pulaski          ',
 '\n            Pellissippi State Comm College          ',
 '\n            Chattanooga State Comm College          ',
 '\n            Chattanooga State Comm College          ',
 '\n            Tennessee Board of Regents          ',
 '\n            Northeast State Comm College       

In [160]:
institutions = [i.strip() for i in institutions]
institutions

['Walters State Comm College',
 'Southwest TN Comm College',
 'Chattanooga State Comm College',
 'Tennessee Board of Regents',
 'Nashville State Comm College',
 'Chattanooga State Comm College',
 'Nashville State Comm College',
 'Southwest TN Comm College',
 'Jackson State Comm College',
 'Cleveland State Comm College',
 'Dyersburg State Comm College',
 'Volunteer State Comm College',
 'TCAT Pulaski',
 'Pellissippi State Comm College',
 'Chattanooga State Comm College',
 'Chattanooga State Comm College',
 'Tennessee Board of Regents',
 'Northeast State Comm College',
 'Northeast State Comm College',
 'Walters State Comm College',
 'Southwest TN Comm College',
 'Southwest TN Comm College',
 'Dyersburg State Comm College',
 'TCAT Memphis',
 'Pellissippi State Comm College',
 'Motlow State Comm College',
 'TCAT Jackson',
 'TCAT Ripley',
 'TCAT Nashville',
 'TCAT Pulaski',
 'Northeast State Comm College',
 'Nashville State Comm College',
 'Chattanooga State Comm College',
 'TCAT Jacksboro'

In [164]:
lastname    = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-lastname"})]
firstname   = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-firstname"})]
jobtitle    = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-jobtitle"})]
department  = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-department"})]
salary      = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-php"})]
fte         = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-fte"})]

In [166]:
import pandas as pd
tennessee_salaries_df = pd.DataFrame({
    "institution": institutions, 
    "last_name": lastname, 
    "first_name": firstname,
    "job_title": jobtitle,
    "department": department,
    "salary": salary,
    "fte": fte
    })
tennessee_salaries_df

Unnamed: 0,department,first_name,fte,institution,job_title,last_name,salary
0,Industrial Technology,Andrew,1.0,Walters State Comm College,Associate Professor,Aarons,"$60,178"
1,Business and Legal Studies,Cynthia,1.0,Southwest TN Comm College,Associate Professor,Abadie,"$59,040"
2,Customer Response Center,Joyce,0.8,Chattanooga State Comm College,Admissions Records Clerk Pt,Abbott,"$21,901"
3,TNeCampus,Harun,1.0,Tennessee Board of Regents,System Adm Specialist,Abdulle,"$58,794"
4,Property Management,Edward,1.0,Nashville State Comm College,Stock Clerk 3,Abel,"$30,967"
5,Educational Planning and Advising,Roni,1.0,Chattanooga State Comm College,Academic Specialist,Abraham,"$42,361"
6,Payroll,Rebecca,1.0,Nashville State Comm College,Department Head,Abu-Orf,"$70,236"
7,TECTA Grant,Janura,1.0,Southwest TN Comm College,Coordinator,Acoff,"$38,016"
8,Small Business Development Center,Ronald,1.0,Jackson State Comm College,"Director, Sbdc",Acree,"$56,253"
9,Nursing,Cynthia,1.0,Cleveland State Comm College,Instructor,Acuff,"$22,465"


### Data Spread Over Multiple Pages

Often data will not be limited to a single page, but spread across many. This could be a separate page per year/date or just a limit on the number of results listed per page.

As an example, let’s go back to the salary data for all employees of Tennessee public schools.

We see the first 50 individuals listed here with no evidence of how to list more individuals on the single page. However, if we go to the second page of results (the pages are listed below the data), we see the URL changes to [https://www.tbr.edu/hr/salaries?firstname=&lastname=&department=&jobtitle=&institution=&page=1](https://www.tbr.edu/hr/salaries?firstname=&lastname=&department=&jobtitle=&institution=&page=1). This is common in webpage design. Each of those arguments (firstname= and lastname= etc) could be non-blank (try it: restrict the results by using lastname=smith).

The most interesting argument for our purposes is page=1. Changing that to page=0 gives us the first page of results, and clicking to the the “last” results gives page=119. We can therefore use BeautifulSoup and scrape all these pages.

In [180]:
import requests
from bs4 import BeautifulSoup

tennessee_salaries_df = pd.DataFrame({
    "institution": [], 
    "last_name": [], 
    "first_name": [],
    "job_title": [],
    "department": [],
    "salary": [],
    "fte": []
    })

pages = list(range(0,120)) # there are 119 pages
for page in pages:
    url  = "https://www.tbr.edu/hr/salaries?firstname=&lastname=&department=&jobtitle=&institution=&page=" + str(page)
    html = requests.get(url) 
    soup = BeautifulSoup(html.text)
    institutions = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-institution-1"})]
    lastname     = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-lastname"})]
    firstname    = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-firstname"})]
    jobtitle     = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-jobtitle"})]
    department   = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-department"})]
    salary       = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-php"})]
    fte          = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-fte"})]
    salaries = pd.DataFrame({
        "institution": institutions, 
        "last_name": lastname, 
        "first_name": firstname,
        "job_title": jobtitle,
        "department": department,
        "salary": salary,
        "fte": fte
        })
    tennessee_salaries_df = tennessee_salaries_df.append(salaries, ignore_index=True)
tennessee_salaries_df



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Unnamed: 0,department,first_name,fte,institution,job_title,last_name,salary
0,Industrial Technology,Andrew,1,Walters State Comm College,Associate Professor,Aarons,"$60,178"
1,Business and Legal Studies,Cynthia,1,Southwest TN Comm College,Associate Professor,Abadie,"$59,040"
2,Customer Response Center,Joyce,0.8,Chattanooga State Comm College,Admissions Records Clerk Pt,Abbott,"$21,901"
3,TNeCampus,Harun,1,Tennessee Board of Regents,System Adm Specialist,Abdulle,"$58,794"
4,Property Management,Edward,1,Nashville State Comm College,Stock Clerk 3,Abel,"$30,967"
5,Educational Planning and Advising,Roni,1,Chattanooga State Comm College,Academic Specialist,Abraham,"$42,361"
6,Payroll,Rebecca,1,Nashville State Comm College,Department Head,Abu-Orf,"$70,236"
7,TECTA Grant,Janura,1,Southwest TN Comm College,Coordinator,Acoff,"$38,016"
8,Small Business Development Center,Ronald,1,Jackson State Comm College,"Director, Sbdc",Acree,"$56,253"
9,Nursing,Cynthia,1,Cleveland State Comm College,Instructor,Acuff,"$22,465"


### In Summmary 

* Inpsect the HTML source code to locate where in the Document Object Model (DOM) the information of interest resides
* View the source code in your browser (cmd+alt+u in Chrome on OSX)
* Use BeautifulSoup to parse HTML and extract information. BeautifulSoup creates an object that represents the original HTML information in a nested structure that you can navigate with BeautifulSoup commands. 
    * tags <...>
    * #id
    * .class
    * text

### Challenges

* Dealing with bad markup on webpages
* JavaScript driven content
* Webpages needing Login

We will use Selenium to address these problems.

## 4. Scraping Dynamic Webpages with Selenium

JavaScript is, by far, the most common and most well-supported client-side scripting language on the Web today. It can be used to collect information for user tracking, submit forms without reloading the page, embed multimedia, and even power entire online games. Even deceptively simple-looking pages can often contain multiple pieces of JavaScript. You can find it embedded between `<script>` tags in the page’s source code.


### AJAX and Dynamic HTML

If you’ve ever submitted a form or retrieved information from a server without reloading the page, you’ve likely used a website that uses Ajax. Ajax stands for *Asynchronous JavaScript and XML*, and it is a group of technologies used to send information to and receive from a web server without making a separate page request. Like Ajax, dynamic HTML or DHTML is a collection of technologies used for a common purpose. DHTML is HTML code, CSS language, or both that change due to client-side scripts changing HTML elements on the page. A button might appear only after the user moves the cursor, a background color might change on a click, or an Ajax request might trigger a new block of content to load.

Note that although the word “dynamic” is generally associated with words like “moving,” or “changing,” the presence of interactive HTML components, moving images, or embedded media does not necessarily make a page DHTML, even though it might look dynamic. In addition, some of the most boring, static-looking pages on the Internet can have DHTML processes running behind the scenes that depend on the use of JavaScript to manipulate the HTML and CSS.

If you scrape a large number of different websites, you will soon run into a situation in which the content you are viewing in your browser does not match the content you see in the source code you’re retrieving from the site. You might view the output of your scraper and scratch your head, trying to figure out where everything you’re seeing on the exact same page in your browser has disappeared to. The web page might also have a loading page that appears to redirect you to another page of results, but you’ll notice that the page’s URL never changes when this redirect happens. Both of these are caused by a failure of your scraper to execute the JavaScript that is making the magic happen on the page. Without the JavaScript, the HTML just sort of sits there, and the site might look very different than what it looks like in your web browser, which executes the JavaScript without problem.

    
### Executing JavaScript in Python with Selenium

Selenium is a powerful web scraping tool developed originally for website testing. These days it’s also used when the accurate portrayal of websites — as they appear in a browser — is required. Selenium works by automating browsers to load the website, retrieve the required data, and even take screenshots or assert that certain actions happen on the website.

Selenium does not contain its own web browser; it requires integration with third- party browsers in order to run. If you were to run Selenium with Firefox, for example, you would literally see a Firefox instance open up on your screen, navigate to the website, and perform the actions you had specified in the code. Although this might be neat to watch, I prefer my scripts to run quietly in the background, so I use a tool called PhantomJS in lieu of an actual browser.

PhantomJS is what is known as a “headless” browser. It loads websites into memory and executes JavaScript on the page, but does it without any graphic rendering of the website to the user. By combining Selenium with PhantomJS, you can run an extremely powerful web scraper that handles cookies, JavaScript, headers, and every‐ thing else you need with ease.
You can download the Selenium library from its website or use a third-party installer such as pip to install it from the command line.
PhantomJS can be downloaded from its website. Because PhantomJS is a full (albeit headless) browser and not a Python library, it does require a download and installa‐ tion to use and cannot be installed with pip.
Although there are plenty of pages that use Ajax to load data (notably Google), I’ve created a sample page at http://pythonscraping.com/pages/javascript/ajaxDemo.html to run our scrapers against. This page contains some sample text, hardcoded into the page’s HTML, that is replaced by Ajax-generated content after a two-second delay. If we were to scrape this page’s data using traditional methods, we’d only get the loading page, without actually getting the data that we want.
The Selenium library is an API called on the a *WebDriver*. The WebDriver is a bit like a browser in that it can load websites, but it can also be used like a BeautifulSoup object to find page elements, interact with elements on the page (send text, click, etc.), and do other actions to drive the web scrapers.
The following code retrieves text behind an Ajax “wall” on the test page:

In [191]:
from selenium import webdriver 
import time

driver = webdriver.Chrome('/Applications/chromedriver')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") 
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

Here is some important text you want to retrieve!
A button to click!


In [190]:
from selenium import webdriver 
import time

driver = webdriver.PhantomJS(executable_path='/Applications/phantomjs-2.1.1-macosx/bin/phantomjs')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") 
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

Here is some important text you want to retrieve!
A button to click!


This creates a new Selenium WebDriver, using the PhantomJS library, which tells the WebDriver to load a page and then pauses execution for three seconds before looking at the page to retrieve the (hopefully loaded) content.

Note that although the page itself contains an HTML button, Selenium’s .text func‐ tion retrieves the text value of the button in the same way that it retrieves all other content on the page.

If `the time.sleep` pause is changed to one second instead of three, the text returned changes to the original:
```    
This is some content that will appear on the page while it's loading.
You don't care about scraping this.
```

Although this solution works, it is somewhat inefficient and implementing it could cause problems on a large scale. Page load times are inconsistent, depending on the server load at any particular millisecond, and natural variations occur in connection speed. Although this page load should take just over two seconds, we’re giving it an entire three seconds to make sure that it loads completely. A more efficient solution would repeatedly check for the existence of some element on a fully loaded page and return only when that element exists.

This code uses the presence of the button with id `loadedButton` to declare that the page has been fully loaded: from `selenium import webdriver.`

In [192]:
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS(executable_path='/Applications/phantomjs-2.1.1-macosx/bin/phantomjs')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")  
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "loadedButton")))
finally: 
    print(driver.find_element_by_id("content").text) 
    driver.close()

Here is some important text you want to retrieve!
A button to click!



* Selenium WebDriver accepts commands, sends them to the browser, and retrieves results
* You can navigate and search the DOM:
    + `find_element_by_id('id')`
    + `find_elements_by_tag_name('table')`
    + `find_elements_by_class_name('myclassname')`
    + `find_elements_by_xpath()`
        - for example, `find_elements_by_xpath("//table[@id='list')"` will find a table with the id = 'list'
    + and other methods
* You can click on links using WebDriver's `click()` method



### Selenium Selectors

In previous chapters, we’ve selected page elements using BeautifulSoup selectors, such as find and findAll. Selenium uses an entirely new set of selectors to find an element in a WebDriver’s DOM, although they have fairly straightforward names.
In the example, we used the selector find_element_by_id, although the following other selectors would have worked as well:
```
driver.find_element_by_css_selector("#content")
driver.find_element_by_tag_name("div")
```    
Of course, if you want to select multiple elements on the page, most of these element selectors can return a Python list of elements simply by using elements (i.e., make it plural):
```
driver.find_elements_by_css_selector("#content")
driver.find_elements_by_css_selector("div")
```
Of course, if you still want to use BeautifulSoup to parse this content you can, by using WebDriver’s page_source function, which returns the page’s source, as viewed by the DOM at that current time, as a string:
```
pageSource = driver.page_source
bsObj = BeautifulSoup(pageSource) 
print(bsObj.find(id="content").get_text())
```

### In Summmary 

* AJAX and Dynamic HTML Refers to a group of web technologies used to implement web applications that communicate with the server in the background without interfering with the display of the page or reloading the entire page
* For web scraping, this means that the data you are interested in may not be in the source HTML file you received from the server but is generated on the client side
* You can use browser development tools "inspect elements" (cmd+alt+u and cmd+alt+c in Chrome on OSX) to inspect the AJAX generated HTML that your browser displays 
* Use Selenium to generate the HTML after running JavaScript. Then,
    + either access it as a string and use it as an input to BeautifulSoup for data extraction
    + or navigate the DOM and find elements within Selenium

## Sources
The exposition is inspired by [lecture notes by Josh Errickson](http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/webscrape.html#fn5) and the book *Web Scraping Using Python* by Ryan Mitchell. 