# Introduction


There is no silver bullet to getting info from the internet. 

The coding requirements in these notes start easy and will gradually become more demanding. We will cover the following web scraping techniques

1. copy pasting clean tables
2. APIs
3. Scraping static webpages with BeautifulSoup
4. Scraping dynamic wepages with Selenium

# 1. Copy Pasting

No, this isn’t a joke. If the data is in a clean table, and small enough that you can hightlight it all, copying and pasting often produces a text document that can be read in easily. For example, Wikipedia tables are easy to approach this way. Consider the list of nominal GDP by countries according to [Wikipedia](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)). If you highlight one of the tables, paste it in a plaintext editor such as Atom, you could load with



https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

In [13]:
import pandas as pd

In [14]:
data = pd.read_csv("gdp_list_imf.txt", sep = "\t", names = ['no', 'country', 'gdp'])

In [15]:
data[1:15]

Unnamed: 0,no,country,gdp
1,(US$MM),,
2,190,Tuvalu,40.0
3,189,Kiribati,186.0
4,188,Marshall Islands,199.0
5,187,Palau,321.0
6,186,Federated States of Micronesia,329.0
7,185,São Tomé and Príncipe,372.0
8,184,Tonga,437.0
9,183,Dominica,608.0
10,182,Comoros,659.0


In [16]:
data[191:193]

Unnamed: 0,no,country,gdp
191,1,United States,19390600
192,—,European Union[n 1][19],17308862


# 2. API

Application Programming Interface (API): set of methods to access data which is not publically available as a complete data set.

A lot of APIs are private - either they’re only available to people who are authorized to use them, or they’re used internally by development teams. However, a good number are public, but may require purchasing access or at least registering to get an API key. There’s a useful list of [public APIs](https://github.com/toddmotto/public-apis) maintained by Todd Motto. 

## Rate Limits
Note that a lot of API’s have rate limits - the number of requests you can send in a particular window of time (e.g. 100 requests per hour). Before spending a lot of time scraping, make sure you know the rate limit and structure your requests appropriately. You may need to space your scraping out over a couple of days.

## JSON
A lot of formal APIs return data in a format known as JSON (JavaScript Object Notation). It is similar to dictionaries in Python. An example modified from [Wikipedia](https://en.wikipedia.org/wiki/JSON):

``` 
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "children": [],
  "spouse": null
}```

In [17]:
# Import json library to be able to read the data we want to import
import json

In [18]:
with open('wiki_example.json', 'r') as json_file:
  json_data = json.load(json_file)
json_data

{'firstName': 'John',
 'lastName': 'Smith',
 'isAlive': True,
 'age': 25,
 'address': {'streetAddress': '21 2nd Street',
  'city': 'New York',
  'state': 'NY',
  'postalCode': '10021-3100'},
 'children': [],
 'spouse': None}

In [19]:
# The type of json is stored in dictionary format. Access data with familiar syntax. 
type(json_data)

dict

In [20]:
# let's access the dictionary
json_data['firstName']

'John'

## Example: Currency Rates
The site [exchangeratesapi.io](https://exchangeratesapi.io/) provides an API to obtain daily currency conversion rates. The URL differs depending on the date desired; current results are [exchangeratesapi.io/api/latest](https://api.exchangeratesapi.io/latest). Query parameters include `base=` to get the comparison (default is “EUR”). We will obtain the JSON data from an API query and use the json library to parse the data.

In [36]:
import json
import requests
# These snippets you ll use very often for any scraping:

# Define the website we want to open as url
url = "https://api.exchangeratesapi.io/latest"
# Define the result we get as "response". Request.get(webpage argument) loads the page. 
response = requests.get(url)


In [38]:
# let's try to get the data. Here is one way. But strings are not so practical, you ll need to split it to access info. 
#data = response.text
data = response
data
#type(data)


<Response [200]>

In [39]:
#Let s get the actual output like this
data = response.text
data

'{"base":"EUR","date":"2018-08-28","rates":{"AUD":1.5924,"BGN":1.9558,"BRL":4.7909,"CAD":1.5128,"CHF":1.1427,"CNY":7.9641,"CZK":25.713,"DKK":7.4574,"GBP":0.9068,"HKD":9.1922,"HRK":7.4325,"HUF":323.79,"IDR":17122.0,"ILS":4.2431,"INR":82.096,"ISK":124.7,"JPY":130.03,"KRW":1294.7,"MXN":21.95,"MYR":4.797,"NOK":9.7243,"NZD":1.7444,"PHP":62.387,"PLN":4.2723,"RON":4.6464,"RUB":79.073,"SEK":10.663,"SGD":1.5945,"THB":38.098,"TRY":7.3316,"USD":1.171,"ZAR":16.589}}'

In [41]:
# looks like a dict, except this doesnt work. why?
data["base"]


TypeError: string indices must be integers

In [42]:
# then: it is just text
type(data)

str

In [23]:
# Can also use data = response.json()
# Much easier to load in json to get a dictionary
data = json.loads(response.text)
data


{'base': 'EUR',
 'date': '2018-08-28',
 'rates': {'AUD': 1.5924,
  'BGN': 1.9558,
  'BRL': 4.7909,
  'CAD': 1.5128,
  'CHF': 1.1427,
  'CNY': 7.9641,
  'CZK': 25.713,
  'DKK': 7.4574,
  'GBP': 0.9068,
  'HKD': 9.1922,
  'HRK': 7.4325,
  'HUF': 323.79,
  'IDR': 17122.0,
  'ILS': 4.2431,
  'INR': 82.096,
  'ISK': 124.7,
  'JPY': 130.03,
  'KRW': 1294.7,
  'MXN': 21.95,
  'MYR': 4.797,
  'NOK': 9.7243,
  'NZD': 1.7444,
  'PHP': 62.387,
  'PLN': 4.2723,
  'RON': 4.6464,
  'RUB': 79.073,
  'SEK': 10.663,
  'SGD': 1.5945,
  'THB': 38.098,
  'TRY': 7.3316,
  'USD': 1.171,
  'ZAR': 16.589}}

In [24]:
# Data is in the dictionary format
# Note that rates then has a dict inside a dictionary. 
type(data)

dict

In [25]:
# What is the rate of EUR to CHF?
data["rates"]

{'AUD': 1.5924,
 'BGN': 1.9558,
 'BRL': 4.7909,
 'CAD': 1.5128,
 'CHF': 1.1427,
 'CNY': 7.9641,
 'CZK': 25.713,
 'DKK': 7.4574,
 'GBP': 0.9068,
 'HKD': 9.1922,
 'HRK': 7.4325,
 'HUF': 323.79,
 'IDR': 17122.0,
 'ILS': 4.2431,
 'INR': 82.096,
 'ISK': 124.7,
 'JPY': 130.03,
 'KRW': 1294.7,
 'MXN': 21.95,
 'MYR': 4.797,
 'NOK': 9.7243,
 'NZD': 1.7444,
 'PHP': 62.387,
 'PLN': 4.2723,
 'RON': 4.6464,
 'RUB': 79.073,
 'SEK': 10.663,
 'SGD': 1.5945,
 'THB': 38.098,
 'TRY': 7.3316,
 'USD': 1.171,
 'ZAR': 16.589}

In [26]:
data["rates"]["CHF"]

1.1427

# Exercise 1

Request the USD to JPY rate on 7th of March 2018. Hint: check the website on how to define the url. 



In [27]:
url = "https://api.exchangeratesapi.io/2018-03-07?base=USD"
# Define the result we get as "response". Request.get(webpage argument) loads the page. 
response = requests.get(url)
# This converts a given string into a dictionary (unicode): allows you to access your JSON data
# Can also use data = response.json()
data = json.loads(response.text)
data['rates']["JPY"]

105.83

# Exercise 2: Count the Number of Victims of US Drone Strikes
The API [https://dronestre.am/](https://dronestre.am/) provides information on all reported US drone strikes. Write a few lines of code to access the Raw JSON data and parse them. Calculate the lower bound of the number of victims (deaths_min) of US drone strikes. 

In [35]:
url = "http://api.dronestre.am/data"
response = requests.get(url)
data = json.loads(response.text)
data


{'status': 'OK',
 'strike': [{'_id': '55c79e711cbee48856a30886',
   'number': 1,
   'country': 'Yemen',
   'date': '2002-11-03T00:00:00.000Z',
   'narrative': 'In the first known US targeted assassination using a drone, a CIA Predator struck a car, killing 6 people.',
   'town': '',
   'location': 'Marib Province',
   'deaths': '6',
   'deaths_min': '6',
   'deaths_max': '6',
   'civilians': '0',
   'injuries': '',
   'children': '',
   'tweet_id': '278544689483890688',
   'bureau_id': 'YEM001',
   'bij_summary_short': 'In the first known US targeted assassination using a drone, a CIA Predator struck a car killing six al Qaeda suspects.',
   'bij_link': 'http://www.thebureauinvestigates.com/2012/03/29/yemen-reported-us-covert-actions-since-2001/',
   'target': '',
   'lat': '15.47467',
   'lon': '45.322755',
   'articles': [],
   'names': ["Qa'id Salim Sinan al-Harithi, Abu Ahmad al-Hijazi, Salih Hussain Ali al-Nunu, Awsan Ahmad al-Tarihi, Munir Ahmad Abdallah al-Sauda, Adil Nasir al-S

In [32]:
data["strike"]
data["strike"][0]
data["strike"][0]["deaths_min"]
data["strike"][5]["deaths_min"]

'8'

In [33]:
#Let s try a loop
lower_bound = 0

# Element is each strike, identified by the strike id. Let's see how to acess them. 
for element in data['strike']:
    print(element)
    print(type(element))

{'_id': '55c79e711cbee48856a30888', 'number': 3, 'country': 'Pakistan', 'date': '2005-05-08T00:00:00.000Z', 'narrative': "2 people killed in a Predator strike which reportedly targeted Haitham al-Yemeni's mobile phone.", 'town': 'Toorikhel', 'location': 'North Waziristan', 'deaths': '2', 'deaths_min': '2', 'deaths_max': '2', 'civilians': '', 'injuries': '', 'children': '', 'tweet_id': '278544812255367168', 'bureau_id': 'B2', 'bij_summary_short': 'Two killed, including Haitham al-Yemeni an al Qaeda explosives expert, near Mir Ali, North Waziristan.', 'bij_link': 'http://www.thebureauinvestigates.com/2011/08/10/the-bush-years-2004-2009/', 'target': 'Haitham al-Yemeni', 'lat': '32.98677989', 'lon': '70.26082993', 'articles': [], 'names': ['Haitham al-Yemeni, Samiullah Khan']}
<class 'dict'>
{'_id': '55c79e721cbee48856a30889', 'number': 4, 'country': 'Pakistan', 'date': '2005-11-05T00:00:00.000Z', 'narrative': "A failed strike destroyed Abu Hamza Rabia's house and killed 8 people, includin

In [18]:
#Let s try a loop
lower_bound = 0

# Now that we know it is a dictionary (if we didnt before), we know how to extract stuff. 
# Element is each strike, identified by the strike id. For each element in the strike dictionary, get the death min data. 
# I want a number, so I define that I want the element as an integer
for element in data['strike']:
    # define what we want to extract and see if it works
    lower_bound = element['deaths_min']
    #print(lower_bound)
    # Then check what type it is because what I want is a number
    print(type(lower_bound))
    
    # Ok string is not so optimal. we want an integer

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

In [19]:
# We need to convert the string into integer 
# Let s google a bit
# https://www.digitalocean.com/community/tutorials/how-to-convert-data-types-in-python-3
# have to use the int() transformation to transform string into integer

In [20]:
lower_bound = 0

for element in data["strike"]:
        lower_bound += int(element['deaths_min'])
        print(lower_bound)
        # maybe sth with the type
        print(type(lower_bound))

print("This is my result", lower_bound)
# Still doesnt work. Ok what next. Error messages built into your code helps a lot. 

6
<class 'int'>
12
<class 'int'>
14
<class 'int'>
22
<class 'int'>
27
<class 'int'>
35
<class 'int'>
48
<class 'int'>
129
<class 'int'>
137
<class 'int'>
140
<class 'int'>
160
<class 'int'>
165
<class 'int'>
165
<class 'int'>
177
<class 'int'>
185
<class 'int'>
197
<class 'int'>
209
<class 'int'>
210
<class 'int'>
216
<class 'int'>
229
<class 'int'>
237
<class 'int'>
237
<class 'int'>
241
<class 'int'>
247
<class 'int'>
251
<class 'int'>
256
<class 'int'>
261
<class 'int'>
278
<class 'int'>
288
<class 'int'>
292
<class 'int'>
296
<class 'int'>
299
<class 'int'>
320
<class 'int'>
324
<class 'int'>
328
<class 'int'>
333
<class 'int'>
340
<class 'int'>
355
<class 'int'>
359
<class 'int'>
363
<class 'int'>
374
<class 'int'>
385
<class 'int'>
389
<class 'int'>
393
<class 'int'>
398
<class 'int'>
400
<class 'int'>
403
<class 'int'>
409
<class 'int'>
411
<class 'int'>
414
<class 'int'>
416
<class 'int'>
419
<class 'int'>
422
<class 'int'>
429
<class 'int'>
434
<class 'int'>
460
<class 'int'>


ValueError: invalid literal for int() with base 10: ''

In [21]:
lower_bound = 0

for element in data["strike"]:
    try:
        lower_bound += int(element['deaths_min'])
    except:
        print("error here:", element)
       

print("This is my result:", lower_bound)
   

error here: {'_id': '55c79e721cbee48856a309a0', 'number': 283, 'country': 'Yemen', 'date': '2011-06-18T00:00:00.000Z', 'narrative': 'At least 6 Yemeni civilians wounded by a US drone.', 'town': 'Jaar', 'location': 'Abyan Province', 'deaths': '0', 'deaths_min': '', 'deaths_max': '', 'civilians': '', 'injuries': '6', 'children': '', 'tweet_id': '299605330252398593', 'bureau_id': 'YEM015', 'bij_summary_short': "Six civilians were wounded in an apparent drone strike targeting 'senior jihadists'. However, no AQAP militants were reported to be hit in the attack.", 'bij_link': 'http://www.thebureauinvestigates.com/2012/03/29/yemen-reported-us-covert-actions-since-2001/', 'target': '', 'lat': '13.218055', 'lon': '45.307832', 'articles': [], 'names': ['']}
error here: {'_id': '55c79e721cbee48856a309d4', 'number': 335, 'country': 'Yemen', 'date': '2011-11-08T00:00:00.000Z', 'narrative': "Five US drone strikes' killed an unknown number of people in Rumeila.", 'town': 'Rumeila', 'location': 'Abyan

In [22]:
# I can just code these as missing values by adding zero. 

lower_bound = 0

for element in data["strike"]:
    try:
        lower_bound += int(element['deaths_min'])
    except ValueError:
        lower_bound += 0

print(lower_bound)
   

4012


# 3. Scraping Static Webpages with BeautifulSoup

- If there is no API, we need to extract content by "brute force"
- Scraping consists of two main components

1) Reproducing the content of a webpage: easy part  
2) Extracting the part that we are intersted in: hard part

Let's try with a simple practice example:
http://pythonscraping.com/pages/page1.html


In [23]:
import requests
url = "http://pythonscraping.com/pages/page1.html"
html = requests.get(url) 
print(html)
# 200 is a status code which means it downloaded successfully. 4 or 5 usually mean there was an error. 

<Response [200]>


In [24]:
html.text

'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

In [None]:
# This doesnt look very pretty. 
# If you display a whole webpage like that and then find your info in there, it might get a bit annoying

## Extracting Information: BeautifulSoup



In [25]:
import requests
from bs4 import BeautifulSoup as bs
html = requests.get("http://pythonscraping.com/pages/page1.html") 
soup = bs(html.text, "lxml")
# lxml is a parser used by bs. 
soup

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

In [26]:
# Even nicer when you use prettify within bs, esp for a long page the indents help. 
print(soup.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div>
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



### The Components of a Webpage

When you visit a web page, your web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for you. The files fall into a few main types:

* HTML -- contain the main content of the page
* CSS -- add styling to make the page look nicer
* JS -- Javascript files add interactivity to web pages
* Images -- image formats, such as JPG and PNG allow web pages to show pictures


### A Primer in HTML

After your browser receives all the files, it renders the page and displays it to you. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML (HyperText Markup Language). Note that HTML is not a programming lanuage (like Python) but a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word – make text bold, create paragraphs, and so on. Because HTML isn’t a programming language, it isn’t nearly as complex as Python.

HTML consists of elements called tags. All HTML pages (at least all well-formatted ones) are surrounded by opening and cloing `<html> </html>` tags, with `<head>` and `<body>` tags in between. Other tags populate these page headers and page bodies to form the structure and content of the page.

In the example above, the page title (this is the text that is seen in a tab in your browser) is "A Useful Page" and the first header "An Interesting Title" is contained in the `<h1>` tag. Immediately below that is a `div` ("divider") tag of the class "body", containing what might be a main article or longer piece of text. 

You may notice that the `head` and `body` tags are inside the `html` tag. In HTML, tags are nested, and can go inside other tags. Tags have commonly used names that depend on their position in relation to other tags:

* `child` -- a child is a tag inside another tag. So the `div` tag above is a child of the `body` tag
* `parent` – a parent is the tag another tag is inside. Above, the `html` tag is the parent of the `body` tag.
* `sibiling` – a sibiling is a tag that is nested inside the same parent as another tag. For example, `head` and `body` are siblings, since they’re both inside `html`. `h1` and `div` are siblings, since they’re both inside `body`.

The tags above are extremely common html tags. Here are a few others:

* p -- creates a paragraph
* a -- renders a link to another webpage; it often comes with the `href` property that determines where the link goes
* table – creates a table
* form – creates an input form

You can find a full list of HTML tags [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

Before we move into actual web scraping, let’s learn about the `class` and `id` properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them. They do not change how the tags are rendered. 

In [27]:
# How to extract infos from the above? Use the tags
print(soup.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div>
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



In [28]:
soup.title

<title>A Useful Page</title>

In [29]:
soup.div

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

In [30]:
# If you just want the text: Output is a string. 
soup.div.text


'\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n'

In [31]:
# You can also access the tags by their family structure
# skip this

In [32]:
# Using the tags for scraping is very useful


## Exercise: open the following webpage and display with bs
https://www.pythonscraping.com/pages/warandpeace.html


In [33]:
import requests
from bs4 import BeautifulSoup as bs
url = "http://www.pythonscraping.com/pages/warandpeace.html"
html = requests.get(url) 
soup = bs(html.text, "lxml")
print(soup.prettify())

<html>
 <head>
  <style>
   .green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
  </style>
 </head>
 <body>
  <h1>
   War and Peace
  </h1>
  <h2>
   Chapter 1
  </h2>
  <div id="text">
   "
   <span class="red">
    Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
   </span>
   "
   <p>
   </p>
   It was in July, 1805, and the speaker was the well-known
   <span class="green">
    Anna
Pavlovna Scherer
   </span>
   , maid of honor and favorite of the
   <span class="green">
    Empress Marya
Fedorovna
   </span>
   . With these words she greeted
   <span 

This website has a few CSS elemnts to control for font colour and text size. The following will return a list of all header tags in a document:

In [34]:
soup.find_all({"h1", "h2"})

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

In [35]:
soup.find_all("span")

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">St. Petersburg</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span clas

In [36]:
greens = soup.find_all("span", {"class":"green"})

In [37]:
greens[0]

<span class="green">Anna
Pavlovna Scherer</span>

### Example: Salaries of All Employees at Tennessee Public Schools

In [38]:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.tbr.edu/hr/salaries"
html = requests.get(url) 
soup = bs(html.text, "lxml")


In [39]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="ie6 ie" lang="en" dir="ltr"> <![endif]-->
<!--[if IE 7]>    <html class="ie7 ie" lang="en" dir="ltr"> <![endif]-->
<!--[if IE 8]>    <html class="ie8 ie" lang="en" dir="ltr"> <![endif]-->
<!--[if IE 9]>    <html class="ie9 ie" lang="en" dir="ltr"> <![endif]-->
<!--[if !IE]> -->
<html dir="ltr" lang="en">
 <!-- <![endif]-->
 <head>
  <meta charset="utf-8"/>
  <link href="https://www.tbr.edu/profiles/tbr_hosting/themes/tbr_bootstrap_new/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
  <meta content="Drupal 7 (http://drupal.org)" name="generator"/>
  <link href="https://www.tbr.edu/hr/salaries" rel="canonical"/>
  <link href="https://www.tbr.edu/hr/salaries" rel="shortlink"/>
  <meta content="Tennessee Board of Regents" property="og:site_name"/>
  <meta content="article" property="og:type"/>
  <meta content="https://www.tbr.edu/hr/salaries" property="og:url"/>
  <meta content="Salary Database" property="og:title"/>
  <!-- 

In [40]:
# let's try to get walters institution
inst = soup.find("td", {"class":"views-field views-field-institution-1"})
inst

<td class="views-field views-field-institution-1">
            Walters State Comm College          </td>

In [41]:
# Let s try to get all the institutions here. This is a list. we can access single elements
inst = soup.find_all("td", {"class":"views-field views-field-institution-1"})
inst
inst[0]

<td class="views-field views-field-institution-1">
            Walters State Comm College          </td>

In [42]:
# let s use some useful python syntax to get the same result and then make it easier to work with the data by getting it as text
inst_all = [i.text for i in soup.find_all("td", {"class":"views-field views-field-institution-1"})]
inst_all

['\n            Walters State Comm College          ',
 '\n            Southwest TN Comm College          ',
 '\n            Chattanooga State Comm College          ',
 '\n            Tennessee Board of Regents          ',
 '\n            Nashville State Comm College          ',
 '\n            Tennessee Board of Regents          ',
 '\n            Chattanooga State Comm College          ',
 '\n            Nashville State Comm College          ',
 '\n            Motlow State Comm College          ',
 '\n            Southwest TN Comm College          ',
 '\n            Jackson State Comm College          ',
 '\n            TCAT Pulaski          ',
 '\n            Volunteer State Comm College          ',
 '\n            Dyersburg State Comm College          ',
 '\n            Pellissippi State Comm Coll          ',
 '\n            Dyersburg State Comm College          ',
 '\n            Tennessee Board of Regents          ',
 '\n            Northeast State Comm College          ',
 '\n  

In [43]:
inst_text = [i.strip() for i in inst_all]
inst_text

['Walters State Comm College',
 'Southwest TN Comm College',
 'Chattanooga State Comm College',
 'Tennessee Board of Regents',
 'Nashville State Comm College',
 'Tennessee Board of Regents',
 'Chattanooga State Comm College',
 'Nashville State Comm College',
 'Motlow State Comm College',
 'Southwest TN Comm College',
 'Jackson State Comm College',
 'TCAT Pulaski',
 'Volunteer State Comm College',
 'Dyersburg State Comm College',
 'Pellissippi State Comm Coll',
 'Dyersburg State Comm College',
 'Tennessee Board of Regents',
 'Northeast State Comm College',
 'Northeast State Comm College',
 'Southwest TN Comm College',
 'Southwest TN Comm College',
 'Walters State Comm College',
 'TCAT Memphis',
 'Chattanooga State Comm College',
 'Chattanooga State Comm College',
 'Motlow State Comm College',
 'Pellissippi State Comm Coll',
 'Nashville State Comm College',
 'TCAT Jackson',
 'TCAT Shelbyville',
 'Dyersburg State Comm College',
 'TCAT Ripley',
 'TCAT Nashville',
 'TCAT Pulaski',
 'Chattan

In [44]:
# Let s get some more info from that page
lastname    = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-lastname"})]
firstname   = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-firstname"})]
jobtitle    = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-jobtitle"})]
department  = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-department"})]
salary      = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-php"})]
fte         = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-fte"})]

In [45]:
import pandas as pd
tennessee_salaries_df = pd.DataFrame({
    "institution": inst_text, 
    "last_name": lastname, 
    "first_name": firstname,
    "job_title": jobtitle,
    "department": department,
    "salary": salary,
    "fte": fte
    })
tennessee_salaries_df

Unnamed: 0,institution,last_name,first_name,job_title,department,salary,fte
0,Walters State Comm College,Aarons,Andrew,Associate Professor,Industrial Technology,"$61,983",1.0
1,Southwest TN Comm College,Abadie,Cynthia,Associate Professor,Business and Legal Studies,"$60,636",1.0
2,Chattanooga State Comm College,Abbott,Joyce,Admissions Records Clerk Pt,Customer Response Center,"$22,558",0.8
3,Tennessee Board of Regents,Abdulle,Harun,System Adm Specialist,TNeCampus,"$60,558",1.0
4,Nashville State Comm College,Abel,Edward,Stock Clerk 3,Property Management,"$32,853",1.0
5,Tennessee Board of Regents,Able,Mary,Payroll Associate,Shared Services Initiative,"$46,985",1.0
6,Chattanooga State Comm College,Abraham,Roni,Academic Specialist,Academic Completion Specialists,"$43,632",1.0
7,Nashville State Comm College,Abu-Orf,Rebecca,Department Head,Payroll,"$73,936",1.0
8,Motlow State Comm College,Abunaemeh,Malek,Instructor,Mechatronics,"$55,000",1.0
9,Southwest TN Comm College,Acoff,Janura,Coordinator,TECTA Grant,"$39,168",1.0


In [46]:
# But what if I want all the pages?
# https://www.tbr.edu/hr/salaries?page=121
# There are 122 pages: compare last and first page

In [47]:
import requests
from bs4 import BeautifulSoup as bs 

In [48]:
# Open a dictionary to store information in. 

tennessee_salaries_df = pd.DataFrame({
    "institution": [], 
    "last_name": [], 
    "first_name": [],
    "job_title": [],
    "department": [],
    "salary": [],
    "fte": []
    })

In [49]:
pages = list(range(1,4)) # there are 122 pages. for now only do a few so to not break it down. 

for page in pages:
    
    url  = "https://www.tbr.edu/hr/salaries?firstname=&lastname=&department=&jobtitle=&institution=&page=" + str(page)
    # Define the url. Deal with the 0 page later. 
    html = requests.get(url) 
    soup = bs(html.text, "lxml")
    
    # retrieve the elements on each page as before
    institutions = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-institution-1"})]
    lastname     = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-lastname"})]
    firstname    = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-firstname"})]
    jobtitle     = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-jobtitle"})]
    department   = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-department"})]
    salary       = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-php"})]
    fte          = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-fte"})]
    
    # Pass them into the dictionary
    salaries = pd.DataFrame({
        "institution": institutions, 
        "last_name": lastname, 
        "first_name": firstname,
        "job_title": jobtitle,
        "department": department,
        "salary": salary,
        "fte": fte
        })
    
    # Append this to the already existing list
    tennessee_salaries_df = tennessee_salaries_df.append(salaries)
    
tennessee_salaries_df[0:20]

Unnamed: 0,institution,last_name,first_name,job_title,department,salary,fte
0,TCAT Murfreesboro,Akers,Jessica,Senior Instructor (Ttcu),Business Systems Technology,"$53,581",1.0
1,Roane State Comm College,Akers,Mariella,Technical Clerk,Oak Ridge Instruction,"$31,968",1.0
2,TCAT McMinnville,Akers,Debra,Master Instructor,Computer Oper Technology,"$68,337",1.0
3,Nashville State Comm College,Akers,Kathleen,Director,Clarksville Administration,"$75,000",1.0
4,Southwest TN Comm College,Akin,Tiffany,Instructor,Languages and Literature,"$38,292",1.0
5,TCAT Newbern,Akins,John,Assoc Instructor Ttc Newbern,Injection Welding,"$54,000",1.0
6,TCAT Jackson,Akins,Lori,Master Instructor,Practical Nursing Lexington,"$60,554",1.0
7,Nashville State Comm College,Akther,Jesmin,Instructor,Chemistry,"$42,918",1.0
8,TCAT Murfreesboro,Albers,Jenny,Acad & Student Sup Assoc 4,Other General,"$31,591",1.0
9,Northeast State Comm College,Albright,Kathy,Secretary 2 - Testing Services,Counseling and Testing Services,"$23,250",1.0


In [50]:
# Now, how to deal with the first page? 

# Open a dictionary to store information in. 

tennessee_salaries_df = pd.DataFrame({
    "institution": [], 
    "last_name": [], 
    "first_name": [],
    "job_title": [],
    "department": [],
    "salary": [],
    "fte": []
    })

# Need to add an exception, can use e.g. a counter
pages = list(range(0,4))
counter = 0 

for page in pages:
    
    if counter == 0:
        
        url  = "https://www.tbr.edu/hr/salaries" 
        
    
    if counter > 0:
        
        url  = "https://www.tbr.edu/hr/salaries?firstname=&lastname=&department=&jobtitle=&institution=&page=" + str(page)
    
    html = requests.get(url) 
    soup = bs(html.text, "lxml")

    # retrieve the elements on each page as before
    institutions = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-institution-1"})]
    lastname     = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-lastname"})]
    firstname    = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-firstname"})]
    jobtitle     = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-jobtitle"})]
    department   = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-department"})]
    salary       = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-php"})]
    fte          = [i.text.strip() for i in soup.findAll("td", {"class":"views-field views-field-fte"})]

    # Pass them into the dictionary
    salaries = pd.DataFrame({
        "institution": institutions, 
        "last_name": lastname, 
        "first_name": firstname,
        "job_title": jobtitle,
        "department": department,
        "salary": salary,
        "fte": fte
        })

    # Append this to the already existing list
    tennessee_salaries_df = tennessee_salaries_df.append(salaries)

    counter += 1
    
tennessee_salaries_df[0:20]  


Unnamed: 0,institution,last_name,first_name,job_title,department,salary,fte
0,Walters State Comm College,Aarons,Andrew,Associate Professor,Industrial Technology,"$61,983",1.0
1,Southwest TN Comm College,Abadie,Cynthia,Associate Professor,Business and Legal Studies,"$60,636",1.0
2,Chattanooga State Comm College,Abbott,Joyce,Admissions Records Clerk Pt,Customer Response Center,"$22,558",0.8
3,Tennessee Board of Regents,Abdulle,Harun,System Adm Specialist,TNeCampus,"$60,558",1.0
4,Nashville State Comm College,Abel,Edward,Stock Clerk 3,Property Management,"$32,853",1.0
5,Tennessee Board of Regents,Able,Mary,Payroll Associate,Shared Services Initiative,"$46,985",1.0
6,Chattanooga State Comm College,Abraham,Roni,Academic Specialist,Academic Completion Specialists,"$43,632",1.0
7,Nashville State Comm College,Abu-Orf,Rebecca,Department Head,Payroll,"$73,936",1.0
8,Motlow State Comm College,Abunaemeh,Malek,Instructor,Mechatronics,"$55,000",1.0
9,Southwest TN Comm College,Acoff,Janura,Coordinator,TECTA Grant,"$39,168",1.0
