# 11.3 Dynamic Scraping


We have now seen how to scrape static pages using the `requests` and `BeautifulSoup` packages. However, some pages cannot be scraped with these tools because they are dynamic. In this tutorial we are going to explore the most general webscraping tool: browser simulation. 

## 11.3.1 Javascript

The reason why some pages cannot be scraped with `requests` library is because they are written in Javascript. Let's have a look at one example: http://www.webscrapingfordatascience.com/simplejavascript/. This page returns list of three random quotes. The page is dynamic since the quotes shown are random and change whenever the table is loaded.

Let0s see what happens if we try to scrape the page with the `requests` library.

In [11]:
import requests

# Scrape javascript page
url = 'http://www.webscrapingfordatascience.com/simplejavascript/'
response = requests.get(url)
print(response.text)

<html>

<head>
	<script src="https://code.jquery.com/jquery-3.2.1.min.js"></script>
	<script>
	$(function() {
	document.cookie = "jsenabled=1";
	$.getJSON("quotes.php", function(data) {
		var items = [];
		$.each(data, function(key, val) {
			items.push("<li id='" + key + "'>" + val + "</li>");
		});
		$("<ul/>", {
			html: items.join("")
			}).appendTo("body");
		});
	});
	</script>
</head>

<body>

<h1>Here are some quotes</h1>

</body>

</html>



Weird. Our response does not contain the quotes on the page, even though they are clearly visible when we open it in our browser.

## 11.3.2 Selenium

Selenium is a python library that emulates a browser and lets us see pages exactly as with a normal browser. This is the most user-friendly way to do web scraping, however it has a huge cost: speed. This is by far the slowest way to do web scraping. 

After installing `selenium`, we need to download a browser to simulate. We will use Google's chromedriver. You can download it from here: https://sites.google.com/a/chromium.org/chromedriver/. Make sure to select "**latest stable release**" and not "latest beta release".

Move the downloaded `chromedriver` in the current directory ("*/11-python-webscraping*" for me). We will now try open the url above with selenium and see if we can scrape the quotes in it.

In [12]:
# Set your chromedriver name
chromedriver_name = '/chromedriver_mac'

In [13]:
import os
from selenium import webdriver

# Open url
path = os.getcwd()
print(path)
driver = webdriver.Chrome(path+chromedriver_name)

/Users/macbook/Dropbox/UZH/TA_pp4rs/python-webscraping


Awesome! Now, if everything went smooth, you should have a new Chrome window with a banner that says "*Chrome is being controlled by automated test software*". We can now open the web page and check that the list appears.

In [5]:
# Open url
url = 'http://www.webscrapingfordatascience.com/simplejavascript/'
driver.get(url)

Again, if averything went well, we are now abl to see our page with all the quotes in it. How do we scrape them?

If we inspect the elements of the list with the right-click `inspect` option, we should see something like:
```
<html><head>
	<script src="https://code.jquery.com/jquery-3.2.1.min.js"></script>
	<script>
	$(function() {
	document.cookie = "jsenabled=1";
	$.getJSON("quotes.php", function(data) {
		var items = [];
		$.each(data, function(key, val) {
			items.push("<li id='" + key + "'>" + val + "</li>");
		});
		$("<ul/>", {
			html: items.join("")
			}).appendTo("body");
		});
	});
	</script>
</head>

<body>

<h1>Here are some quotes</h1>




<ul><li id="0">Every strike brings me closer to the next home run. –Babe Ruth</li><li id="1">The two most important days in your life are the day you are born and the day you find out why. –Mark Twain</li><li id="2">Whatever you can do, or dream you can, begin it.  Boldness has genius, power and magic in it. –Johann Wolfgang von Goethe</li></ul></body></html>
```

Now we can see the content! Can we actually retrieve it? Let's try.

The most common selenium functions to get elements of a page, have a very intuitive syntax and are:
find_element_by_id
- find_element_by_name
- find_element_by_xpath
- find_element_by_link_text
- find_element_by_partial_link_text
- find_element_by_tag_name
- find_element_by_class_name
- find_element_by_css_selector

We will not try to recover all elements with tag `<li>` (element of list `<ul>`).

In [6]:
# Scrape content
quotes = [li.text for li in driver.find_elements_by_tag_name('li')]
quotes

['Whatever the mind of man can conceive and believe, it can achieve. –Napoleon Hill',
 'Every strike brings me closer to the next home run. –Babe Ruth',
 'I attribute my success to this: I never gave or took any excuse. –Florence Nightingale']

Yes! It worked! But why?



In [8]:
# Headless option
headless_option = webdriver.ChromeOptions()
headless_option.add_argument('--headless')

# Scraping
driver = webdriver.Chrome(path+chromedriver_name, options=headless_option)
driver.get(url)
quotes = [li.text for li in driver.find_elements_by_tag_name('li')]
quotes

[]

Mmm, it (probably) didn't work. Why?

The problem is that we are trying to retrieve the content of the page too fast. The page hasn't loaded yet. This is a common issue with `selenium`. Where are two ways to solve it:
- waiting
- waiting for the element to load

The second way is the best way but we will first try the first and simpler one: we will just ask the browser to wait for 1 second before searching for `<li>` tags

In [10]:
import time

# Scraping
driver = webdriver.Chrome(path+chromedriver_name, options=headless_option)
driver.get(url)
time.sleep(1)
quotes = [li.text for li in driver.find_elements_by_tag_name('li')]
quotes

['Your time is limited, so don’t waste it living someone else’s life. –Steve Jobs',
 'The most difficult thing is the decision to act, the rest is merely tenacity. –Amelia Earhart',
 'Either you run the day, or the day runs you. –Jim Rohn']

Nice! Now you should have obtained the list that we could not scrape with `requests`. If it didn't work, just increase the waiting time and it should work.

We can now have a look at the "better" way to use a series of built-in functions:
- `WebDriverWait`: the waiting function. We will call the `until` method
- `expected_conditions`: the condition function. We will call the `visibility_of_all_elements_located` method
- `By`: the selector function. Some of the options are:
    - By.ID
    - By.XPATH
    - By.NAME
    - By.TAG_NAME
    - By.CLASS_NAME
    - By.CSS_SELECTOR
    - By.LINK_TEXT
    - By.PARTIAL_LINK_TEXT

In [18]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Scraping
driver = webdriver.Chrome(path+chromedriver_name, options=headless_option)
driver.get(url)
quotes = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.TAG_NAME, 'li')))
quotes = [quote.text for quote in quotes]
quotes

["I've missed more than 9000 shots in my career. I've lost almost 300 games. 26 times I've been trusted to take the game winning shot and missed. I've failed over and over and over again in my life. And that is why I succeed. –Michael Jordan",
 'Life is what happens to you while you’re busy making other plans. –John Lennon',
 'Life is about making an impact, not making an income.']

In this case, we have told the browser to wait until either all elements with tag `<li>` are visible or 10 seconds have passed. After one condition is realized, the `WebDriverWait` function also automatically retrieves all the elements which the `expected_condition` function is conditioning on. There are many different conditions we can use. A list can be found here: https://selenium-python.readthedocs.io/waits.html.

We can easily generalize the function above as follows.

In [19]:
# Find element function
def find_elements(driver, function, identifier):
    element = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((function, identifier)))
    return element

quotes = [quote.text for quote in find_elements(driver, By.TAG_NAME, 'li')]
quotes

["I've missed more than 9000 shots in my career. I've lost almost 300 games. 26 times I've been trusted to take the game winning shot and missed. I've failed over and over and over again in my life. And that is why I succeed. –Michael Jordan",
 'Life is what happens to you while you’re busy making other plans. –John Lennon',
 'Life is about making an impact, not making an income.']

## 11.2.4 Example: OECD data on pollution

So far we have seen a lot of theory and some simple examples. But when does one actually need to use `selenium` for web scraping? In this section, we will explore a simple example: OECD data on air pollution. 

Let's open the link: https://stats.oecd.org/.

Suppose we want to scrape all greenhouse gases emissions. We have to select from the left menu:
- Environment
- Air and Climate
- Greenhouse gas emission by source

A table with the total greenhouse gas emissions should have appeared. Let's now select the first polluant from the menu on top of the table: *Carbon dioxide*. We should have a nice table with all carbon dioxide emissions from OECD countries from 1990 until today. 

What is the problem? The url is still the same: https://stats.oecd.org/. If we tried to scrape the url with the `requests` package, we would bet the initial page but there is no way to recover this specific table. We have to to id with `selenium`.

I am now going to repeat the same exact steps listed above

In [23]:
# Open browser
driver = webdriver.Chrome(path+chromedriver_name)

# Open url
url = 'https://stats.oecd.org/'
driver.get(url)

We have identified the topic list but there is no easy identifier to get to the environment category. One easy shortcut is to use Chrome build in *inspect* functionality. We can afterwards use the top-left arrow to browse elements. Once we have found the element of interest - the *Environment* option in our case - we can right click on the element inside the *inspect* window and select *Copy $\to$ Copy Xpath*.

In [24]:
# Click on the Environment option
environment = driver.find_element_by_xpath('//*[@id="browsethemes"]/ul/li[7]')
environment.click()

Note that the structure of the XPATH is relatively simple. It is telling us exactly how to proceed in the page tree:
1. select element with id="browsethemes"
2. select the (1st) <ul> tag
3. select the 7th <li> tagù
    
We could have done the same by hand but this is a simple shortcut.
    
We can now proceed and select the other options util the table appears.

In [25]:
# Make table appear
air_and_climate = driver.find_element_by_xpath('//*[@id="browsethemes"]/ul/li[7]/ul/li[1]')
air_and_climate.click()
greenhouse_gas = driver.find_element_by_xpath('//*[@id="browsethemes"]/ul/li[7]/ul/li[1]/ul/li[1]/a[2]')
greenhouse_gas.click()

We now need to select the *Carbon dioxide* option from *Polluant* the drop-down menu on top of the table.

In [35]:
from selenium.webdriver.support.select import Select

# Select Pulluant drop-down menu
polluants = driver.find_element_by_id('PDim_POL')
Select(polluants).select_by_index(1)

We should have now obtained the table we wanted from the start. Note that we have used the `select_by_index` method for the drop-down menu. There are however other options.
- select_by_index
- select_by_value
- select_by_visible_text

In our case, if we inspect the elements, we see that they are all nicely tagged by value: "*0~GHG*", "*1~CO2*", "*2~CH4*", etc... While selecting by value would have been much more accurate if we wateda specific polluant, selecting by index is better if we want them all, as we can easily loop over the index.

We can now retrieve the table.

In [37]:
import pandas as pd

# Get table
html_source = driver.page_source
df = pd.read_html(html_source)[0]
df.head()

Unnamed: 0_level_0,Pollutant,Pollutant,Pollutant,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride,Greenhouse gases Carbon dioxide Methane Nitrous oxide Unspecified mix of HFCs and PFCs Hydrofluorocarbons Perfluorocarbons Sulphur hexafluoride Nitrogen trifluoride
Unnamed: 0_level_1,Variable,Variable,Variable,"Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other","Total emissions excluding LULUCF 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - OtherLand use, land-use change and forestry (LULUCF)Agriculture, Forestry and Other Land Use (AFOLU)Total emissions including LULUCFTotal emission intensities Total GHG excl. LULUCF per capita Total GHG excl. LULUCF per unit of GDPTotal emission trends Total GHG excl. LULUCF, Index 2000=100 Total GHG excl. LULUCF, Index 1990=100Percentages by source 1 - Energy 1A1 - Energy Industries 1A2 - Manufacturing industries and construction 1A3 - Transport 1A4 - Residential and other sectors 1A5 - Energy - Other 1B - Fugitive Emissions from Fuels 1C - CO2 from Transport and Storage 2- Industrial processes and product use 3 - Agriculture 5 - Waste 6 - Other"
Unnamed: 0_level_2,Unit,Unit,Unit,"Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands","Tonnes of CO2 equivalent, Thousands"
Unnamed: 0_level_3,Year,Year,Year,1990,1991,1992,1993,1994,1995,1996,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Unnamed: 0_level_4,Year,Year,Year,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,...,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4,Unnamed: 24_level_4,Unnamed: 25_level_4,Unnamed: 26_level_4,Unnamed: 27_level_4,Unnamed: 28_level_4,Unnamed: 29_level_4,Unnamed: 30_level_4
Unnamed: 0_level_5,Year,Year,Year,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5,...,Unnamed: 21_level_5,Unnamed: 22_level_5,Unnamed: 23_level_5,Unnamed: 24_level_5,Unnamed: 25_level_5,Unnamed: 26_level_5,Unnamed: 27_level_5,Unnamed: 28_level_5,Unnamed: 29_level_5,Unnamed: 30_level_5
Unnamed: 0_level_6,Country,Country.1,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6,...,Unnamed: 21_level_6,Unnamed: 22_level_6,Unnamed: 23_level_6,Unnamed: 24_level_6,Unnamed: 25_level_6,Unnamed: 26_level_6,Unnamed: 27_level_6,Unnamed: 28_level_6,Unnamed: 29_level_6,Unnamed: 30_level_6
0,Australia,Australia,,278 424.38,279 872.36,284 912.12,289 234.86,294 014.61,305 409.80,312 361.21,...,405 048.54,408 344.94,406 425.94,404 263.70,406 986.99,398 051.59,393 288.53,402 537.57,413 157.39,417 041.28
1,Austria,Austria,,62 322.64,65 931.47,60 441.60,60 806.13,61 199.35,64 268.22,67 693.64,...,73 727.00,67 723.78,72 228.29,70 142.56,67 577.13,68 161.18,64 467.37,66 732.86,67 314.88,69 978.85
2,Belgium,Belgium,,120 481.84,123 543.22,122 655.91,121 555.27,124 938.75,126 080.41,129 579.15,...,120 352.13,107 699.34,113 814.07,104 319.18,101 514.57,101 578.97,95 906.48,99 753.30,98 421.54,97 563.57
3,Canada,Canada,,462 502.18,452 534.62,466 707.56,466 577.68,481 647.07,494 241.05,509 693.64,...,575 453.99,543 067.90,556 420.25,566 674.48,570 157.63,577 346.38,577 359.55,576 756.85,564 068.42,571 138.88
4,Chile,Chile,,33 490.13,32 360.89,33 890.99,36 359.52,38 724.83,41 558.19,47 608.09,...,71 969.71,68 853.76,70 137.10,78 389.53,82 186.19,81 007.68,78 032.24,84 565.54,87 889.34,..


The dataframe is definitely not clean. We have a problem with the columns.

In [28]:
# Clean columns
cols = [col[3] for col in df.columns]
cols

['Year',
 'Year',
 'Year',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017']

In [39]:
# Drop first and third column
df_clean = df.drop(df.columns[[0,2]], axis=1)
col_clean = ['Country'] + cols[3:]
df_clean.columns = col_clean
df_clean.head()

Unnamed: 0,Country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Australia,278 424.38,279 872.36,284 912.12,289 234.86,294 014.61,305 409.80,312 361.21,320 794.70,334 684.11,...,405 048.54,408 344.94,406 425.94,404 263.70,406 986.99,398 051.59,393 288.53,402 537.57,413 157.39,417 041.28
1,Austria,62 322.64,65 931.47,60 441.60,60 806.13,61 199.35,64 268.22,67 693.64,67 463.63,67 066.65,...,73 727.00,67 723.78,72 228.29,70 142.56,67 577.13,68 161.18,64 467.37,66 732.86,67 314.88,69 978.85
2,Belgium,120 481.84,123 543.22,122 655.91,121 555.27,124 938.75,126 080.41,129 579.15,124 042.52,130 281.50,...,120 352.13,107 699.34,113 814.07,104 319.18,101 514.57,101 578.97,95 906.48,99 753.30,98 421.54,97 563.57
3,Canada,462 502.18,452 534.62,466 707.56,466 577.68,481 647.07,494 241.05,509 693.64,524 433.39,533 101.03,...,575 453.99,543 067.90,556 420.25,566 674.48,570 157.63,577 346.38,577 359.55,576 756.85,564 068.42,571 138.88
4,Chile,33 490.13,32 360.89,33 890.99,36 359.52,38 724.83,41 558.19,47 608.09,54 408.84,55 349.86,...,71 969.71,68 853.76,70 137.10,78 389.53,82 186.19,81 007.68,78 032.24,84 565.54,87 889.34,..


We have successfully scraped the table content. Now we just have to rearrange it in a long format.

In [40]:
# Rearrange as long
df_long = pd.wide_to_long(df_clean, "", i="Country", j="Year")
polluant_name = driver.find_element_by_xpath('//*[@id="PDim_POL"]/option[2]').text.strip()
df_long.rename(columns={'':polluant_name})

Unnamed: 0_level_0,Unnamed: 1_level_0,Carbon dioxide
Country,Year,Unnamed: 2_level_1
Australia,1990,278 424.38
Austria,1990,62 322.64
Belgium,1990,120 481.84
Canada,1990,462 502.18
Chile,1990,33 490.13
...,...,...
Costa Rica,2017,..
India,2017,..
Indonesia,2017,..
Russia,2017,1 647 041.08


Now we are ready to scrape data on all polluants.

In [41]:
# Open browser
driver = webdriver.Chrome(path+'/chromedriver', options=headless_option)

# Generate empty dataframe
df_polluants = pd.DataFrame()

# Open url
driver.get(url)

# Open table
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="browsethemes"]/ul/li[7]'))).click()
driver.find_element_by_xpath('//*[@id="browsethemes"]/ul/li[7]/ul/li[1]').click()
driver.find_element_by_xpath('//*[@id="browsethemes"]/ul/li[7]/ul/li[1]/ul/li[1]/a[2]').click()

# Collect list of polluants
polluants = WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.ID, 'PDim_POL')))
names = [x.text.strip() for x in polluants.find_elements_by_tag_name('option')]

# Loop over polluants
for i in range(len(names)):
    
    # Load page
    print(names[i])
    polluants = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, 'PDim_POL')))
    Select(polluants).select_by_index(i)
    WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'tr')))
    
    # Get table
    html_source = driver.page_source
    df = pd.read_html(html_source)[0]
    
    # Clean columns
    cols = [col[3].strip() for col in df.columns]
    df_clean = df.drop(df.columns[[0,2]], axis=1)
    col_clean = ['Country'] + cols[3:]
    df_clean.columns = col_clean
    
    # Rearrange as long
    df_clean = df_clean.drop_duplicates(subset ="Country") 
    df_long = pd.wide_to_long(df_clean, "", i="Country", j="Year")
    df_long = df_long.rename(columns={'':names[i]})
    df_long = df_long[~df_long.index.duplicated()]
    
    # Merge
    df_polluants = pd.concat([df_polluants, df_long], axis=1, join='outer')

# Close driver
print("\nDone!")
driver.quit()
    
# Reset index and clear numbers
df_polluants = df_polluants.reset_index()

Greenhouse gases
Carbon dioxide
Methane
Nitrous oxide
Unspecified mix of HFCs and PFCs
Hydrofluorocarbons
Perfluorocarbons
Sulphur hexafluoride
Nitrogen trifluoride

Done!


In [48]:
# What does the data look like?
df_polluants.loc[df_polluants['Country']=="United States", "Carbon dioxide"]

1344    6 371 000.54
1345    6 315 615.19
1346    6 424 934.36
1347    6 532 069.60
1348    6 624 835.79
1349    6 710 067.30
1350    6 907 699.05
1351    6 968 461.97
1352    7 032 526.01
1353    7 071 461.46
1354    7 232 010.77
1355    7 116 810.08
1356    7 156 652.87
1357    7 199 262.66
1358    7 333 052.05
1359    7 339 039.87
1360    7 270 326.95
1361    7 369 967.71
1362    7 160 600.81
1363    6 709 369.15
1364    6 938 591.68
1365    6 787 419.03
1366    6 545 969.33
1367    6 710 218.18
1368    6 759 995.63
1369    6 623 775.48
1370    6 492 267.43
1371    6 456 718.19
Name: Carbon dioxide, dtype: object

## Bibliography

- Mitchell, R. (2018). *Web scraping with Python: Collecting more data from the modern web*. O'Reilly Media, Inc.
- Vanden Broucke, S., & Baesens, B. (2018). *Practical Web scraping for data science*. New York, NY: Apress.