### GreenDS

# Fundamentals of Agro-Environmental Data Science

## Example APIs and Web scraping

### Introduction

The purpose of this Jupyter Notebook exercise is to demonstrate the methods available to obtain data from online services. Two examples are explored:
- web data services based on REST APIs
- web scraping from online web pages

Sometime, web pages use APIs to expose information and services, but no documentation is provided. We will learn how to identify the existence of these services, to use them in more efficient data collection.

## Web scraping Air Quality data

The QUALAR (https://qualar.apambiente.pt/) is a web platform of APA, the Portuguese Environment Agency, that displays online air quality data sampled by on air monitoring stations in Portugal. Unfortunately, the platform does not expose to the final user, or provides documentation how to use the API service that was implemented for web users downloads. Downloads are generated as XLSX files.

However, it is possible to hack the source code of the webpage to identify that there is an implemented API, and that it can be used to facilite efficient download of data. This exercise will demonstrate that, with the following steps:
- check how are downloads generated from the website
- identify and collect the parameters that define the data download
- use the API to download data
- as a bonus, visualise a timeseries of a data quality variable.

## 1. Data download for human users

Visit the data download page of QUALAR, at https://qualar.apambiente.pt/downloads. It displays a table with a list of Air Monitoring Stations, with the following columns:
- Region (Região)
- Municipality (Concelho)
- Station (Estação)
- Station type (Tipo de Estação), with categories traffic, industrial and background
- Área type (Tipo de Área), with categories urban, suburban and rural
- columns for the following pollutants: O3, NO2, CO, SO2, PM10, PM2.5, C6H6, other

On the top, the page has two fields to define the time range for the data download, and on the left several buttons to activate filters about the type of station and type of área.

To make a download, users can click on arrows that are available for each station and each pollutant, or if they want to download all pollutants for a station, they can click directly on the station name. After clicking, the download file is generated for the requested options as a excel file (xlsx).

*Try to make a download in this way, and check the file downloaded.*

## 2. Verify how are downloads generated inspecting the webpage source code

It is possible to inspect the source code of the table, and the behaviour of the page when a download is solicited. Checking this we can try to identify which methods are used to provide data to users. If we manage to verify that the web page is served by an API, and we can identify which parameters define a request, then it would be possible to generate a script to speed up downloads.

**1. Activate the Inspect Tool of the source code of the webpage.**

*Open in your web browser, navigate to https://qualar.apambiente.pt/downloads.Afterwards, in the menu of your browser, find the option **Web Developer Tools** or **Developer Tools** (in Firefox or Chrome, you will find it in **More tools**). This will open a new panel in the browser.*

**2. Check the method to generate downloads**

As mentioned before, clicking on the name of a station will generate a download with all data for that station. This means that through the HTTP protocol, a request is made through the network. Checking which request was made (which URL request was send) is a good way of verifying what was the information send to the web server.

*On the Developer Panel, click on the tab **Netwotk**. After that, click on the name of a station to make a download request. This will generate a new row on the panel, with the information about a **GET** request.*

One of the parameters in that row is the name or file field, which shows the URL sent to the server, e.g.:

```https://qualar.apambiente.pt/api/download.php?poluente_id=0&estacao_id=3082&data_inicio=2021-01-01&data_fim=2021-12-31&influencias=1,2,3&ambientes=1,2,4```

We can identify the following sections in the URL:

Host URL: ```https://qualar.apambiente.pt/api/download.php```

Parameters:```poluente_id=0&estacao_id=3082&data_inicio=2021-01-01&data_fim=2021-12-31&influencias=1,2,3&ambientes=1,2,4```

The meaning of the parameters is more or less obvious:

`poluente_id` - the ID of the pollutant. The value zero should mean all pollutants

`estacao_ID` - the ID of the station

`data_inicio` - starting date

`data_fim` - ending date

`influencias` - station type

`ambiente` - area type

**3. Verify that the method for download works**

We have just identified an API service for downloading data. We can check if it works, testing with different parameters and see if results correspond to what is expected:

*To download data only for **year 2020**, try the following modified URL:*

```https://qualar.apambiente.pt/api/download.php?poluente_id=0&estacao_id=3082&data_inicio=2020-01-01&data_fim=2020-12-31&influencias=1,2,3&ambientes=1,2,4```

The challenge now is to discover the values of the IDs of the air monitoring stations (the parameter `estacao_ID`). If we find these, we can make a script to make automatic requests to download the data files. 

**4. Inspect the HTML source code** 

On the top left bar of the **Developer tools** panel, there is a arrow cursor option. Select this, and the place the mouse pointed on the name of one air monitoring station, in the table. You will verify that for each section of the web page where you hover your mouse, the corresponding HTML source code will be highlighted in the developer panel.

**5. Select the section of the HTML code with the cell of the station name**

Place the mouse so that the complete cell with the name of a station in the table is highlighted, and the click. In the source code, a line starting with the tag **td** should be selected.

In the begining of that line, a triangle indicates that the inner HTLM code can be expanded. Remember that HTML is a hierarchical language, where html tags placed inside other tags are "child" or "depended" of these.

The html line looks, for example, like the following:

```html
<td style="background-color: #EBF7FF; text-align: center; vertical-align: middle; ">
    <label title="Dados de todos os poluentes para uma estação num dado ano" 
           style="color:#0000ff; cursor:pointer" onclick="tableDataManager.openExcel(3082)">
        <u><b>Alfragide/Amadora</b></u>
    </label>
</td>
```

The interesting about that html is that the **onclick** event on the **label** tag is an actionable event that triggers a method to open an Excel, **with an ID 3082**. This was for the station Alfragide/Amadora. If we try another station, the ID will be different. We have, therefore, found a way of identifying the IDs of all air monitoring stations.

## 3. Scrap the HTML source code to obtain IDs of the stations

We will scrap the html of table in https://qualar.apambiente.pt/downloads to obtain the IDs of the air monitoring stations. For that, we will use the phyton module **BeautifulSoup**.

In [None]:
# If you don't have BeautifulSoup library installed, you can do it at the shell terminal
# with the following commands:
#
# $ pip3 install BeautifulSoup4

In [1]:
# import requests to make python request URLs through HTTP
import requests
# import BeautifulSoup library
from bs4 import BeautifulSoup

import datetime
import pandas as pd
from time import sleep
import urllib

import requests

In [6]:
# function to get the html source code contaning the table with the list of air monitoring stations
def scrape_table(base_url, header):
    table_soup = requests.get(url)
    return(table_soup)



In [2]:
# Downloading contents of the web page
url = "https://qualar.apambiente.pt/downloads"
data = requests.get(url).text

In [5]:
# Creating BeautifulSoup object
soup = BeautifulSoup(data, 'html.parser')
print(soup)

<!DOCTYPE html>

<!--[if (gt IE 9)|!(IE)]><!--> <html class="no-js" lang="pt"> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<!-- www.phpied.com/conditional-comments-block-downloads/ -->
<!-- Always force latest IE rendering engine
         (even in intranet) & Chrome Frame
         Remove this if you use the .htaccess -->
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<!--  Mobile Viewport Fix
          j.mp/mobileviewport & davidbcalhoun.com/2010/viewport-metatag
          device-width: Occupy full width of the screen in its current orientation
          initial-scale = 1.0 retains dimensions instead of zooming out if page height > device height
          user-scalable = yes allows the user to zoom in -->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>QualAR - Qualidade do AR</title>
<!-- http://dev.w3.org/html5/markup/meta.name.html -->
<meta content="www" name="application-name"/>
<!-- Speaking of Google, don't forget to set your s

In [4]:
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
    print(table.get('class'))

Classes of each table:
None


In [None]:
# function to extract from the html source the IDs of the stations
del get_ids(table_soup):
    
    

In [7]:
url = 'https://qualar.apambiente.pt/downloads'

html = scrape_table(url,
             header={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})

print(html.text)

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]><!--> <html class="no-js" lang="pt"> <!--<![endif]-->
  <head>
    <meta charset="utf-8">
    <!-- www.phpied.com/conditional-comments-block-downloads/ -->
    <!-- Always force latest IE rendering engine
         (even in intranet) & Chrome Frame
         Remove this if you use the .htaccess -->
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <!--  Mobile Viewport Fix
          j.mp/mobileviewport & davidbcalhoun.com/2010/viewport-metatag
          device-width: Occupy full width of the screen in its current orientation
          initial-scale = 1.0 retains dimensions instead of zooming out if page height > device height
          user-scalable = yes allows the user to zoom in -->
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>QualAR - Qualidade do AR</title>
    <!-- http://dev.w3.org/html5/markup/meta.name.html -->
    <meta name="application-name" content="www">
    <!-- Speaking 

In [None]:
# function to request the HTML source with the table

def scrape_table(base_url, start_date, end_date, proxies, header):
    wine_pages_to_mine = []
    for page_number in range(min_page_number, max_page_number):
        url_to_mine = base_url + str(page_number)
        r = requests.Session()
        r.proxies = proxies
        r.headers = header
        try:
            response = r.get(url_to_mine)
            soup = BeautifulSoup(response.content, 'html.parser')

            all_wine_links = soup.find_all("a", class_="review-listing")
            all_wine_links = [a.get('href') for a in all_wine_links]
            wine_pages_to_mine.extend(all_wine_links)
        except:
            continue

    series_wine_pages = pd.Series(wine_pages_to_mine)
    series_wine_pages.to_csv('data/wine_pages_to_mine.csv')
    return wine_pages_to_mine