### GreenDS

# Fundamentals of Agro-Environmental Data Science

## Example APIs and Web scraping

### Introduction

The purpose of this Jupyter Notebook exercise is to demonstrate the methods available to obtain data from online services. Two examples are explored:
- web data services based on REST APIs
- web scraping from online web pages

Sometime, web pages use APIs to expose information and services, but no documentation is provided. We will learn how to identify the existence of these services, to use them in more efficient data collection.

## Web scraping Air Quality data

The QUALAR (https://qualar.apambiente.pt/) is a web platform of APA, the Portuguese Environment Agency, that displays online air quality data sampled by on air monitoring stations in Portugal. Unfortunately, the platform does not expose to the final user, or provides documentation how to use the API service that was implemented for web users downloads. Downloads are generated as XLSX files.

However, it is possible to hack the source code of the webpage to identify that there is an implemented API, and that it can be used to facilite efficient download of data. This exercise will demonstrate that, with the following steps:
- check how are downloads generated from the website
- identify and collect the parameters that define the data download
- use the API to download data
- as a bonus, visualise a timeseries of a data quality variable.

## 1. Data download for human users

Visit the data download page of QUALAR, at https://qualar.apambiente.pt/downloads. It displays a table with a list of Air Monitoring Stations, with the following columns:
- Region (Região)
- Municipality (Concelho)
- Station (Estação)
- Station type (Tipo de Estação)
- Área type (Tipo de Área)
- columsn for the following pollutants: O3, NO2, CO, SO2, PM10, PM2.5, C6H6, other

On the top, the page has two fields to define the time range for the data download, and on the left several buttons to activate filters about the type of station and type of área.

To make a download, users can click on arrows that are available for each station and each pollutant, or if they want to download all pollutants for a station, they can click directly on the station name. After clicking, the download file is generated for the requested options as a excel file (xlsx).

*Try to make a download in this way, and check the file downloaded.*

## 2. Verify how are downloads generated inspecting the webpage source code

It is possible to inspect the source code of the table, and the behaviour of the page when a download is solicited. Checking this we can try to identify which methods are used to provide data to users. If we manage to verify that the web page is served by an API, and we can identify which parameters define a request, then it would be possible to generate a script to speed up downloads.

**1. Activate the Inspect Tool of the source code of the webpage.**

*Open in your web browser, navigate to https://qualar.apambiente.pt/downloads.Afterwards, in the menu of your browser, find the option **Web Developer Tools** or **Developer Tools** (in Firefox or Chrome, you will find it in **More tools**). This will open a new panel in the browser.*

**2. Check the method to generate downloads**

As mentioned before, clicking on the name of a station will generate a download with all data for that station. This means that through the HTTP protocol, a request is made through the network. Checking which request was made (which URL request was send) is a good way of verifying what was the information send to the web server.

*On the Developer Panel, click on the tab **Netwotk**. After that, click on the name of a station to make a download request. This will generate a new row on the panel, with the information about a **GET** request.*

One of the parameters in that row is the name or file field, which shows the URL sent to the server, e.g.:

```https://qualar.apambiente.pt/api/download.php?poluente_id=0&estacao_id=1041&data_inicio=2021-01-01&data_fim=2021-12-31&influencias=1,2,3&ambientes=1,2,4```

We can identify the following sections in the URL:

Host URL: ```https://qualar.apambiente.pt/api/download.php```

Parameters:


On the top left bar of this panel, there is a arrow cursor otion. Select this, and the place the mouse pointed on the name of one air monitoring station. You will verify that for each section of the web page where you hover your mouse, the corresponding HTML source code will be highlighted in the developer panel.

**2. Select a section of the HTML code**

Place the mouse so that the complete cell with the name of a station in the table is highlighted, and click. In the source code, a line starting with the tag **td** should be selected.   

In [None]:
# If you don't have pandas library installed, you can do it at the shell terminal
# with the following commands:
#
# $ pip3 install pandas
# $ pip3 install sklearn
# $ pip3 install seaborn
# $ pip3 install matplotlib

In [None]:
# import pandas library
import pandas as pd

## 2. Download the data file from Kaggle
Go to http://www.kaggle.com/uciml/breast-cancer-wisconsin-data and download the `data.csv` file. Place the file at the `raw-data` directory.

## 3. Read and preview data
Read the data file, and print the shape and a preview of the table:
- number of rows
- number of properties (columns)
- show the first two rows of data

In [None]:
cancer_data = pd.read_csv('./raw-data/data.csv')
pd.options.display.max_columns = len(cancer_data)
print(f'Number of entries: {cancer_data.shape[0]:,}\n'
      f'Number of features: {cancer_data.shape[1]:,}\n\n'
      f'Number of missing values: {cancer_data.isnull().sum().sum()}\n\n'
      f'{cancer_data.head(2)}')

The table was loaded to `pandas`, which has the possibility to show a preview of data (head):

In [None]:
cancer_data.head()

## 4. Clean and explore data

You can scroll to the last column of the table above and verify that it contains no values (NaN). We need to remove the last column with missing values:

In [None]:
cancer_data = cancer_data.drop('Unnamed: 32', axis=1)

It is possible to calculate descriptive statistics parameters of the attributes of the data set:

In [None]:
cancer_data.describe()

Next, let's calculate how many women have a confirmed cancer (a malignant breast tumor)?

In [None]:
cancer_data['diagnosis'].value_counts()

We can calculate these values as percentages:

In [None]:
round(cancer_data['diagnosis'].value_counts()*100/len(cancer_data)).convert_dtypes()

## 5. Visualize data

We can get a better insight of the data if we compare values for benign and malignant cases. Seaborn is one of the powerfull libraries to visualize data. 

In [None]:
import seaborn as sns; sns.set(style="ticks", color_codes=True)

In [None]:
radius = cancer_data[['radius_mean','radius_se','radius_worst','diagnosis']]
sns.pairplot(radius, hue='diagnosis',palette="husl", markers=["o", "s"],height=4)

We can do another visualization, adding linear regression lines.

In [None]:
texture = cancer_data[['texture_mean','texture_se','texture_worst','diagnosis']]
sns.pairplot(texture, hue='diagnosis', palette="husl",height=4, kind="reg")

Another visualization which display the histogram for each category, is called violinplot. We will do this in groups of ten variables.

In [None]:
# y includes our labels and x includes our features
y = cancer_data.diagnosis # M or B 
list_drp = ['id','diagnosis']
x = cancer_data.drop(list_drp, axis = 1 )

In [None]:
import matplotlib.pyplot as plt

data_dia = y
data = x
# standardization of the data
data_n_2 = (data - data.mean()) / (data.std())
data = pd.concat([y,data_n_2.iloc[:,0:30]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,6))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart",palette ={"B": "g", "M": "r"})
plt.xticks(rotation=90)

We can represent using box plots the worst values of the features. 

In [None]:
# box-plots
data = pd.concat([y,data_n_2.iloc[:,20:30]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.boxplot(x='features', y='value', hue='diagnosis', data=data, palette ={"B": "g", "M": "r"})
plt.xticks(rotation=45)

To explore correlations between independent variables, we can calculate the correlation matrix, and represented with a heatmap:

In [None]:
# correlation map
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

## 6. Build the model

We will calculate two models: on based on the [K-nearest neighors (KNN)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) algorithm, and the other based on [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression).

First, we will define the X (independent) and Y (dependent) variables:

In [None]:
X = cancer_data.iloc[:, 2:32].values
y = cancer_data.iloc[:, 1].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

It is important to divide the dataset in two subsets, one for training (creating) the model, and other for testing. This is important to make sure that the model is not overfitted, and can be applied to other data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In this example, we will try two models:
- K-Nearest Neighbor (KNN)
- Logistic Regression

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# KNN
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)

# Logistic regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_predictions = lr.predict(X_test)

## 7. Evaluate the model

We can calculate the accuracy of the models. This value returns the fraction of correctly classified samples, in the test subset.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

print(f'Accuracy scores:\n'
      f'KNN model:\t\t   {accuracy_score(y_test, knn_predictions):.3f}\n'
      f'Logistic regression model: {accuracy_score(y_test, lr_predictions):.3f}')

Another way of evaluation the model is to calculate the confusion matrix *C*, in which *C<sub>i,j</sub>* is the number of observations which true value is *i*, and was predicted to be *j*.
It gives the values of true negatives (*C<sub>0,0</sub>*), false negatives (*C<sub>1,0</sub>*), true positives (*C<sub>1,1</sub>*) and false positives (*C<sub>0,1</sub>*). 

In [None]:
matrix = confusion_matrix(knn_predictions, y_test)
sns.heatmap(matrix, cbar=False, annot=True)
plt.xlabel('Predict')
plt.ylabel('True')
plt.title('Confusion Matrix - Logistic Regression model')