# Data from the Web

#### What do you find in this Notebook?

The purpose of this tutorial session is to offer a **quick** overview on how to scrape a Web page. In details, we illustrate the two main libraries used for this purpose: [Requests](https://requests.kennethreitz.org/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Afterwords, we show how to retrieve data from the Web.

---

*Credits to: [Tiziano Piccardi](https://github.com/tizianopiccardi)*

*Updated by: [Ekaterina Svikhnushina](https://github.com/Sea94) and [Pablo Cañas](https://github.com/pcanas)*

## What is web scraping

The web is a massive knowledge base. Most of the data available is not structured and it is difficult to generate a dataset from which you can extract valuable insights. 

Web scraping is the process of using bots to extract content and data from a website. With web scraping, you can extract underlying HTML code and data stored in a database. As a result, you can generate your own datasets for your data science projects.

## HTTP and HTML

Access to websites is conducted via the [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) protocol. In such, a request to obtain the content of a website (GET request) is sent to a server, and the server sends back HTML code with the content of the website, which renders in our screens. We can also send information to the server via POST requests, for example, when filling a form.

In the following extract, we show the basic HTML structure of a website. The `head` section contains the title and metadata, while the `body` section contains the content of the website.
```
<!DOCTYPE html>
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>This is a Heading</h1>
        <p>Hello world!</p>
    </body>
</html>
```

## Remark: JSON

[JSON](https://en.wikipedia.org/wiki/JSON) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). It is a very common data format, with a diverse range of applications, one example being web applications that communicate with a server.

Example of a basic JSON format:

```
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null
}
```

## Web scraping libraries

To fetch data from the Web with Python, you need to get use to two essential libraries:

 * [`Requests (HTTP)`](https://requests.kennethreitz.org/en/master/): get the `html` page to parse.

 * [`Beautiful Soup (HTML Parsing)`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): parse the `html` and extract data.

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup

### Make a `get` request

The [GET method](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods) retrieves information from the server.

We start scraping this website: https://httpbin.org/ - HTTP Request & Response Service. The website offers some useful endpoints [1] to check the content of our request. Some of them provide an 'echo service' that reply with the request received.

[1] Endpoint is a web address (URL) at which clients of a specific service can gain access to it. By referencing that URL, clients can get to operations provided by that service.

#### Example 1: Get request basics
Here we show an example on how use a get request. In particular, you see that we can get different information about the response:

* The status code [2] which tells us whether everything is fine and if the request worked
* The headers
* Body of the response (typically HTML for webpages or JSON/XML for web services)

[2] Find the reminder of HTTP status codes [here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). Some typical codes are: **200 OK** (standard response for successful HTTP requests) and **404 Not Found** (the requested resource could not be found but may be available in the future).

**NOTE:** this is an echo service, what you see is what we sent to the server

In [None]:
# Make the request
r = requests.get('https://httpbin.org/ip') # /ip: Returns the requester's IP Address.

print('Response status code: {0}\n'.format(r.status_code))
print('Response headers: {0}\n'.format(r.headers))
print('Response body: {0}'.format(r.text))

#### Example 2: Parsing JSON string from the response

If the body of the response is a JSON string, Requests offers a convenient way to parse the text and get a Python dictionary.

Let's try to get the current time from here: http://worldtimeapi.org/api/timezone/Europe/Madrid – a simple web service that returns the local-time for a given timezone as either JSON (by default) or plain-text.

In [None]:
r = requests.get('http://worldtimeapi.org/api/timezone/Europe/Madrid')

print('Response body (parsed json):')
r.json()

#### Example 3: Including parameters into get request

This time, the `url` has been slightly changed to include a parameter (key1).

Remember that the with the GET method the parameters are part of the URL.

In [None]:
r = requests.get('https://httpbin.org/get?key1=value1')
r.json()

### Make a `post` request
The [POST method](https://en.wikipedia.org/wiki/POST_(HTTP)) requests that a web server accepts the data enclosed in the body of the request message, most likely for storing it.

A POST request can have the paramenters in the body. Let's how to do this with Requests library:

In [None]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post('https://httpbin.org/post', data=payload)
r.json()

### Make a request and extract the Page Title!

* Send the request and get the `html`

In [None]:
# Send the request
r = requests.get('https://httpbin.org/html')
r.text[:300]

* Thus, we start to use our beloved `BeautifulSoup` to parse the HTML and we get the header

In [None]:
# Extract the header
soup = BeautifulSoup(r.text, 'html.parser')
soup.h1

## Let's get interesting data - DBLP

*DBLP is a computer science bibliography website. Starting in 1993 at the University of Trier, Germany, it grew from a small collection of HTML files and became an organization hosting a database and logic programming bibliography site. DBLP listed more than 3.66 million journal articles, conference papers, and other publications on computer science in July 2016, up from about 14,000 in 1995.*

<div align="right">https://en.wikipedia.org/wiki/DBLP</div> 

We want to check the distribution of the publications by year of the president of our Prof. Vicenç Soler 

First of all, let's check the page with the data we need:

In [None]:
URL = 'https://dblp.org/pid/20/4669.html'

The page is public and accessible with a browser using a simple GET:

In [None]:
r = requests.get(URL)
page_body = r.text

Now the page content is downloaded and we can inspect the body of the response:

In [None]:
page_body

It is pure HTML, and we need BeautifulSoup to parse the content. Let's check the documentation of this library! https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
soup = BeautifulSoup(page_body, 'html.parser')

Now the page is parsed and we can read the data we need!

For example, let's get the title! Are we in the right page?

Yes! And we can get the clean text without HTML tags:

A more complex query now! Let's find all the links in the page. 

Hint: HTML a link is defined using the tag &lt;a&gt;

Let's print the first the link:

Let's try to get all the links pointing to an external website (not DBLP)

Let's move on. Now we want to extract the sections that contain the publication details.

**The easiest way is to inspect the DOM of the web page with a browser.** Check with your browser how to isolate the portions of the page that represent publications. --- Use Google Chrome Inspector ---

Ok, each row is composed by a &lt;li&gt; tag and has a class called 'entry'. Can we get these publications?

Let's check the number of rows:

Print the titles of the publications!

In [None]:
import pandas as pd
%matplotlib inline

Create a script that gathers all Vicenç's publications and stores them in a DataFrame with the columns `title` (string), `authors` (list) and `year` (integer)

Let's create a bar plot that shows the number of publications that Vicenç has carried out each year.