# Automatic data collection on the Web

Before we start to tackle some nice web pages (HTML), we will discover the XML language which is a good introduction to scraping data on the scraping.

### XML

The following lists a few properties of the XML language.

- XML was created to facilitate data exchange between machines and software.

- XML is a language that is written using tags.

- XML is a W3C recommendation, so it is a technology with strict rules to follow.

- XML is intended to be understandable by everyone: people and machines alike.

- XML allows us to create our own vocabulary using a set of customizable rules and tags.

- XML is also compatible with the web so that data exchanges can be easily carried out over the Internet.

- XML is therefore standardized, simple, but above all extensible and configurable so that any type of data can be described.

Here is an example of a XML document, which we have saved as `data.xml` in the `assets/` directory.

Display its content!

In [1]:
filename = "./assets/data.xml"
file = open(filename, "r")
print(file.read())
file.close()

<?xml version="1.0" encoding="UTF-8"?>
<users>
    <user data-id="101">
        <name>Zorro</name>
        <job>Dancer</job>
    </user>
    <user data-id="102">
        <name>Hulk</name>
        <job>Football player</job>
    </user>
    <user data-id="103">
        <name>Zidane</name>
        <job>Star</job>
    </user>
    <user data-id="104">
        <name>Beans</name>
        <job>Grocer</job>
    </user>
    <user data-id="105">
        <name>Batman</name>
        <job>Veterinary</job>
    </user>
    <user data-id="106">
        <name>Spiderman</name>
        <job>Veterinary</job>
    </user>
</users>



The first line indicates the encoding, we always stay in the UTF-8 encoding. Then we notice that the "users" tag has other "user" tags that themselves have their own tags. The data is hierarchized in a tree and each node provides information.

Here is a small script that displays all the usernames.

You will first have to install the `lxml` package. It depends on the `numpy` package, which will be installed alongside `lxml` if you use a standard package manager. However, some version of `numpy` give problems, so changing the version might be the first thing that you can troubleshoot if you fail to import `lxml`.

In [2]:
from lxml import etree

# I define my source document
tree = etree.parse(filename)
# I look at my document and identify the path to the tag to get to the "user" information

# The user name we are looking for is in it's own tag, `name`. Which itself
# is in a tag `user`, and lastly `user` is contained in a `users` tag.
# So tree.xpath("/users/user/name") contains the tags associated with our search
for user in tree.xpath("/users/user/name"):
    # I only want to display the content (.text) of the `/users/user/name` tags
    print(user.text)

Zorro
Hulk
Zidane
Beans
Batman
Spiderman


In [5]:
tree.xpath("/users/user/name")[0].text

'Zorro'

In [6]:
# You can display the attributes of the tags that store this information
tree = etree.parse(filename)
for user in tree.xpath("/users/user"):
    print(user.get("data-id"))

101
102
103
104
105
106


You can refine the display by proposing to display only users whose job is Veterinary 

In [7]:
tree = etree.parse(filename)
# Quel joli petit dictionnaire
for user in tree.xpath("/users/user[job='Veterinary']/name"):
    print(user.text)

Batman
Spiderman


# Web scraping

For this section, you will have to download the`beautifulsoup4` package using

`pip install beautifulsoup4`

or the `conda` package manager.

We saw earlier how to parse XML, it is also possible to **parse HTML** and the tool that does the job best in my opinion is the `beautifulsoup` library.

Save a web page (for example `www.becode.org`) that you like in the `./assets` directory, and display its content (the `.html` file)

Put the content of this page in a variable, for example `html_doc`.


In [8]:
becode_filename = "./assets/becode.html"
file = open(becode_filename, "r")

html_doc = file.read()
file.close()
html_doc

'<!DOCTYPE HTML>\n<html lang="en">\n\n<head>\n\t<title>BeCode.org for Friends</title>\n\t<meta charset="utf-8" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<!--[if lte IE 8]><script src="assets/js/ie/html5shiv.js"></script><![endif]-->\n\t<link rel="stylesheet" href="assets/css/main.css" />\n\t<!--[if lte IE 8]><link rel="stylesheet" href="assets/css/ie8.css" /><![endif]-->\n\t<!--[if lte IE 9]><link rel="stylesheet" href="assets/css/ie9.css" /><![endif]-->\n\t<meta property="og:type" content="business.business">\n\t<meta property="og:title" content="BeCode | Free Coding School">\n\t<meta property="og:description" content="Offering web development trainings for individuals.">\n\t<meta property="og:url" content="https://www.becode.org/partners/">\n\t<meta property="og:image" content="https://www.becode.org/partners/images/IngloriousBasterdz.jpg">\n\t<meta property="business:contact_data:country_name" content="Belgium">\n\t<link rel="canonical" href="ht

In [9]:
from bs4 import BeautifulSoup

# In my file (becode.org) by looking at this html script,
# we can see that the main title is arranged in the `h1` tag
soup = BeautifulSoup(html_doc, "lxml")

for tag in soup.find_all("h1"):
    # We only retrieve the content ==> .text
    print(tag.text)

BeCode.org


Do the same with `h2` tags.

In [17]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "lxml")

for tag in soup.find_all("h2"):
    print(tag.text)

A school to address the skills gap in an inclusive way
A Belgian coding school, powered by a methodology with a proven track record
Realtime Stats
Our story
The team
Our partners
Board of directors
Here's how you can make a difference.
Goodies and Ready-To-Share Content


And now, do the same with the `p` tags.

In [15]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "lxml")

for tag in soup.find_all("p"):
    print(tag.text)

                        


						Our mission : Enabling tomorrow's digital talents to blossom. 
						We believe that education makes anything possible. 
						Since 2017, BeCode has been offering free training courses for jobseekers to become web developers in partnership
						with Simplon. 

Here’s how you can help!
Opportunities and talents currently remain untapped due to a skills gap: 
					the gap between what employers need and what job seekers are offering today.
of Belgian youth are unemployed.
vacancies in Belgium by 2020 due to shortfall of e-skilled workers.
of employers blame inadequate training for the shortfall in skilled workers.

					Whereas a large amount of digital job vacancies remain unfilled - and that number will increase dramatically in
					the years to come -
					most of these professions don't require an engineering degree at all. 
					With a logical mindset, a lot of motivation and some basic training,
					anyone can learn to create mobile applications and websites and turn that

### Scraping via HTTP requests

HTTP is a kind of language that will allow the client (you, through your browser for example) to communicate with a server connected to the network (the HTTP server installed on a site's server, for example Apache).

Requests always go in pairs: the request (from the client) and the response (from the server).
If this is not the case, it is because a problem has occurred at a point in the network.

The syntax of the request (= client request) is always the same and is the following

Command line (`command`, `URL`, `Protocol version`)



`command` is the method to use, it specifies the type of request, it can have the values :


- `"GET"`
This is the most common way to request a resource. A GET request has no effect on the resource, it must be possible to repeat the request without effect.


- `"HEAD"`
This method only asks for information about the resource (the header), without asking for the resource itself.


- `"POST"`
This method must be used when a request modifies the resource.


- `"OPTIONS"`
This method allows you to obtain the communication options of a resource or the server in general.


- `"CONNECT"`
This method allows you to use a proxy as a communication tunnel.


- `"TRACE"`
This method asks the server to return what it has received, in order to test and diagnose the connection.


- `"PUT"`
This method allows you to add a resource to the server.


- `"DELETE"`
This method allows you to delete a resource from the server.

I will only discuss the most common ones here: HEAD, GET and POST.

### Putting it into practice

In [21]:
import requests

# Url of website
url = "http://www.allocine.fr/"
# I send my HTTP request with a "GET" to the site server to identify in the url
r = requests.get(url)
# I display the requested url and the return of the server
print(url, r.status_code)
# I ask beautifulsoup to store the HTML content of the website in the `soup` variable
soup = BeautifulSoup(r.content, "lxml")
soup

('http://www.allocine.fr/', 200)




We have thus retrieved the information from the website without physically saving it in a file, only in a variable!