# Automatic data collection on the Web

Before we start to tackle some nice web pages (html), we will discover the XML language which is a good introduction to data web scraping

### XML

The following lists a few properties of the XML language.

- XML was created to facilitate data exchange between machines and software.

- XML is a language that is written using tags.

- XML is a W3C recommendation, so it is a technology with strict rules to follow.

- XML is intended to be understandable by everyone: people and machines alike.

- XML allows us to create our own vocabulary using a set of customizable rules and tags.

- XML is also compatible with the web so that data exchanges can be easily carried out over the Internet.

- XML is therefore standardized, simple, but above all extensible and configurable so that any type of data can be described.

Here is an example of a XML document, which we have saved as `data.xml` in the `assets/` directory.

Display its content!

In [4]:
filename = "./assets/data.xml"
file = open(filename, "r")
print (file.read())
file.close()

<?xml version="1.0" encoding="UTF-8"?>
<users>
    <user data-id="101">
        <nom>Zorro</nom>
        <metier>Danseur</metier>
    </user>
    <user data-id="102">
        <nom>Hulk</nom>
        <metier>Footballeur</metier>
    </user>
    <user data-id="103">
        <nom>Zidane</nom>
        <metier>Star</metier>
    </user>
    <user data-id="104">
        <nom>Beans</nom>
        <metier>Epicier</metier>
    </user>
    <user data-id="105">
        <nom>Batman</nom>
        <metier>Veterinaire</metier>
    </user>
    <user data-id="106">
        <nom>Spiderman</nom>
        <metier>Veterinaire</metier>
    </user>
</users>



The first line indicates the encoding, we always stay in the UTF-8 encoding. Then we notice that the "users" tag has other "user" tags that themselves have their own tags. The data is hierarchized in a tree and each node provides information.

Here is a small script that displays all the usernames.

You will first have to install the `lxml` package. It depends on the `numpy` package, which will be installed alongside `lxml` if you use a standard package manager. However, some version of `numpy` give problems, so changing the version might be the first thing that you can troubleshoot if you fail to import `lxml`.

In [5]:
from lxml import etree
# I define my source document
tree = etree.parse(filename)
# I look at my document and identify the tag path to get to the "user" information
# Indeed, the information is in a name tag itself present in a user tag
# it even presents itself in a users tag. This last tag is located at the root of the directory
# so in tree.xpath("/users/user/name") there are the tags associated with our search
for user in tree.xpath("/users/user/nom"):
    # I want to display only the content (.text) of these tags /users/user/name
    print(user.text)

Zorro
Hulk
Zidane
Beans
Batman
Spiderman


In [6]:
tree.xpath("/users/user/nom")[0].text

'Zorro'

In [7]:
# You can display the attributes of the tags that store this information
tree = etree.parse(filename)
for user in tree.xpath("/users/user"):
    print(user.get("data-id"))

101
102
103
104
105
106


You can refine the display by proposing to display only users whose job is Veterinary 

In [8]:
tree = etree.parse(filename)
# Quel joli petit dictionnaire
for user in tree.xpath("/users/user[metier='Veterinaire']/nom"):
    print(user.text)

Batman
Spiderman


# Data web scrapping

For this section, you will have to download `beautifulsoup4` using

`pip install beautifulsoup4`

We saw earlier how to parse XML, it is also possible to **parse HTML** and the tool that does the job best in my opinion is the `beautifulsoup` librairy

Save a web page (for example `www.becode.org`) that you like in the `./assets` directory, and display its content (the `.html` file)

Put the content of this page in a variable, for example `html_doc`


In [10]:
becode_filename = "./assets/becode.html"
file = open(becode_filename, "rb")

html_doc=file.read()
file.close()
html_doc

b'<!DOCTYPE HTML>\n<html lang="en">\n\n<head>\n\t<title>BeCode.org for Friends</title>\n\t<meta charset="utf-8" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<!--[if lte IE 8]><script src="assets/js/ie/html5shiv.js"></script><![endif]-->\n\t<link rel="stylesheet" href="assets/css/main.css" />\n\t<!--[if lte IE 8]><link rel="stylesheet" href="assets/css/ie8.css" /><![endif]-->\n\t<!--[if lte IE 9]><link rel="stylesheet" href="assets/css/ie9.css" /><![endif]-->\n\t<meta property="og:type" content="business.business">\n\t<meta property="og:title" content="BeCode | Free Coding School">\n\t<meta property="og:description" content="Offering web development trainings for individuals.">\n\t<meta property="og:url" content="https://www.becode.org/partners/">\n\t<meta property="og:image" content="https://www.becode.org/partners/images/IngloriousBasterdz.jpg">\n\t<meta property="business:contact_data:country_name" content="Belgium">\n\t<link rel="canonical" href="h

In [11]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "lxml")
# In my file (becode.org) by looking at this html script We can see that the main title is arranged in the h1 tag

for p in soup.find_all('h1'):
    # We only retrieve the content ==> .text
    print (p.text)

BeCode.org


Do the same with H2 tags

In [12]:
for p in soup.find_all('h2'):
    print(p.text)

A school to address the skills gap in an inclusive way
A Belgian coding school, powered by a methodology with a proven track record
Realtime Stats
Our story
The team
Our partners
Board of directors
Here's how you can make a difference.
Goodies and Ready-To-Share Content


And now, do the same with the "p" tags

In [13]:
for p in soup.find_all('p'):
    print(p.text)


						Our mission : Enabling tomorrow's digital talents to blossom. 
						We believe that education makes anything possible. 
						Since 2017, BeCode has been offering free training courses for jobseekers to become web developers in partnership
						with Simplon. 

Here’s how you can help!
Opportunities and talents currently remain untapped due to a skills gap: 
					the gap between what employers need and what job seekers are offering today.
of Belgian youth are unemployed.
vacancies in Belgium by 2020 due to shortfall of e-skilled workers.
of employers blame inadequate training for the shortfall in skilled workers.

					Whereas a large amount of digital job vacancies remain unfilled - and that number will increase dramatically in
					the years to come -
					most of these professions don't require an engineering degree at all. 
					With a logical mindset, a lot of motivation and some basic training,
					anyone can learn to create mobile applications and websites and turn that

### Scrapping via request HTTP

HTTP is a kind of language that will allow the client (you, through your browser for example) to communicate with a server connected to the network (the HTTP server installed on a site's server, for example Apache).

Requests always go in pairs: the request (from the client) and the response (from the server).
If this is not the case, it is because a problem has occurred at a point in the network.

The syntax of the request (= client request) is always the same and is the following

Command line (`command`, `URL`, `Protocol version`)



`command` is the method to use, it specifies the type of request, it can have the values :


- `"GET"`
This is the most common way to request a resource. A GET request has no effect on the resource, it must be possible to repeat the request without effect.


- `"HEAD"`
This method only asks for information about the resource (the header), without asking for the resource itself.


- `"POST"`
This method must be used when a request modifies the resource.


- `"OPTIONS"`
This method allows you to obtain the communication options of a resource or the server in general.


- `"CONNECT"`
This method allows you to use a proxy as a communication tunnel.


- `"TRACE"`
This method asks the server to return what it has received, in order to test and diagnose the connection.


- `"PUT"`
This method allows you to add a resource to the server.


- `"DELETE"`
This method allows you to delete a resource from the server.

I will only discuss the most common ones here: HEAD, GET and POST.

### Putting it into practice

In [17]:
import requests
# Url of website
url='https://www.becode.org/about/'
# I send my HTTP request with a "GET" to the site server to identify in the url
r = requests.get(url)
# I display the requested url and the return of the server
print(url, r.status_code)
# I ask beautifulSoup to keep in a soup variable the web page to scrape (url) an html script
soup = BeautifulSoup(r.content,'lxml')
soup

https://www.becode.org/about/ 200


<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://becode.org/wp/xmlrpc.php" rel="pingback"/>
<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
<script>var et_site_url='https://becode.org/wp';var et_post_id='52';function et_core_page_resource_fallback(a,b){"undefined"===typeof b&&(b=a.sheet.cssRules&&0===a.sheet.cssRules.length);b&&(a.onerror=null,a.onload=null,a.href?a.href=et_site_url+"/?et_core_page_resource="+a.id+et_post_id:a.src&&(a.src=et_site_url+"/?et_core_page_resource="+a.id+et_post_id))}
</script><link href="https://www.google-analytics.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://www.google-analytics.com/" rel="preconnect"/>
<title>About · BeCode</title>
<meta content="max-image-preview:large" name="robots"/>
<link href="https://becode.org/about/" hreflang="en" rel="alternate"/>
<link href="https://becode.org/nl/over-ons

We have thus retrieved the information from the site without physically saving it in a file, only in a variable!

Display the main title, the subtitles of and the paragraphs and their descriptions again to convince you

In [16]:
for p in soup.find_all('p'):
    print(p.text)

At BeCode, we are dreamers. We believe we can change the world, make it a better place. A more equal place, where everyone has access to a proper education, whatever their background. Therefore we provide qualitative, competitive and inclusive coding bootcamps, accessible to all.
 
First of all, we learn by doing, by applying our skills on concrete projects, by working in a team. We emphasise a lot the meta-learning: learning how to learn in a technical context as well as helping yourself by helping others! Although we play hard, strict rules have been elaborated in order to ease and protect the learning process for the group and help everyone develop the right soft skills: being a reliable team player, eager to learn and with a great solution mindset.

Our mission is to grow today’s untapped talents into tomorrow’s best developers. With the current shortage in the market, employers are more motivated than ever to opt for a diversified recruitment strategy, focusing on skills rather th