# Slide 1
# Automatic data collection on the Web

# Slide 2
Before we start to tackle some nice web pages (html), we will discover the xml language which is a good introduction to data web scraping

### XML

XML was created to facilitate data exchange between machines and software.

XML is a language that is written using tags.

XML is a W3C recommendation, so it is a technology with strict rules to follow.

XML is intended to be understandable by everyone: people and machines alike.

XML allows us to create our own vocabulary using a set of customizable rules and tags.

XML is also compatible with the web so that data exchanges can be easily carried out over the Internet.

XML is therefore standardized, simple, but above all extensible and configurable so that any type of data can be described.

Here is an example of an XML document, which we have saved as data.xml in the data directory

Display its content

In [1]:
file = open("../../04.File-handling/data/data.xml", "r")
print (file.read())
file.close()

<?xml version="1.0" encoding="UTF-8"?>
<users>
    <user data-id="101">
        <nom>Zorro</nom>
        <metier>Danseur</metier>
    </user>
    <user data-id="102">
        <nom>Hulk</nom>
        <metier>Footballeur</metier>
    </user>
    <user data-id="103">
        <nom>Zidane</nom>
        <metier>Star</metier>
    </user>
    <user data-id="104">
        <nom>Beans</nom>
        <metier>Epicier</metier>
    </user>
    <user data-id="105">
        <nom>Batman</nom>
        <metier>Veterinaire</metier>
    </user>
    <user data-id="106">
        <nom>Spiderman</nom>
        <metier>Veterinaire</metier>
    </user>
</users>



The first line indicates the encoding, we always stay in the UTF-8 encoding. Then we notice that the "users" tag has other "user" tags that themselves have their own tags. The data is hierarchized in a tree and each node provides information.

Here is a small script that displays all the user names.

In [2]:
from lxml import etree
# I define my source document
tree = etree.parse("../../04.File-handling/data/data.xml")
# I look at my document and identify the tag path to get to the "user" information
# Indeed, the information is in a name tag itself present in a user tag
# it even presents itself in a users tag. This last tag is located at the root of the directory
# so in tree.xpath("/users/user/name") there are the tags associated with our search
for user in tree.xpath("/users/user/nom"):
    # I want to display only the content (.text) of these tags /users/user/name
    print(user.text)

Zorro
Hulk
Zidane
Beans
Batman
Spiderman


In [3]:
tree.xpath("/users/user/nom")[0].text

'Zorro'

In [4]:
# You can display the attributes of the tags that store this information
tree = etree.parse("../../04.File-handling/data/data.xml")
for user in tree.xpath("/users/user"):
    print(user.get("data-id"))

101
102
103
104
105
106


You can refine the display by proposing to display only users whose job is Veterinary 

In [5]:
tree = etree.parse("../../04.File-handling/data/data.xml")
# Quel joli petit dictionnaire
for user in tree.xpath("/users/user[metier='Veterinaire']/nom"):
    print(user.text)

Batman
Spiderman


# Data web scrapping

We saw earlier how to parse XML, it is also possible to parse HTML and the tool that does the job best in my opinion is the BeautifulSoup librairy

Save a web page (for example becode.org) that you like in the data directory, and display its content (the xxx.html file)

Put the content of this page in a variable, for example html_doc


In [6]:
file = open("../../04.File-handling/data/becode.html", "r", encoding="utf8")

for _ in range(10):
    html_doc=file.readline()
    print(html_doc)
file.close()


<!DOCTYPE HTML>

<html lang="en">



<head>

	<title>BeCode.org for Friends</title>

	<meta charset="utf-8" />

	<meta name="viewport" content="width=device-width, initial-scale=1" />

	<!--[if lte IE 8]><script src="assets/js/ie/html5shiv.js"></script><![endif]-->

	<link rel="stylesheet" href="assets/css/main.css" />

	<!--[if lte IE 8]><link rel="stylesheet" href="assets/css/ie8.css" /><![endif]-->



In [19]:
file = open("../../04.File-handling/data/becode.html", "r", encoding="utf8")
html_doc = file.read()

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "lxml")
# In my file (becode.org) by looking at this html script We can see that the main title is arranged in the h1 tag

for p in soup.find_all('h1'):
    # We only retrieve the content ==> .text
    print (p.text)

BeCode.org


Do the same with H2 tags

In [20]:
for p in soup.find_all('h2'):
    # We only retrieve the content ==> .text
    print (p.text)

A school to address the skills gap in an inclusive way
A Belgian coding school, powered by a methodology with a proven track record
Realtime Stats
Our story
The team
Our partners
Board of directors
Here's how you can make a difference.
Goodies and Ready-To-Share Content


And now, do the same with the "p" tags

In [23]:
for n, p in enumerate(soup.find_all('p')[:3]):
    # We only retrieve the content ==> .text
    print (n, p.text.strip())

0 Our mission : Enabling tomorrow's digital talents to blossom. 
						We believe that education makes anything possible. 
						Since 2017, BeCode has been offering free training courses for jobseekers to become web developers in partnership
						with Simplon.
1 Here’s how you can help!
2 Opportunities and talents currently remain untapped due to a skills gap: 
					the gap between what employers need and what job seekers are offering today.


### Scrapping via request HTTP

HTTP is a kind of language that will allow the client (you, through your browser for example) to communicate with a server connected to the network (the HTTP server installed on a site's server, for example Apache).

Requests always go in pairs: the request (from the client) and the response (from the server).
If this is not the case, it is because a problem has occurred at a point in the network.

The syntax of the request (= client request) is always the same:
- Command line (Command, URL, Protocol version)

Command is the method to use, it specifies the type of request, it can have the values :


GET
This is the most common way to request a resource. A GET request has no effect on the resource, it must be possible to repeat the request without effect.

HEAD
This method only asks for information about the resource, without asking for the resource itself.

POST
This method must be used when a request modifies the resource.

OPTIONS
This method allows you to obtain the communication options of a resource or the server in general.

CONNECT
This method allows you to use a proxy as a communication tunnel.

TRACE
This method asks the server to return what it has received, in order to test and diagnose the connection.

PUT
This method allows you to add a resource to the server.

DELETE
This method allows you to delete a resource from the server.

I will only discuss the most common ones here: HEAD, GET and POST.

### Putting it into practice

In [31]:
import requests
from bs4 import BeautifulSoup
# Url of website
url='https://www.becode.org/about/'
# I send my HTTP request with a "GET" to the site server to identify in the url
r = requests.get(url)
# I display the requested url and the return of the server
print(url, r.status_code)
# I ask beautifulSoup to keep in a soup variable the web page to scrape (url) an html script
soup = BeautifulSoup(r.content,'lxml')
soup_list = str(soup).split('\n')
for i in soup_list[0:10]:
    print(i)

https://www.becode.org/about/ 200
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://becode.org/wp/xmlrpc.php" rel="pingback"/>
<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
<script>var et_site_url='https://becode.org/wp';var et_post_id='52';function et_core_page_resource_fallback(a,b){"undefined"===typeof b&&(b=a.sheet.cssRules&&0===a.sheet.cssRules.length);b&&(a.onerror=null,a.onload=null,a.href?a.href=et_site_url+"/?et_core_page_resource="+a.id+et_post_id:a.src&&(a.src=et_site_url+"/?et_core_page_resource="+a.id+et_post_id))}


We have thus retrieved the information from the site without physically saving it in a file, only in a variable!

Display the main title, the subtitles of and the paragraphs and their descriptions again to convince you

In [32]:
for h1 in soup.find_all('h1'):
    print(h1)

<h1>Passion for learning</h1>


In [33]:
for h2 in soup.find_all('h2'):
    print(h2)

<h2>Our mission</h2>
<h2>BeCode Pedagogical Framework</h2>
<h2>Education is in Our Blood</h2>
<h2 class="et_pb_module_header">Meet the team</h2>
<h2>Our campuses</h2>
<h2>A thousand different stories</h2>
<h2>Partners</h2>
<h2>Public partners</h2>
<h2>Private partners</h2>
<h2>Educational partners</h2>


In [34]:
for p in soup.find_all('p')[:5]:
    print(p)

<p><span style="font-weight: 400;">At BeCode, we are dreamers. We believe we can change the world, make it a better place. A more equal place, where everyone has access to a proper education, whatever their background.</span></p>
<p><b><i>Therefore we</i></b> <b><i>provide qualitative, competitive and inclusive coding bootcamps, accessible to all</i></b><span style="font-weight: 400;">.</span></p>
<p>Our mission is to <strong>grow today’s talented – and especially vulnerable – professionals into tomorrow’s best developers</strong>.</p>
<p>With the current shortage in the market, employers are more motivated than ever to opt for a diversified recruitment strategy, focusing on skills rather than diplomas. In addition, these professions offer well-paying, interesting and long-term career opportunities for all those entering the industry today.</p>
<p>We therefore want to help <strong>bridge the gap between motivated job seekers and the employer market</strong>, by using the shortage of di