# Statistical Software for Data Scientists
## Tutorial 4 - Webscrapping in Python
### December 3rd, 2019

Before starting, import the needed libraries. 

In [1]:
import urllib
import bs4 # BeautifulSoup Library for webscrapping

First of all, we will work on scrapping the data from the Wikipedia page of the French Football League (Ligue 1).

**Step 1:** connect to the Wikipedia page and obtain the source code, through sending a request to the Wikipedia server at the page's link.

In [2]:
url_ligue_1 = "https://fr.wikipedia.org/wiki/Championnat_de_France_de_football_2016-2017"

from urllib import request
request_text = request.urlopen(url_ligue_1).read()
print(request_text[:1000]) # print a part of the returned response body.

b'<!DOCTYPE html>\n<html class="client-nojs" lang="fr" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Championnat de France de football 2016-2017 \xe2\x80\x94 Wikip\xc3\xa9dia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":[",\\t.","\xc2\xa0\\t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","janvier","f\xc3\xa9vrier","mars","avril","mai","juin","juillet","ao\xc3\xbbt","septembre","octobre","novembre","d\xc3\xa9cembre"],"wgMonthNamesShort":["","janv.","f\xc3\xa9v.","mars","avr.","mai","juin","juill.","ao\xc3\xbbt","sept.","oct.","nov.","d\xc3\xa9c."],"wgRequestId":"XeOJKQpAADoAAHlYY-cAAAAB","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Championnat_de_France_de_football_2016-2017","wgTitle":"Championnat de France de football 2016-2017","wgCurRevisionId":164181571,"wgRevisionId":164181571,"wgArticleId":9734718,"wgIs

**Step 2:** Use the package BeautifulSoup to pull data from the returned HTML body. BeautifulSoup would divide the body based HTML tags.

In [4]:
page = bs4.BeautifulSoup(request_text, "lxml")
print(page)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="fr">
<head>
<meta charset="utf-8"/>
<title>Championnat de France de football 2016-2017 — Wikipédia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":[",\t."," \t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","janvier","février","mars","avril","mai","juin","juillet","août","septembre","octobre","novembre","décembre"],"wgMonthNamesShort":["","janv.","fév.","mars","avr.","mai","juin","juill.","août","sept.","oct.","nov.","déc."],"wgRequestId":"XeOJKQpAADoAAHlYY-cAAAAB","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Championnat_de_France_de_football_2016-2017","wgTitle":"Championnat de France de football 2016-2017","wgCurRevisionId":164181571,"wgRevisionId":164181571,"wgArticleId":9734718,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGro

If we print the object *page* created with BeautifulSoup, we see it is not a character string by an HTML page with tags. We can now search for elements inside those tags. For instance, if we want to know the title of the page, we use the method **.find** and ask for **title**.

In [5]:
print(page.find("title"))

<title>Championnat de France de football 2016-2017 — Wikipédia</title>


**NOTE:** The method **.find** returns only the first occurrence of the element.

In [6]:
print(page.find("table"))

<table><caption style="background:#99cc99;color:#000000;">Généralités</caption><tbody><tr>
<th scope="row" style="width:10.5em;">Sport</th>
<td>
<a href="/wiki/Football" title="Football">Football</a></td>
</tr>
<tr>
<th scope="row" style="width:10.5em;">Organisateur(s)</th>
<td>
<a href="/wiki/Ligue_de_football_professionnel" title="Ligue de football professionnel">LFP</a></td>
</tr>
<tr>
<th scope="row" style="width:10.5em;">Édition</th>
<td>
<abbr class="abbr" title="Quatre-vingt-quatrième (huitante-quatrième / octante-quatrième)">84<sup>e</sup></abbr></td>
</tr>
<tr>
<th scope="row" style="width:10.5em;">Lieu(x)</th>
<td>
<span class="datasortkey" data-sort-value="France"><span class="flagicon"><a class="image" href="/wiki/Fichier:Flag_of_France.svg" title="Drapeau de la France"><img alt="Drapeau de la France" class="noviewer thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="13" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.

To find all occurences, we need to use the function **findAll( )**.

In [8]:
print("There are", len(page.findAll("table")), "table elements in this page.")

There are 32 table elements in this page.


In [9]:
all_tables = page.findAll("table")

print("The 2nd table element of the page is: Hierarchie \n", all_tables[1])
print("--------------------------------------------------------------------")
print("The 3rd table element of the page is: Palmares \n", all_tables[2])

The 2nd table element of the page is: Hierarchie 
 <table><caption style="background:#99cc99;color:#000000;">Hiérarchie</caption><tbody><tr>
<th scope="row" style="width:10.5em;">Hiérarchie</th>
<td>
<abbr class="abbr" title="Premier">1<sup>er</sup></abbr> échelon</td>
</tr>
<tr>
<th scope="row" style="width:10.5em;">Niveau inférieur</th>
<td>
<a href="/wiki/Championnat_de_France_de_football_de_Ligue_2_2016-2017" title="Championnat de France de football de Ligue 2 2016-2017">Ligue 2 2016-2017</a></td>
</tr></tbody></table>
--------------------------------------------------------------------
The 3rd table element of the page is: Palmares 
 <table><caption style="background:#99cc99;color:#000000;">Palmarès</caption>
<tbody><tr>
<th scope="row" style="width:10.5em;">Tenant du titre</th>
<td>
<a href="/wiki/Paris_Saint-Germain_Football_Club" title="Paris Saint-Germain Football Club">Paris Saint-Germain</a> (6)</td>
</tr>
<tr>
<th scope="row" style="width:10.5em;">Promu(s) en début de saiso

---

### Exercise: Find the list of the teams of the Ligue 1

The list of teams is in the table "Participants" in the source code. By looking closely at the data, we can see that it is the table with `class = "DebutCarte"`. 

Therefore, in order to find this table, we will have to find the table having this specific CSS class. 

Furthermore, we notice that the HTML tags around the names of the clubs have the following form: 

`<a href = "url_club" title = "nom_club">Name_of_Club</a>`

Thus we will also have to extract all the `<a>` elements with these specificiations from the table.

In [10]:
for item in page.find('table', {'class' : 'DebutCarte'}).findAll({'a'})[0:5] :
    print(item, "\n-------------")

<a class="image" href="/wiki/Fichier:France_location_map-Regions-2016.svg"><img alt="France location map-Regions-2016.svg" data-file-height="1922" data-file-width="2000" decoding="async" height="288" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b1/France_location_map-Regions-2016.svg/300px-France_location_map-Regions-2016.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b1/France_location_map-Regions-2016.svg/450px-France_location_map-Regions-2016.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b1/France_location_map-Regions-2016.svg/600px-France_location_map-Regions-2016.svg.png 2x" width="300"/></a> 
-------------
<a href="/wiki/Paris_Saint-Germain_Football_Club" title="Paris Saint-Germain Football Club">Paris SG</a> 
-------------
<a href="/wiki/Association_sportive_de_Monaco_football_club" title="Association sportive de Monaco football club">AS Monaco FC</a> 
-------------
<a href="/wiki/Olympique_lyonnais" title="Olympique lyonnais">Olympiq

Looking at the above result, we notice that we do not want to take the element that does not correspond to a club but to an image. This element is the only one not having a `title = ""` CSS component. 

We can exclude elements that are not interesting to indicate elements that the line should have rather than exclude them in function of their place in list.

In [12]:
for e, item in enumerate(page.find('table', {'class' : 'DebutCarte'}).findAll({'a'})[0:5]):
    if item.get('title') : 
        print(item)
# only print the items that have a title

<a href="/wiki/Paris_Saint-Germain_Football_Club" title="Paris Saint-Germain Football Club">Paris SG</a>
<a href="/wiki/Association_sportive_de_Monaco_football_club" title="Association sportive de Monaco football club">AS Monaco FC</a>
<a href="/wiki/Olympique_lyonnais" title="Olympique lyonnais">Olympique lyonnais</a>
<a href="/wiki/Stade_rennais_football_club" title="Stade rennais football club">Stade rennais FC</a>


Finally, we want to obtain the name and url of the 20 clubs. For this, we use two methods of element item: 
- `getText()` that enables us to obtain the text on the webpage and in the tag `<a>`.
- `get('xxxx')` that enables us to obtain the element equal to xxxx. 

In our case, we want the name of the club and the url: we use `getText` and `get("href")`.

In [13]:
for item in page.find('table', {'class' : 'DebutCarte'}).findAll({'a'})[0:5] :
    if item.get("title") :
        print(item.get("href"))
        print(item.getText())

/wiki/Paris_Saint-Germain_Football_Club
Paris SG
/wiki/Association_sportive_de_Monaco_football_club
AS Monaco FC
/wiki/Olympique_lyonnais
Olympique lyonnais
/wiki/Stade_rennais_football_club
Stade rennais FC


To obtain the official name, we should use the element `<title>`: 

In [15]:
for item in page.find('table', {'class' : 'DebutCarte'}).findAll({'a'}): 
    if item.get('title') :
        print(item.get('title'))

Paris Saint-Germain Football Club
Association sportive de Monaco football club
Olympique lyonnais
Stade rennais football club
Olympique gymnaste club Nice Côte d'Azur
Association sportive de Saint-Étienne
Dijon Football Côte-d'Or
Angers sporting club de l'Ouest
LOSC Lille
Stade Malherbe Caen Calvados Basse-Normandie
Association sportive Nancy-Lorraine
Football Club de Nantes
Montpellier Hérault Sport Club
Football Club des Girondins de Bordeaux
Sporting Club de Bastia
En Avant de Guingamp
Football Club Lorient-Bretagne Sud
Olympique de Marseille
Football Club de Metz
Toulouse Football Club


---