# Web scraping with Python

Lino Galiana  
2025-12-26

<div class="badge-container"><div class="badge-text">If you want to try the examples in this tutorial:</div><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/manipulation/04a_webscraping_TP.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&name=¬´04a_webscraping_TP¬ª&init.personalInit=¬´https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh¬ª&init.personalInitArgs=¬´en/manipulation%2004a_webscraping_TP¬ª" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-python?autoLaunch=true&name=¬´04a_webscraping_TP¬ª&init.personalInit=¬´https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh¬ª&init.personalInitArgs=¬´en/manipulation%2004a_webscraping_TP¬ª" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/manipulation/04a_webscraping_TP.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

> **English üá¨üáß üá∫üá∏ version**
>
> Ceci est la version fran√ßaise üá´üá∑ de ce chapitre, pour voir la version anglaise rendez-vous sur \<a href="https://pythonds.linogaliana.fr//home/runner/work/python-datascientist/python-datascientist/en/content/manipulation/04a_webscraping_TP.qmd"\>le site du cours\</a\>.


<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Skills you will acquire in this chapter
</div>
</div>
<div class="callout-body-container callout-body">

-   Understand the key challenges of web scraping, including legal concerns (e.g.¬†GDPR, grey areas), site stability, and data reliability  
-   Follow best practices when scraping: check the `robots.txt` file, space out your requests, avoid overloading servers, and scrape during off-peak hours when possible  
-   Navigate the HTML structure of a web page (tags, parent-child relationships) to accurately target the elements you want to extract  
-   Use the `requests` library to fetch web page content, and `BeautifulSoup` to parse and explore the HTML using methods like `find` and `find_all`  
-   Practice your scraping skills with a hands-on exercise involving the French Ligue 1 football team list  
-   Explore Selenium for simulating user interactions on JavaScript-driven dynamic pages  
-   Understand the limitations of web scraping and know when it‚Äôs better to use more stable and reliable APIs  

</div>
</div>

[*Web scraping*](https://en.wikipedia.org/wiki/Web_scraping) refers to techniques for extracting content from websites.
It is a very useful practice for anyone looking to work with information available online, but not necessarily in the form of an *Excel* table.

This chapter introduces you to how to create and run bots to quickly retrieve useful information for your current or future projects.
It starts with some concrete use cases.
This chapter is heavily inspired and adapted from [Xavier Dupr√©‚Äôs](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/TD2A_Eco_Web_Scraping.html) work, the former professor of the subject.

# 1. Issues

A number of issues related to *web scraping* will only be briefly mentioned in this chapter.

## 1.1 The Legal Gray Area of *Web Scraping*

First, regarding the legality of retrieving information through *scraping*, there is a gray area. Just because information is available on the internet, either directly or with a little searching, does not mean it can be retrieved and reused.

The excellent [course by Antoine Palazzolo](https://inseefrlab.github.io/formation-webscraping/) discusses several media and legal cases on this issue. In France, the CNIL published new guidelines in 2020 on *web scraping*, clarifying that any data cannot be reused without the knowledge of the person to whom the data belongs. In other words, in principle, data collected by *web scraping* is subject to GDPR, meaning it requires the consent of the individuals from whom the data is reused.

It is therefore recommended to **be cautious with the data retrieved** by *web scraping* to avoid legal issues.

## 1.2 Stability and Reliability of Retrieved Information

Data retrieval through *web scraping* is certainly practical, but it does not necessarily align with the intended or desired use by a data provider. Since data is costly to collect and make available, some sites may not necessarily want it to be extracted freely and easily. Especially when the data could provide a competitor with commercially useful information (e.g., the price of a competing product).

As a result, companies often implement strategies to block or limit the amount of data scraped. The most common method is detecting and blocking requests made by bots rather than humans. For specialized entities, this detection is quite easy because numerous indicators can identify whether a website visit comes from a human user behind a browser or a bot. To mention just a few clues: browsing speed between pages, speed of data extraction, fingerprinting of the browser used, ability to answer random questions (captcha)‚Ä¶

The best practices mentioned later aim to ensure that a bot behaves civilly by adopting behavior close to that of a human without pretending to be one.

It‚Äôs also essential to be cautious about the information received through *web scraping*. Since data is central to some business models, some companies don‚Äôt hesitate to send false data to bots rather than blocking them. It‚Äôs fair play! Another trap technique is called the *honey pot*. These are pages that a human would never visit‚Äîfor example, because they don‚Äôt appear in the graphical interface‚Äîbut where a bot, automatically searching for content, might get stuck.

Without resorting to the strategy of blocking *web scraping*, other reasons can explain why a data retrieval that worked in the past may no longer work. The most frequent reason is a change in the structure of a website. *Web scraping* has the disadvantage of retrieving information from a very hierarchical structure. A change in this structure can make a bot incapable of retrieving content. Moreover, to remain attractive, websites frequently change, which can easily render a bot inoperative.

In general, one of the key takeaways from this chapter is that **web scraping is a last resort solution for occasional data retrieval without any guarantee of future functionality**. It is preferable to **favor APIs when they are available**. The latter resemble a contract (formal or not) between a data provider and a user, where needs (the data) and access conditions (number of requests, volume, authentication‚Ä¶) are defined, whereas *web scraping* is more akin to behavior in the *Wild West*.

## 1.3 Best Practices

The ability to retrieve data through a bot does not mean one can afford to be uncivilized. Indeed, when uncontrolled, *web scraping* can resemble a classic cyberattack aimed at taking down a website: a denial of service. The [course by Antoine Palazzolo](https://inseefrlab.github.io/formation-webscraping/) reviews some best practices that have emerged in the scraping community. It is recommended to read this resource to learn more about this topic. Several conventions are discussed, including:

-   Navigate from the site‚Äôs root to the `robots.txt` file to check the guidelines provided by the website‚Äôs developers to regulate the behavior of bots;
-   Space out each request by several seconds, as a human would, to avoid overloading the website and causing it to crash due to a denial of service;
-   Make requests during the website‚Äôs off-peak hours if it is not an internationally accessed site. For example, for a French-language site, running the bot during the night in metropolitan France is a good practice. To run a bot from `Python` at a pre-scheduled time, there are `cronjobs`.

# 2. A Detour to the Web: How Does a Website Work?

Even though this lab doesn‚Äôt aim to provide a web course, you still need some basics on how a website works to understand how information is structured on a page.

A website is a collection of pages coded in *HTML* that describe both the content and the layout of a *Web* page.

To see this, open any web page and right-click on it.

-   On `Chrome` <i class="fab fa-chrome"></i>: Then click on *‚ÄúView page source‚Äù* (<kbd>CTRL</kbd>+<kbd>U</kbd>);
-   On `Firefox` <i class="fab fa-firefox"></i>: *‚ÄúView Page Source‚Äù* (<kbd>CTRL</kbd>+<kbd>SHIFT</kbd>+<kbd>K</kbd>);
-   On `Edge` <i class="fab fa-edge"></i>: *‚ÄúView page source‚Äù* (<kbd>CTRL</kbd>+<kbd>U</kbd>);
-   On `Safari` <i class="fab fa-safari"></i>: see how to do it [here](https://www.wikihow.com/View-Source-Code).

If you know which element interests you, you can also open the browser‚Äôs inspector (right-click on the element + ‚ÄúInspect‚Äù) to display the tags surrounding your element more ergonomically, like a zoom.

## 2.1 Tags

On a web page, you will always find elements like `<head>`, `<title>`, etc. These are the codes that allow you to structure the content of an *HTML* page and are called **tags**.
For example, tags include `<p>`, `<h1>`, `<h2>`, `<h3>`, `<strong>`, or `<em>`.
The symbol `< >` is a tag: it indicates the beginning of a section. The symbol `</ >` indicates the end of that section. Most tags come in pairs, with an *opening tag* and a *closing tag* (e.g., `<p>` and `</p>`).

For example, the main tags defining the structure of a table are as follows:

| Tag         | Description          |
|-------------|----------------------|
| `<table>`   | Table                |
| `<caption>` | Table title          |
| `<tr>`      | Table row            |
| `<th>`      | Header cell          |
| `<td>`      | Cell                 |
| `<thead>`   | Table header section |
| `<tbody>`   | Table body section   |
| `<tfoot>`   | Table footer section |

### 2.1.1 Application: A Table in HTML

The `HTML` code for the following table:

``` {html}
<table>
    <caption> Le Titre de mon tableau </caption>
    <tr>
        <th>Nom</th>
        <th>Profession</th>
    </tr>
    <tr>
        <td>Ast√©rix</td>
        <td></td>
    </tr>
    <tr>
        <td>Ob√©lix</td>
        <td>Tailleur de Menhir</td>
    </tr>
</table>
```

Donnera dans le navigateur :


|         |                    |
|---------|--------------------|
| Nom     | Profession         |
| Ast√©rix |                    |
| Ob√©lix  | Tailleur de Menhir |

Le Titre de mon tableau


### 2.1.2 Parent and Child

In the context of HTML language, the terms parent (*parent*) and child (*child*) are used to refer to elements nested within each other. In the following construction, for example:

``` html
<div>
    <p>
       bla,bla
    </p>
</div>
```

On the web page, it will appear as follows:


<div>
    <p>
       bla,bla
    </p>
</div>


One would say that the `<div>` element is the parent of the `<p>` element, while the `<p>` element is the child of the `<div>` element.

> *But why learn this for ‚Äúscraping‚Äù?*

Because to effectively retrieve information from a website, you need to understand its structure and, therefore, its HTML code. The `Python` functions used for *scraping* are primarily designed to help you navigate between tags.
With `Python`, you will essentially replicate your manual search behavior to automate it.

# 3. Scraping with `Python`: The `BeautifulSoup` Package

## 3.1 Available Packages

In the first part of this chapter,
we will primarily use the [`BeautifulSoup4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package,
in conjunction with [`requests`](https://requests.readthedocs.io/en/latest/). The latter package allow you to retrieve the raw text
of a page, which will then be inspected via [`BeautifulSoup4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

`BeautifulSoup` will suffice when you want to work on static HTML pages. As soon as the information you are looking for is generated via the execution of [JavaScript](https://en.wikipedia.org/wiki/JavaScript) scripts, you will need to use tools like [Selenium](https://selenium-python.readthedocs.io/).

Similarly, if you don‚Äôt know the URL, you‚Äôll need to use a framework like [Scrapy](https://scrapy.org/), which easily navigates from one page to another. This technique is called *‚Äúweb crawling‚Äù*. `Scrapy` is more complex to handle than `BeautifulSoup`: if you want more details, visit the [Scrapy tutorial page](https://doc.scrapy.org/en/latest/intro/tutorial.html).

*Web scraping* is an area where reproducibility is difficult to implement.
A *web* page may evolve
regularly, and from one web page to another, the structure can be very different, making some code difficult to export.
Therefore, the best way to have a functional program is
to understand the structure of a web page and distinguish the elements that can be exported
to other use cases from *ad hoc* requests.


In [None]:
!pip install lxml
!pip install bs4


<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">

To be able to use `Selenium`, it is necessary
to make `Python` communicate with a web browser (Firefox or Chromium).
The `webdriver-manager` package allows `Python` to know where
this browser is located if it is already installed in a standard path.
To install it, the code in the cell below can be used.

</div>
</div>

To run `Selenium`, you need to use a package
called `webdriver-manager`. So, we‚Äôll install it, along with `selenium`:


In [None]:
!pip install selenium
!pip install webdriver-manager

## 3.2 Retrieve the Content of an HTML Page

Let‚Äôs start slowly. Let‚Äôs take a Wikipedia page,
for example, the one for the 2019-2020 Ligue 1 football season: [2019-2020 Championnat de France de football](https://en.wikipedia.org/wiki/2019%E2%80%9320_Ligue_1). We will want to retrieve the list of teams, as well as the URLs of the Wikipedia pages for these teams.

Step 1Ô∏è‚É£: Connect to the Wikipedia page and obtain the source code.
For this, the simplest way is to use the `requests` package.
This allows `Python` to make the appropriate HTTP request to obtain the content of a page from its URL:


In [None]:
import requests
url_ligue_1 = "https://fr.wikipedia.org/wiki/Championnat_de_France_de_football_2019-2020"

request_text = requests.get(
    url_ligue_1,
    headers={"User-Agent": "Python for data science tutorial"}
).content

In [None]:
request_text[:150]

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature'


<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">

To limit the volume of *bot* retrieving information from Wikipedia (much used by LLMs, for example), you should now specify a *user agent* via `request`. This is a good practice, enabling sites to know who is using their resources.

</div>
</div>

Step 2Ô∏è‚É£: search this abundant source code for the tags that will extract the information we‚Äôre interested in. The main interest of the `BeautifulSoup` package is to offer easy-to-use methods for searching complex texts for strings of characters from HTML or XML tags.


In [None]:
import bs4
page = bs4.BeautifulSoup(request_text, "lxml")

If we *print* the `page` object created with `BeautifulSoup`,
we see that it is no longer a string but an actual HTML page with tags.
We can now search for elements within these tags.

## 3.3 The `find` method

As a first illustration of the power of `BeautifulSoup`, we want to know the title of the page. To do this, we use the `.find` method and ask it *‚Äútitle‚Äù*.


In [None]:
print(page.find("title"))

<title>Championnat de France de football 2019-2020 ‚Äî Wikip√©dia</title>

The `.find` method only returns the first occurrence of the element.

To verify this, you can:

-   copy the snippet of source code obtained when you search for a `table`,
-   paste it into a cell in your notebook,
-   and switch the cell to *‚ÄúMarkdown‚Äù*.

If we take the previous code and replace `title` with `table`, we get


In [None]:
print(page.find("table"))

which is the source text that generates the following table:


<table>

<caption style="background-color:#99cc99;color:#000000;">

G√©n√©ralit√©s

</caption>

<tbody>

<tr>

<th scope="row" style="width:10.5em;">

Sport

</th>

<td>

<a href="/wiki/Football" title="Football">Football</a>

</td>

</tr>

<tr>

<th scope="row" style="width:10.5em;">

Organisateur(s)

</th>

<td>

<a href="/wiki/Ligue_de_football_professionnel_(France)" title="Ligue de football professionnel (France)">LFP</a>

</td>

</tr>

<tr>

<th scope="row" style="width:10.5em;">

√âdition

</th>

<td>

<abbr class="abbr" title="Quatre-vingt-deuxi√®me (huitante-deuxi√®me / octante-deuxi√®me)">82<sup>e</sup></abbr>

</td>

</tr>

<tr>

<th scope="row" style="width:10.5em;">

Lieu(x)

</th>

<td>

<span class="datasortkey" data-sort-value="France"><span class="flagicon nowrap"><span class="mw-image-border noviewer" typeof="mw:File"><a class="mw-file-description" href="/wiki/Fichier:Flag_of_France_(1976%E2%80%932020).svg" title="Drapeau de la France"><img alt="Drapeau de la France" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="13" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d1/Flag_of_France_%281976%E2%80%932020%29.svg/20px-Flag_of_France_%281976%E2%80%932020%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d1/Flag_of_France_%281976%E2%80%932020%29.svg/40px-Flag_of_France_%281976%E2%80%932020%29.svg.png 1.5x" width="20"/></a></span> </span><a href="/wiki/France" title="France">France</a></span> et <br/><span class="datasortkey" data-sort-value="Monaco"><span class="flagicon nowrap"><span class="mw-image-border noviewer" typeof="mw:File"><a class="mw-file-description" href="/wiki/Fichier:Flag_of_Monaco.svg" title="Drapeau de Monaco"><img alt="Drapeau de Monaco" class="mw-file-element" data-file-height="800" data-file-width="1000" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/20px-Flag_of_Monaco.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/40px-Flag_of_Monaco.svg.png 1.5x" width="20"/></a></span> </span><a href="/wiki/Monaco" title="Monaco">Monaco</a></span>

</td>

</tr>

<tr>

<th scope="row" style="width:10.5em;">

Date

</th>

<td>

Du <time class="nowrap date-lien" data-sort-value="2019-08-09" datetime="2019-08-09"><a href="/wiki/9_ao%C3%BBt_en_sport" title="9 ao√ªt en sport">9</a> <a class="mw-redirect" href="/wiki/Ao%C3%BBt_2019_en_sport" title="Ao√ªt 2019 en sport">ao√ªt</a> <a href="/wiki/2019_en_football" title="2019 en football">2019</a></time><br/>au <time class="nowrap date-lien" data-sort-value="2020-03-08" datetime="2020-03-08"><a href="/wiki/8_mars_en_sport" title="8 mars en sport">8 mars</a> <a href="/wiki/2020_en_football" title="2020 en football">2020</a></time> <small>(arr√™t d√©finitif)</small>

</td>

</tr>

<tr>

<th scope="row" style="width:10.5em;">

Participants

</th>

<td>

20 √©quipes

</td>

</tr>

<tr>

<th scope="row" style="width:10.5em;">

Matchs jou√©s

</th>

<td>

279 (sur 380 pr√©vus)

</td>

</tr>

<tr>

<th scope="row" style="width:10.5em;">

Site web officiel

</th>

<td>

<cite class="ouvrage" id="site_officiel" style="font-style: normal;"><a class="external text" href="https://www.ligue1.fr/" rel="nofollow">Site officiel</a></cite>

</td>

</tr>

</tbody>

</table>


## 3.4 The `find_all` Method

To find all occurrences, use `.find_all()`.


In [None]:
print("Il y a", len(page.find_all("table")), "√©l√©ments dans la page qui sont des <table>")

Il y a 34 √©l√©ments dans la page qui sont des <table>


<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">

`Python` is not the only language that allows you to retrieve elements from a web page. This is one of the main objectives of `Javascript`, which is accessible through any web browser.

For example, to draw a parallel with `page.find('title')` that we used in `Python`, you can open the [previously mentioned page](https://fr.wikipedia.org/wiki/Championnat_de_France_de_football_2019-2020) with your browser. After opening the browser‚Äôs developer tools (<kbd>CTRL</kbd>+<kbd>SHIFT</kbd>+<kbd>K</kbd> on `Firefox`), you can type `document.querySelector("title")` in the console to get the content of the HTML node you are looking for:

![](attachment:./04_webscraping/console_log.png)

If you use `Selenium` for web scraping, you will actually encounter these `Javascript` verbs in any method you use.

Understanding the structure of a page and its interaction with the browser is extremely useful when doing *scraping*, even when the site is purely static, meaning it does not have elements reacting to user actions on the web browser.

</div>
</div>

# 4. Guided Exercise: Get the List of Ligue 1 Teams

In the first paragraph of the *‚ÄúParticipants‚Äù* page,
there is a table with the results of the year.


<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Exercise 1: Retrieve the Participants of Ligue 1
</div>
</div>
<div class="callout-body-container callout-body">

To do this, we will proceed in 6 steps:

1.  Find the table
2.  Retrieve each row from the table
3.  Clean up the outputs by keeping only the text in a row
4.  Generalize for all rows
5.  Retrieve the table headers
6.  Finalize the table

</div>
</div>

1Ô∏è‚É£ Find the table


In [None]:
# on identifie le tableau en question : c'est le premier qui a cette classe "wikitable sortable"
tableau_participants = page.find('table', {'class' : 'wikitable sortable'})

In [None]:
print(tableau_participants)

2Ô∏è‚É£ Retrieve each row from the table

Let‚Äôs first search for the rows where `tr` tag appears


In [None]:
table_body = tableau_participants.find('tbody')
rows = table_body.find_all('tr')

You get a list where each element is one of the rows in the table.
To illustrate this, we will first display the first row.
This corresponds to the column headers:


In [None]:
print(rows[0])

The second row will correspond to the row of the first club listed in the table:


In [None]:
print(rows[1])

3Ô∏è‚É£ Clean the outputs by keeping only the text in a row

We will use the `text` attribute to strip away all the HTML layer we obtained in step 2.

An example on the first club‚Äôs row:

-   We start by taking all the cells in that row, using the `td` tag.
-   Then, we loop through each cell and keep only the text from the cell using the `text` attribute.
-   Finally, we apply the `strip()` method to ensure the text is properly formatted (no unnecessary spaces, etc.).


In [None]:
cols = rows[1].find_all('td')
print(cols[0])
print(cols[0].text.strip())

<td><a href="/wiki/Paris_Saint-Germain_Football_Club" title="Paris Saint-Germain Football Club">Paris Saint-Germain</a>
</td>
Paris Saint-Germain

In [None]:
for ele in cols : 
    print(ele.text.strip())

Paris Saint-Germain
1974
637
1er
Thomas Tuchel
2018
Parc des Princes
47¬†929
46

4Ô∏è‚É£ Generalize for all rows:


In [None]:
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]

We have successfully retrieved the information contained in the participants‚Äô table.
But the first row is strange: it‚Äôs an empty list‚Ä¶

These are the headers: they are recognized by the `th` tag, not `td`.

We will put all the content into a dictionary, to later convert it into a pandas DataFrame:


In [None]:
dico_participants = dict()
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    if len(cols) > 0 : 
        dico_participants[cols[0]] = cols[1:]

dico_participants

In [None]:
import pandas as pd
data_participants = pd.DataFrame.from_dict(dico_participants,orient='index')
data_participants.head()

5Ô∏è‚É£ Retrieve the table headers:


In [None]:
for row in rows:
    cols = row.find_all('th')
    print(cols)
    if len(cols) > 0 : 
        cols = [ele.get_text(separator=' ').strip().title() for ele in cols]
        columns_participants = cols

In [None]:
columns_participants

['Club',
 'Derni√®re Mont√©e',
 'Budget [ 3 ] En M ‚Ç¨',
 'Classement 2018-2019',
 'Entra√Æneur',
 'Depuis',
 'Stade',
 'Capacit√© En L1 [ 4 ]',
 'Nombre De Saisons En L1']

6Ô∏è‚É£ Finalize the table


In [None]:
data_participants.columns = columns_participants[1:]

In [None]:
data_participants.head()

# 5. Going Further

## 5.1 Retrieving stadium Locations

Try to understand step by step what is done in the following steps (retrieving additional information by navigating through the pages of the different clubs).


In [None]:
import requests
import bs4
import pandas as pd


def retrieve_page(url: str) -> bs4.BeautifulSoup:
    """
    Retrieves and parses a webpage using BeautifulSoup.

    Args:
        url (str): The URL of the webpage to retrieve.

    Returns:
        bs4.BeautifulSoup: The parsed HTML content of the page.
    """
    r = requests.get(url, headers={"User-Agent": "Python for data science tutorial"})
    page = bs4.BeautifulSoup(r.content, 'html.parser')
    return page


def extract_team_name_url(team: bs4.element.Tag) -> dict:
    """
    Extracts the team name and its corresponding Wikipedia URL.

    Args:
        team (bs4.element.Tag): The BeautifulSoup tag containing the team information.

    Returns:
        dict: A dictionary with the team name as the key and the Wikipedia URL as the value, or None if not found.
    """
    try:
        team_url = team.find('a').get('href')
        equipe = team.find('a').get('title')
        url_get_info = f"http://fr.wikipedia.org{team_url}"
        print(f"Retrieving information for {equipe}")
        return {equipe: url_get_info}
    except AttributeError:
        print(f"No <a> tag for \"{team}\"")
        return None


def explore_team_page(wikipedia_team_url: str) -> bs4.BeautifulSoup:
    """
    Retrieves and parses a team's Wikipedia page.

    Args:
        wikipedia_team_url (str): The URL of the team's Wikipedia page.

    Returns:
        bs4.BeautifulSoup: The parsed HTML content of the team's Wikipedia page.
    """
    r = requests.get(
        wikipedia_team_url, headers={"User-Agent": "Python for data science tutorial"}
    )
    page = bs4.BeautifulSoup(r.content, 'html.parser')
    return page


def extract_stadium_info(search_team: bs4.BeautifulSoup) -> tuple:
    """
    Extracts stadium information from a team's Wikipedia page.

    Args:
        search_team (bs4.BeautifulSoup): The parsed HTML content of the team's Wikipedia page.

    Returns:
        tuple: A tuple containing the stadium name, latitude, and longitude, or (None, None, None) if not found.
    """
    for stadium in search_team.find_all('tr'):
        try:
            header = stadium.find('th', {'scope': 'row'})
            if header and header.contents[0].string == "Stade":
                name_stadium, url_get_stade = extract_stadium_name_url(stadium)
                if name_stadium and url_get_stade:
                    latitude, longitude = extract_stadium_coordinates(url_get_stade)
                    return name_stadium, latitude, longitude
        except (AttributeError, IndexError) as e:
            print(f"Error processing stadium information: {e}")
    return None, None, None


def extract_stadium_name_url(stadium: bs4.element.Tag) -> tuple:
    """
    Extracts the stadium name and URL from a stadium element.

    Args:
        stadium (bs4.element.Tag): The BeautifulSoup tag containing the stadium information.

    Returns:
        tuple: A tuple containing the stadium name and its Wikipedia URL, or (None, None) if not found.
    """
    try:
        url_stade = stadium.find_all('a')[1].get('href')
        name_stadium = stadium.find_all('a')[1].get('title')
        url_get_stade = f"http://fr.wikipedia.org{url_stade}"
        return name_stadium, url_get_stade
    except (AttributeError, IndexError) as e:
        print(f"Error extracting stadium name and URL: {e}")
        return None, None


def extract_stadium_coordinates(url_get_stade: str) -> tuple:
    """
    Extracts the coordinates of a stadium from its Wikipedia page.

    Args:
        url_get_stade (str): The URL of the stadium's Wikipedia page.

    Returns:
        tuple: A tuple containing the latitude and longitude of the stadium, or (None, None) if not found.
    """
    try:
        soup_stade = retrieve_page(url_get_stade)
        kartographer = soup_stade.find('a', {'class': "mw-kartographer-maplink"})
        if kartographer:
            coordinates = kartographer.get('data-lat') + "," + kartographer.get('data-lon')
            latitude, longitude = coordinates.split(",")
            return latitude.strip(), longitude.strip()
        else:
            return None, None
    except Exception as e:
        print(f"Error extracting stadium coordinates: {e}")
        return None, None


def extract_team_info(url_team_tag: bs4.element.Tag, division: str) -> dict:
    """
    Extracts information about a team, including its stadium and coordinates.

    Args:
        url_team_tag (bs4.element.Tag): The BeautifulSoup tag containing the team information.
        division (str): Team league

    Returns:
        dict: A dictionary with details about the team, including its division, name, stadium, latitude, and longitude.
    """

    team_info = extract_team_name_url(url_team_tag)
    url_team_wikipedia = next(iter(team_info.values()))
    name_team = next(iter(team_info.keys()))
    search_team = explore_team_page(url_team_wikipedia)
    name_stadium, latitude, longitude = extract_stadium_info(search_team)
    dict_stadium_team = {
        'division': division,
        'equipe': name_team,
        'stade': name_stadium,
        'latitude': latitude,
        'longitude': longitude
    }
    return dict_stadium_team


def retrieve_all_stadium_from_league(url_list: dict, division: str = "L1") -> pd.DataFrame:
    """
    Retrieves information about all stadiums in a league.

    Args:
        url_list (dict): A dictionary mapping divisions to their Wikipedia URLs.
        division (str): The division for which to retrieve stadium information.

    Returns:
        pd.DataFrame: A DataFrame containing information about the stadiums in the specified division.
    """
    page = retrieve_page(url_list[division])
    teams = page.find_all('span', {'class': 'toponyme'})
    all_info = []

    for team in teams:
        all_info.append(extract_team_info(team, division))

    stadium_df = pd.DataFrame(all_info)
    return stadium_df


# URLs for different divisions
url_list = {
    "L1": "http://fr.wikipedia.org/wiki/Championnat_de_France_de_football_2019-2020",
    "L2": "http://fr.wikipedia.org/wiki/Championnat_de_France_de_football_de_Ligue_2_2019-2020"
}

# Retrieve stadiums information for Ligue 1
stades_ligue1 = retrieve_all_stadium_from_league(url_list, "L1")
stades_ligue2 = retrieve_all_stadium_from_league(url_list, "L2")

stades = pd.concat(
    [stades_ligue1, stades_ligue2]
)

In [None]:
stades.head(5)

At this stage, everything is in place to create a beautiful map. We will
use `folium` for this, which is introduced in the
[visualization](../../conent/visualisation/index.qmd) section.

## 5.2 Stadium Map with `folium`


In [None]:
import geopandas as gpd
import folium

stades = stades.dropna(subset = ['latitude', 'longitude'])
stades.loc[:, ['latitude', 'longitude']] = (
    stades
    .loc[:, ['latitude', 'longitude']]
    .astype(float)
)
stadium_locations = gpd.GeoDataFrame(
    stades, geometry = gpd.points_from_xy(stades.longitude, stades.latitude)
)

center = stadium_locations[['latitude', 'longitude']].mean().values.tolist()
sw = stadium_locations[['latitude', 'longitude']].min().values.tolist()
ne = stadium_locations[['latitude', 'longitude']].max().values.tolist()

m = folium.Map(location = center, tiles='openstreetmap')

# I can add marker one by one on the map
for i in range(0,len(stadium_locations)):
    folium.Marker(
        [stadium_locations.iloc[i]['latitude'], stadium_locations.iloc[i]['longitude']],
        popup=stadium_locations.iloc[i]['stade']
    ).add_to(m)

m.fit_bounds([sw, ne])

The resulting map should look like the following:


# 6. Retrieving Information on Pok√©mon

The next exercise to practice *web scraping*
involves retrieving information on Pok√©mon
from the website [pokemondb.net](http://pokemondb.net/pokedex/national).

## 6.1 Unguided Version


<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Important
</div>
</div>
<div class="callout-body-container callout-body">

As with Wikipedia, this site asks `request` to specify a parameter to control the *user-agent*. For instance,

``` python
requests.get(... , headers = {'User-Agent': 'Mozilla/5.0'})
```

</div>
</div>


<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Exercise 2: Pok√©mon (Unguided Version)
</div>
</div>
<div class="callout-body-container callout-body">

For this exercise, we ask you to obtain various information about Pok√©mon:

1.  The personal information of the **893** Pok√©mon on the website [pokemondb.net](http://pokemondb.net/pokedex/national).
    The information we would like to ultimately obtain in a `DataFrame` is contained in 4 tables:

    -   Pok√©dex data
    -   Training
    -   Breeding
    -   Base stats

2.  We would also like you to retrieve images of each Pok√©mon and save them in a folder.

-   A small hint: use the `request` and [`shutil`](https://docs.python.org/3/library/shutil.html) modules.
-   For this question, you will need to research some elements on your own; not everything is covered in the lab.

</div>
</div>

For question 1, the goal is to obtain the source code of a table like
the one below (Pok√©mon [Nincada](http://pokemondb.net/pokedex/nincada)).


<h2>

Pok√©dex data

</h2>

<table class="vitals-table">

<tbody>

<tr>

<th>

National ‚Ññ

</th>

<td>

<strong>290</strong>

</td>

</tr>

<tr>

<th>

Type

</th>

<td>

<a class="type-icon type-bug" href="/type/bug">Bug</a> <a class="type-icon type-ground" href="/type/ground">Ground</a>

</td>

</tr>

<tr>

<th>

Species

</th>

<td>

Trainee Pok√©mon

</td>

</tr>

<tr>

<th>

Height

</th>

<td>

0.5¬†m (1‚Ä≤08‚Ä≥)

</td>

</tr>

<tr>

<th>

Weight

</th>

<td>

5.5¬†kg (12.1¬†lbs)

</td>

</tr>

<tr>

<th>

Abilities

</th>

<td>

<span class="text-muted">1. <a href="/ability/compound-eyes" title="The Pok√©mon's accuracy is boosted.">Compound Eyes</a></span><br><small class="text-muted"><a href="/ability/run-away" title="Enables a sure getaway from wild Pok√©mon.">Run Away</a> (hidden ability)</small><br>

</td>

</tr>

<tr>

<th>

Local ‚Ññ

</th>

<td>

042 <small class="text-muted">(Ruby/Sapphire/Emerald)</small><br>111 <small class="text-muted">(X/Y ‚Äî Central Kalos)</small><br>043 <small class="text-muted">(Omega Ruby/Alpha Sapphire)</small><br>104 <small class="text-muted">(Sword/Shield)</small><br>

</td>

</tr>

</tbody>

</table>

<h2>

Training

</h2>

<table class="vitals-table">

<tbody>

<tr>

<th>

EV yield

</th>

<td class="text">

1 Defense

</td>

</tr>

<tr>

<th>

Catch rate

</th>

<td>

255 <small class="text-muted">(33.3% with Pok√©Ball, full HP)</small>

</td>

</tr>

<tr>

<th>

Base <a href="/glossary#def-friendship">Friendship</a>

</th>

<td>

70 <small class="text-muted">(normal)</small>

</td>

</tr>

<tr>

<th>

Base Exp.

</th>

<td>

53

</td>

</tr>

<tr>

<th>

Growth Rate

</th>

<td>

Erratic

</td>

</tr>

</tbody>

</table>

<h2>

Breeding

</h2>

<table class="vitals-table">

<tbody>

<tr>

<th>

Egg Groups

</th>

<td>

<a href="/egg-group/bug">Bug</a>

</td>

</tr>

<tr>

<th>

Gender

</th>

<td>

<span class="text-blue">50% male</span>, <span class="text-pink">50% female</span>

</td>

</tr>

<tr>

<th>

<a href="/glossary#def-eggcycle">Egg cycles</a>

</th>

<td>

15 <small class="text-muted">(3,599‚Äì3,855 steps)</small>

</td>

</tr>

</tbody>

</table>

<h2>

Base stats

</h2>

<table class="vitals-table">

<tbody>

<tr>

<th>

HP

</th>

<td class="cell-num">

31

</td>

<td class="cell-barchart">

</td>

<td class="cell-num">

172

</td>

<td class="cell-num">

266

</td>

</tr>

<tr>

<th>

Attack

</th>

<td class="cell-num">

45

</td>

<td class="cell-barchart">

</td>

<td class="cell-num">

85

</td>

<td class="cell-num">

207

</td>

</tr>

<tr>

<th>

Defense

</th>

<td class="cell-num">

90

</td>

<td class="cell-barchart">

</td>

<td class="cell-num">

166

</td>

<td class="cell-num">

306

</td>

</tr>

<tr>

<th>

Sp. Atk

</th>

<td class="cell-num">

30

</td>

<td class="cell-barchart">

</td>

<td class="cell-num">

58

</td>

<td class="cell-num">

174

</td>

</tr>

<tr>

<th>

Sp. Def

</th>

<td class="cell-num">

30

</td>

<td class="cell-barchart">

</td>

<td class="cell-num">

58

</td>

<td class="cell-num">

174

</td>

</tr>

<tr>

<th>

Speed

</th>

<td class="cell-num">

40

</td>

<td class="cell-barchart">

</td>

<td class="cell-num">

76

</td>

<td class="cell-num">

196

</td>

</tr>

</tbody>

<tfoot>

<tr>

<th>

Total

</th>

<td class="cell-total">

<b>266</b>

</td>

<th class="cell-barchart">

</th>

<th>

Min

</th>

<th>

Max

</th>

</tr>

</tfoot>

</table>


For question 2, the goal is to obtain
images of the Pok√©mon.

## 6.2 Guided Version

The following sections will help you complete the above exercise
step by step,
in a guided manner.

First, we want to obtain the
personal information of all
the Pok√©mon on [pokemondb.net](http://pokemondb.net/pokedex/national).

The information we would like to ultimately obtain for the Pok√©mon is contained in 4 tables:

-   Pok√©dex data
-   Training
-   Breeding
-   Base stats

Next, we will retrieve and display the images.

### 6.2.1 Step 1: Create a DataFrame of Characteristics


<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Exercise 2b: Pok√©mon (guided version)
</div>
</div>
<div class="callout-body-container callout-body">

To retrieve the information, the code must be divided into several steps:

1.  Find the site‚Äôs main page and transform it into an intelligible object for your code. The following functions will be useful:

    -   `requests.get`
    -   `bs4.BeautifulSoup`

2.  From this code, create a function that retrieves a pok√©mon‚Äôs page content from its name. You can name this function `get_name`.

3.  From the `bulbasaur` page, obtain the 4 arrays we‚Äôre interested in:

    -   look for the following element: `(‚Äòtable‚Äô, { ‚Äòclass‚Äô : ‚Äúvitals-table‚Äù})`
    -   then store its elements in a dictionary

4.  Retrieve the list of pokemon names, which will enable us to loop later. How many pok√©mons can you find?

5.  Write a function that retrieves all the information on the first ten pok√©mons in the list and integrates it into a `DataFrame`.

</div>
</div>

At the end of question 3,
you should obtain a list of characteristics similar to this one:

The structure here is a dictionary, which is convenient.

Finally, you can integrate the information
of the first ten Pok√©mon into a
`DataFrame`, which will look like this:


### 6.2.2 Step 2: Retrieve and Display Pok√©mon Photos

We would also like you to retrieve the images of the first 5 Pok√©mon
and save them in a folder.


<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Exercise 2b: Pok√©mon (Guided Version)
</div>
</div>
<div class="callout-body-container callout-body">

-   The URLs of Pok√©mon images take the form *‚Äúhttps://img.pokemondb.net/artwork/{pokemon}.jpg‚Äù*.
    Use the `requests` and `shutil` modules to download
    and save the images locally.
-   Import these images stored in JPEG format into `Python` using the `imread` function from the `skimage.io` package.

</div>
</div>


In [None]:
!pip install scikit-image

# 7. `Selenium` : mimer le comportement d‚Äôun utilisateur internet

Until now,
we have assumed that we always know the URL we are interested in.
Additionally, the pages we visit are **‚Äústatic‚Äù**,
they do not depend on any action or search by the user.

We will now see how to fill in fields on a website and retrieve the information we are interested in.
The reaction of a website to a user‚Äôs action often involves the use of `JavaScript` in the world of web development.
The [Selenium](https://pypi.python.org/pypi/selenium) package allows
you to automate the behavior of a manual user from within your code.
It enables you to obtain information from a site that is not in the
`HTML` code but only appears after
the execution of `JavaScript` scripts in the background.

`Selenium` behaves like a regular internet user:
it clicks on links, fills out forms, etc.

## 7.1 First Example: Scraping a Search Engine

In this example, we will try to go to the
[Bing News](https://www.bing.com/news) site
and enter a given topic in the search bar.
To test, we will search with the keyword **‚ÄúTrump‚Äù**.

Installing `Selenium` requires `Chromium`, which is a
minimalist version of the Google Chrome browser.
The version of [chromedriver](https://sites.google.com/a/chromium.org/chromedriver/)
must be `>= 2.36` and depends on the version of `Chrome` you have on your working environment.
To install this minimalist version of `Chrome` on a
`Linux` environment, you can refer to the dedicated section.


<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Installation de Selenium
</div>
</div>
<div class="callout-body-container callout-body">

On `Colab`, you can use the following commands:

``` python
!sudo apt-get update
!sudo apt install -y unzip xvfb libxi6 libgconf-2-4 -y
!sudo apt install chromium-chromedriver -y
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
```

<br>

If you are on the `SSP Cloud`, you can
run the following commands:

``` python
!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb -O /tmp/chrome.deb
!sudo apt-get update
!sudo -E apt-get install -y /tmp/chrome.deb
!pip install chromedriver-autoinstaller selenium

import chromedriver_autoinstaller
path_to_web_driver = chromedriver_autoinstaller.install()
```

<br>

You can then install `Selenium`.
For example, from a
`Notebook` cell:

</div>
</div>

First, you need to initialize the behavior
of `Selenium` by replicating the browser settings. To do this, we will first initialize our browser with a few options:


In [None]:
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
#chrome_options.add_argument('--verbose')

Then we launch the browser:


In [None]:
from selenium.webdriver.chrome.service import Service
service = Service(executable_path=path_to_web_driver)

browser = webdriver.Chrome(
    service=service,
    options=chrome_options
)

We go to the `Bing News` site,
and we specify the keyword we want to search for.
In this case, we‚Äôre interested in news about Donald Trump.
After inspecting the page using the browser‚Äôs developer tools,
we see that the search bar is an element in the code called `q` (as in *query*).
So we‚Äôll ask `selenium` to search for this element:


In [None]:
browser.get('https://www.bing.com/news')

In [None]:
search = browser.find_element("name", "q")
print(search)
print([search.text, search.tag_name, search.id])

# on envoie √† cet endroit le mot qu'on aurait tap√© dans la barre de recherche
search.send_keys("Trump")

search_button = browser.find_element("xpath", "//input[@id='sb_form_go']")
search_button.click()

`Selenium` allows you to capture the image you would see in the browser
with `get_screenshot_as_png`. This can be useful to check if you
have performed the correct action:


Finally, we can extract the results. Several
methods are available. The most
convenient method, when available,
is to use `XPath`, which is an unambiguous path
to access an element. Indeed,
multiple elements can share the same class or
the same attribute, which can cause such a search
to return multiple matches.
To determine the `XPath` of an object, the developer tools
of your web browser are handy.
For example, in `Firefox`, once you
have found an element in the inspector, you
can right-click \> Copy \> XPath.

Finally, to end our session, we ask `Python` to close the browser:


In [None]:
browser.quit()

We get the following results:


Other useful `Selenium` methods:

| Method | Result |
|-------------------------------------------------|-----------------------|
| `find_element(****).click()` | Once you have found a reactive element, such as a button, you can click on it to navigate to a new page |
| `find_element(****).send_keys("toto")` | Once you have found an element, such as a field to enter credentials, you can send a value, in this case *‚Äútoto‚Äù*. |

## 7.2 Additional Exercise

To explore another application of web scraping, you can also tackle topic 5 of the 2023 edition of a non-competitive hackathon organized by Insee:

-   On [`Github`](https://github.com/InseeFrLab/funathon2023_sujet5)
-   On [`SSPCloud`](https://www.sspcloud.fr/formation?search=funat&path=%5B%22Funathon%202023%22%5D)

The NLP section of the course may be useful for the second part of the topic!
