try/error pipeline for GHA + update some webscraping codebase to avoid deprecation warning

linogaliana · linogaliana · commit 53883c089630 · 2025-09-20T14:01:30.000Z
diff --git a/_quarto.yml b/_quarto.yml
@@ -4,6 +4,7 @@ project:
     - index.qmd
     - 404.qmd
     - content/getting-started/index.qmd
+    - content/manipulation/04a_webscraping_TP.qmd
     - content/modelisation/index.qmd
     - content/visualisation/index.qmd
     - content/visualisation/matplotlib.qmd
diff --git a/content/manipulation/04_webscraping/_exo2_solution.qmd b/content/manipulation/04_webscraping/_exo2_solution.qmd
@@ -8,6 +8,8 @@ import pandas as pd
 ```
 
 ```{python}
+#| eval: false
+
 # 1. We need to use Mozilla user-agent for that site
 import requests
 
@@ -20,6 +22,35 @@ page = bs4.BeautifulSoup(req.content, "lxml")
 ```
 
 
+```{python}
+#| echo: false
+import requests
+import bs4
+import time
+
+url_root = "http://pokemondb.net/pokedex/national"
+
+def fetch_page(url, max_retries=3, delay=2):
+    for attempt in range(1, max_retries + 1):
+        try:
+            resp = requests.get(
+                url,
+                headers={"User-Agent": "Mozilla/5.0"},
+                timeout=10  # éviter de bloquer indéfiniment
+            )
+            resp.raise_for_status()  # lève une exception si code HTTP pas 200
+            return bs4.BeautifulSoup(resp.content, "lxml")
+        except Exception as e:
+            print(f"Échec {attempt}/{max_retries}: {e}")
+            if attempt < max_retries:
+                time.sleep(delay)
+            else:
+                raise  # après le dernier essai, on laisse remonter l’erreur
+
+# Exemple d’utilisation
+page = fetch_page(url_root)
+```
+
 ```{python}
 #| output: false
 
@@ -48,21 +79,21 @@ page_pokemon = get_page("bulbasaur")
 indice_tableau = 0 #premier tableau : 0
 print("\n tableau", indice_tableau+1, " : deux premières lignes")
 tableau_1 = page_pokemon.findAll('table', { 'class' : "vitals-table"})[indice_tableau] 
-for elements in tableau_1.find('tbody').findChildren(['tr'])[0:2]:  #Afficher les 2 éléments du tableau
-    print(elements.findChild('th'))
-    print(elements.findChild('td'))
+for elements in tableau_1.find('tbody').find_children(['tr'])[0:2]:  #Afficher les 2 éléments du tableau
+    print(elements.find_child('th'))
+    print(elements.find_child('td'))
 print("\n\n\n")
 
 # Generalization
 def get_cara_pokemon(pokemon_name):
     page = get_page(pokemon_name)
     data = {}
-    for table in page.findAll('table', { 'class' : "vitals-table"})[0:4] :
+    for table in page.find_all('table', { 'class' : "vitals-table"})[0:4] :
         table_body = table.find('tbody')
-        for rows in table_body.findChildren(['tr']) :
+        for rows in table_body.find_children(['tr']) :
             if len(rows) > 1 : # attention aux tr qui ne contiennent rien
-                column = rows.findChild('th').getText()
-                cells = rows.findChild('td').getText()
+                column = rows.find_child('th').getText()
+                cells = rows.find_child('td').getText()
                 cells = cells.replace('\t','').replace('\n',' ')
                 data[column] = cells
                 data['name'] = pokemon_name
diff --git a/content/manipulation/04_webscraping/_exo2b.qmd b/content/manipulation/04_webscraping/_exo2b.qmd
@@ -5,21 +5,21 @@
 Pour récupérer les informations, le code devra être divisé en plusieurs étapes : 
 
 
-1. Trouvez la page principale du site et la transformer en un objet intelligible pour votre code.
-   Les fonctions suivantes vous seront utiles :
-   - `urllib.request.Request`
-   - `urllib.request.urlopen`
+1. Trouvez la page principale du site et la transformer en un objet intelligible pour votre code. Les fonctions suivantes vous seront utiles :
+
+   - `requests.get`
    - `bs4.BeautifulSoup`
 
-2. Créez une fonction qui permet de récupérer la page d'un pokémon à partir de son nom.
+2. A partir de ce code, créer une fonction qui permet de récupérer le copntenu page d'un pokémon à partir de son nom. Vous pouvez nommer cette fonction `get_name`.
 
 3. À partir de la page de `bulbasaur`, obtenez les 4 tableaux qui nous intéressent :
+
    - on va chercher l'élément suivant : `('table', { 'class' : "vitals-table"})`
    - puis stocker ses éléments dans un dictionnaire
 
 4. Récupérez par ailleurs la liste de noms des pokémons qui nous permettra de faire une boucle par la suite. Combien trouvez-vous de pokémons ? 
 
-5. Écrivez une fonction qui récupère l'ensemble des informations sur les dix premiers pokémons de la liste et les intègre dans un `DataFrame`
+5. Écrivez une fonction qui récupère l'ensemble des informations sur les dix premiers pokémons de la liste et les intègre dans un `DataFrame`.
 
 ::::
 :::
@@ -28,24 +28,26 @@ Pour récupérer les informations, le code devra être divisé en plusieurs éta
 :::: {.callout-tip}
 ## Exercise 2b: Pokémon (guided version)
 
-To retrieve the information, the code will need to be divided into several steps:
+To retrieve the information, the code must be divided into several steps: 
 
+1. Find the site's main page and transform it into an intelligible object for your code. The following functions will be useful:
 
-1. Find the main page of the site and transform it into an intelligible object for your code.
-   The following functions will be useful:
-   - `urllib.request.Request`
-   - `urllib.request.urlopen`
+   - `requests.get`
    - `bs4.BeautifulSoup`
 
-2. Create a function that retrieves a Pokémon's page based on its name.
+2. From this code, create a function that retrieves a pokémon's page content from its name. You can name this function `get_name`.
 
-3. From the `bulbasaur` page, obtain the 4 tables we are interested in:
-   - We will look for the following element: `('table', { 'class' : "vitals-table"})`
-   - Then store its elements in a dictionary
+3. From the `bulbasaur` page, obtain the 4 arrays we're interested in:
+   - look for the following element: `(‘table’, { ‘class’ : “vitals-table”})`
+   - then store its elements in a dictionary
 
-4. Additionally, retrieve the list of Pokémon names that will allow us to loop through later. How many Pokémon do you find?
+4. Retrieve the list of pokemon names, which will enable us to loop later. How many pokémons can you find? 
 
-5. Write a function that retrieves all the information on the first ten Pokémon in the list and integrates it into a `DataFrame`.
+5. Write a function that retrieves all the information on the first ten pokémons in the list and integrates it into a `DataFrame`.
 
 ::::
 :::
+
+```{python}
+# Correction above
+```
diff --git a/content/manipulation/04_webscraping/_exo2b_correction.qmd b/content/manipulation/04_webscraping/_exo2b_correction.qmd
@@ -1,12 +1,34 @@
 ```{python}
 #| echo: true
+#| eval: false
 !pip install scikit-image
 ```
 
+```{python}
+# Question 2
+def get_image_from_name(pokemon_name):
+    """
+    Function enabling to retrieve pokemon info from a page, e.g. https://pokemondb.net/pokedex/bulbasaur
+    """
+    url_pokemon = f"https://img.pokemondb.net/artwork/{pokemon}.jpg"
+    response = requests.get(
+        url_pokemon,
+        headers={'User-Agent': 'Mozilla/5.0'}
+    )
+
+    name = f'{pokemon}.jpg'
+    
+    with open(f'{pokemon}.jpg', 'wb') as out_file:
+        shutil.copyfileobj(response.raw, out_file)
+
+    return name
+```
 
 ```{python}
-#| include: false
-#| echo: false
+#| output: false
+#| message: false
+#| warning: false
+#| label: correction-exo2b-step2
 
 # Correction de l'étape 2
 import shutil
@@ -17,21 +39,37 @@ import skimage.io as imio
 
 nb_pokemons = 5
 fig, ax = plt.subplots(1, nb_pokemons, figsize=(12,4))
+
 for indice_pokemon in range(0,nb_pokemons) :
+
     pokemon = liste_pokemon[indice_pokemon]
-    url = f"https://img.pokemondb.net/artwork/{pokemon}.jpg"
-    response = requests.get(url, stream=True)
-    with open(f'{pokemon}.jpg', 'wb') as out_file:
-        shutil.copyfileobj(response.raw, out_file)
-    name = f'{pokemon}.jpg'
+    name = get_image_from_name(pokemon)
+
     img = imio.imread(name)
     ax[indice_pokemon].imshow(img)  
     ax[indice_pokemon].get_xaxis().set_visible(False)
     ax[indice_pokemon].get_yaxis().set_visible(False)
 ```
 
+
+::: {.content-visible when-profile="fr"}
+
 ```{python}
 #| echo: false
-#plt.savefig('pokemon.png', bbox_inches='tight')
+#| fig-cap: "Les premiers pokemon du Pokédex" 
+
 ax[0].get_figure()
 ```
+
+:::
+
+::: {.content-visible when-profile="en"}
+
+```{python}
+#| echo: false
+#| fig-cap: "First pokemon in Pokedex list" 
+
+ax[0].get_figure()
+```
+
+:::
diff --git a/content/manipulation/04a_webscraping_TP.qmd b/content/manipulation/04a_webscraping_TP.qmd