### Exercice de web scraping avec BeautifulSoup

Pour cet exercice, nous vous demandons d'obtenir 1) les informations personnelles des 721 pokemons sur le site internet [pokemondb.net](http://pokemondb.net/pokedex/national). Les informations que nous aimerions obtenir au final pour les pokemons sont celles contenues dans 4 tableaux :

- Pokédex data
- Training
- Breeding
- Base stats

Pour exemple : [Pokemon Database](http://pokemondb.net/pokedex/nincada).

2) Nous aimerions que vous récupériez également les images de chacun des pokémons et que vous les enregistriez dans un dossier  (indice : utilisez les modules request et [shutil](https://docs.python.org/3/library/shutil.html))
_pour cette question ci, il faut que vous cherchiez de vous même certains éléments, tout n'est pas présent dans le TD_.

#### Récupération des infos sur un pokemon

In [1]:
# Récupération des infos sur un pokemon https://pokemondb.net/pokedex/bulbasaur
import urllib
import bs4
from urllib.request import Request, urlopen

# Etape 1 : se connecter à la page et obtenir le code source
# utilisation de l'objet requête avec en-tête pour éviter une erreur "HTTP Error 403: Forbidden"

url_pokemon = "https://pokemondb.net/pokedex/bulbasaur"

req = Request(url_pokemon, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read().decode('utf-8')
print(html[:1000])

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Bulbasaur Pokédex: stats, moves, evolution &amp; locations | Pokémon Database</title>
<link rel="preconnect" href="https://img.pokemondb.net">
<style>@font-face{font-family:"Fira Sans";font-style:normal;font-weight:400;font-display:swap;src:url("/static/fonts/fira-sans-v10-latin-400.woff2") format("woff2");unicode-range:U+0000-00FF,U+0131,U+0152-0153,U+02BB-02BC,U+02C6,U+02DA,U+02DC,U+2000-206F,U+2074,U+20AC,U+2122,U+2191,U+2193,U+2212,U+2215,U+FEFF,U+FFFD}@font-face{font-family:"Fira Sans";font-style:italic;font-weight:400;font-display:swap;src:url("/static/fonts/fira-sans-v10-latin-400i.woff2") format("woff2");unicode-range:U+0000-00FF,U+0131,U+0152-0153,U+02BB-02BC,U+02C6,U+02DA,U+02DC,U+2000-206F,U+2074,U+20AC,U+2122,U+2191,U+2193,U+2212,U+2215,U+FEFF,U+FFFD}@font-face{font-family:"Fira Sans";font-style:normal;font-weight:700;font-display:swap;src:url("/static/fonts/fira-sans-v10-latin-600.woff2") format("woff2")

In [2]:
# Etape 2 : utiliser le package BeautifulSoup
# qui "comprend" les balises contenues dans la chaine de caractères renvoyée par la fonction request
# utilisation du parser de la doc de bs 'html.parser'

page = bs4.BeautifulSoup(html, 'html.parser')

In [3]:
# Etape 3: les données recherchées sont dans des tables du type <table class="vitals-table">
len(page.findAll('table', {'class' : 'vitals-table'}))

7

Donées recherchées:
- Pokédex data
- Training
- Breeding
- Base stats

In [4]:
# les données recherchées sont stockées dans un dictionnaire de caractéristiques
# `feature` = {nom de la caractéristique: valeur}
feature = {}

In [5]:
# Affichage du code html de la première table qui contient les "Pokédex data"
table = page.findAll('table', {'class' : 'vitals-table'})[0]
print(table.prettify())

# Dans une ligne du tableau <tr> ... </tr>
# on récupère l'information de l'en-tête qui servira de clé <th> ... </th>
# la valeur sera le contenu de du <td> ... </td> correpondant

<table class="vitals-table">
 <tbody>
  <tr>
   <th>
    National №
   </th>
   <td>
    <strong>
     001
    </strong>
   </td>
  </tr>
  <tr>
   <th>
    Type
   </th>
   <td>
    <a class="type-icon type-grass" href="/type/grass">
     Grass
    </a>
    <a class="type-icon type-poison" href="/type/poison">
     Poison
    </a>
   </td>
  </tr>
  <tr>
   <th>
    Species
   </th>
   <td>
    Seed Pokémon
   </td>
  </tr>
  <tr>
   <th>
    Height
   </th>
   <td>
    0.7 m (2′04″)
   </td>
  </tr>
  <tr>
   <th>
    Weight
   </th>
   <td>
    6.9 kg (15.2 lbs)
   </td>
  </tr>
  <tr>
   <th>
    Abilities
   </th>
   <td>
    <span class="text-muted">
     1.
     <a href="/ability/overgrow" title="Powers up Grass-type moves in a pinch.">
      Overgrow
     </a>
    </span>
    <br/>
    <small class="text-muted">
     <a href="/ability/chlorophyll" title="Boosts the Pokémon's Speed in sunshine.">
      Chlorophyll
     </a>
     (hidden ability)
    </small>
    <br/>
   </td>
 

In [6]:
# Dans une ligne du tableau <tr> ... </tr>
# on récupère l'information de l'en-tête qui servira de clé <th> ... </th>
# la valeur sera le contenu de du <td> ... </td> correpondant
for row in table.findAll({'tr'}) :
    print(row.th.string)
    print(row.td.getText())
    print("---")

National №
001
---
Type

Grass Poison 
---
Species
Seed Pokémon
---
Height
0.7 m (2′04″)
---
Weight
6.9 kg (15.2 lbs)
---
Abilities
1. OvergrowChlorophyll (hidden ability)
---
Local №
001 (Red/Blue/Yellow)226 (Gold/Silver/Crystal)001 (FireRed/LeafGreen)231 (HeartGold/SoulSilver)080 (X/Y — Central Kalos)001 (Let's Go Pikachu/Let's Go Eevee)068 (The Isle of Armor)
---


In [7]:
feature.clear()

In [8]:
# Alimentation du dictionnaire de caractéristiques
for row in table.findAll({'tr'}) :
    key = row.th.string
    value = row.td.getText()
    feature[key] = value

feature

{'National №': '001',
 'Type': '\nGrass Poison ',
 'Species': 'Seed Pokémon',
 'Height': '0.7\xa0m (2′04″)',
 'Weight': '6.9\xa0kg (15.2\xa0lbs)',
 'Abilities': '1. OvergrowChlorophyll (hidden ability)',
 'Local №': "001 (Red/Blue/Yellow)226 (Gold/Silver/Crystal)001 (FireRed/LeafGreen)231 (HeartGold/SoulSilver)080 (X/Y — Central Kalos)001 (Let's Go Pikachu/Let's Go Eevee)068 (The Isle of Armor)"}

In [9]:
# Récupération des 4 tables souhaitées:
feature.clear()
tables = page.findAll('table', {'class' : 'vitals-table'})
for table in tables[:4]:
    for row in table.findAll({'tr'}) :
        key = row.th.string
        value = row.td.getText()
        feature[key] = value
feature

{'National №': '001',
 'Type': '\nGrass Poison ',
 'Species': 'Seed Pokémon',
 'Height': '0.7\xa0m (2′04″)',
 'Weight': '6.9\xa0kg (15.2\xa0lbs)',
 'Abilities': '1. OvergrowChlorophyll (hidden ability)',
 'Local №': "001 (Red/Blue/Yellow)226 (Gold/Silver/Crystal)001 (FireRed/LeafGreen)231 (HeartGold/SoulSilver)080 (X/Y — Central Kalos)001 (Let's Go Pikachu/Let's Go Eevee)068 (The Isle of Armor)",
 'EV yield': '\n1 Special Attack ',
 'Catch rate': '\n45 (5.9% with PokéBall, full HP)\n',
 None: '\n50 (normal)\n',
 'Base Exp.': '64',
 'Growth Rate': 'Medium Slow',
 'Egg Groups': 'Grass, Monster',
 'Gender': '87.5% male, 12.5% female',
 'Egg cycles': '20 (4,884–5,140 steps)\n',
 'HP': '45',
 'Attack': '49',
 'Defense': '49',
 'Sp. Atk': '65',
 'Sp. Def': '65',
 'Speed': '45',
 'Total': '318'}

In [10]:
# Fonction de récupération des caractéristiques pour un pokémon
def get_features(url_pokemon):
    feature = {}

    req = Request(url_pokemon, headers={'User-Agent': 'Mozilla/5.0'})
    html = urlopen(req).read().decode('utf-8')
    page = bs4.BeautifulSoup(html, 'html.parser')

    tables = page.findAll('table', {'class' : 'vitals-table'})
    for table in tables[:4]:
        for row in table.findAll({'tr'}) :
            key = row.th.string
            value = row.td.getText()
            feature[key] = value
    return feature

In [11]:
get_features("https://pokemondb.net/pokedex/bulbasaur")

{'National №': '001',
 'Type': '\nGrass Poison ',
 'Species': 'Seed Pokémon',
 'Height': '0.7\xa0m (2′04″)',
 'Weight': '6.9\xa0kg (15.2\xa0lbs)',
 'Abilities': '1. OvergrowChlorophyll (hidden ability)',
 'Local №': "001 (Red/Blue/Yellow)226 (Gold/Silver/Crystal)001 (FireRed/LeafGreen)231 (HeartGold/SoulSilver)080 (X/Y — Central Kalos)001 (Let's Go Pikachu/Let's Go Eevee)068 (The Isle of Armor)",
 'EV yield': '\n1 Special Attack ',
 'Catch rate': '\n45 (5.9% with PokéBall, full HP)\n',
 None: '\n50 (normal)\n',
 'Base Exp.': '64',
 'Growth Rate': 'Medium Slow',
 'Egg Groups': 'Grass, Monster',
 'Gender': '87.5% male, 12.5% female',
 'Egg cycles': '20 (4,884–5,140 steps)\n',
 'HP': '45',
 'Attack': '49',
 'Defense': '49',
 'Sp. Atk': '65',
 'Sp. Def': '65',
 'Speed': '45',
 'Total': '318'}

#### Récupération de la liste des pokemons

In [12]:
# Récupération de la liste des pokemons et des liens vers leur page de caractéristiques
# https://pokemondb.net/pokedex/all

In [13]:
url_pokemons_list = "https://pokemondb.net/pokedex/all"

req = Request(url_pokemons_list, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read().decode('utf-8')
print(html[:1000])
page = bs4.BeautifulSoup(html, 'html.parser')

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Pokémon Pokédex: list of Pokémon with stats | Pokémon Database</title>
<link rel="preconnect" href="https://img.pokemondb.net">
<style>@font-face{font-family:"Fira Sans";font-style:normal;font-weight:400;font-display:swap;src:url("/static/fonts/fira-sans-v10-latin-400.woff2") format("woff2");unicode-range:U+0000-00FF,U+0131,U+0152-0153,U+02BB-02BC,U+02C6,U+02DA,U+02DC,U+2000-206F,U+2074,U+20AC,U+2122,U+2191,U+2193,U+2212,U+2215,U+FEFF,U+FFFD}@font-face{font-family:"Fira Sans";font-style:italic;font-weight:400;font-display:swap;src:url("/static/fonts/fira-sans-v10-latin-400i.woff2") format("woff2");unicode-range:U+0000-00FF,U+0131,U+0152-0153,U+02BB-02BC,U+02C6,U+02DA,U+02DC,U+2000-206F,U+2074,U+20AC,U+2122,U+2191,U+2193,U+2212,U+2215,U+FEFF,U+FFFD}@font-face{font-family:"Fira Sans";font-style:normal;font-weight:700;font-display:swap;src:url("/static/fonts/fira-sans-v10-latin-600.woff2") format("woff2");unicode-range:

In [14]:
# <table id="pokedex" class="data-table block-wide">
table = page.find('table', id = 'pokedex')
for item in table.findAll('a', {'class':'ent-name'})[:10]:
    print(item)

<a class="ent-name" href="/pokedex/bulbasaur" title="View Pokedex for #001 Bulbasaur">Bulbasaur</a>
<a class="ent-name" href="/pokedex/ivysaur" title="View Pokedex for #002 Ivysaur">Ivysaur</a>
<a class="ent-name" href="/pokedex/venusaur" title="View Pokedex for #003 Venusaur">Venusaur</a>
<a class="ent-name" href="/pokedex/venusaur" title="View Pokedex for #003 Venusaur">Venusaur</a>
<a class="ent-name" href="/pokedex/charmander" title="View Pokedex for #004 Charmander">Charmander</a>
<a class="ent-name" href="/pokedex/charmeleon" title="View Pokedex for #005 Charmeleon">Charmeleon</a>
<a class="ent-name" href="/pokedex/charizard" title="View Pokedex for #006 Charizard">Charizard</a>
<a class="ent-name" href="/pokedex/charizard" title="View Pokedex for #006 Charizard">Charizard</a>
<a class="ent-name" href="/pokedex/charizard" title="View Pokedex for #006 Charizard">Charizard</a>
<a class="ent-name" href="/pokedex/squirtle" title="View Pokedex for #007 Squirtle">Squirtle</a>


In [15]:
pokemons = {}  # { name : url}
table = page.find('table', id = 'pokedex')
for item in table.findAll('a', {'class':'ent-name'}):
    name = item.getText()
    url = item.get("href")
    pokemons[name] = url
len(pokemons)

898

In [16]:
#pokemons

In [20]:
#### Récupération des informations pour tous les pokemons
infos = {}  # { pokemon name : features }


# Dictionnaire allégé pour simplifier l'accès au site
# >>>
keys_to_extract = ['Bulbasaur', 'Ivysaur', 'Venusaur', 'Charmander']
pokemons_light = {key: pokemons[key] for key in keys_to_extract}
pokemons = pokemons_light
# <<<

for name, end_of_url in pokemons.items():
    url_pokemon = "https://pokemondb.net" + end_of_url
    infos[name] = get_features(url_pokemon)

In [21]:
# Affichage de deux éléments du dictionnaire
# { nom du pokemon : {nom de la caractéristique: valeur}
#   }
list(infos.items())[0:2]

[('Bulbasaur',
  {'National №': '001',
   'Type': '\nGrass Poison ',
   'Species': 'Seed Pokémon',
   'Height': '0.7\xa0m (2′04″)',
   'Weight': '6.9\xa0kg (15.2\xa0lbs)',
   'Abilities': '1. OvergrowChlorophyll (hidden ability)',
   'Local №': "001 (Red/Blue/Yellow)226 (Gold/Silver/Crystal)001 (FireRed/LeafGreen)231 (HeartGold/SoulSilver)080 (X/Y — Central Kalos)001 (Let's Go Pikachu/Let's Go Eevee)068 (The Isle of Armor)",
   'EV yield': '\n1 Special Attack ',
   'Catch rate': '\n45 (5.9% with PokéBall, full HP)\n',
   None: '\n50 (normal)\n',
   'Base Exp.': '64',
   'Growth Rate': 'Medium Slow',
   'Egg Groups': 'Grass, Monster',
   'Gender': '87.5% male, 12.5% female',
   'Egg cycles': '20 (4,884–5,140 steps)\n',
   'HP': '45',
   'Attack': '49',
   'Defense': '49',
   'Sp. Atk': '65',
   'Sp. Def': '65',
   'Speed': '45',
   'Total': '318'}),
 ('Ivysaur',
  {'National №': '002',
   'Type': '\nGrass Poison ',
   'Species': 'Seed Pokémon',
   'Height': '1.0\xa0m (3′03″)',
   'Weigh

#### Stockage dans un dataframe
On voit que le contenu contient pas mal de caractères spéciaux qui pourraient être nettoyés.

In [22]:
import pandas as pd
df = pd.DataFrame(infos).T
df.head()

Unnamed: 0,National №,Type,Species,Height,Weight,Abilities,Local №,EV yield,Catch rate,NaN,...,Egg Groups,Gender,Egg cycles,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Total
Bulbasaur,1,\nGrass Poison,Seed Pokémon,0.7 m (2′04″),6.9 kg (15.2 lbs),1. OvergrowChlorophyll (hidden ability),001 (Red/Blue/Yellow)226 (Gold/Silver/Crystal)...,\n1 Special Attack,"\n45 (5.9% with PokéBall, full HP)\n",\n50 (normal)\n,...,"Grass, Monster","87.5% male, 12.5% female","20 (4,884–5,140 steps)\n",45,49,49,65,65,45,318
Ivysaur,2,\nGrass Poison,Seed Pokémon,1.0 m (3′03″),13.0 kg (28.7 lbs),1. OvergrowChlorophyll (hidden ability),002 (Red/Blue/Yellow)227 (Gold/Silver/Crystal)...,"\n1 Special Attack, 1 Special Defense","\n45 (5.9% with PokéBall, full HP)\n",\n50 (normal)\n,...,"Grass, Monster","87.5% male, 12.5% female","20 (4,884–5,140 steps)\n",60,62,63,80,80,60,405
Venusaur,3,\nGrass Poison,Seed Pokémon,2.0 m (6′07″),100.0 kg (220.5 lbs),1. OvergrowChlorophyll (hidden ability),003 (Red/Blue/Yellow)228 (Gold/Silver/Crystal)...,"\n2 Special Attack, 1 Special Defense","\n45 (5.9% with PokéBall, full HP)\n",\n50 (normal)\n,...,"Grass, Monster","87.5% male, 12.5% female","20 (4,884–5,140 steps)\n",80,82,83,100,100,80,525
Charmander,4,\nFire,Lizard Pokémon,0.6 m (2′00″),8.5 kg (18.7 lbs),1. BlazeSolar Power (hidden ability),004 (Red/Blue/Yellow)229 (Gold/Silver/Crystal)...,\n1 Speed,"\n45 (5.9% with PokéBall, full HP)\n",\n50 (normal)\n,...,"Dragon, Monster","87.5% male, 12.5% female","20 (4,884–5,140 steps)\n",39,52,43,60,50,65,309


#### Récupération des images de chacun des pokémons et enregistrement dans un dossier

In [23]:
# Liste des url des images
url_pokemons_list = "https://pokemondb.net/pokedex/all"
icon_url_list = []  # Liste des url des images de pokemons (sous forme d'icônes)

req = Request(url_pokemons_list, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read().decode('utf-8')
page = bs4.BeautifulSoup(html, 'html.parser')

table = page.find('table', id = 'pokedex')
for item in table.findAll('span', {'class':'img-fixed icon-pkmn'}):
    icon_url_list.append(item.get("data-src"))
icon_url_list[:10]

['https://img.pokemondb.net/sprites/sword-shield/icon/bulbasaur.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/ivysaur.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/venusaur.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/venusaur-mega.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/charmander.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/charmeleon.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/charizard.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/charizard-mega-x.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/charizard-mega-y.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/squirtle.png']

How to Download an Image Using Python : https://towardsdatascience.com/how-to-download-an-image-using-python-38a75cfa21c

In [24]:
## Importing Necessary Modules
import requests # to get image from the web
import shutil # to save it locally

## Set up the image URL and filename
image_url = "https://img.pokemondb.net/sprites/sword-shield/icon/bulbasaur.png"
filename = image_url.split("/")[-1]

# Open the url image, set stream to True, this will return the stream content.
r = requests.get(image_url, stream = True)

# Check if the image was retrieved successfully
if r.status_code == 200:
    # Set decode_content value to True, otherwise the downloaded image file's size will be zero.
    r.raw.decode_content = True
    
    # Open a local file with wb ( write binary ) permission.
    with open(filename,'wb') as f:
        shutil.copyfileobj(r.raw, f)
        
    print('Image sucessfully Downloaded: ',filename)
else:
    print('Image Couldn\'t be retreived')

Image sucessfully Downloaded:  bulbasaur.png


Enregistrement de toutes les images dans un dossier `icon` dans le répertoire courant

In [25]:
# Création du dossier `icon`
import os
if not os.path.exists('icon'):
    os.makedirs('icon')

# Récupération des images
for icon_url in icon_url_list:
    filename = icon_url.split("/")[-1]

    # Open the url image, set stream to True, this will return the stream content.
    r = requests.get(icon_url, stream = True)

    # Check if the image was retrieved successfully
    if r.status_code == 200:
        # Set decode_content value to True, otherwise the downloaded image file's size will be zero.
        r.raw.decode_content = True
    
        # Open a local file with wb ( write binary ) permission.
        with open(os.path.join('icon',filename) ,'wb') as f:
            shutil.copyfileobj(r.raw, f)

KeyboardInterrupt: 