# BeautifulSoup: Creating the Pokédex with Webscraping
This Juyter Notebook aims at creating a `pandas` DataFrame from [a pokemon database](https://pokemondb.net/pokedex/all), as well as collecting icon images for training a generative adversarial network in a separate Jupyter Notebook.

To begin, import `requests` and `bs4`. Connect to the URL and make soup of it using an LXML parser.

In [1]:
import requests
from bs4 import BeautifulSoup

url = 'https://pokemondb.net/pokedex/all'
r = requests.get(url)
html_contents = r.text

html_soup = BeautifulSoup(html_contents, 'lxml')

The DataFrame is going to be built from `td` tags with the following `class` values: `cell-name`, `cell-icon`, `cell-total`, and `cell-num`. A good start would be `cell-name`:

In [2]:
cell_name = html_soup.findAll('td', {'class':'cell-name'})
cell_name[:5]

[<td class="cell-name"><a class="ent-name" href="/pokedex/bulbasaur" title="View pokedex for #001 Bulbasaur">Bulbasaur</a></td>,
 <td class="cell-name"><a class="ent-name" href="/pokedex/ivysaur" title="View pokedex for #002 Ivysaur">Ivysaur</a></td>,
 <td class="cell-name"><a class="ent-name" href="/pokedex/venusaur" title="View pokedex for #003 Venusaur">Venusaur</a></td>,
 <td class="cell-name"><a class="ent-name" href="/pokedex/venusaur" title="View pokedex for #003 Venusaur">Venusaur</a><br/> <small class="text-muted">Mega Venusaur</small></td>,
 <td class="cell-name"><a class="ent-name" href="/pokedex/charmander" title="View pokedex for #004 Charmander">Charmander</a></td>]

Regex can be used to get rid of the HTML tags gathered from scraping the page. Some Pokémon have special titles (check the above list) given away by `<small class="text-muted">`. This information can be used to successfully format all the names within the `cell-name` soup.

In [3]:
import re 
from tqdm import tqdm

def strip_html(string):
    return re.compile(r'<[^>]+>').sub('', string)

number = []
pokemon = []

for idx in tqdm(range(len(cell_name))):
    entry = str(html_soup.findAll('td', {'class':'cell-name'})[idx]).split('-muted">')[-1]
    cleaned_name = strip_html(entry)
    pokemon.append(cleaned_name)
    
    num_entry = str(html_soup.findAll('td', {'class':'cell-name'})[idx]).split('#')[-1]
    number.append(num_entry[:3])

100%|██████████| 1034/1034 [05:36<00:00,  3.08it/s]


In [4]:
pokemon[:5]

['Bulbasaur', 'Ivysaur', 'Venusaur', 'Mega Venusaur', 'Charmander']

In [5]:
number[:5]

['001', '002', '003', '003', '004']

Each Pokémon has its type(s) listed in `cell-icon`, which can be cleaned from its HTML tags and placed into a list.

In [6]:
cell_icon = html_soup.findAll('td', {'class':'cell-icon'})
cell_icon[:5]

[<td class="cell-icon"><a class="type-icon type-grass" href="/type/grass">Grass</a><br/> <a class="type-icon type-poison" href="/type/poison">Poison</a></td>,
 <td class="cell-icon"><a class="type-icon type-grass" href="/type/grass">Grass</a><br/> <a class="type-icon type-poison" href="/type/poison">Poison</a></td>,
 <td class="cell-icon"><a class="type-icon type-grass" href="/type/grass">Grass</a><br/> <a class="type-icon type-poison" href="/type/poison">Poison</a></td>,
 <td class="cell-icon"><a class="type-icon type-grass" href="/type/grass">Grass</a><br/> <a class="type-icon type-poison" href="/type/poison">Poison</a></td>,
 <td class="cell-icon"><a class="type-icon type-fire" href="/type/fire">Fire</a><br/> </td>]

If a Pokémon only has a single type, the HMTL stripping will leave whitespace at the end and therefore make a list entry with `''` as an element. To fix this, simply apply the `.strip()` method before the `.split(' ')`:

In [7]:
poke_type = []
for idx in tqdm(range(len(cell_icon))):
    types = strip_html(str(html_soup.findAll('td', {'class':'cell-icon'})[idx])).strip().split(' ')
    poke_type.append(types)

100%|██████████| 1034/1034 [02:49<00:00,  6.09it/s]


In [8]:
poke_type[:5]

[['Grass', 'Poison'],
 ['Grass', 'Poison'],
 ['Grass', 'Poison'],
 ['Grass', 'Poison'],
 ['Fire']]

The Pokémon icons can be gathered from the `data-src` tag, and the values can be rolled into a list for HP, Attack, Defense, Sp. Atk, Sp. Def, and Speed, respectively.

In [9]:
cell_num = html_soup.findAll('td', {'class':'cell-num'})
cell_num[:10]

[<td class="cell-num cell-fixed" data-sort-value="1"><span class="infocard-cell-img"><span class="img-fixed icon-pkmn" data-alt="Bulbasaur icon" data-src="https://img.pokemondb.net/sprites/sword-shield/icon/bulbasaur.png"></span></span><span class="infocard-cell-data">001</span></td>,
 <td class="cell-num">45</td>,
 <td class="cell-num">49</td>,
 <td class="cell-num">49</td>,
 <td class="cell-num">65</td>,
 <td class="cell-num">65</td>,
 <td class="cell-num">45</td>,
 <td class="cell-num cell-fixed" data-sort-value="2"><span class="infocard-cell-img"><span class="img-fixed icon-pkmn" data-alt="Ivysaur icon" data-src="https://img.pokemondb.net/sprites/sword-shield/icon/ivysaur.png"></span></span><span class="infocard-cell-data">002</span></td>,
 <td class="cell-num">60</td>,
 <td class="cell-num">62</td>]

The icon URLs occur as every 7th element (from the first element) and can be gathered from a `.split()` string using `"` as a delimiter.

In [10]:
str(cell_num[0]).split('\"')[-4]

'https://img.pokemondb.net/sprites/sword-shield/icon/bulbasaur.png'

In [11]:
icon_url = []

for idx in range(len(cell_num)):
    if idx % 7 == 0:
        icon_url.append(str(cell_num[idx]).split('\"')[-4])

In [12]:
icon_url[:4]

['https://img.pokemondb.net/sprites/sword-shield/icon/bulbasaur.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/ivysaur.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/venusaur.png',
 'https://img.pokemondb.net/sprites/sword-shield/icon/venusaur-mega.png']

Of course, the other data occur as every 7th element from their respective beginnings: each HP stat is every 7th from the first HP stat, every ATK stat is every 7th from the first ATK stat, and so on. While the following may not be the most elegant nor Pythonic, it captures the data in a form that can easily be appended into a DataFrame later.

In [32]:
stats = ['HP', 'ATK', 'DEF', 'Sp.ATK', 'Sp.DEF', 'SPD']
stats_dic = {}

for count, stat in enumerate(stats):
    stats_dic[stat] = [strip_html(str(cell_num[idx + (count + 1)])) for idx in range(len(cell_num)) if idx % 7 == 0]

In [35]:
stats_dic['ATK'][:5]

['49', '62', '82', '100', '52']

With `cell-total`, either a summation can span the above elements for each listing or the HTML can be mined.

In [41]:
cell_total = html_soup.findAll('td', {'class':'cell-total'})
total = [strip_html(str(cell)) for cell in cell_total]
total[:5]

['318', '405', '525', '625', '309']

Now that all of the necessary information has been scraped from the Pokémon database, the Pokédex can be created as a `pandas` DataFrame.

In [57]:
import pandas as pd

pokedex = pd.DataFrame()

pokedex['#'] = number
pokedex['Pokémon'] = pokemon
pokedex['Type'] = poke_type

for stat in stats_dic:
    pokedex[stat] = stats_dic[stat]
    
pokedex['Total'] = total
pokedex['URLs'] = icon_url

In [58]:
pokedex.head()

Unnamed: 0,#,Pokémon,Type,HP,ATK,DEF,Sp.ATK,Sp.DEF,SPD,Total,URLs
0,1,Bulbasaur,"[Grass, Poison]",45,49,49,65,65,45,318,https://img.pokemondb.net/sprites/sword-shield...
1,2,Ivysaur,"[Grass, Poison]",60,62,63,80,80,60,405,https://img.pokemondb.net/sprites/sword-shield...
2,3,Venusaur,"[Grass, Poison]",80,82,83,100,100,80,525,https://img.pokemondb.net/sprites/sword-shield...
3,3,Mega Venusaur,"[Grass, Poison]",80,100,123,122,120,80,625,https://img.pokemondb.net/sprites/sword-shield...
4,4,Charmander,[Fire],39,52,43,60,50,65,309,https://img.pokemondb.net/sprites/sword-shield...


The `pandas` DataFrame can finally be exported to a `.CSV` file.

In [59]:
pokedex.to_csv('pokedex.csv');

![](https://img.pokemondb.net/sprites/sword-shield/icon/charizard.png)