# Web Scraping 101: Extracting Pokemon Info

The previous notebook demonstrated how to gather data on the Generation I Pokemon using `requests` and `bs4`.

### Import libraries

In [None]:
import requests
from bs4 import BeautifulSoup

# We want to pretty-print the JSON file for readability
import json

### Get the contents of the page to be scraped

In [None]:
# Define the URL from which to gather data

HOST = "https://bulbapedia.bulbagarden.net"
PATH = "/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number"
URL = HOST + PATH
data = requests.get(URL)

### Sanity Check \#1

Print the response HTML. If the page somehow returned an error, we handle it specially by telling the reader.

In [None]:
if data.status_code == 200:
    print(data.content)
else:
    print('Something went wrong:')
    print('Status code: ' + str(data.status_code))
    print('Response:')
    print(data.content)

### Create the parser and parse HTML

In [None]:
soup = BeautifulSoup(data.content, 'html.parser')

### Find the Pokedex tables

In [None]:
content = soup.find(id='mw-content-text')

# Select all tables that follow an h3
# The reasoning is that the tables on the page follow headers
# indicating which generation those Pokemon belong
all_pokemon = content.select('h3 + table')

### Sanity Check \#2

Ensure that there are eight generations of Pokemon.

In [None]:
len(all_pokemon)

### Cleaning the list

There are extra features we need to remove for easy cleaning. The returned array contains garbage entries like `'\n'` that need to be removed, as well as the header row that we do not need. While this is not really necessary and such entries can be skipped, it is important to highlight data cleaning to ease data collection.

In [None]:
# Get the second generation Pokemon
gen = all_pokemon[1]
gen.contents

In [None]:
# Clear all '\n' from the list. It would be better to use functions like isspace(),
# but this will do.
gen_cleaned = list(filter(lambda x: x != '\n', gen.contents))
# Remove the first index; this is the header row which should not be included
gen_cleaned = gen_cleaned[1:]
gen_cleaned

### Reading the Pokemon data

Now, we can extract an entry from the table, which contains the *kdex*, *ndex* and name of the Pokemon, as well as its types and a link to its Wiki entry.

In [None]:
sample_pokemon = list(filter(lambda x: x != '\n', gen_cleaned[0].contents))
sample_pokemon

We can extract the required information into variables to be put into an array.

In [None]:
sample_kdex = sample_pokemon[0].text.strip()
sample_ndex = sample_pokemon[1].text.strip()
sample_name = sample_pokemon[3].text.strip()
sample_types = []
for i in range(4, len(sample_pokemon)):
    sample_types.append(sample_pokemon[i].text.strip())
sample_url = HOST + sample_pokemon[3].find('a')['href']
sample_url

### Putting it all together

Now we can put everything into a function. Actually two. The first function reads a single Pokemon entry while the second function parses a table and returns all the Pokemon in that table. In effect, it returns all the Pokemon within a generation.

In [None]:
# Reads a single Pokemon from a table row 'entry'
def get_pokemon(entry):
    pokemon = list(filter(lambda x: x != '\n', entry.contents))
    kdex = pokemon[0].text.strip()
    ndex = pokemon[1].text.strip()
    name = pokemon[3].text.strip()
    types = []
    for i in range(4, len(pokemon)):
        types.append(pokemon[i].text.strip())
    url = HOST + pokemon[3].find('a')['href']
    return {
        'kdex': kdex,
        'ndex': ndex,
        'name': name,
        'types': types,
        'url': url
    }

# Reads all Pokemon from a table 'contents'
def get_pokemon_list(contents):
    contents_cleaned = list(filter(lambda x: x != '\n', contents.contents))
    # Remove the first index; this is the header row which should not be included
    contents_cleaned = contents_cleaned[1:]
    
    return [ get_pokemon(entry) for entry in contents_cleaned ]

Let's try this function:

In [None]:
sample_list = get_pokemon_list(all_pokemon[0])
sample_list

### Saving to JSON

Now we are ready to write everything into a JSON file. All we need to do is to loop over every table (yes, including Gen I) and compile all Pokemon into a list. Note that there are duplicate entries for some reason, but it will do for now.

In [None]:
poke_json = []

for pokemon_table in all_pokemon:
    poke_json += get_pokemon_list(pokemon_table)

In [None]:
len(poke_json)

In [None]:
with open('pokemon.json', 'w') as f:
    json.dump(poke_json, f, indent=4)