# `scrape_pokedex_text_simple`

This program scrapes Pokédex text from 
[Bulbapedia](https://bulbapedia.bulbagarden.net/wiki/Main_Page)
and saves it to a text file. The text is cleaned/trimmed but
otherwise unedited.

## Preamble / Set up
- Calling functions
- building the list of pokemon names (at least)
    * this depends on [`pokemon.json`](https://github.com/fanzeyi/pokemon.json)
    which I have included in the main directory

In [1]:
# All the stuff to import
import json
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import time

In [2]:
# get the list of pokemon names
poke_names_json = open('pokemon_json/pokedex.json')
datastore = json.load(poke_names_json)
names = [item['name']['english'] for item in datastore]

In [3]:
# some of the pokemon have strange characters that make functions have problems
# this defines them so we can check to see if we're dealing with them
space_names = [names[121],names[438],names[771],names[784],names[785],names[786],names[787]]
nidorans = [names[28],names[31]]

# Defining functions
- `get_bulbapedia_link(pokemon)`
    * takes a pokémon name and returns a link to the pokémon's Bulbapedia page
- `soup_it(url)`
    * takes a url and returns the result of get_text
    * uses a trick I found somewhere to ... make it less likely to get blocked,
    but it does appear to get blocked on occasion, and I just have to restart
    the kernel
- `trim_to_dex1(words) and trim_to_dex2(words)`
    * these take the result of soup_it and trim it down to essential words.
    * I tried once to make this just one function, but it wasn't working
    correctly so I just kept it as two 
- `re_trim(text)`
    * does regex-related data cleaning
    * may want to remove later if I want to build a more sophisticated data
    structure, eg, a dictionary that pairs the Pokédex entry with its associated
    game title or something like that.

In [17]:
# creates the link for Bulbapedia
def get_bulbapedia_link(pokemon):
    if pokemon == "Nidoran♀": # because Nidoran F has a strange character
        pokelink = "Nidoran%E2%99%80"
    elif pokemon == "Nidoran♂": # because Nidoran M has a strange character
        pokelink = "Nidoran%E2%99%82"
    elif pokemon == "Flabébé": # because Flabébé has é characters that apparently cause problems
        pokelink = "Flab%C3%A9b%C3%A9"
    elif pokemon in space_names:
        pokelink = re.sub(r'\s','_', pokemon)
    else:
        pokelink = pokemon
    thelink = "https://bulbapedia.bulbagarden.net/wiki/{}_(Pok%C3%A9mon)".format(pokelink)
    return thelink

In [5]:
# soup_it:
# input url
# return result of Beautiful Soup's get_text on that url
def soup_it(url):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    page = urlopen(req)
    soup = BeautifulSoup(page, 'lxml')
    return soup.get_text()

In [6]:
# trim_to_dex1
# input result of soup_it
# return result trimmed (but not quite enough)
def trim_to_dex1(words):
    words = re.sub(r'\n\s*\n', '\n\n', words)
#     text_ii = text.find("From Bulbapedia")
#     text_oo = text.find("Retrieved from")
#     text = text[text_ii:text_oo]
    words_i = words.find("\nGame data")
    words_o = words.find("\nGame locations")
    return words[words_i:words_o]

# trim_to_dex2;
# I tried to combine this with trim_to_dex1, and it wasn't working
# Input: result of trim_to_dex1
# Return: previous text, trimmed to just the essentials
def trim_to_dex2(words):
    words_i = words.find("\nPokédex entries")
    return words[words_i:]

In [7]:
# re_trim(dext)
# input: trimmed text from soup_it
# output: just Pokédex entries;
#    removes extra spaces, removes game names, etc.
#    removes dex entries that are like "This pokémon has no pokédex entries"
def re_trim(dext):
    # Removes any line with text that does not end in a period (ie, any line 
    # that is not from the Pokédex itself)
    dext = re.sub(r'[^\.]*\n', '\n', dext)
    # OBSOLETE (due to first re.sub call); removes all pokedex entry numbers
    # dext = re.sub(r'[A-Z]\w+ #\S{1,3}','\n', dext)
    # OBSOLETE (due to first re.sub call); removes generation titles
    # dext = re.sub(r'\nGeneration [VI]{1,4}', '\n', dext)
    # Removes "this pokemon has no pokedex entries"
    dext = re.sub(r'This Pokémon has no Pok.*\.\n', '\n', dext)
    # Removes "this pokemon was unavailable"
    dext = re.sub(r'This Pokémon was unavailable prior.*\.\n', '\n', dext)
    # removes initial spaces
    dext = re.sub(r'\s ', '\n', dext)
    # reduces all multiplied line breaks to just double breaks
    dext = re.sub('\n{2,}', '\n', dext)
    return dext

# Debug function
`dex_entries_debug` is intended to debug the original functions that go into
the overarching `dex_entries` function, which pulls all the `dex_entries` for
a given pokemon. It works now but I am keeping the debug function in case
I decide to mess with stuff at a later date.

In [8]:
def dex_entries_debug(pokemon):
    # debug get_bulbapedia_link
    bulblink = get_bulbapedia_link(pokemon)
    print(bulblink)
    # debug the BeautifulSoup function
    blsoup = soup_it(bulblink)
    # print(blsoup)
    # debug the function trimming the text to just pokedex entries
    just_dex = trim_to_dex1(blsoup)
    print(just_dex)
    # debug the more trim function
    just_dex2 = trim_to_dex2(just_dex)
    print(just_dex2)
    # debug the function that removes extraneous text (via regex)
    just_dex_clean = re_trim(just_dex2)
    print(just_dex_clean)

In [18]:
print(names[668])
dex_entries_debug(names[668])

Flabébé
https://bulbapedia.bulbagarden.net/wiki/Flab%C3%A9b%C3%A9_(Pok%C3%A9mon)

Game data
Pokédex entries

 This Pokémon was unavailable prior to Generation VI.

 Generation VI

 KalosCentral #068

 Hoenn #—

 X 

 It draws out and controls the hidden power of flowers. The flower Flabébé holds is most likely part of its body.

 Y 

 When it finds a flower it likes, it dwells on that flower its whole life long. It floats in the wind's embrace with an untroubled heart.

 Omega Ruby 

 It draws out and controls the hidden power of flowers. The flower Flabébé holds is most likely part of its body.

 Alpha Sapphire 

 When it finds a flower it likes, it dwells on that flower its whole life long. It floats in the wind's embrace with an untroubled heart.

 Generation VII

 AlolaUSUM: #100

 Kanto #—

 This Pokémon has no Pokédex entries in Sun, Moon, Let's Go, Pikachu! and Let's Go, Eevee!‎.

 Ultra Sun 

 It's not safe without the power of a flower, but it will keep traveling around until 

# `dex_entries(pokemon)`
Given a pokemon, return its Pokédex entries from Bulbapedia!
- relies on `get_bulbapedia_link`, `soup_it`, `trim_to_dex1`, `trim_to_dex2`, `re_trim`

In [9]:
def dex_entries(pokemon):
    words = soup_it(get_bulbapedia_link(pokemon))
    words = trim_to_dex1(words)
    words = trim_to_dex2(words)
    words = re_trim(words)
    return(words)

# Writes all pokédex text for specified pokémon to "pokedex_gen1all.txt"

This loop does the following things:
- Loop through specified pokémon names
- scrape Bulbapedia for the text from their Pokédex entries
- dump the text into a big text file

You can specify that it include the Pokémon's name before each set of entries.
This is something you might choose to do since many Pokédox entries (~80%) do
not use the Pokémon's name, but only use pronouns.

In case you want to pull Pokédex entries from specific sets of Pokémon, here
are the numbers to pull them from the JSON database
- I: [:151]
- II: [151:251]
- III: [251:386]
- IV: [386:493]
- V: [493:649]
- VI: [649:721]
- VII: [721:810]
- VIII: [810:890] **As of 5/26/2020, these are unsupported. They are waiting
to be merged into pokemon.json**

In [11]:
# If you want each set to be preceded by the pokemon name,
# set this to True
include_name = False

In [19]:
# this loop just writes the text directly to a text file
for pokemon in names[:151]:
    with open("pokedex_gen1all.txt", "a") as file:
        if include_name:
            file.write(pokemon)
        file.write(dex_entries(pokemon))
    print("Finished:", pokemon)

Finished: Chespin
Finished: Quilladin
Finished: Chesnaught
Finished: Fennekin
Finished: Braixen
Finished: Delphox
Finished: Froakie
Finished: Frogadier
Finished: Greninja
Finished: Bunnelby
Finished: Diggersby
Finished: Fletchling
Finished: Fletchinder
Finished: Talonflame
Finished: Scatterbug
Finished: Spewpa
Finished: Vivillon
Finished: Litleo
Finished: Pyroar
Finished: Flabébé
Finished: Floette
Finished: Florges
Finished: Skiddo
Finished: Gogoat
Finished: Pancham
Finished: Pangoro
Finished: Furfrou
Finished: Espurr
Finished: Meowstic
Finished: Honedge
Finished: Doublade
Finished: Aegislash
Finished: Spritzee
Finished: Aromatisse
Finished: Swirlix
Finished: Slurpuff
Finished: Inkay
Finished: Malamar
Finished: Binacle
Finished: Barbaracle
Finished: Skrelp
Finished: Dragalge
Finished: Clauncher
Finished: Clawitzer
Finished: Helioptile
Finished: Heliolisk
Finished: Tyrunt
Finished: Tyrantrum
Finished: Amaura
Finished: Aurorus
Finished: Sylveon
Finished: Hawlucha
Finished: Dedenne
Finish