# `scrape_pokedex_text_bolstered`

This program scrapes Pokédex text from 
[Bulbapedia](https://bulbapedia.bulbagarden.net/wiki/Main_Page)
and does the following things.
* trims/cleans it (ie, reduces it to just the text needed)
* splits it into a `spaCy` object broken up by entries
* for each entry, checks to see if the Pokémon's name is in
    the entry
    * If not, it searches the entry for a few key phrases and
    replaces the first one with the name of the Pokémon
* Then, it writes the entry with the name of the Pokémon to
    a text file.

## Preamble / Set up
- Calling functions
- building the list of pokemon names (at least)
    * this depends on [`pokemon.json`](https://github.com/fanzeyi/pokemon.json)
    which I have included in the main directory

In [1]:
# All the stuff to import
import json
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import time

In [2]:
# get the list of pokemon names
poke_names_json = open('pokemon_json/pokedex.json')
datastore = json.load(poke_names_json)
names = [item['name']['english'] for item in datastore]

In [3]:
# some of the pokemon have strange characters that make functions have problems
# this defines them so we can check to see if we're dealing with them
space_names = [names[121],names[438],names[771],names[784],names[785],names[786],names[787]]
nidorans = [names[28],names[31]]

# Defining functions
- `get_bulbapedia_link(pokemon)`
    * takes a pokémon name and returns a link to the pokémon's Bulbapedia page
- `soup_it(url)`
    * takes a url and returns the result of get_text
    * uses a trick I found somewhere to ... make it less likely to get blocked,
    but it does appear to get blocked on occasion, and I just have to restart
    the kernel
- `trim_to_dex1(words) and trim_to_dex2(words)`
    * these take the result of soup_it and trim it down to essential words.
    * I tried once to make this just one function, but it wasn't working
    correctly so I just kept it as two 
- `re_trim(text)`
    * does regex-related data cleaning
    * may want to remove later if I want to build a more sophisticated data
    structure, eg, a dictionary that pairs the Pokédex entry with its associated
    game title or something like that.

In [4]:
# creates the link for Bulbapedia
def get_bulbapedia_link(pokemon):
    if pokemon == "Nidoran♀": # because Nidoran F has a strange character
        pokelink = "Nidoran%E2%99%80"
    elif pokemon == "Nidoran♂": # because Nidoran M has a strange character
        pokelink = "Nidoran%E2%99%82"
    elif pokemon == "Flabébé":
        pokelink = "Flab%C3%A9b%C3%A9b"
    elif pokemon in space_names:
        pokelink = re.sub(r'\s','_', pokemon)
    else:
        pokelink = pokemon
    thelink = "https://bulbapedia.bulbagarden.net/wiki/{}_(Pok%C3%A9mon)".format(pokelink)
    return thelink

In [5]:
# soup_it:
# input url
# return result of Beautiful Soup's get_text on that url
def soup_it(url):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    page = urlopen(req)
    soup = BeautifulSoup(page, 'lxml')
    return soup.get_text()

In [6]:
# trim_to_dex1
# input result of soup_it
# return result trimmed (but not quite enough)
def trim_to_dex1(words):
    words = re.sub(r'\n\s*\n', '\n\n', words)
#     text_ii = text.find("From Bulbapedia")
#     text_oo = text.find("Retrieved from")
#     text = text[text_ii:text_oo]
    words_i = words.find("\nGame data")
    words_o = words.find("\nGame locations")
    return words[words_i:words_o]

# trim_to_dex2;
# I tried to combine this with trim_to_dex1, and it wasn't working
# Input: result of trim_to_dex1
# Return: previous text, trimmed to just the essentials
def trim_to_dex2(words):
    words_i = words.find("\nPokédex entries")
    return words[words_i:]

In [7]:
# re_trim(dext)
# input: trimmed text from soup_it
# output: just Pokédex entries;
#    removes extra spaces, removes game names, etc.
#    removes dex entries that are like "This pokémon has no pokédex entries"
def re_trim(dext):
    # Removes any line with text that does not end in a period (ie, any line 
    # that is not from the Pokédex itself)
    dext = re.sub(r'[^\.]*\n', '\n', dext)
    # OBSOLETE (due to first re.sub call); removes all pokedex entry numbers
    # dext = re.sub(r'[A-Z]\w+ #\S{1,3}','\n', dext)
    # OBSOLETE (due to first re.sub call); removes generation titles
    # dext = re.sub(r'\nGeneration [VI]{1,4}', '\n', dext)
    # Removes "this pokemon has no pokedex entries"
    dext = re.sub(r'This Pokémon has no Pok.*\.\n', '\n', dext)
    # Removes "this pokemon was unavailable"
    dext = re.sub(r'This Pokémon was unavailable prior.*\.\n', '\n', dext)
    # removes initial spaces
    dext = re.sub(r'\s ', '\n', dext)
    # reduces all multiplied line breaks to just double breaks
    dext = re.sub('\n{2,}', '\n', dext)
    return dext

# Debug function
`dex_entries_debug` is intended to debug the original functions that go into
the overarching `dex_entries` function, which pulls all the `dex_entries` for
a given pokemon. It works now but I am keeping the debug function in case
I decide to mess with stuff at a later date.

In [8]:
def dex_entries_debug(pokemon):
    # debug get_bulbapedia_link
    bulblink = get_bulbapedia_link(pokemon)
    print(bulblink)
    # debug the BeautifulSoup function
    blsoup = soup_it(bulblink)
    # print(blsoup)
    # debug the function trimming the text to just pokedex entries
    just_dex = trim_to_dex1(blsoup)
    print(just_dex)
    # debug the more trim function
    just_dex2 = trim_to_dex2(just_dex)
    print(just_dex2)
    # debug the function that removes extraneous text (via regex)
    just_dex_clean = re_trim(just_dex2)
    print(just_dex_clean)

# `dex_entries(pokemon)`
Given a pokemon, return its Pokédex entries from Bulbapedia!
- relies on `get_bulbapedia_link`, `soup_it`, `trim_to_dex1`, `trim_to_dex2`, `re_trim`

In [9]:
def dex_entries(pokemon):
    words = soup_it(get_bulbapedia_link(pokemon))
    words = trim_to_dex1(words)
    words = trim_to_dex2(words)
    words = re_trim(words)
    return(words)

# Get `spaCy` up and running
- import correct parts of spaCy
- `new_line_sentences`: define sentences based on new line characters
b/c some Pokédex entries contain multiple sentences.
- `nlp`: `en_core_web_sm` pipeline without `ner`
- `nlp_newline`: `en_core_web_sm` pipeline with `tagger` and my custom
sentence segmenter `new_line_sentences`

In [10]:
# set up spaCy how I need it
import spacy
from spacy.lang.en import English

In [11]:
# defines a parser that will create new "sentences" at the end of every line
# this is important because some Pokédex entries are more than one sentence
# and we want to prioritize putting the name of the pokémon in the first 
# sentence if at all possible.
def new_line_sentences(doc):
    for token in doc[:-1]:
        if token.text == "\n":
            doc[token.i].is_sent_start = True
    return doc

In [12]:
# define an nlp pipeline (nlp) using default sentence segmenter
nlp = spacy.load("en_core_web_sm")

# remove ner from nlp because I don't want it to try to identify Pokémon
nlp.remove_pipe("ner")

('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fc7d14f51a0>)

In [13]:
# define an nlp pipeline (nlp_newline)
nlp_newline = spacy.load("en_core_web_sm")

# remove the parser and ner from the pipeline for nlp_newline, because
# I want to set my own sentence boundaries and I don't want entities.
nlp_newline.remove_pipe('parser')
nlp_newline.remove_pipe('ner')

# add my custom sentence segmenter to the pipeline for nlp_newline
nlp_newline.add_pipe(new_line_sentences)

## Cleaning up the Pokédex data
Many Pokédex entries (~80%, 3104/3891 for orig 151) do not contain the name of the Pokémon---
they just have pronouns or the NP _this Pokémon_. In order to create more sentences that
actually use the Pokémon's names (eg, for more training data), I converted the entries that
don't mention the Pokémon's name into entries that do by replacing certain referring
expressions (pronouns and "This adjective* pokémon").

* `name_check` will be a function that checks to see if the pokémon's name
is in the text. It will also call `replace_pronoun` if the pokémon's name is not
found.
* `replace_pronoun` is a function that replaces a pronoun in each line with a 
Pokémon name. If there are multiple sentences in the line, it prioritizes
replacing something in the first sentence over later sentences. Within 
each sentence, it sets this priority: 
    - _This Pokémon_ > _it/its_ > _they/them_ > _their_
    - calls `check_for_possessive` as the `repl` argument of `re.sub()`
* `check_for_possessive` is a function that is called by `re.sub()` to replace
certain text in a `re.sub()` search.
    - if the token it is replacing is _its, their, it's_, it gets replaced with the possessive
    name of the Pokémon

In [14]:
# set regex patterns 
it_regex = re.compile(r"\b[Ii]t'?s?\b")
they_regex = re.compile(r"\b[Tt]he[ym]\b")
their_regex = re.compile(r"\b[Tt]heir\b")
pn_regex = re.compile(r"\b([Ii]t'?s?|[Tt]he[ym])\b")
thispkmn_regex = re.compile(r"\b[Tt]his (\w+ ){0,2}Pokémon'?s?\b")

# order the regex patterns to prioritize them
# pn_patterns = [thispkmn_regex, it_regex, they_regex, their_regex]
pn_patterns = [thispkmn_regex, pn_regex, their_regex]

In [15]:
# name_check
# checks to see if the Pokémon's name is in the entry, and if not, it adds it
# 
# pokemon: string
# doc: spaCy Doc, parsed with nlp_newline
# returns: a list comprised of Spans and Docs; that's a problem but as long 
# as I convert everything to text in the end, I might not need to fix it?
def name_check(pokemon,doc):
    newsents = []
    # remove duplicates
    dex_sents = list(doc.sents)
    for line in dex_sents:
        # if the name of the pokémon is in that line, immediately
        # add to the list of new sentences
        if pokemon[:len(pokemon)-1] in line.text:
            newsents.append(line)
        # otherwise, break into sentences for further processing
        else:
            # process the line using ordinary sentence segmentation
            line_doc = nlp(line.text)
            replaced_line = nlp_newline(replace_pronoun(pokemon,line_doc))
            newsents.append(replaced_line)
    return newsents

In [16]:
# replace_pronoun
# called by `check_for_name` if the dex entry does not contain a
# pokémon name. Breaks each dex entry into sentences, then iterates
# through sentences and then through anaphora (this pokémon, then 
# "it,its,they,them" (all equal), then "their") to try to find the first suitable replacement.
#
# pokemon: string
# doc: nlp object, tagged and parsed
# return: string
def replace_pronoun(pokemon,doc):
    new_sents = []
    unchanged = True
    # go through each sent in the doc (ie, the pokedex entry)
    for sent in doc.sents:
        # if this line is still unchanged, then look for something to change.
        # Note that it could be an earlier sentence that was changed.
        if unchanged:
            # look for something to switch in order of the list
            for pattern in pn_patterns:
                match = re.search(pattern,sent.text)
                # if you find a match for this pattern, replace the pattern
                # with the pokemon name
                if match:
                    new_sent = re.sub(pattern,check_for_possessive,sent.text,count=1)
                    new_sents.append(new_sent)
                    # and tell the system that you've made a switch
                    unchanged = False
                    # immediately break out of the pattern for-loop
                    break
            else:
                new_sents.append(sent.text)
        else:
            new_sents.append(sent.text)
    return " ".join(new_sents)

In [17]:
# check_for_possessive
# this is the thing that re.sub() will call for defining which 
# form of the pokemon name will be used.
def check_for_possessive(matchobj):
    if matchobj.group(0).lower() == 'its' or \
    matchobj.group(0).lower() == 'their'or \
    matchobj.group(0).endswith("'s"):
        return "{}'s".format(pokemon)
    else:
        return pokemon

# Code to build a corpus of examples
Run the next block of code to build a corpus of examples. If you want to pull
specific generations:
- I: [:151]
- II: [151:251]
- III: [251:386]
- IV: [386:493]
- V: [493:649]
- VI: [649:721]
- VII: [721:810]
- VIII: [810:890]

In [18]:
start_time = time.time()
for pokemon in names[:386]:
    words = dex_entries(pokemon)
    doc = nlp_newline(words)
    entries = [entry.text for entry in name_check(pokemon,doc)]
    unique_entries = list(dict.fromkeys(entries))
    for entry in unique_entries:
        with open("pokedex_bolster.txt","a") as file:
            file.write(entry)
    print("Finished:", pokemon)
print("--- %s seconds ---" % (time.time()-start_time))

Finished: Bulbasaur
Finished: Ivysaur
Finished: Venusaur
Finished: Charmander
Finished: Charmeleon
Finished: Charizard
Finished: Squirtle
Finished: Wartortle
Finished: Blastoise
Finished: Caterpie
Finished: Metapod
Finished: Butterfree
Finished: Weedle
Finished: Kakuna
Finished: Beedrill
Finished: Pidgey
Finished: Pidgeotto
Finished: Pidgeot
Finished: Rattata
Finished: Raticate
Finished: Spearow
Finished: Fearow
Finished: Ekans
Finished: Arbok
Finished: Pikachu
Finished: Raichu
Finished: Sandshrew
Finished: Sandslash
Finished: Nidoran♀
Finished: Nidorina
Finished: Nidoqueen
Finished: Nidoran♂
Finished: Nidorino
Finished: Nidoking
Finished: Clefairy
Finished: Clefable
Finished: Vulpix
Finished: Ninetales
Finished: Jigglypuff
Finished: Wigglytuff
Finished: Zubat
Finished: Golbat
Finished: Oddish
Finished: Gloom
Finished: Vileplume
Finished: Paras
Finished: Parasect
Finished: Venonat
Finished: Venomoth
Finished: Diglett
Finished: Dugtrio
Finished: Meowth
Finished: Persian
Finished: Psyduc