# Pokemon Scraper

The goal of this notebook is to construct the dataset used for **RAG (Retrieval Augment Generation)** in PokepedAI.

## Information

So the firszt step in creating the dataset is to establish what information should be collected from and where.

### Where to Collect Info

So, considering the goal of PokepedAI is to serve as a chatbot one can use when playing Pokemon games, we can use a common source of information players typically use to get this infomation [PokemonDB](
https://pokemondb.net)

### What Info Should Be Collected

PokemonDB contains a lot of information overall about Pokemon but included in the documents will be information for the most common questions people will ask:

```
What type is <Pokemon>?
What are the base stats of <Pokemon>?
How does <Pokemon A> compare to <Pokemon B>?

How do I evolve <Pokemon>?
What level does <Pokemon> evolve?
What can <Pokemon> evolve into?

What abilities does <Pokemon> have?
What is <Pokemon>'s hidden ability?
Which Pokemon have <Ability>?
What does <Ability> do?

What moves does <Pokemon> learn by level-up?
What TMs can <Pokemon> learn?
Can <Pokemon> learn <Move>?
Which Pokemon can learn <Move>?
When does <Pokemon> learn <Move>?

What are <Pokemon>'s weaknesses?
What is super effective against <Pokemon>?
Which Pokemon resist <Type>?

Where can I find <Pokemon> in <Game>?
Which Pokemon appear in <Location> in <Game>?
Which Pokemon are version exclusives in <Game>?

What egg group is <Pokemon>?
Which Pokemon share an egg group with <Pokemon>?
Can <Pokemon> breed?

What forms does <Pokemon> have?
What is the type of <Form/Variant>?
What differs between <Form A> and <Form B>?

What changed for <Pokemon> across generations?
Was <Pokemon> obtainable in <Generation>?
Did <Move> change between generations?
```

Thus, information on the site such as such as the pokemon name in diffferent languages and its flavor text (information suplerfious to gameplay related questions) will not be included.

## Creating RAG Chunks

While creating this dataset, especially for Pokemon specific information, since much of it is organized into tables and charts, which does not translate well into a RAG (which likes natural language styled documents), we will convert such information into a natural language text while keeping the raw information as metadata.

In addition, to help with indexing, all information will be split into smaller json sections based on the information they hold. Since this document is mostly consistent of fact based information (Pokemon move stats, Pokemon stats) that does not rely too much on broader context, these RAG chunks will be broken into small sections in order to ensure more acurracy.

## Attributions

*   The initial template for scraping this code comes from work done from this [repository](https://github.com/christian-jaimes/pokemon-data-scraping). This includes the general setup of the webscraper and most of the code for listing basic pokemon information.
*   The boilerplate code for retrieving the rest of the information needed as well as creating the data chunks in generated by ChatGPT, however they are cleaned and commented for methodology purposes afterwards by me.





---

## Code

### Imports

In [290]:
import os
import re
import json
import time
import requests
import pandas as pd
from pathlib import Path
from bs4 import BeautifulSoup, Tag

### Constants + Helper Methods

In [291]:
project_name = 'pokemon-test-1'             # project name, base text for output files

full_dataset = False                        # if True, will scrape whole database
part_dataset = 250                          # if False, scrape the first 20

In [292]:
def list_to_str(l:list):
    """
    Given a list of items, convert to comma and separated string.
    """
    if len(l) == 1:
        s = l[0]
    elif len(l) == 2:
        s = " and ".join(l)
    else:
        s = ", ".join(l[:-1]) + f", and {l[-1]}"

    return s


def remove_parentheses(method_raw: str) -> str:
    """Strip wrapping parentheses from things like '(Level 16)'."""
    return method_raw.strip().strip("()")

### Building RAG Chunks

In [293]:
def build_core_doc(info: dict):
    name = info.get("name")
    natdex = info.get("id")
    desc = info.get("description")
    species = info.get("species")
    types = info.get("types") or []
    generation = info.get("generation")
    height = info.get("height")
    weight = info.get("weight")
    abilities = info.get("abilities") or []
    name_etymology = info.get("name_etymology")

    clean_types = [t for t in types if t]
    if clean_types:
        type_str = "/".join(clean_types)
    else:
        type_str = "unknown"

    gen_str = f"Generation {generation}" if generation else "an unknown generation"

    raw_abilities = info.get("abilities")

    if isinstance(raw_abilities, str):
        abilities_list = [a.strip() for a in raw_abilities.split(",") if a.strip()]
    elif isinstance(raw_abilities, (list, tuple, set)):
        abilities_list = [str(a).strip() for a in raw_abilities if a]
    else:
        abilities_list = []
    if abilities_list:
        abil_str = list_to_str(abilities_list)

    text_parts = []
    text_parts.append(f"{name} is a {type_str}-type Pokémon introduced in {gen_str}.")
    if natdex is not None:
        text_parts.append(f"It is number {natdex} in the National Pokédex.")
    if species:
        text_parts.append(f"{name} is classified as the {species} Pokémon.")
    if height is not None or weight is not None:
        hw_parts = []
        if height is not None:
            hw_parts.append(f"is {height} tall")
        if weight is not None:
            hw_parts.append(f"weighs {weight}")
        if hw_parts:
            text_parts.append(f"{name} " + " and ".join(hw_parts) + ".")
    text_parts.append(f"{name} can have the abilities {abil_str}.")
    if name_etymology:
        text_parts.append(f"Its name is derived from {name_etymology}.")

    text = " ".join(text_parts).strip()

    return {
        "id": f"{name}-core" if name else f"{natdex}-core",
        "pokemon": name,
        "section": "core",
        "description": desc,
        "text": text,
        "metadata": {
            "National Dex Number": natdex,
            "Types": clean_types,
            "Generatin": generation,
            "Species": species,
            "Height": height,
            "Weight": weight,
            "Abilities": abilities_list,
            "Name Etymology": name_etymology,
        },
    }


In [294]:
def build_training_doc(info: dict):
    name = info["name"]
    ev_yield = info["ev_yield"]
    catch_rate = info["catch_rate"]
    base_friendship = info["base_friendship"]
    base_exp = info["base_exp"]
    growth_rate = info["growth_rate"]

    if isinstance(ev_yield, str):
        ev_list = [a.strip() for a in ev_yield.split(",") if a.strip()]
    elif isinstance(ev_yield, (list, tuple, set)):
        ev_list = [str(a).strip() for a in ev_yield if a]
    else:
        ev_list = []
    if ev_list:
        ev_str = list_to_str(ev_list)

    text_parts = []
    text_parts.append(f"For training purposes, defeating {name} yields {ev_str} EVs.")
    text_parts.append(f"It grants {base_exp} base experience points and follows a {growth_rate} experience growth rate.")
    text_parts.append(f"{name} has a catch rate of {catch_rate} and a base friendship of {base_friendship}.")

    text = " ".join(text_parts)

    return {
        "id": f"{name}-training",
        "pokemon": name,
        "section": "training",
        "text": text,
        "metadata": {
            "EV Yield": ev_yield,
            "Catch Rate": catch_rate,
            "Base Friendship": base_friendship,
            "Base Experience": base_exp,
            "Growth Rate": growth_rate,
        },
    }


In [295]:
def build_breeding_doc(info: dict):
    name = info["name"]
    egg_groups = info["egg_groups"]
    gender_male = info["gender_male"]
    gender_female = info["gender_female"]
    egg_cycles = info["egg_cycles"]

    if isinstance(egg_groups, str):
        egg_list = [a.strip() for a in egg_groups.split(",") if a.strip()]
    elif isinstance(egg_groups, (list, tuple, set)):
        egg_list = [str(a).strip() for a in egg_groups if a]
    else:
        egg_list = []
    if egg_list:
        egg_str = list_to_str(egg_list)

    if gender_male == 0 and gender_female == 0:
        gender_text = f"{name} is a genderless species."
    else:
        gender_text = (
            f"The typical gender ratio for {name} is "
            f"{gender_male}% male and {gender_female}% female."
        )

    hatch_text_split = egg_cycles.split(" ")
    egg_cycles_str = hatch_text_split[0]
    steps = " ".join(hatch_text_split[1:])

    text_parts = []
    text_parts.append(f"{name} belongs to the {egg_str} Egg Group.")
    text_parts.append(gender_text)
    text_parts.append(f"Eggs take {egg_cycles_str} egg cycles {steps} to hatch.")
    text = " ".join(text_parts)

    return {
        "id": f"{name}-breeding",
        "pokemon": name,
        "section": "breeding",
        "text": text,
        "metadata": {
            "eggGroups": egg_groups,
            "genderMale": gender_male,
            "genderFemale": gender_female,
            "eggCycles": egg_cycles,
        },
    }


In [296]:
def build_statistics_doc(info: dict):
    name = info["name"]

    hp = info["base_hp"]
    atk = info["base_atk"]
    dfc = info["base_def"]
    satk = info["base_satk"]
    sdfc = info["base_sdef"]
    spd = info["base_spd"]
    total = int(hp) + int(atk) + int(dfc) + int(satk) + int(sdfc) + int(spd)

    min_hp = info["min_hp"]
    max_hp = info["max_hp"]
    min_atk = info["min_atk"]
    max_atk = info["max_atk"]
    min_def = info["min_def"]
    max_def = info["max_def"]
    min_satk = info["min_satk"]
    max_satk = info["max_satk"]
    min_sdef = info["min_sdef"]
    max_sdef = info["max_sdef"]
    min_spd = info["min_spd"]
    max_spd = info["max_spd"]

    text = (
        f"{name} has a base stat total of {total}, with base stats of "
        f"{hp} HP, {atk} Attack, {dfc} Defense, {satk} Special Attack, "
        f"{sdfc} Special Defense, and {spd} Speed. "
        f"At level 100, {name}'s HP can range from {min_hp} to {max_hp}, "
        f"Attack from {min_atk} to {max_atk}, Defense from {min_def} to {max_def}, "
        f"Special Attack from {min_satk} to {max_satk}, "
        f"Special Defense from {min_sdef} to {max_sdef}, "
        f"and Speed from {min_spd} to {max_spd}, depending on its nature, IVs, and EVs."
    )

    return {
        "id": f"{name}-statistics",
        "pokemon": name,
        "section": "statistics",
        "text": text,
        "metadata": {
            "baseStats": {
                "hp": hp,
                "attack": atk,
                "defense": dfc,
                "spAttack": satk,
                "spDefense": sdfc,
                "speed": spd,
            },
            "baseStatTotal": total,
            "minStatsLevel100": {
                "hp": min_hp,
                "attack": min_atk,
                "defense": min_def,
                "spAttack": min_satk,
                "spDefense": min_sdef,
                "speed": min_spd,
            },
            "maxStatsLevel100": {
                "hp": max_hp,
                "attack": max_atk,
                "defense": max_def,
                "spAttack": max_satk,
                "spDefense": max_sdef,
                "speed": max_spd,
            },
        },
    }


In [297]:
def build_evolution_doc(pokemon_data: dict, soup):
    pokemon_name = pokemon_data["name"]
    pokemon_id = pokemon_data["id"]

    # retrieve all evolution edges
    all_edges = parse_all_evolution_edges(soup)

    # keep only edges this pokemon in
    incoming = [e for e in all_edges if e["to"] == pokemon_name]
    outgoing = [e for e in all_edges if e["from"] == pokemon_name]

    # if there is no evolution
    if not incoming and not outgoing:
        return {
            "id": f"{pokemon_name}-evolutions",
            "pokemon": pokemon_name,
            "section": "evolutions",
            "text": f"{pokemon_name} does not evolve.",
            "metadata": {
                "pokemon_id": pokemon_id,
                "pokemon_name": pokemon_name,
                "evolution_edges": [],
                "has_evolutions": False,
            },
        }

    text_parts = []

    # pre-evolution -> this pokemon
    for e in incoming:
        method_clean = remove_parentheses(e["method"])
        text_parts.append(
            f"{e['from']} evolves into {pokemon_name} via {method_clean}."
        )

    # this pokemon -> later evolutions
    if len(outgoing) == 1:
        e = outgoing[0]
        method_clean = remove_parentheses(e["method"])
        text_parts.append(
            f"{pokemon_name} evolves into {e['to']} via {method_clean}."
        )
    elif len(outgoing) > 1:
        branch_parts = []
        for e in outgoing:
            method_clean = remove_parentheses(e["method"])
            branch_parts.append(f"{e['to']} via {method_clean}")

        branches_text = list_to_str(branch_parts)

        text_parts.append(f"{pokemon_name} can evolve into {branches_text}.")

    text = " ".join(text_parts)
    relevant_edges = incoming + outgoing

    return {
        "id": f"{pokemon_name.lower()}-evolutions",
        "pokemon": pokemon_name,
        "section": "evolutions",
        "text": text,
        "metadata": {
            "pokemon_id": pokemon_id,
            "pokemon_name": pokemon_name,
            "evolution_edges": relevant_edges,
            "has_evolutions": True,
        },
    }

In [298]:
def build_moves_docs_for_generation(pokemon_data: dict, gen, moves_soup):
    pokemon_name = pokemon_data["name"]
    pokemon_id = pokemon_data["id"]

    sections = parse_moves_sections(moves_soup)

    grouped = {}  # key: (game_group, method)

    for sec in sections:
        games_text = sec["games_text"]
        method = sec["method"]
        rows = sec["rows"]

        # get the game group, create the key
        game_group = infer_game_group_id(games_text)
        key = (game_group, method)

        group = grouped.setdefault(
            key,
            {
                "game_group": game_group,
                "method": method,
                "games_text_samples": set(),
                "rows": [],
            },
        )

        # add the game text above the table
        if games_text:
            group["games_text_samples"].add(games_text)
        # add all the moves
        group["rows"].extend(rows)

    method_phrases = {
        "level-up": "By level up",
        "evolution": "On evolution",
        "egg": "By egg moves",
        "pre-evo": "From pre-evolutions",
        "tm": "By TM",
        "hm": "By HM",
        "tr": "By TR",
        "tutor": "From move tutors",
        "transfer": "As transfer-only moves",
    }

    chunks = []

    # for each game_group, method make a chunk
    for (game_group, method), group in grouped.items():
        rows = group["rows"]
        games_texts = list(group["games_text_samples"])

        has_any_moves = bool(rows)
        has_any_text = bool(games_texts)

        if not has_any_moves and not has_any_text:
            continue

        text_parts = []

        if games_texts:
            text_parts.append(" ".join(games_texts))

        if rows:
            if method == "level-up":
                move_list = ", ".join(f"{r.get('move')} (Lv. {r.get('level')})" for r in rows)
            elif method in ("tm", "hm", "tr"):
                move_list = ", ".join(f"{r.get('machine')} {r.get('move')}" for r in rows)
            else:
                move_list = ", ".join(r.get("move") for r in rows)

            prefix = method_phrases.get(method)
            text_parts.append(f"{prefix} it can learn: {move_list}.")

        text = (
            " ".join(text_parts)
            or f"No moves recorded for {pokemon_name} in generation {gen} ({game_group}, {method})."
        )

        chunk = {
            "id": f"{pokemon_name.lower()}-moves-gen{gen}-{game_group}-{method}",
            "pokemon": pokemon_name,
            "section": "moves",
            "text": text,
            "metadata": {
                "pokemon_id": pokemon_id,
                "pokemon_name": pokemon_name,
                "generation": gen,
                "game_group": game_group,
                "method": method,
                "games_text_samples": games_texts,
            },
        }
        chunks.append(chunk)

    return chunks

In [None]:
def build_locations_doc(pokemon_data: dict, pokemon_soup):
    pokemon_id = pokemon_data["id"]
    pokemon_name = pokemon_data["name"]

    location_entries = parse_locations_section(pokemon_soup)

    if not location_entries:
        text = (
            f"{pokemon_name} does not have location data."
        )
        has_locations = False
    else:
        has_locations = True

        # get pokemon normal locatons
        normal_entries = [
            e for e in location_entries
            if e["availability"] == "normal" and e["location_names"]
        ]

        # map location to all games where pokemon is located there
        by_location: dict[str, list[str]] = {}
        for e in normal_entries:
            loc = e["raw_location"]
            by_location.setdefault(loc, []).append(e["game"])

        # location string Pallet Town (Red, Blue)
        location_bits = []
        for loc, games in by_location.items():
            games_str = ", ".join(sorted(set(games)))
            location_bits.append(f"{loc} ({games_str})")

        text_parts = []

        if location_bits:
            text_parts.append(
                f"{pokemon_name} can be found in the following locations in the core series games: "
                + list_to_str(location_bits)
                + "."
            )

        # get pokemon special locations
        special_entries = [e for e in location_entries if e["availability"] != "normal"]

        if special_entries:
            # get the kinds of special locations for this pokemon
            kinds = sorted(set(e["availability"] for e in special_entries))
            phrases_map = {
                "transfer-only": "only obtainable by trading or transferring from other games",
                "event-only": "only obtainable via special events or distributions",
                "not-available": "not obtainable in-game",
                "unknown": "with locations that have not yet been documented",
            }

            # map the kind to the phrase
            phrases = [phrases_map[k] for k in kinds if k in phrases_map]
            if phrases:
                text_parts.append(
                    f"In some titles, {pokemon_name} is "
                    + list_to_str(phrases)
                    + "."
                )

        text = " ".join(text_parts)

    return {
        "id": f"{pokemon_name.lower()}-locations",
        "pokemon": pokemon_name,
        "section": "locations",
        "text": text,
        "metadata": {
            "pokemon_id": pokemon_id,
            "pokemon_name": pokemon_name,
            "locations": location_entries,
            "has_locations": has_locations,
        },
    }

### Helper Methods to Parse Information from HTML


In [None]:
def get_pokemon_data(pokemon_soup)
    ###### Data Parsing
    ### Pokemon Info
    pokemon_id = int(pokemon_soup.find("th", string="National №").find_next("td").text)

    pokemon_name = pokemon_soup.find("h1").text.strip()

    pokemon_desc = pokemon_soup.find('div', class_='tabset-basics').find_all_previous("p")
    pokemon_desc = '|'.join(desc.text.strip() for desc in pokemon_desc).split('|')[::-1]
    pokemon_desc = ' '.join(pokemon_desc)

    species_data = pokemon_soup.find("th", string="Species").find_next("td").text.strip().replace(" Pokémon", "")

    height = pokemon_soup.find("th", string="Height").find_next("td").text.strip()
    weight = pokemon_soup.find("th", string="Weight").find_next("td").text.strip()

    type_elements = pokemon_soup.find("th", string="Type").find_next("td").find_all("a")
    type_info = ', '.join(type_element.text.strip() for type_element in type_elements).split(',')
    type_1 = type_info[0]
    if len(type_info) == 1:
        type_2 = None
    else:
        type_2 = type_info[1].strip()

    generation_title_element = pokemon_soup.find(class_="list-nav-title", string='In other generations')
    if generation_title_element:
        generation_all = generation_title_element.find_next_siblings('li')
        in_generation = ', '.join(generation_select.text.strip() for generation_select in generation_all)
    else:
        in_generation = '9'
    generation = int(in_generation[0])

    name_etymology_element = pokemon_soup.find(class_="list-nav-title", string='In other generations')
    if name_etymology_element:
        name_etymology_piece = pokemon_soup.find("dl", class_="etymology").find_all('dt')
        name_etymology_desc = pokemon_soup.find("dl", class_="etymology").find_all('dd')
        name_etymology = [f"{dt.text.strip()}: {dd.text.strip()}" for dt, dd in zip(name_etymology_piece, name_etymology_desc)]
        name_etymology = " | ".join(name_etymology)

    ability_elements = pokemon_soup.find("th", string="Abilities").find_next("td").find_all("a")
    abilities = ', '.join(ability_element.text.strip() for ability_element in ability_elements)

    ### Training Info
    ev_yield = pokemon_soup.find("th", string="EV yield").find_next("td").text.strip()

    catch_rate = pokemon_soup.find("th", string="Catch rate").find_next("td").text.strip().split()[0]

    base_friendship = pokemon_soup.find("th", string="Base Exp.").find_previous("td").text.strip().split()[0]

    base_exp = pokemon_soup.find("th", string="Base Exp.").find_next("td").text.strip().split()[0]

    growth_rate = pokemon_soup.find("th", string="Growth Rate").find_next("td").text.strip()

    ### Breeding Info
    gender = pokemon_soup.find("th", string="Gender").find_next("td").text.strip().split(', ')
    if len(gender) > 1:
        gender_male = gender[0]
        gender_male = gender_male.split('%')[0]
    else:
        gender_male = '0'
    if len(gender) > 1:
        gender_female = gender[1]
        gender_female = gender_female.split('%')[0]
    else:
        gender_female = '0'

    egg_groups = pokemon_soup.find("th", string="Egg Groups").find_next("td").text.strip().split(', ')

    egg_cycles = pokemon_soup.find("th", string="Egg cycles").find_next("td").text.replace("\t\t\t\t", " ").strip()


    ### Pokemon Stats
    hp_elements = pokemon_soup.find("th", string="HP").find_next_siblings("td", class_="cell-num")
    hp_stats = [hp_element.text.strip() for hp_element in hp_elements]
    base_hp, min_hp, max_hp = hp_stats

    atk_elements = pokemon_soup.find("th", string="Attack").find_next_siblings("td", class_="cell-num")
    atk_stats = [atk_element.text.strip() for atk_element in atk_elements]
    base_atk, min_atk, max_atk = atk_stats

    def_elements = pokemon_soup.find("th", string="Defense").find_next_siblings("td", class_="cell-num")
    def_stats = [def_element.text.strip() for def_element in def_elements]
    base_def, min_def, max_def = def_stats

    satk_elements = pokemon_soup.find("th", string="Sp. Atk").find_next_siblings("td", class_="cell-num")
    satk_stats = [satk_element.text.strip() for satk_element in satk_elements]
    base_satk, min_satk, max_satk = satk_stats

    sdef_elements = pokemon_soup.find("th", string="Sp. Def").find_next_siblings("td", class_="cell-num")
    sdef_stats = [sdef_element.text.strip() for sdef_element in sdef_elements]
    base_sdef, min_sdef, max_sdef = sdef_stats

    spd_elements = pokemon_soup.find("th", string="Speed").find_next_siblings("td", class_="cell-num")
    spd_stats = [spd_element.text.strip() for spd_element in spd_elements]
    base_spd, min_spd, max_spd = spd_stats

    return {
        "description": pokemon_desc,

        "id": pokemon_id,
        "name": pokemon_name,
        "species": species_data,
        "height": height,
        "weight": weight,
        "types": [type_1, type_2],
        "generation": generation,
        "name_etymology": name_etymology,
        "abilities": abilities,

        "ev_yield": ev_yield,
        "catch_rate": catch_rate,
        "base_friendship": base_friendship,
        "base_exp": base_exp,
        "growth_rate": growth_rate,

        "egg_groups": egg_groups,
        "gender_male": gender_male,
        "gender_female": gender_female,
        "egg_cycles": egg_cycles,

        "base_hp": base_hp,
        "min_hp": min_hp,
        "max_hp": max_hp,
        "base_atk": base_atk,
        "min_atk": min_atk,
        "max_atk": max_atk,
        "base_def": base_def,
        "min_def": min_def,
        "max_def": max_def,
        "base_satk": base_satk,
        "min_satk": min_satk,
        "max_satk": max_satk,
        "base_sdef": base_sdef,
        "min_sdef": min_sdef,
        "max_sdef": max_sdef,
        "base_spd": base_spd,
        "min_spd": min_spd,
        "max_spd": max_spd,
    }

In [300]:
def extract_evo_card(card):
    # from a single card element, get the information
    name_el = card.select_one("a.ent-name")
    if not name_el:
        return {}

    name = name_el.get_text(strip=True)
    num_el = card.select_one("span.infocard-lg-data small")
    dex_number = num_el.get_text(strip=True) if num_el else None

    return {
        "name": name,
        "dex_number": dex_number
    }


def parse_all_evolution_edges(soup):
    edges = []

    # split evolution (one pre-evolution can go to multiple future)
    for evo_split in soup.select("span.infocard-evo-split"):
        # the pre evolution card
        base_card = evo_split.find_previous("div", class_="infocard")
        if not base_card:
            continue
        base_info = extract_evo_card(base_card)
        if not base_info:
            continue

        # each direct child is a from card from this base one
        for child in evo_split.children:
            if not getattr(child, "get", None):
                continue
            if "infocard-list-evo" not in (child.get("class") or []):
                continue

            arrow = child.select_one("span.infocard-arrow")
            target_card = child.find("div", class_="infocard")

            if not arrow or not target_card:
                continue

            method_text = arrow.get_text(" ", strip=True)
            target_info = extract_evo_card(target_card)
            if not target_info:
                continue

            edges.append(
                {
                    "from": base_info["name"],
                    "from_dex": base_info["dex_number"],
                    "to": target_info["name"],
                    "to_dex": target_info["dex_number"],
                    "method": method_text,
                }
            )

    # linear evolution (only to one)
    for arrow in soup.select("span.infocard-arrow"):
        # skip arrows that are already handled inside a split
        if arrow.find_parent("span", class_="infocard-evo-split"):
            continue

        # if can't find method, from, or to card, skip
        method_text = arrow.get_text(" ", strip=True)
        if not method_text:
            continue

        from_card = arrow.find_previous("div", class_="infocard")
        to_card = arrow.find_next("div", class_="infocard")
        if not from_card or not to_card:
            continue

        from_info = extract_evo_card(from_card)
        to_info = extract_evo_card(to_card)
        if not from_info or not to_info:
            continue

        edges.append(
            {
                "from": from_info["name"],
                "from_dex": from_info["dex_number"],
                "to": to_info["name"],
                "to_dex": to_info["dex_number"],
                "method": method_text,
            }
        )

    return edges


In [299]:
# game move methods
METHOD_PATTERNS = [
    ("level-up",  ["moves learnt by level up"]),
    ("evolution", ["moves learnt on evolution"]),
    ("egg",       ["egg moves"]),
    ("pre-evo",   ["pre-evolution moves"]),
    ("tm",        ["moves learnt by tm"]),
    ("hm",        ["moves learnt by hm"]),
    ("tr",        ["moves learnt by tr"]),
    ("tutor",     ["move tutor moves", "tutor moves"]),
    ("transfer",  ["transfer-only moves"]),
]

# egg move parents are considered out of scope for the llm
IGNORE_METHOD_PATTERNS = [
    "egg move parents",
]

# game abbreviations to game name mapping
GAME_GROUP_PATTERNS = [
    ("sv",      ["scarlet & violet"]),
    ("lza",     ["legends: z-a"]),

    ("swsh",    ["sword & shield"]),
    ("bdsp",    ["brilliant diamond & shining pearl"]),
    ("la",      ["legends: arceus"]),

    ("usum",    ["ultra sun & ultra moon", "ultra sun", "ultra moon"]),
    ("sm",      ["sun & moon"]),
    ("lgpe",    ["let's go pikachu & let's go eevee",
                 "lets go pikachu & lets go eevee"]),

    ("xy",      ["x & y"]),
    ("oras",    ["omega ruby", "alpha sapphire"]),

    ("b2w2",    ["black 2 & white 2"]),
    ("bw",      ["black & white"]),

    ("hgss",    ["heartgold & soulsilver"]),
    ("pt",      ["platinum"]),
    ("dp",      ["diamond & pearl"]),

    ("rs",      ["ruby & sapphire"]),
    ("frlg",    ["firered & leafgreen", "firered", "leafgreen"]),
    ("emerald", ["emerald"]),

    ("gs",      ["gold & silver"]),
    ("crystal", ["crystal"]),

    ("rb",      ["red & blue"]),
    ("y",       ["yellow"]),
]

def detect_method(title):
    t = title.lower()

    for ignore in IGNORE_METHOD_PATTERNS:
        if ignore in t:
            return None

    for method, patterns in METHOD_PATTERNS:
        if any(p in t for p in patterns):
            if method == "egg" and not t.startswith("egg moves"):
                continue
            return method

    return None

def infer_game_group_id(games_text):
    txt = (games_text or "").lower()
    for code, patterns in GAME_GROUP_PATTERNS:
        if any(p in txt for p in patterns):
            return code
    return "unknown"

def normalize_headers_in_moves_table(text):
    text = text.strip()
    mapping = {
        "Lv.": "level",
        "Move": "move",
        "Type": "type",
        "Cat.": "category",
        "Power": "power",
        "Acc.": "accuracy",
        "TM": "machine",
        "HM": "machine",
        "TR": "machine",
    }
    return mapping.get(text, text.lower())

# for each moves table (so same method), get the moves
def parse_moves_table(table):
    if not isinstance(table, Tag):
        return []

    header_row = table.find("tr")
    if not isinstance(header_row, Tag):
        return []

    headers_raw = [th.get_text(" ", strip=True) for th in header_row.find_all(["th", "td"])]
    headers = [normalize_headers_in_moves_table(h) for h in headers_raw]

    rows = []
    tbody = table.find("tbody") or table
    for tr in tbody.find_all("tr"):
        if not isinstance(tr, Tag):
            continue

        cells = tr.find_all("td")
        if not cells:
            continue
        if len(cells) < len(headers):
            continue

        row = {}
        for idx, key in enumerate(headers):
            if idx >= len(cells):
                continue
            cell = cells[idx]
            if not isinstance(cell, Tag):
                continue
            text = cell.get_text(" ", strip=True)
            row[key] = text
        rows.append(row)

    return rows

# for each pokemon game, get the different moves for each method
def parse_moves_sections(soup):
    sections = []

    for h3 in soup.find_all("h3"):
        if not isinstance(h3, Tag):
            continue

        title = h3.get_text(strip=True)
        method = detect_method(title)
        if method is None:
            continue

        siblings = []
        node = h3.next_sibling
        while node is not None:
            if isinstance(node, Tag) and node.name == "h3":
                break
            siblings.append(node)
            node = node.next_sibling

        games_text = None
        table = None

        for node in siblings:
            if not isinstance(node, Tag):
                continue

            if node.name == "p" and games_text is None:
                games_text = node.get_text(" ", strip=True)

            if table is None:
                t = node.find("table")
                if isinstance(t, Tag):
                    table = t

            if games_text and table:
                break

        rows = parse_moves_table(table)
        sections.append(
            {
                "method": method,
                "games_text": games_text,
                "rows": rows,
            }
        )

    return sections


In [301]:
# classify how pokemon is obtained
def infer_location_availability(raw_location: str) -> str:
    text = (raw_location or "").strip().lower()

    if not text or text == "—":
        return "unknown"
    if "not available in this game" in text:
        return "not-available"
    if "location data not yet available" in text:
        return "unknown"
    if "trade/migrate from another game" in text:
        return "transfer-only"
    if "event" in text:
        return "event-only"

    return "normal"

# for each entry, put game, availability, and location
def parse_locations_section(pokemon_soup) -> list[dict]:
    heading = pokemon_soup.find(
        lambda tag: tag.name in ("h2", "h3") and "Where to find" in tag.get_text(strip=True)
    )
    if not heading:
        return []

    table = None
    for sib in heading.next_siblings:
        name = getattr(sib, "name", None)
        if name in ("h1", "h2", "h3"):
            break
        if name == "table":
            table = sib
            break

    if table is None:
        return []

    entries: list[dict] = []

    for row in table.find_all("tr"):
        header_cell = row.find("th")
        data_cell = row.find("td")
        if not header_cell or data_cell is None:
            continue

        games_text = header_cell.get_text(" ", strip=True)
        raw_location = data_cell.get_text(" ", strip=True)

        location_links = [
            a.get_text(" ", strip=True) for a in data_cell.find_all("a")
        ]

        availability = infer_location_availability(raw_location)

        raw_games = re.split(r",|/| & | and ", games_text)
        games = [g.strip() for g in raw_games if g.strip()]

        for game in games:
            entries.append(
                {
                    "game": game,
                    "availability": availability,
                    "raw_location": raw_location,
                    "location_names": location_links,
                }
            )

    return entries


### Web Scraping

In [302]:
# get list of all pokemon currently in the pokedex

POKEMON_LIST = "https://pokemondb.net/pokedex/all"

response = requests.get(POKEMON_LIST)
pokemon_soup_list = BeautifulSoup(response.text, "html.parser")

pokemon_list = list(dict.fromkeys(pokemon_soup_list.find_all('a', class_="ent-name")))

if full_dataset:
  pokemon_scope = len(pokemon_list)
else:
  pokemon_scope =  part_dataset

In [303]:
output_path = Path(f'data/{project_name}.jsonl')

chunk_groups = {
    "core": [],
    "training": [],
    "breeding": [],
    "statistics": [],
    "evolutions": [],
    "moves": [],
    "locations": [],
}

for index, pokemon in enumerate(pokemon_list[:pokemon_scope], start = 1):
    pokemon_url = "https://pokemondb.net" + pokemon["href"]

    response = requests.get(pokemon_url)
    pokemon_soup = BeautifulSoup(response.text, "html.parser")

    pokemon_data = get_pokemon_data(pokemon_soup)

    chunk_groups["core"].append(build_core_doc(pokemon_data))
    chunk_groups["training"].append(build_training_doc(pokemon_data))
    chunk_groups["breeding"].append(build_breeding_doc(pokemon_data))
    chunk_groups["statistics"].append(build_statistics_doc(pokemon_data))
    chunk_groups["evolutions"].append(build_evolution_doc(pokemon_data, pokemon_soup))

    gens_to_scrape = [1, 2, 3, 4, 5, 6, 7, 8, 9]

    for gen in gens_to_scrape:
        moves_url = f"https://pokemondb.net{pokemon['href']}/moves/{gen}"
        resp = requests.get(moves_url)
        if resp.status_code != 200:
            continue

        moves_soup = BeautifulSoup(resp.text, "html.parser")
        move_docs = build_moves_docs_for_generation(pokemon_data, gen, moves_soup)
        chunk_groups["moves"].extend(move_docs)

    chunk_groups["locations"].append(build_locations_doc(pokemon_data, pokemon_soup))


In [304]:
for group_name, group_chunks in chunk_groups.items():
    group_path = output_path.with_name(f"{output_path.stem}_{group_name}{output_path.suffix}")

    group_path.parent.mkdir(parents=True, exist_ok=True)

    with group_path.open("w", encoding="utf-8") as f:
        for c in group_chunks:
            f.write(json.dumps(c, ensure_ascii=False) + "\n")

    print(f"Wrote {len(group_chunks)} documents to {group_path.resolve()}")

Wrote 250 documents to /content/data/pokemon_core.jsonl
Wrote 250 documents to /content/data/pokemon_training.jsonl
Wrote 250 documents to /content/data/pokemon_breeding.jsonl
Wrote 250 documents to /content/data/pokemon_statistics.jsonl
Wrote 250 documents to /content/data/pokemon_evolutions.jsonl
Wrote 23899 documents to /content/data/pokemon_moves.jsonl
Wrote 250 documents to /content/data/pokemon_locations.jsonl
