## Web Scraping

### 1. From pokemondb.net, I want the following information:
- Pokedex #
- Pokemon Name
- Type
- Stat Total (sum of HP, Attack, Defense, Sp. Atk, Sp. Def, and Speed)
- HP
- Attack
- Defense
- Sp. Atk
- Sp. Def
- Speed
- Pokemon Moves learned via level up

### 2. From bulbapedia.net, I want the following information:
- Catch Rates
- Egg Groups

### 3. From serebii.net. I want the following information:
- Pokemon unobtainable in Sword/Shield
- Pokemon unobtainable in Brilliant Diamond and Shining Pearl

In [1]:
# Import relevant libraries
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup as BS

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Change the amount of rows shown in printed dataframes
pd.set_option('display.max_rows', 10)

In [3]:
# Read in pokemondb webpage
URL = 'https://pokemondb.net/pokedex/all'
response = requests.get(URL)
soup = BS(response.text)

# Bring pokedex table into notebook as a dataframe
pokedex = pd.read_html(str(soup.find("table")))[0]

In [4]:
# Read in bulbapedia webpage for catch rates
URL = 'https://web.archive.org/web/20220520075934/https://bulbapedia.bulbagarden.net/wiki/List_of_Pokémon_by_catch_rate'
response = requests.get(URL)
soup = BS(response.text)

# Bring catch rates table into notebook as a dataframe
catch_rate = pd.read_html(str(soup.findAll("table")))[1]

In [5]:
# Read in bulbapedia webpage for egg groups
URL = 'https://web.archive.org/web/20220503001732/https://bulbapedia.bulbagarden.net/wiki/List_of_Pokémon_by_Egg_Group'
response = requests.get(URL)
soup = BS(response.text)

# Bring egg group table into notebook as a dataframe
egg_group = pd.read_html(str(soup.findAll("table")))[2]

In [6]:
# Read in list of Pokemon not available in Sword and Shield
URL = 'https://www.serebii.net/swordshield/unobtainable.shtml'
response = requests.get(URL)
soup = BS(response.text)

# Bring excluded Pokemon table into notebook as a dataframe
pokemon_not_in_sword_shield = pd.read_html(str(soup.findAll("table")))[0]

In [7]:
# Read in list of Pokemon not available in Brilliant Diamond and Shining Pearl
URL = 'https://www.serebii.net/brilliantdiamondshiningpearl/unobtainable.shtml'
response = requests.get(URL)
soup = BS(response.text)

# Bring excluded Pokemon table into notebook as a dataframe
pokemon_not_in_diamond_pearl = pd.read_html(str(soup.findAll("table")))[0]

## Data Cleaning

### There are certain Pokemon that need to be removed from the dataset to prevent inaccurate training or overtraining of the machine learning model.
### Remove the Following Pokemon:
- Mega Pokemon
- Partner Pokemon
- Primal Pokemon
- Castform Alternate Forms
- Deoxys Alternate Forms
- Rotom Forms
- Dialga, Palkia, Giratina Origin Formes
- Darmanitan Zen Modes
- Basculin White and Red-Striped Form
- Therian Forms
- Black and White Kyurems
- Keldeo Resolute Form
- Ash-Greninja
- Meowstic Female
- Pumpkaboo and Gourgeist Small, Large, and Super Sizes
- Zygarde 10% and Complete Formes
- Rockruff Own Tempo Rockruff
- Wishiwashi School Form
- Toxtricity Amped Form
- Eiscue Noice Face
- Morpeko Hangry Mode
- Eternatus Eternamax
- Urshifu Rapid Strike Style

In [8]:
# Clean pokedex column names
pokedex.columns = [x.lower().replace(". ","_") for x in pokedex.columns]

# Change '#' column to 'pokedex_number' 
pokedex = pokedex.rename(columns={'#': 'pokedex_number'})

In [9]:
# List of strings for Pokemon to be removed
removable_string = ['Mega ', 
                    'Partner ', 
                    'Primal ', 
                    'Castform ',
                    'Deoxys A', 
                    'Deoxys D', 
                    'Deoxys S',
                    'Rotom ', 
                    'Origin', 
                    'Zen', 
                    'Basculin R', 
                    'Basculin W', 
                    'Therian', 
                    ' Kyurem', 
                    'Resolute', 
                    'Ash-Greninja', 
                    'Meowstic Female', 
                    'Pumpkaboo L', 
                    'Pumpkaboo S', 
                    'Gourgeist L', 
                    'Gourgeist S', 
                    'Zygarde 1', 
                    'Zygarde C', 
                    'Rockruff ', 
                    'Wishiwashi Sc', 
                    'Toxtricity A',
                    'Noice', 
                    'Hangry', 
                    'Eternamax', 
                    'Urshifu Rapid',
                    'Aegislash Shield Forme', 
                    'Minior Core Form']

# Loop to Remove Pokemon
for x in removable_string:
    pokedex = pokedex[~(pokedex['name'].str.contains(x))]

### Features that will need to be calculated/added in:
- Pokemon Generation = Done
- Create seprate columns for Primary type and secondary types = Done
- Pokemon Legendary Status (Legendary or Normal) = Done
- Make column that calculates average of all Pokemon stats = Done
- Egg Group (for predicting types) = Done
- Pokemon Movesets, Maybe Count Number of Move Types It Can Learn (for predicting types)
- Pokemon Abilities (for predicting types) (will do if have more time, but will contunue without it)

In [10]:
# Function to determine Pokemon's generation
def pokemon_gen(pokedex_num, pokemon_name):
    if 'Alolan' in pokemon_name:
        return 7
    elif 'Galarian' in pokemon_name:
        return 8
    elif 'Hisuian' in pokemon_name:
        return 8
    elif pokedex_num < 152:
        return 1
    elif pokedex_num < 252:
        return 2
    elif pokedex_num < 387:
        return 3
    elif pokedex_num < 494:
        return 4
    elif pokedex_num < 650:
        return 5
    elif pokedex_num < 722:
        return 6
    elif pokedex_num < 810:
        return 7
    else:
        return 8
    
# Loop through data and assign a generation to each pokemon
pokedex['generation'] = ''

for ind in pokedex.index:
    number = pokedex['pokedex_number'][ind]
    name = pokedex['name'][ind]
    pokedex['generation'][ind] = pokemon_gen(number, name)

In [11]:
# Calculate the average of each Pokemon's stat total
pokedex.insert(loc = 4, 
               column = 'average', 
               value = (pokedex['total'] / 6).round(2))

In [12]:
# Split the 'type' column into 'primary_type' and 'secondary_type' columns
types = pokedex['type'].str.split(expand = True)
pokedex['type'] = types[0]
pokedex.insert(loc = 3, 
               column = 'secondary_type', 
               value = types[1])

# Rename 'type' column to 'primary_type'
pokedex = pokedex.rename(columns={'type': 'primary_type'})

In [13]:
# Modify name of Nidoran male and female
pokedex['name'].replace({"Nidoran♀": 'Nidoran F'}, inplace=True)
pokedex['name'].replace({"Nidoran♂": 'Nidoran M'}, inplace=True)
pokedex['name'].replace({"Flabébé": 'Flabebe'}, inplace=True)

In [14]:
# Create lists of pokedex numbers for legendary and pseudo-legendary pokemon
legendary_pokedex_number = [144, 145, 146, 150, 151, 243, 244, 245, 249, 250, 
                            251, 377, 378, 379, 380, 381, 382, 383, 384, 385, 
                            386, 480, 481, 482, 483, 484, 485, 486, 487, 489, 
                            490, 491, 492, 493, 494, 638, 639, 640, 641, 642, 
                            643, 644, 645, 646, 647, 648, 649, 716, 717, 718, 
                            719, 720, 721, 772, 773, 785, 786, 787, 788, 789, 
                            790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 
                            800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 
                            888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 
                            898, 905]

# Create legendary column and assign bool value to every Pokemon
pokedex['legendary'] = ''

for ind in pokedex.index:
    number = pokedex['pokedex_number'][ind]
    
    if number in legendary_pokedex_number:
        pokedex['legendary'][ind] = True
    else:
        pokedex['legendary'][ind] = False

In [15]:
# Remove Partner Eevee and Pikachu from catch_rate
catch_rate = catch_rate[~(catch_rate['Name'].str.contains('Partner'))].reset_index(drop=True)

# Drop and rename columns in catch_rate
catch_rate = catch_rate.drop(columns={'Name', 'Unnamed: 1'}).rename(columns={'#': 'pokedex_number'})

# Clean catch_rate column names
catch_rate.columns = [x.lower().replace(" ","_") for x in catch_rate.columns]

# Extract digits only in the catch_rate column
catch_rate['catch_rate'] = catch_rate['catch_rate'].str.extract('(\d+)')

In [16]:
# Create catch rate rows for last few Pokedex entries
pokedex_number_list = list(range(899, 906))
catch_rate_list = [135, 115, 75, 135, 135, 135, 3]

for i, j in zip(pokedex_number_list, catch_rate_list):
    catch_rate.loc[len(catch_rate.index)] = [i, j]

In [17]:
# Drop and rename columns in egg_group dataframe
egg_group = egg_group.drop(columns={'Unnamed: 1', 'Pokémon'}).rename(columns={'#': 'pokedex_number'})

# Clean catch_rate column names
egg_group.columns = [x.lower().replace(" ","_") for x in egg_group.columns]

# Replace 'No Eggs Discovered' entries with NaN
egg_group['egg_group_1'].replace({"No Eggs Discovered": np.nan}, inplace=True)

# Remove * from both egg_group columns
egg_group['egg_group_1'] = egg_group['egg_group_1'].str.replace("*","")

# Remove * from both egg_group columns
egg_group['egg_group_2'] = egg_group['egg_group_2'].str.replace("*","")

In [18]:
# Drop first two rows of pokemon_not_in_sword_shield dataframe
pokemon_not_in_sword_shield = pokemon_not_in_sword_shield.drop([0,1])

# Change name of pokemon_not_in_sword_shield columns
pokemon_not_in_sword_shield.columns = ['pokedex_number', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

# Extract digits only in the first column
pokemon_not_in_sword_shield['pokedex_number'] = pokemon_not_in_sword_shield['pokedex_number'].str.extract('(\d+)').astype('int64')

In [19]:
# Drop first two rows of pokemon_not_in_sword_shield dataframe
pokemon_not_in_diamond_pearl = pokemon_not_in_diamond_pearl.drop(0)

# Change name of pokemon_not_in_sword_shield columns
pokemon_not_in_diamond_pearl.columns = ['pokedex_number', '1', 'name', '3', '4']

# Extract digits only in the first column
pokemon_not_in_diamond_pearl['pokedex_number'] = pokemon_not_in_diamond_pearl['pokedex_number'].str.extract('(\d+)').astype('int64')

# Remove Alolan and Galarian Pokemon
pokemon_not_in_diamond_pearl = pokemon_not_in_diamond_pearl[~(pokemon_not_in_diamond_pearl['name'].str.contains('Alolan'))]
pokemon_not_in_diamond_pearl = pokemon_not_in_diamond_pearl[~(pokemon_not_in_diamond_pearl['name'].str.contains('Galarian'))]

In [20]:
excluded_pokemon = np.intersect1d(pokemon_not_in_sword_shield['pokedex_number'].unique(), pokemon_not_in_diamond_pearl['pokedex_number'].unique())

In [21]:
# Create dataframe to record each Pokemon's number of moves
moves_dataset = pd.DataFrame(data={'pokedex_number': [], 'name': []})

# Create column for each move type
for i in pokedex['primary_type'].unique():
    i = i.lower()
    column_name = 'number_of_{}_moves'.format(i)
    moves_dataset[column_name] = ""

In [22]:
# Function defined to clean Pokemon names to format them into the moveset URLs
def pokemon_name_cleaning(name):
    name = name.lower().replace(' ', '-').replace(' ', '-').replace('.','').replace(':','').replace("'",'')
    
    if 'alolan' in name:
        name = re.sub('-alolan.*', '', name)
    elif 'galarian' in name:
        name = re.sub('-galarian.*', '', name)
    elif 'hisuian' in name:
        name = re.sub('-hisuian.*', '', name)
    elif 'incarnate' in name:
        name = re.sub('-.*', '', name)
    elif 'deoxys' in name:
        name = re.sub('-.*', '', name)
    elif 'burmy' in name:
        name = re.sub('-.*', '', name)
    elif 'wormadam' in name:
        name = re.sub('-.*', '', name)
    elif 'giratina' in name:
        name = re.sub('-.*', '', name)
    elif 'shaymin' in name:
        name = re.sub('-.*', '', name)
    elif 'basculin' in name:
        name = re.sub('-.*', '', name)
    elif 'darmanitan' in name:
        name = re.sub('-.*', '', name)
    elif 'keldeo' in name:
        name = re.sub('-.*', '', name)
    elif 'meloetta' in name:
        name = re.sub('-.*', '', name)
    elif 'aegislash' in name:
        name = re.sub('-.*', '', name)
    elif 'pumpkaboo' in name:
        name = re.sub('-.*', '', name)
    elif 'gourgeist' in name:
        name = re.sub('-.*', '', name)
    elif 'zygarde' in name:
        name = re.sub('-.*', '', name)
    elif 'hoopa' in name:
        name = re.sub('-.*', '', name)
    elif 'oricorio' in name:
        name = re.sub('-.*', '', name)
    elif 'lycanroc' in name:
        name = re.sub('-.*', '', name)
    elif 'wishiwashi' in name:
        name = re.sub('-.*', '', name)
    elif 'minior' in name:
        name = re.sub('-.*', '', name)
    elif 'necrozma' in name:
        name = re.sub('-.*', '', name)
    elif 'toxtricity' in name:
        name = re.sub('-.*', '', name)
    elif 'eiscue' in name:
        name = re.sub('-.*', '', name)
    elif 'indeedee' in name:
        name = re.sub('-.*', '', name)
    elif 'morpeko' in name:
        name = re.sub('-.*', '', name)
    elif 'zacian' in name:
        name = re.sub('-.*', '', name)
    elif 'zamazenta' in name:
        name = re.sub('-.*', '', name)
    elif 'urshifu' in name:
        name = re.sub('-.*', '', name)
    elif 'calyrex' in name:
        name = re.sub('-.*', '', name)
    elif 'basculegion' in name:
        name = re.sub('-.*', '', name)
    elif 'meowstic' in name:
        name = re.sub('-.*', '', name)
    
    return name

In [23]:
# Function created to determine correct URL
def URL_cleaning(origin_name, name, number):
    URL_template = 'https://pokemondb.net/pokedex/{}/moves/{}'
    
    if (origin_name == 'Meowth Galarian Meowth' or 
        origin_name == 'Lycanroc Dusk Form' or
        origin_name == 'Burmy Trash Cloak' or
        origin_name == 'Wormadam Trash Cloak' or
        origin_name == 'Calyrex Shadow Rider'):
            # Gather all tables from generated URL
            URL = URL_template.format(name, 8)
            response = requests.get(URL)
            soup = BS(response.text)
            movesets = pd.read_html(str(soup))
            table_index = 2
            return movesets, table_index
    elif (origin_name == 'Meowth Alolan Meowth' or 
          origin_name == 'Lycanroc Midnight Form' or
          origin_name == 'Burmy Sandy Cloak' or
          origin_name == 'Wormadam Sandy Cloak' or 
          origin_name == 'Indeedee Female' or
          origin_name == 'Calyrex Ice Rider' or
          origin_name == 'Exeggutor Alolan Exeggutor' or
          origin_name == 'Sandshrew Alolan Sandshrew' or 
          origin_name == 'Sandslash Alolan Sandslash' or
          origin_name == 'Marowak Alolan Marowak' or
          origin_name == 'Vulpix Alolan Vulpix' or 
          origin_name == 'Ninetales Alolan Ninetales' or
          origin_name == 'Raichu Alolan Raichu' or 
          origin_name == 'Diglett Alolan Diglett' or
          origin_name == 'Dugtrio Alolan Dugtrio' or
          'Galarian' in origin_name):
            # Gather all tables from generated URL
            URL = URL_template.format(name, 8)
            response = requests.get(URL)
            soup = BS(response.text)
            movesets = pd.read_html(str(soup))
            table_index = 1
            return movesets, table_index
    elif (origin_name == 'Hoopa Hoopa Unbound' or 'Alolan' in origin_name):
        # Gather all tables from generated URL
        URL = URL_template.format(name, 7)
        response = requests.get(URL)
        soup = BS(response.text)
        movesets = pd.read_html(str(soup))
        table_index = 1
        return movesets, table_index
    elif (number in excluded_pokemon or origin_name == 'Hoopa Hoopa Confined'):
        # Gather all tables from generated URL
        URL = URL_template.format(name, 7)
        response = requests.get(URL)
        soup = BS(response.text)
        movesets = pd.read_html(str(soup))
        table_index = 0
        return movesets, table_index
    elif 'Hisuian' in origin_name:
        # Gather all tables from generated URL
        URL = URL_template.format(name, 8)
        response = requests.get(URL)
        soup = BS(response.text)
        movesets = pd.read_html(str(soup.find(id='tab-moves-20')))
        table_index = 1
        return movesets, table_index
    else:
        # Gather all tables from generated URL
        URL = URL_template.format(name, 8)
        response = requests.get(URL)
        soup = BS(response.text)
        movesets = pd.read_html(str(soup))
        table_index = 0
    
    return movesets, table_index

In [24]:
# Dictionary for types in below for loop
type_dict = {"Grass": 2,
 "Fire": 3,
 "Water": 4,
 "Bug": 5,
 "Normal": 6,
 "Dark": 7,
 "Poison": 8,
 "Electric": 9,
 "Ground": 10,
 "Ice": 11,
 "Fairy": 12,
 "Steel": 13,
 "Fighting": 14,
 "Psychic": 15,
 "Rock": 16,
 "Ghost": 17,
 "Dragon": 18,
 "Flying": 19
}

# Filter through each Pokemon and determine number of moves of each type
for number, name in zip(pokedex['pokedex_number'], pokedex['name']):
    origin_name = name
    name = pokemon_name_cleaning(name)
    movesets, table_index = URL_cleaning(origin_name, name, number)
    
    # Use table list index from earlier function with URL string to get specific table out
    moveset = movesets[table_index]

    # Count up unique moves from extracted table
    moveset['Lv.'] = 1
    moveset = moveset.drop_duplicates()
    move_df = moveset.groupby('Type')['Lv.'].sum()
    
    # Temp list for new row to be added to moves_dataset
    temp_move_list = [number, origin_name, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    
    # Loop through move_df dataframe and assign counts of move types to temp_move_list
    for index in move_df.index:
        temp_move_list[type_dict[index]] = move_df.loc[index]
    
    # Create new row in moves_dataset dataframe
    moves_dataset.loc[len(moves_dataset.index)] = temp_move_list

In [25]:
# Change pokedex_number from float type to int type
moves_dataset['pokedex_number'] = moves_dataset['pokedex_number'].astype('int64')

## Data Merging

### Merge scraped and cleaned dataframes into one and export it out to an excel sheet.

In [29]:
# Combine the scraped and cleaned dataframes into one
pokedex_merged = pd.merge(pokedex, catch_rate, on='pokedex_number', how='left')
pokedex_merged = pd.merge(pokedex_merged, egg_group, on='pokedex_number', how='left')
pokedex_merged = pd.merge(pokedex_merged, moves_dataset, on=['pokedex_number', 'name'], how='left')

In [30]:
# Export pokedex_merged dataframe as a csv
pokedex_merged.to_csv('../data/pokedex_merged.csv', index=False)