I originally was going to put this in the Pokemon_Scrape.ipynb, but I realized there was a lot of cleaning that had to be done so I am bringing it to this new file to keep things more organized.

# API Connection

Pokemon Showdown (the website I play pokemon on) has an API where you can get their pokedex and move data information. I'm just going to access that here and then download it myself.

In [68]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
print('setup complete')

setup complete


In [69]:
pokemon_df = pd.read_csv('Natdex_Data.csv')

# The below line of code is to add functionality even when Natdex_Data.csv already has isLegend and isFinal as columns. 
# When this file was first being written, this line was not included and not needed
pokemon_df = pokemon_df.drop(['isLegend', 'isFinal'], axis=1)

pokemon_df.head()

Unnamed: 0,name,type1,type2,ability1,ability2,hiddenability,hp,atk,physdef,spatk,spdef,speed,bst,alternate
0,Bulbasaur,grass,poison,Overgrow,Chlorophyll,,45,49,49,65,65,45,318,False
1,Ivysaur,grass,poison,Overgrow,Chlorophyll,,60,62,63,80,80,60,405,False
2,Venusaur,grass,poison,Overgrow,Chlorophyll,,80,82,83,100,100,80,525,False
3,Charmander,fire,,Blaze,Solar Power,,39,52,43,60,50,65,309,False
4,Charmeleon,fire,,Blaze,Solar Power,,58,64,58,80,65,80,405,False


In [70]:
response = requests.get("https://play.pokemonshowdown.com/data/pokedex.json")
pokedex_df = pd.DataFrame.from_dict(response.json(), orient='index')

In [71]:
pokedex_df.columns

Index(['num', 'name', 'types', 'genderRatio', 'baseStats', 'abilities',
       'heightm', 'weightkg', 'color', 'evos', 'eggGroups', 'tier',
       'isNonstandard', 'prevo', 'evoLevel', 'otherFormes', 'formeOrder',
       'canGigantamax', 'baseSpecies', 'forme', 'requiredItem', 'changesFrom',
       'evoCondition', 'evoType', 'gender', 'gen', 'evoItem', 'evoRegion',
       'canHatch', 'evoMove', 'tags', 'baseForme', 'cosmeticFormes', 'maxHP',
       'requiredAbility', 'battleOnly', 'requiredMove', 'requiredItems',
       'cannotDynamax'],
      dtype='object')

As it happens, the only useful information I can glean from this data is whether it is legendary or mythical or not and where it is in the evolutionary line. After I engineer those features, I just need to add them with my original data. The other information that I want: usage, tiering, move descriptions, ability descriptions, will need to be gotten from other data sources. Showdown has more data like this that I can use so I'll get it from there. For now, I just need to engineer the tags.

As I go through each of these columns, I'm going to write down here what my interpretation of them is. I don't think I'll be able to find much documentation in a reasonable amount of time. If the column is not useful to me I'm just going to leave it and say it's "not useful" along with any other remarks, because there's too many columns here to describe all of them when I'm only going to use a couple

* num: The pokedex number of the pokemon. Any unofficial/not actual pokemon have a number that is 0 or below (CAP and missingno,    etc). Any alternate forms of pokemon will have their own separate entry but the same number
* name: the english name of the pokemon. Letters are capitalized, spaces are kept, all formatting is good for my purposes. The      only formatting issue I can see is that all variants are appended with a dash, whereas my regional variants have "Alolan" or    "Galarian" or what have you in the beginning of the name. I'll have to clean this up when I execute the merge
* types: simple enough. A list of the types, with each type capitalized. It appears that this is in the form of a numpy array.      Calling it with ...['types'].values[0] appears to return a python list of string which may be easier to work with
* genderRatio: not useful for me. It's the percent likelihood of a randomly selected pokemon of that species being a specific      gender
* baseStats: I already have this information in my pre-scraped data. It appears to be a dictionary of each of the stats. Beyond    checking the original data for errors there's not much more I want to do with this.
* abilities: a dictionary of abilities. "H" is a hidden ability. Similar situation to baseStats
* heightm: height in meters of the pokemon. Probably not too useful
* weightkg: weight of the pokemon in kilograms. Only impactful for very specific scenarios (some moves deal more damage            depending on the weight of the target. Nevertheless, I'm not planning on using it.
* color: not useful (visual description of the pokemon)
* evos: a list of what the pokemon evolves into. Useful because all fully evolved pokemon have a Null in this spot, which means    I can sort them out with this. This will probably require some feature engineering of some sort
* eggGroups: not useful for me. This is egg Groups, which dictate which pokemon a certain pokemon can breed with in-game.
* tier: The current tiering of the pokemon in the current Generation (gen 9). Although this is going to eventually be a target      feature, this column in this dataset is unreliable. In the current generation, only a little over a hundred pokemon are          allowed, compared to the 1200 pokemon to ever exist, which means that all of those other pokemon are given the tiering          "illegal" or something else like that. I'll probably have to go access another file to find the tiering placements of these      pokemon in the last generation they were available. Therefore, this column isn't useful to me right now.
* isNonStandard: a tag of whether the pokemon is available in the current generation (generation 9). Anything with "CAP",          "Custom", or "LGPE" will be excluded, because they aren't part of competitive pokemon (what the analysis is geared towards)
* prevo: A list of the pre-evolution of the pokemon. Will be used in conjunction with "evos" to get an index for where along the    evolution chain this pokemon is during feature engineering
* evoLevel: not useful
* otherFormes: can tell me whether the pokemon has other forms as well. Some of these forms may be considered as actual pokemon    for my purposes or not. I'll have to see how I can use this later
* formeOrder: A list of all the possible forms of this pokemon, but only on the original form. I don't know how this will be        different from otherFormes
* canGigantamax: gigantamax/dynamax is always banned, not useful
* baseSpecies: will be very useful for indexing. Gives the pokemon that is the base species of this pokemon. This way I can        distinguish between which pokemon will be used as their own pokemon and those that won't be by their base species. For          example, all pokemon with a base species of pikachu will not be useful at all, since they're not tiered any differently than    regular pikachu
* forme: Will be very important for querying only the pokemon that I need, since all pokemon of certain specific forms (mega,      alola, galar) are useful, while others will not be used (gmax, hisui for now). 
* requiredItem: not useful
* changesFrom: another "baseSpecies" clone, but this one has less entries for some reason. I'll have to see how this is            different from baseSpecies when I get to cleaning the data
* evoCondition: not useful
* evoType: not useful
* gender: not useful
* gen: I think this is the gen where the pokemon was added to pokemon showdown. Unfortunately this is not accurate to what          generation the pokemon was introduced, which means that in order to find that information I'll have to look elsewhere. 
* evoItem: not useful
* evoRegion: not useful
* canHatch: not useful
* evoMove: not useful
* tags: tags whether the pokemon is a restricted legendary, sub-legendary, mythical, or paradox pokemon. Will be useful for        tagging for later analysis. Like with types, its in the form of a numpy array. This could be pretty hard to work with, i        would have preferred if it was in some other format than an array
* baseForme: This seems like a list of the suffixes of the base forms of certain pokemon. I don't think this will be much useful    to me
* cosmeticFormes: not useful
* maxHP: This column has one singular entry and that is Shedinja. Shedinja is special because its HP is hard capped at 1HP, even    with EV investment and at level 100. This is important but only for it, and I think its base HP of 1 will be good enough for    models to tell that it has terrible hp. therefore, not useful
* requiredAbility: not useful
* battleOnly: I think this exists if the pokemon has a form that is only available in battle. This seems like it could be useful    because pokemon with battle-only forms are grouped together with their alternate form counterparts, although they're still      listed separately. This could help me avoid adding false or redundant data
* requiredMove: not useful
* requiredItems: not useful
* cannotDynamax: not useful

Many of the columns I labelled "not useful" mostly had to do with evolution requirements, purely in-game things like breeding and egg hatching, and banned mechanics such as dynamax and gigantamax. Now on to the data cleaning and preparation for concatenation with the original dataframe

## Cleaning Data and keeping only what I want

In [72]:
# dropping things
def process_pokedex(pokedex_df):
    '''
    Function that preprocesses the pokedex data obtained from the Pokemon Showdown API. It formats the names so that it merges
    correctly with the dataframe of the data I scraped myself, which means dropping all the additional pokemon that aren't 
    included in it.
    
    Parameters:
        pokedex_df (Pandas DataFrame): The DataFrame in question
    
    Returns:
        Pandas DataFrame
    '''
    pokedex_df = pokedex_df.copy()
    
    # Dropping all pokemon with numbers 0 or below (they aren't actual pokemon)
    pokedex_df = pokedex_df.query('num > 0')
    
    # Dropping all pokemon where the name contains a string in drop_list. This is usually alternate forms that don't
    # count as unique pokemon
    drop_list = ['Hisui', 'Pikachu-', 'Gmax', 'Vivillon-', 'Totem', 'Cherrim-', 'Sinistea-', 'Polteageist-',
                '-Neutral', 'Zarude-Dada', '-School', '-Meteor', '-Hangry', 'Genesect-', 'Meloetta-', 'Palafin-',
                'Aegislash-Blade', 'Basculegion-F', 'Basculin-Blue-Striped',
                'Basculin-White-Striped', 'Castform-Rainy', 'Castform-Snowy',
                'Castform-Sunny', 'Cramorant-Gorging', 'Cramorant-Gulping',
                'Darmanitan-Galar-Zen', 'Darmanitan-Zen', 'Dialga-Origin',
                'Dudunsparce-Three-Segment', 'Eevee-Starter', 'Eiscue-Noice',
                'Enamorus-Therian', 'Eternatus-Eternamax', 'Floette-Eternal',
                'Gimmighoul-Roaming', 'Giratina-Origin', 'Hoopa-Unbound',
                'Keldeo-Resolute', 'Magearna-Original', 'Maushold-Four',
                'Mimikyu-Busted', 'Necrozma-Ultra', 'Palkia-Origin',
                'Pichu-Spiky-eared', 'Toxtricity-Low-Key']
    
    mask = pokedex_df['name'].str.contains('|'.join(drop_list))
    pokedex_df = pokedex_df[~mask]
    
    
    return pokedex_df

In [73]:
pokedex_df = process_pokedex(pokedex_df)
pokedex_df.shape

(1176, 39)

## Feature Engineering and merging with Natdex_Data.csv

Here's what I plan to take from this pokemon showdown dataset:
* evolutionary place: using evos and/or prevos
* legendary: using isLegend

In [74]:
# column that says if the pokemon is legendary, mythical, paradox, or sub-legendary. all of them will count the same for my 
# purposes

pokedex_df['isLegend'] = ~pokedex_df.tags.isnull()

In [75]:
pokedex_df['isFinal'] = pokedex_df.evos.isnull()

In [76]:
total_df = pokemon_df.merge(pokedex_df, on='name')
total_df = total_df[['name', 'type1', 'type2', 'ability1', 'ability2', 'hiddenability', 'hp', 'atk', 'physdef',
                    'spatk', 'spdef', 'speed', 'bst', 'alternate', 'isLegend', 'isFinal']]

In [77]:
#total_df.to_csv('Natdex_Data.csv', index=False)

And that's it for this task!

## Saving the showdown pokemon names

I'm currently adding the tiering in and the names used by pokemon showdown are the same as the indexes on this dataframe from the showdown API. I will now add a file that will act as a conversion from my current in-use pokemon names and the showdown pokemon names

In [78]:
name_conversion = pd.Series(index=pokedex_df['name'], data=pokedex_df.index)

In [79]:
#name_conversion.to_csv('name_conversion.csv')