# Pokémon showdown challenge

To start off I needed a way to read out the data files that contain usage stats provided by Smogon*. As a base I used a script made by Kaggle user named 'Fruity Fritz'. I needed to reform a bit of their code to fit my own wants and needs.

In [1]:
#some general imports needed for data cleaning
import pandas as pd
import urllib.request as urllib, os
import ssl

In [2]:
# made function to format the same sort of file later
def import_showdown_usage_stats(url, filename): # file url and name for local storage
    # location for datafile
    datafolder = 'data/'
    #download file from url to local storage
    gcontext = ssl.SSLContext() # needed to avoid possible certificate validation error
    downloaded = urllib.urlopen(url, context=gcontext)
    buf = downloaded.read()

    try:
        os.mkdir(datafolder)
    except FileExistsError:
        pass

    with open(datafolder + filename, 'wb') as f:
        f.write(buf)
    # import dataframe
    df = pd.read_csv(datafolder + filename, sep="|",skiprows=5,names=["rank","pokemon","usage","count",],usecols=[1,2,3,4])
    # drop any NaNs we find
    df.dropna(inplace=True)
    # changing data type of the columns
    for k, v in {"rank":int,"pokemon":str, "usage":str, "count":int}.items():
        df[k] = df[k].astype(v)
    # fixing leading/trailing white spaces
    df["pokemon"] = df["pokemon"].str.strip()
    df["usage"] = df["usage"].str.strip()
    # fixing format of string to not use '%'
    df["usage"] = df["usage"].str.rstrip('%')
    # change datatype of usage_%
    df["usage"] = df["usage"].astype(float) / 100
    # get rid of unnecessary columns
    df = df[['pokemon' ,'count' ,'usage']]
    return df

Now we need to use the method above to import and format a usage file from Smogon.

In [3]:
usage_stats = import_showdown_usage_stats(url = "https://www.smogon.com/stats/2023-01/gen9vgc2023series2-0.txt", filename="gen9s2vgc0.txt")
print(usage_stats.shape)
usage_stats.head()

(393, 3)


Unnamed: 0,pokemon,count,usage
0,Flutter Mane,795029,0.280452
1,Iron Hands,724440,0.255552
2,Amoonguss,705360,0.248821
3,Iron Bundle,663073,0.233904
4,Gholdengo,635767,0.224272


Then I searched for a complete dataset containing all Pokémon and their various forms. That's where I found a somewhat good [datafile through Kaggle](https://www.kaggle.com/datasets/brdata/complete-pokemon-dataset-gen-iiv). This file will be used to merge onto the Smogon usage datafile and not just the one I imported above. To later validate my findings I plan to use different usage files from different generations, that is why it is important for this datafile to be complete.

I imported it as a csv file and found out the encoding was not the standard UTF-8 encoding, so I just needed to specify the encoding in the read function.

In [4]:
names = ["dex#","pokemon","type1","type2","total","hp","attack","defence","sp.attack","sp.defence","speed","variant"]
stats = pd.read_csv("./data/Pokedex_Cleaned.csv",names=names,encoding='windows-1252')
print(stats.shape)
stats.head(10)

(1173, 12)


Unnamed: 0,dex#,pokemon,type1,type2,total,hp,attack,defence,sp.attack,sp.defence,speed,variant
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,
3,3,Venusaur,Grass,Poison,625,80,100,123,122,120,80,
4,4,Charmander,Fire,,309,39,52,43,60,50,65,
5,5,Charmeleon,Fire,,405,58,64,58,80,65,80,
6,6,Charizard,Fire,Dragon,634,78,130,111,130,85,100,Mega X
7,6,Charizard,Fire,Flying,634,78,104,78,159,115,100,Mega Y
8,6,Charizard,Fire,Flying,534,78,84,78,109,85,100,
9,7,Squirtle,Water,,314,44,48,65,50,64,43,


There were a lot of NaN values which I needed to fill in a way so that the data is 'complete' and ready to be merged with the usage dataset from Smogon.

In [5]:
stats.type2 = stats.type2.fillna('none') # if the second typing is NaN it means there is no second typing hence the value of none
stats.variant = stats.variant.fillna('') # if the variant is NaN it means there is no variant. replaced with empty string to add variant at the end of the name as done in the usage df
stats.head(10)

Unnamed: 0,dex#,pokemon,type1,type2,total,hp,attack,defence,sp.attack,sp.defence,speed,variant
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,
3,3,Venusaur,Grass,Poison,625,80,100,123,122,120,80,
4,4,Charmander,Fire,none,309,39,52,43,60,50,65,
5,5,Charmeleon,Fire,none,405,58,64,58,80,65,80,
6,6,Charizard,Fire,Dragon,634,78,130,111,130,85,100,Mega X
7,6,Charizard,Fire,Flying,634,78,104,78,159,115,100,Mega Y
8,6,Charizard,Fire,Flying,534,78,84,78,109,85,100,
9,7,Squirtle,Water,none,314,44,48,65,50,64,43,


After seeing the datafile above I wanted to check the values of both the type columns.

First I wanted to check the type2 column, so I can see if the previous action to fill the NaN values actually worked.

In [6]:
stats.type2.value_counts()

none        540
Flying      118
Psychic      47
Poison       46
Fairy        42
Fighting     42
Ground       41
Steel        38
Dragon       35
Ghost        34
Grass        34
Dark         33
Water        22
Ice          22
Fire         19
Rock         19
Normal       19
Electric     13
Bug           9
Name: type2, dtype: int64

These values are what I expected and also what I want from this column

When inspecting the datafile in another software tool, I found that there were some wierd values in type1 and  wanted to check them out.

In [7]:
stats.type1.value_counts()

Water       147
Normal      130
Grass       102
Bug          99
Psychic      76
Electric     73
Fire         73
Rock         67
Dark         56
Dragon       47
Ground       47
Poison       46
Fighting     46
Ghost        46
Steel        42
Fairy        32
Ice          27
Flying       12
Null          2
Male          1
Female        1
Rockruff      1
Name: type1, dtype: int64

After seeing the wierd values of Rockruff, null, male and female in typings I further looked into them.

In [8]:
stats.loc[(stats['type1'] == "Rockruff") | (stats['type1'] == "Male") | (stats['type1'] == "Female") | (stats['type1'] == "Null")]

Unnamed: 0,dex#,pokemon,type1,type2,total,hp,attack,defence,sp.attack,sp.defence,speed,variant
801,678,Meowstic,Male,Psychic,466,74,48,76,83,81,104,
802,678,Meowstic,Female,Psychic,466,74,48,76,83,81,104,
884,744,Rockruff,Rockruff,Rock,280,45,65,40,30,40,60,Own Tempo
916,772,Type:,Null,Normal,534,95,95,95,95,95,59,
917,772,Type:,Null,none,534,95,95,95,95,95,59,


The null value is from a Pokémon named "type:null", both records are for this same Pokémon. The stats are exactly the same, so I will delete the second mention and just replace the faulty record with a correct one.

The values for male and female are faulty in the sense that they occupy the wrong column, they need to be in variant but are in type1. The Rockruff record is also faulty. the Variant value contains their 'ability' which is another attribute that we do not need in this dataset. I have no idea why it is here.

Because there are only 5 faulty records, I am going to manually replace them.

In [9]:
stats.loc[801] = ['678','Meowstic','Psychic','none','466','74','48','76','83','71','104','Male']
stats.loc[802] = ['678','Meowstic','Psychic','none','466','74','48','76','83','71','104','Female']
stats.loc[884] = ['744','Rockruff','Rock','none','280','45','65','40','30','40','60','']
stats.loc[916] = ['772','Type:Null','Normal','none','534','95','95','95','95','95','59','']
stats.drop(stats.index[917], inplace=True)

stats.type1.value_counts()

Water       147
Normal      131
Grass       102
Bug          99
Psychic      78
Fire         73
Electric     73
Rock         68
Dark         56
Ground       47
Dragon       47
Fighting     46
Poison       46
Ghost        46
Steel        42
Fairy        32
Ice          27
Flying       12
Name: type1, dtype: int64

As you can see, the weird values are now no longer present in the dataset.

Now we are going to change the notation of the Variant to those that are used by our usage stats dataframe. These notations are the ones that Smogon uses.
When I found this dataset I knew it would take a lot of work to get all the data cleaned. I still chose this one because it was the most complete I could find. Even tough this dataset takes a lot of work to prepare, it only needs data cleaning, other datasets would need to be merged together to create this one and I prefer to clean just one dataset apposed to multiple

The first thing we are going to do is remove all mega's from the list of Pokémon. This is because mega is a mechanic and not a Pokémon to choose for your team.

In [10]:
stats = stats[stats.variant != 'Mega']
stats = stats[stats.variant != 'Mega X']
stats = stats[stats.variant != 'Mega Y']
stats.variant.value_counts()

                    981
Galarian             16
Alolan               15
Hisuian              13
Therian Form          4
                   ... 
Confined              1
Unbound               1
Baile Style           1
Pa u Style            1
Two-Segment Form      1
Name: variant, Length: 85, dtype: int64

After this I needed to clean the remaining variants. To start with I changed the notation of the regions to the one that Smogon uses.

In [11]:
stats['variant'] = stats['variant'].str.strip()
# notation for region specific variants
stats['variant'] = stats['variant'].replace(['Galarian'],'Galar')
stats['variant'] = stats['variant'].replace(['Alolan'],'Alola')
stats['variant'] = stats['variant'].replace(['Hisuian'],'Hisui')
stats['variant'] = stats['variant'].replace(['Paldean'],'Paldea')

Then I changed the gender notations.

In [12]:
# gender notation
stats['variant'] = stats['variant'].replace(['Female'],'F')
stats['variant'] = stats['variant'].replace(['Male'],'')

Then I removed the suffixes(?).

In [13]:
# removed unneeded endings
stats['variant'] = stats['variant'].str.replace(' Form','')
stats['variant'] = stats['variant'].str.replace(' Style','')
stats['variant'] = stats['variant'].str.replace(' Super','')
stats['variant'] = stats['variant'].str.replace(' Size','')
stats['variant'] = stats['variant'].str.replace(' Plumage','')

After that I removed records of variant that do not change the stats of the Pokémon.

In [14]:
# variants that dont matter in stats or are not available in Pokemon showdown
stats = stats[stats.variant != 'Blue-Striped']
stats = stats[stats.variant != 'Red-Striped']
stats = stats[stats.variant != 'Partner']
stats = stats[stats.variant != 'Altered']
stats = stats[stats.variant != 'Eternamax']

There were still some Pokémon with multiple variants that did not have a base variant or empty variant. Those still needed fixing.

In [15]:
# base forms that have a variant fixes
stats['variant'] = stats['variant'].replace('White-Striped','')
stats['variant'] = stats['variant'].replace('Hero of Many Battles','')
stats['variant'] = stats['variant'].replace('Incarnate','')
stats['variant'] = stats['variant'].replace('Ordinary','')
stats['variant'] = stats['variant'].replace('Aria','')
stats['variant'] = stats['variant'].replace('Shield','')
stats['variant'] = stats['variant'].replace('Average','')
stats['variant'] = stats['variant'].replace('50%','')
stats['variant'] = stats['variant'].replace('Confined','')
stats['variant'] = stats['variant'].replace('Baile','')
stats['variant'] = stats['variant'].replace('Midday','')
stats['variant'] = stats['variant'].replace('Solo','')
stats['variant'] = stats['variant'].replace('Core','')
stats['variant'] = stats['variant'].replace('Single Strike','')
stats['variant'] = stats['variant'].replace('Full Belly Mode','')
stats['variant'] = stats['variant'].replace('Family of Three','')
stats['variant'] = stats['variant'].replace('Family of Four','')
stats['variant'] = stats['variant'].replace('Amped','')
stats['variant'] = stats['variant'].replace('Two-Segment','')
stats['variant'] = stats['variant'].replace('Green','')

Now I needed to manually fix the variants that still needed fixes.

In [16]:
# unique variant notation fixes
stats['variant'] = stats['variant'].replace('Crowned Sword','Crowned')
stats['variant'] = stats['variant'].replace('Crowned Shield','Crowned')
stats['variant'] = stats['variant'].replace('Ice Rider','Ice')
stats['variant'] = stats['variant'].replace('Shadow Rider','Shadow')
stats['variant'] = stats['variant'].replace('Galarian Zen','Galar-Zen')
stats['variant'] = stats['variant'].replace('Pa u',"Pa'u")
stats['variant'] = stats['variant'].replace('Rapid Strike','Rapid-Strike')
stats['variant'] = stats['variant'].replace('Zero','Rapid-Strike')
stats['variant'] = stats['variant'].replace('Hangry Mode','Hangry')
stats['variant'] = stats['variant'].replace('Low Key','Low-Key')
stats['variant'] = stats['variant'].replace('Galarian Standard','Galar')

stats.variant.value_counts()

                 1009
Galar              17
Alola              15
Hisui              13
Therian             4
Origin              3
F                   3
Large               2
Crowned             2
Rapid-Strike        2
Trash Cloak         2
Sandy Cloak         2
Plant Cloak         2
Primal              2
Small               2
Super               2
White               2
Dawn Wings          1
Midnight            1
School              1
Dusk                1
Meteor              1
Low-Key             1
Dusk Mane           1
Ultra               1
Pom-Pom             1
Hangry              1
Ice                 1
Shadow              1
Blue                1
Yellow              1
Hero                1
Sensu               1
Blade               1
Pa'u                1
Mow                 1
Paldea              1
Rainy               1
Snowy               1
Sunny               1
Fan                 1
Frost               1
Heat                1
Wash                1
Unbound             1
Galar-Zen 

After this there were still 6 variants that were not yet fully clean. These were the Pokémon: Burmy and Wormadam. These two Pokémon have the same shared 3 Variants, but those variants only matter for Wormadam and not for Burmy. This means I will need to manually these values. the Variants in question are the Plant Cloak, Sandy Cloak and Trash Cloak variants. The Plant cloak is the easiest to handle because this is the 'default' form according to the [Smogon Pokédex](https://www.smogon.com/dex/sv/pokemon/), this means that an empty('') value would be fitting for both Pokémon. The naming of cloak in the variant needs to be removed for the same reason.
 The only thing left to do was to remove the remaining Burmy, the way I found easiest was to locate them and manually remove the unneeded records.

In [17]:
stats['variant'] = stats['variant'].replace('Plant Cloak','')
stats['variant'] = stats['variant'].str.replace(' Cloak','')
stats.loc[(stats['pokemon'] == "Burmy") | (stats['pokemon'] == "Wormadam")]

Unnamed: 0,dex#,pokemon,type1,type2,total,hp,attack,defence,sp.attack,sp.defence,speed,variant
493,412,Burmy,Bug,none,224,40,29,45,29,45,36,
494,412,Burmy,Bug,none,224,40,29,45,29,45,36,Sandy
495,412,Burmy,Bug,none,224,40,29,45,29,45,36,Trash
499,413,Wormadam,Bug,Grass,424,60,59,85,79,105,36,
500,413,Wormadam,Bug,Ground,424,60,79,105,59,85,36,Sandy
501,413,Wormadam,Bug,Steel,424,60,69,95,69,95,36,Trash


In [18]:
stats.drop(stats.index[494], inplace=True)
stats.drop(stats.index[495], inplace=True)
stats.loc[(stats['pokemon'] == "Burmy") | (stats['pokemon'] == "Wormadam")]

Unnamed: 0,dex#,pokemon,type1,type2,total,hp,attack,defence,sp.attack,sp.defence,speed,variant
493,412,Burmy,Bug,none,224,40,29,45,29,45,36,
494,412,Burmy,Bug,none,224,40,29,45,29,45,36,Sandy
495,412,Burmy,Bug,none,224,40,29,45,29,45,36,Trash
499,413,Wormadam,Bug,Grass,424,60,59,85,79,105,36,
500,413,Wormadam,Bug,Ground,424,60,79,105,59,85,36,Sandy
501,413,Wormadam,Bug,Steel,424,60,69,95,69,95,36,Trash


This did not work and I started thinking again. I didn't really need to strip out the Burmy forms. This because I want to join this dataset on the one provided by Smogon. This would mean it will only select the one with the correct variant index. In this case that index wil always by none because Smogon does not recognize the other Burmy variants. This makes the step above unnecessary.