# Pre-Processing
The purpose of this notebook is to highlight how to import the usage data from Pokemon Showdown and save in a more Pandas-friendly output.

The dataframes created here are also provided as CSV files in the database.

In [1]:
import pandas as pd
import re # helps with separating strings with multiple delimiters

## Importing Raw Showdown Data
The Showdown data is a "cute" format that needs to be coaxed into a proper dataframe. The data columns are more or less split by the pipe (`|`) symbol so we can use that as our delimiter. Then we skip the first few rows, define our own column names, and use specific columns that contain the more interesting data. Again, you can view the raw format [here](https://www.smogon.com/stats/2020-12/gen8vgc2021-0.txt) as an example.

In [2]:
def import_showdown_usage_stats(filepath="../input/gen8-vgc-december-series-7-usage-stats", filename="2021-01"):
    """
    Imports the VGC series usage stats from Pokemon Showdown
    
    Inputs:
    - filepath: string of the path to the file
    - filename: string of the file name WITHOUT the extension - assumes txt
    
    Returns a dataframe with the Pokemon, the raw counts for usage, and the percentage of teams it is used on ordered by most used. 
    """
    # importing data, delimiting by pipe, skipping first few rows, defining our own column names, and only importing the necessary columns
    usage_df = pd.read_csv(f"{filepath}/{filename}.txt", sep="|",skiprows=5,names=["rank","pokemon","count",],usecols=[1,2,4])
    # drop any NaNs we find
    usage_df.dropna(inplace=True)
    # changing data type of the columns
    for k, v in {"rank":int,"pokemon":str,"count":int}.items():
        usage_df[k] = usage_df[k].astype(v)
    # fixing leading/trailing white spaces
    usage_df["pokemon"] = usage_df["pokemon"].str.strip()
    
    # getting the overall number of battles for percentage
    usage_file = open(f"{filepath}/{filename}.txt","r")
    n = re.split("\n| ",usage_file.readline())[3]
    # adding in percentage column
    usage_df["n"] = int(n)
    usage_df["percent"] = usage_df["count"] / int(n)
    
    return usage_df.set_index("pokemon")

With the function defined above to help us import and process the text-formatted files, we can produce four sets of dataframes to work with:
1. November Usage Stats - first month of Series 7
2. December Usage Stats
3. January Usage Stats - last month of Series 7
4. Combined Usage Stats

## Monthly Usage Stats 
For these data, we simply import them, convert them to dataframes, and then save their output.

In [3]:
usage = {}
months = ["2020-11","2020-12","2021-01"]
for file in months:
    usage[file] = import_showdown_usage_stats(filename=file)
    
for key, df in usage.items():
    print(key,"\n", df.head())

2020-11 
                   rank    count        n   percent
pokemon                                           
Tapu Fini            1  1354612  1883573  0.719171
Regieleki            2   924749  1883573  0.490955
Incineroar           3   839666  1883573  0.445784
Landorus-Therian     4   770399  1883573  0.409009
Metagross            5   730940  1883573  0.388060
2020-12 
             rank   count        n   percent
pokemon                                    
Tapu Fini      1  677186  1287914  0.525801
Incineroar     2  639683  1287914  0.496681
Rillaboom      3  589514  1287914  0.457728
Regieleki      4  566288  1287914  0.439694
Urshifu        5  560059  1287914  0.434857
2021-01 
             rank   count        n   percent
pokemon                                    
Incineroar     1  691534  1153370  0.599577
Rillaboom      2  557659  1153370  0.483504
Tapu Fini      3  531015  1153370  0.460403
Regieleki      4  466890  1153370  0.404805
Urshifu        5  410550  1153370  0.3559

## Combined Usage Stats
The combined usage stats might be of use, but take them with a grain of salt: the meta is constantly evolving to counteract new threats that are then counteracted themselves. Be weary of looking at these summary data.

We will need the `numpy` library to help generate a list for us.

In [4]:
import numpy as np

Now we create a function to help us combined data from all the monthly dataframes we generated before.

In [5]:
def get_combined(usage_dict):
    """
    Combines dataframes from the monthly usage statistics housed in the dictionary
    
    Inputs:
    - usage_dict: dictionary of usage dataframes indexed by month
    
    Returns a dataframe with the information from each of the monthly dataframes combined.    
    """
    combined_df = pd.concat([df for df in usage.values()],axis=0,join='outer')
    aggregated_df = combined_df.groupby(combined_df.index).agg(sum)
    # dropping rank and percent because they don't mean anything now
    aggregated_df.drop(["rank","percent"],axis=1,inplace=True)
    # sorting and redefining rank
    aggregated_df.sort_values("count",ascending=False,inplace=True)
    aggregated_df["rank"] = np.arange(1,len(aggregated_df)+1,1)
    # getting percent back
    aggregated_df["n"] = max(aggregated_df["n"]) # overwriting n with total
    aggregated_df["percent"] = (aggregated_df["count"]/aggregated_df["n"]).astype(float)
    
    return aggregated_df

In [6]:
combined_df = get_combined(usage)
combined_df.head()

Unnamed: 0_level_0,count,n,rank,percent
pokemon,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tapu Fini,2562813,4324857,1,0.592578
Incineroar,2170883,4324857,2,0.501955
Regieleki,1957927,4324857,3,0.452715
Rillaboom,1731032,4324857,4,0.400252
Landorus-Therian,1603189,4324857,5,0.370692


### Note on Combined Usage Data
Showdown allowed restricted Pokemon (for the upcoming series) prior to Series 8 being released for the cartridge game. Therefore, the last month of data and the combined dataframe have restricted Pokemon included.

This pattern is likely to continue for future VGC development so only the first two months of data should be considered.

In [7]:
last_month_df = usage[months[-1]]
restricted_mons = last_month_df[last_month_df.index.isin(["Mewtwo","Lugia","Ho-oh","Groudon","Kyogre","Rayquaza"
                                                         "Dialga","Palkia","Giratina","Reshiram","Zekrom","Kyurem",
                                                         "Xerneas","Yveltal","Zygarde"])]
restricted_mons

Unnamed: 0_level_0,rank,count,n,percent
pokemon,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Kyogre,54,62191,1153370,0.053921
Groudon,82,30503,1153370,0.026447
Yveltal,97,23685,1153370,0.020535
Xerneas,116,17470,1153370,0.015147
Palkia,175,7318,1153370,0.006345
Lugia,197,5779,1153370,0.005011
Zygarde,201,5626,1153370,0.004878
Reshiram,220,4724,1153370,0.004096
Mewtwo,254,3697,1153370,0.003205
Zekrom,293,2674,1153370,0.002318
