# Learning Data Exploration & Analysis Through the Pokedex Dataset

## Data exploration & analysis

Data exploration is a critical skill that often goes under appreciated. It's often treated as more of an art than a science and that is because it requires a certain level of intution on how to take in raw data and turn out quantitative or qualitative results. In an attempt to hone my data exploration skills and teach myself how to create nice looking visualizations using Altair, I am using the most complete Pokemon dataset I could find from (Kaggle)[https://www.kaggle.com/mariotormo/complete-pokemon-dataset-updated-090420?select=pokedex_%28Update_04.21%29.csv]. 


## How to be the very best?

The only real question when it comes to being a Pokemon trainer is, "Can I be the very best, like no one ever was?" In order to answer this question though we will need to do some digging to find out!

1. What are the distributions of Pokemon stats? (HP, Attack, Defense, Special Attack, Special Defense, Speed)
1. Which Pokemon are strongest overall using these stats? Which are the weakest?
2. Which Pokemon types are the strongest overall? Which are the weakest?

By the end of this notebook I hope you will have a good idea of which Pokemon can help you become the greatest Pokemon trainer of all time!

## Imports

First we'll import some python modules that we'll be using in this notebook

In [1]:
## IMPORTS ##
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import altair as alt
np.set_printoptions(suppress=True)
import matplotlib.pyplot as plt
%matplotlib inline

## Data Sanitization

This dataset actually came pretty neatly packaged; however, there were a few things I needed to do. Namely, set some strings to lower case, have columns names match entry values for cross referencing, and change some column names for convenience. All other work is done by creating new dataframes off of this dataframe.

In [2]:
df = pd.read_csv('pokedex_(Update_05.20).csv')
df = df.rename(columns={'pokedex_number': 'pnum'})
df = df.drop_duplicates('pnum', keep='first')
df['type_2'][df['type_2'].isna()] = df['type_1'][df['type_2'].isna()] # replace type_2 for single types with same type_1
df[['type_1', 'type_2']] = df[['type_1', 'type_2']].apply(lambda x: x.str.lower())
df.head(5)

Unnamed: 0.1,Unnamed: 0,pnum,name,german_name,japanese_name,generation,status,species,type_number,type_1,...,against_ground,against_flying,against_psychic,against_bug,against_rock,against_ghost,against_dragon,against_dark,against_steel,against_fairy
0,0,1,Bulbasaur,Bisasam,フシギダネ (Fushigidane),1,Normal,Seed Pokémon,2,grass,...,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
1,1,2,Ivysaur,Bisaknosp,フシギソウ (Fushigisou),1,Normal,Seed Pokémon,2,grass,...,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
2,2,3,Venusaur,Bisaflor,フシギバナ (Fushigibana),1,Normal,Seed Pokémon,2,grass,...,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
4,4,4,Charmander,Glumanda,ヒトカゲ (Hitokage),1,Normal,Lizard Pokémon,1,fire,...,2.0,1.0,1.0,0.5,2.0,1.0,1.0,1.0,0.5,0.5
5,5,5,Charmeleon,Glutexo,リザード (Lizardo),1,Normal,Flame Pokémon,1,fire,...,2.0,1.0,1.0,0.5,2.0,1.0,1.0,1.0,0.5,0.5


### Make a Pokedex
For thematic purposes and for some conveniences later on we'll also make a Pokedex dataframe for easy reference of the Pokemon attributes that are not quantitative in nature.

In [3]:
pokedex_list = ['pnum', 'name', 'type_1', 'type_2', 'generation', 'status']
pokedex = df[pokedex_list]
pokedex.head()

Unnamed: 0,pnum,name,type_1,type_2,generation,status
0,1,Bulbasaur,grass,poison,1,Normal
1,2,Ivysaur,grass,poison,1,Normal
2,3,Venusaur,grass,poison,1,Normal
4,4,Charmander,fire,fire,1,Normal
5,5,Charmeleon,fire,fire,1,Normal


## 1. What are the distributions of Pokemon stats? (HP, Attack, Defense, Special Attack, Special Defense, Speed)

First, we'd like to explore the whole world of Pokemon to understand how their stats vary and in order to do that we need to standardize the dataset. Here we adjust the values of the stats across the whole pokedex to have zero mean and unit standard deviation. This will allow us to more uniformly compare individual Pokemon to others on an even basis. We then perform a sanity check for our peace of mind.


In [4]:
# Make a stats df
stat_list =  ['hp', 'attack', 'defense', 'sp_attack', 'sp_defense', 'speed']
statistics = pd.merge(
    df,
    pokedex,
    on='pnum'
).loc[:, ['pnum', *stat_list]]
# statistics.head()

# standardize data
std_stats = statistics.set_index('pnum').apply(
    lambda x: (x - x.mean()) / x.std())
std_stats['strength'] = std_stats.sum(axis='columns')/np.sqrt(len(std_stats.columns))

# sanity check
sanity_check = pd.DataFrame({'Means':std_stats.mean(axis='rows'), 'Stds': std_stats.std(axis='rows')})
sanity_check.head(6)

Unnamed: 0,Means,Stds
hp,-4.9897660000000003e-17,1.0
attack,-3.655004e-16,1.0
defense,1.591735e-16,1.0
sp_attack,-1.92106e-16,1.0
sp_defense,7.372380000000001e-17,1.0
speed,-1.891121e-16,1.0


The next thing we do is start exploring the stats and checking for correlations. We can create a matrix scatter plot to show how different statistics correlate to eachother. Here we can notice a few things:

1. In general, there is a positive correlation across all of the statistics. This makes intuitive sense since any pokemon that is "stronger" than another would also very likely be higher in all the attributes. So all stats go up for stronger Pokemon.
2. The primary counter example to the previous point is that speed and defense seem to not be correlated at all. This interesting since I would expect slower Pokemon to be bigger and bulkier, meaning that they would also be high in defense, but clearly the spread is more independent than that.
3. Finally, the strongest correlations tend to be between attack and special attack as well as defense and special defense as well as attack and defense. Together, this all means essentially that "stronger Pokemon are stronger" in that we see Pokemon whose attack is high also have high defense, etc. This furthers the point in number 1.

Another interesting note is that if we code the color of the scatter matrix by type between Normal (i.e. common), Sub Legendary, Legendary and Mythical we see that all of the Sub Legendary, Legendary and Mythical Pokemon skew to the upper right in the plots. In the density plots they also tend to be above the zero mean, while common Pokemon tned to stay below the zero mean mark. Again, it would appear that intutitively Mythical, Legendary, Sub Legendary Pokemon tend to be stronger than Normal Pokemon on the whole.

In [121]:
def regression_plot(df,r,c,reg=True):
    if r==c:
        chart = alt.Chart(df).transform_density(
            r,
            as_=[r, 'density'],
            groupby=['status'],
            extent=[np.min(df[r])-.25, np.max(df[r])+.25],
        ).mark_area(opacity=.7).encode(
            x=alt.X(r, type='quantitative'),
            y='density:Q',
            color='status:N'
        ).properties(
            width=150,
            height=150
        )
    else:
        chart = alt.Chart(df).mark_circle().encode(
            x=alt.X(r, type='quantitative'),
            y=alt.Y(c, type='quantitative'),
            color=alt.Color('status:N', title='Primary Type', sort=type_sort)
        ).properties(width=150, height=150
        )
        line = alt.Chart(df).mark_line(
            color='black',
            strokeDash=[5]
        ).encode(
            x=alt.X(r, type='quantitative'),
            y=alt.Y(c, type='quantitative')
        ).transform_regression(
            r, c
        )
        corl = df[[r, c]].corr().iloc[0,1]
        text = alt.Chart({'values':[{}]}).mark_text(
            align='right',
            baseline='top'
        ).encode(
            x=alt.value(145),
            y=alt.value(5),
            text=alt.value(f'R²: {corl**2:.2f}')
        )
        if reg:
            chart = chart + line + text
        
    return chart

In [122]:
stats = ['hp', 'attack', 'defense', 'sp_attack', 'sp_defense', 'speed']

strength_breakdown = pd.merge(
    pokedex,
    std_stats,
    on='pnum'
)

chart_regs = []
for c in strength_breakdown[stats]:
    temp_list = []
    for r in strength_breakdown[stats][::-1]:
        reg_plot = regression_plot(strength_breakdown, r, c)
        temp_list.append(reg_plot)
    chart_regs.append(alt.hconcat(*temp_list))


complete_chart = alt.vconcat(*chart_regs)
complete_chart.properties(title='Standardized Pokemon Attribute Distributions').configure_title(anchor='middle')

## 2. Which Pokemon are strongest overall?

### Z-score Strength Analysis

So now we have a feel for how the attibutes are distributed against each other, but what makes a strong Pokemon? One method of determining strength is to sum up all of the standardized stats we have for each Pokemon and divide it by the square root of the number of attributes. This is known as [Stouffer's Z-score method](https://www.wikiwand.com/en/Fisher%27s_method#/Relation_to_Stouffer's_Z-score_method). By doing this we are making two major assumptions:

1. That all stats should be weighted the same (should they?)
2. and that they are mostly independent (they clearly aren't)

So obviously this isn't a perfect metric for strength; however, for the most part this should suffice and we can always tweak it later if we want.

We sum up the individual standardized attributes and divide by the square root of the number of attributes (in this case 6) to arrive at the z-score for each Pokemon. Using this metric, we can directly look up the most powerful Pokemon in our pokedex.

In [91]:
strength_breakdown.sort_values('strength', ascending=False).head(10)[['name', 'status', 'type_1', 'type_2', 'strength']]

Unnamed: 0,name,status,type_1,type_2,strength
492,Arceus,Mythical,normal,normal,4.320112
888,Zamazenta Crowned Shield,Legendary,fighting,steel,4.300168
887,Zacian Crowned Sword,Legendary,fairy,steel,4.271486
889,Eternatus,Legendary,poison,dragon,3.925149
486,Giratina Altered Forme,Legendary,ghost,dragon,3.80647
248,Lugia,Legendary,psychic,flying,3.778434
791,Lunala,Legendary,psychic,ghost,3.768818
249,Ho-oh,Legendary,fire,flying,3.75971
790,Solgaleo,Legendary,psychic,steel,3.742238
715,Xerneas,Legendary,fairy,fairy,3.734728


No surprise, all of the strongest Pokemon are Mythical or Legendary! We can also look at the median for a given type (or dual type) Pokemon and see if there are some types that stand out more than others.

In [95]:
medians = strength_breakdown.groupby(['type_1', 'type_2']).median().unstack()['strength'].unstack().reset_index().rename(columns={0:'strength'}).sort_values('strength', ascending=False)
medians['order'] = [f'{x+1}' for x in range(len(medians))]


# print(medians)
marks = alt.Chart(medians).mark_rect().encode(
    x = alt.X('type_2:N', sort=None, title='Secondary Type'),
    y = alt.Y('type_1:N', sort=None, title='Primary Type'),
    color = alt.Color('median(strength):Q', scale=alt.Scale(scheme='redblue', domain=(-4, 4)))
)

nulls = marks.transform_filter(
    '!isValid(datum.strength)'
).mark_rect().encode(
    color=alt.value('lightgrey')
)


N = 10
tops_base = alt.Chart(medians.head(N).reset_index()).encode(
    x = alt.X('type_2:N'),
    y = alt.Y('type_1:N')
)

bots_base = alt.Chart(medians.dropna().tail(N).reset_index()).encode(
    x = 'type_2:N',
    y = 'type_1:N'
)

# marks
tops = tops_base.mark_rect(stroke='black', strokeWidth=3, fill=None)
bots = bots_base.mark_rect(stroke='black', strokeWidth=3, fill=None)
top_text= tops_base.mark_text(color='white').encode(
    text='order:N'
)
bot_text = bots_base.mark_text(color='black').encode(
    text='order:N'
)


chart1 = (marks+nulls+tops+bots+top_text+bot_text).properties(width=600,height=600,title='Median Pokemon Strength by Type Including Mythical, Legendary & Sub Legendary Pokemon')
chart1

Here we highlight the top 10 median values for Pokemon with given types. Both steel and dragon types make a strong showing while bug type tends to really stand out as being very, very weak. But we should keep in mind this incorporates Legendary/Mythical Pokemon. It is a little too unreasonable to say that to be the best you must catch Pokemon that are literally legends, so let's filter those out.

We can see below now that we've filtered out the legendaries our top 10 has changed. We are still very heavy on the dragon front though. Let's take a look at the type strengths.

In [96]:
strength_breakdown_no_legs = pd.merge(
    pokedex[pokedex['status']=='Normal'],
    std_stats,
    on='pnum'
)

strength_breakdown_no_legs.sort_values('strength', ascending=False).head(10)[['name', 'status', 'type_1', 'type_2', 'strength']]

Unnamed: 0,name,status,type_1,type_2,strength
277,Slaking,Normal,normal,normal,3.59786
657,Goodra,Normal,dragon,dragon,2.613532
423,Garchomp,Normal,dragon,ground,2.57749
805,Dragapult,Normal,dragon,ghost,2.555992
598,Hydreigon,Normal,dark,dragon,2.553792
239,Tyranitar,Normal,rock,dark,2.551189
145,Dragonite,Normal,dragon,flying,2.547311
361,Salamence,Normal,dragon,flying,2.544308
727,Kommo-o,Normal,dragon,fighting,2.52867
364,Metagross,Normal,steel,psychic,2.505379


In [97]:
medians_no_legs = strength_breakdown_no_legs.groupby(['type_1', 'type_2']).median().unstack()['strength'].unstack().reset_index().rename(columns={0:'strength'}).sort_values('strength', ascending=False) # this line basically fills in all missing type-pairs with NaN so we can make the empty space grey
medians_no_legs['order'] = [f'{x+1}' for x in range(len(medians_no_legs))]

marks = alt.Chart(medians_no_legs).mark_rect().encode(
    x = alt.X('type_2:N', title='Secondary Type'),
    y = alt.Y('type_1:N', title='Primary Type'),
    color = alt.Color('median(strength):Q', scale=alt.Scale(scheme='redblue', domain=(-4, 4)))
).transform_impute(
    impute='strength',
    key='type_1',
    value=-100,
    groupby=['type_2']
)

nulls = marks.transform_filter(
    '!isValid(datum.strength)'
).mark_rect().encode(
    color=alt.value('lightgrey')
)

N = 10
tops_base = alt.Chart(medians_no_legs.head(N).reset_index()).encode(
    x = alt.X('type_2:N'),
    y = alt.Y('type_1:N')
)

bots_base = alt.Chart(medians_no_legs.dropna().tail(N).reset_index()).encode(
    x = 'type_2:N',
    y = 'type_1:N'
)

# marks
tops = tops_base.mark_rect(stroke='black', strokeWidth=3, fill=None)
bots = bots_base.mark_rect(stroke='black', strokeWidth=3, fill=None)
top_text= tops_base.mark_text(color='white').encode(
    text='order:N'
)
bot_text = bots_base.mark_text(color='black').encode(
    text='order:N'
)

chart2 = (marks+nulls+tops+bots+top_text+bot_text).properties(width=600,height=600,title='Median Pokemon Strength by Type Without Mythical, Legendary & Sub Legendary Pokemon')
chart2

Only considering normal Pokemon we see that dragon and steel make strong showings, while bug types are still the weakest (sorry Butterfree fans). Below are both charts side by side for convenience in referencing them. You'll notice many empty squares on the right. This is because many of the legendary Pokemon are the only ones of their unique type.

In [98]:
chart1 | chart2

One thing I wanted to investigate though was the likelihood of catching the common Pokemon that fit the best types. We don't have data for encounter rates or frequency of interaction, but we can do a kind of proxy by seeing how many different Pokemon exist of a given type. Below is a heatmap of how many common Pokemon exist of a given type-pair (or single type) with boxes around the strongest types from above. Here we can see that

1. There are many more instances of single-types (e.g. bug-bug) than many other dual types and
2. All of the dual types identified as the "best" previously only have a few Pokemon that exist that fit the bill.

This tends to tell us again that to get these Pokemon with the types that have good strength we may have to work a bit harder. This makes sense though since it shouldn't necessarily come very easy.

In [103]:
dual_types = pokedex[(pokedex['type_1']!=pokedex['type_2']) & (pokedex['status']=='Normal')]
single_types = pokedex[(pokedex['type_1']==pokedex['type_2']) & (pokedex['status']=='Normal')]


base = alt.Chart(dual_types).encode(
    x = alt.X('type_2:N', title='Secondary Type'),
    y = alt.Y('type_1:N', title='Primary Type'),
    color = alt.Color('count(type_1+type_2):Q', scale=alt.Scale(domain=(0,65)), title='Number of Pokemon')
)

marks = base.mark_rect()
text = base.mark_text().encode(
    text = 'count(type_1+type_2):Q',
    color=alt.value('black')
)

single_base = alt.Chart(single_types).transform_joinaggregate(
    counts = 'count(type_1)',
    groupby=['type_2']
).encode(
    x=alt.X('type_2:N', title='Secondary Type'),
    y=alt.Y('type_1:N', title='Primary Type'),
    color=alt.Color('counts:Q')
)

single_text = single_base.mark_text().encode(
    text = 'counts:Q',
    # color=alt.value('black')
    color=alt.condition(
        alt.datum.counts > 40,
        alt.value('white'),
        alt.value('black')
    )
)

(marks+single_base.mark_rect()+text+single_text+tops).properties(width=600,height=600, title='Number of Unique Common Pokemon by Type')

 Finally, one thing worth noticing is that on our previous list of strongest normal Pokemon (recreated below) actually has types that aren't identified as "best" in terms of median. Slaking and Goodra are both single type Pokemon, which means their strong scores get watered down by other weaker Pokemon. In fact 6 of the top 10 listed below are different dragon types while we only identified dragon-flying as being a top-10 strength type. 

In [104]:
strength_breakdown_no_legs.sort_values('strength', ascending=False).head(10)[['name', 'status', 'type_1', 'type_2', 'strength']]

Unnamed: 0,name,status,type_1,type_2,strength
277,Slaking,Normal,normal,normal,3.59786
657,Goodra,Normal,dragon,dragon,2.613532
423,Garchomp,Normal,dragon,ground,2.57749
805,Dragapult,Normal,dragon,ghost,2.555992
598,Hydreigon,Normal,dark,dragon,2.553792
239,Tyranitar,Normal,rock,dark,2.551189
145,Dragonite,Normal,dragon,flying,2.547311
361,Salamence,Normal,dragon,flying,2.544308
727,Kommo-o,Normal,dragon,fighting,2.52867
364,Metagross,Normal,steel,psychic,2.505379


#### Z-score Strength Analsysis Conclusions
Overall, the z-score method of evaluating a Pokemon's strength has helped break down the mutliple attributes, but it doesn't tell the whole story and needs refinement. Some ways to do this could be weighting the z-score (e.g. scaling HP by 2), using the norm of the stats instead of the z-score as our strength metric, or using principal component analysis (PCA) like in the next section.

### Principal Component Analysis of Pokemon Strength

Principal component analysis (PCA) is used to explore datasets and reduce dimensionality of observations. Mathematically, PCA determines a vector basis that maximizes the variance of the samples, but put in regular English it simply means that we can determine which combination of stats explain the biggest differences between Pokemon so we can breakdown weak Pokemon from strong Pokemon.

Before when using the Z-score method we made two assumptions, that the stats weights should be uniformly weighted and that they are also independent, both of which were weak assumptions. Here PCA actually handles both of these for us by giving us stat weights and forcing those different weights to be independent. Ideally this should be a better metric.

For this we take the standardized stats we used before, but perform PCA on them to effectively get weights for each of the stats. The first principal compnent then tells us which combination of stats are the most important to the variation of Pokemon strength, the second tells us which combination of stats are second most important and so on. We can do this up to 6 times (the number of original stats we have to begin with), but the nice part about PCA is that we can probably get a good feel by using fewer components than we had to begin with. In our case we'll look at the firt 3 principal components. 

In [105]:
N = 3
pca = PCA(n_components=N, svd_solver='full')
pca.fit(std_stats[stat_list])
pcscores = pd.DataFrame(pca.transform(std_stats[stat_list]))
pcs = [f'PC{i+1}' for i in range(len(pcscores.columns))]
pcscores.columns = pcs
components = pd.DataFrame(pca.components_, columns = std_stats[stat_list].columns, index=pcs)

# print(defense_components)

var = pca.explained_variance_ratio_
print(f'The first {N} principal components explain {np.sum(var)*100:0.1f}% Variance in Pokemon stats')
print(components)

The first 3 principal components explain 76.8% Variance in Pokemon stats
           hp    attack   defense  sp_attack  sp_defense     speed
PC1  0.420286  0.434930  0.380220   0.444615    0.444662  0.306956
PC2  0.051041  0.018872  0.606017  -0.325287    0.217055 -0.690549
PC3 -0.243443 -0.666128 -0.045343   0.374966    0.591528 -0.066690


Here we see that the first 3 principal components explain 76.8% of the variance in the Pokemon stats HP, Attack, Defense, Special Attack, Special Defense and Speed. 100% would be ideal since it would fully explain the dataset, but 76.8% isn't too bad (a common acceptable level is about 80%). The first principal component here tells us that we can weight the HP stat by 42%, Attack by 43% and so on. We actually see in the first PC that each of the stats are weighted relatively the same. Intuitively, this would suggest that Pokemon that are higher in all stats then have higher PC1 values, so let's take a closer look.

In [123]:
pca_stats = pd.merge(
    pokedex,
    std_stats[stat_list] @ components.T, 
    on='pnum'
)
# pca_stats['strength'] = pca_stats[pcs].sum(axis='columns')

chart_regs = []
for c in pca_stats[pcs]:
    temp_list = []
    for r in pca_stats[pcs][::-1]:
        reg_plot = regression_plot(pca_stats, r, c, False)
        temp_list.append(reg_plot)
    chart_regs.append(alt.hconcat(*temp_list))


complete_chart = alt.vconcat(*chart_regs)
complete_chart.properties(title='Principal Component Distributions of Pokemon').configure_title(anchor='middle')

Each of the dots here represent Pokemon stats in their principal components. By coding the color in terms of legendary/non-legendary Pokemon we actually see that our mythical, legendary & sub legendary Pokemon have higher values in the first principal component PC1. Legendary Pokemon have high values for all stats so this follows our intutition from above where Pokemon high in all stats will have high PC1 values. There is a clear break in the dataset between legendary/non-legendary Pokemon, so in one sense our PC1 is actualy a sense of how "legendary" a Pokemon actually is. We also see in the PC2/PC3 plots that they are more centralized and gaussian in nature. The only clear break is that sub-legendary and mythical Pokemon tend to skew outward in the radial direction, but true legendary Pokemon actually overlap more with normal Pokemon. 

Using this information, we can begin to rank Pokemon by their PC1 values to see which Pokemon are more "legendary" than others and compare to our previous z-score strength metric.

In [15]:
pca_rank = pca_stats.sort_values('PC1', ascending=False).head(10)['name']
z_rank = strength_breakdown.sort_values('strength', ascending=False).head(10)['name']
comp = pd.DataFrame({'PCA Ranking': pca_rank.values, 'Z-Score Rank': z_rank.values})
comp.index += 1
comp

Unnamed: 0,PCA Ranking,Z-Score Rank
1,Arceus,Arceus
2,Zamazenta Crowned Shield,Zamazenta Crowned Shield
3,Zacian Crowned Sword,Zacian Crowned Sword
4,Ho-oh,Eternatus
5,Eternatus,Giratina Altered Forme
6,Lunala,Lugia
7,Giratina Altered Forme,Lunala
8,Reshiram,Ho-oh
9,Yveltal,Solgaleo
10,Xerneas,Xerneas


We see here that our first three Pokemon are the same, but there are some differences later. First, the PCA metric has some Pokemon shifting around like Ho-oh, Giratina and Eternatus, but some Pokemon also don't show up in the PCA column like Lugia and Solgaleo. Now let's look at non-legendary Pokemon like before.

In [16]:
pca_rank_no_legs = pca_stats[pca_stats['status']=='Normal'].sort_values('PC1', ascending=False).head(10)['name']
z_rank_no_legs = strength_breakdown[strength_breakdown['status']=='Normal'].sort_values('strength', ascending=False).head(10)['name']
comp_no_legs = pd.DataFrame({'PCA Ranking': pca_rank_no_legs.values, 'Z-Score Rank': z_rank_no_legs.values})
comp_no_legs.index += 1
comp_no_legs

Unnamed: 0,PCA Ranking,Z-Score Rank
1,Slaking,Slaking
2,Goodra,Goodra
3,Tyranitar,Garchomp
4,Dragonite,Dragapult
5,Hydreigon,Hydreigon
6,Metagross,Tyranitar
7,Salamence,Dragonite
8,Kommo-o,Salamence
9,Garchomp,Kommo-o
10,Dragapult,Metagross


Here again we see the same kind of idea, the top Pokemon stay on top and others shift around. Interestingly, the top 10 all stayed the top 10 though. Maybe our z-score metric was pretty good afterall! 

#### PCA Conclusions

Here we tried to develop a separate metric for how strong a Pokemon is. Using PCA we determined that the first PC was actually a good metric for how "legendary" a Pokemon was in terms of base stats. Using this information we ranked the top legendary and non-legendary Pokemon and saw that there was not much difference in the very top echelon of Pokemon. Just because our metrics correlate though doesn't mean they're good; however, it would appear that as a first pass these aren't necessarily bad metrics since almost by defintion legendary Pokemon should be stronger than others.

### 3. Which Pokemon types are strongest and weakest?

Another way we can look at Pokemon strength is not based on their individual stats, but on their type matchup bonuses. The equation for dealing damage in Pokemon is [very complicated](https://bulbapedia.bulbagarden.net/wiki/Damage), but ultimately there is a multiplier for when certain types of Pokemon match up against others. For example, when a fire type Pokemon attacks a grass type it deals double damage. We can assess how strong a given type of Pokemon is if it can deal more damage by having a higher multiplier in more matchups.

First we'll make a type matchup matrix that reflects all the different possible combinations.

> NOTE: One quick note about this section. Before when we talked about "normal" Pokemon it was actually in the context of rarity and was a reference for if it was common or not. Here "normal" is an actual Pokemon attack type and is not correlated to how common a Pokemon is.



In [124]:
# make a matchup df
versus_list = list(df.columns[df.columns.str.contains('against')])

versus_df = df[(df['status']=='Normal') & (df['type_number']==1) & (df['ability_1']!='Levitate')].drop_duplicates('type_1')
versus_df = versus_df.set_index('type_1')
versus_df = versus_df[versus_list]
versus_df.columns = versus_df.columns.str.replace('against_','')
versus_df = versus_df.rename(columns={'fight':'fighting'})

col_order = ['normal', 'fighting', 'flying', 'poison', 'ground', 'rock', 'bug', 'ghost', 'steel', 'fire', 'water','grass', 'electric', 'psychic', 'ice', 'dragon', 'dark', 'fairy']
versus_df = versus_df.reindex(col_order)
versus_df = versus_df[col_order]
versus_df = versus_df.transpose()

def altair_conv(df):
    df.index.name = None
    df.columns.name = None
    return df.stack().reset_index().rename(columns={'level_0': 'attack_type', 'level_1': 'defend_type', 0:'effect'})

# altair_conv(versus_df) 

# Plot type match up chart
base = alt.Chart(altair_conv(versus_df)).encode(
    x=alt.X('defend_type:N', title='Defending Pokemon Type', axis=alt.Axis(orient='top')),
    y=alt.Y('attack_type:N', title='Attack Type')
).properties(
    title='Pokemon Type Matchup Chart'
)

text = base.mark_text().encode(
    text='effect',
    color=alt.condition(
        alt.datum.effect <= .5,
        alt.value('black'),
        alt.value('white')
    ),
)

effect_plot = base.mark_rect().encode(
    color=alt.Color(
        'effect', 
        type='nominal',
        title='Attack Multiplier',
        scale=alt.Scale(scheme='redblue')
    )
)
# versus_df = versus_df.pivot(index='attack_type', columns='defend_type', values='effect')
(effect_plot+text).properties(width=600,height=600)


You can think of this chart as how well an attack will do against a certain type. First we find the row of the attack we want to perform, for example _Flamethrower_ from a Charizard. Then we find the defending Pokemon column to see what the multiplier is, in our example Venasaur is a grass type. So Charizard's attack will deal double damage to Venasaur over another type of Pokemon matchup. 

Important to know here though is that a Pokemon's defense type is static (a Venasaur will always be grass), but Pokemon can learn attacks from many different types. That is, Charizard can learn normal attack types such as _Tackle_ and when used against Venasaur this has unit multiplier, it does not deal double damage. So for our analysis we aren't neceassrily considering literal matchup types, but rather attack types vs defending Pokemon types.

One natural way of assessing how effective an attack type would be over the long run is assessing the means of attack multipliers and defense multipliers. This shows that across all battles and all Pokemon a multiplier would average out to the mean. This makes a large assumption that you're uniformly likely to run into any Pokemon type, which definitely isn't the case, but it's not a bad way get a feel.

We can take the means across the columns to get an effective attack multiplier and do the same for rows to get effective defense multipliers. For attack multipliers, higher is better since we want to do more damage, but for defense multiplier lower is better because we would want our Pokemon to receive less damage. We can plot these against each other on a scatter plot to see which types have optimal values. Similarly we can take the ratio of attack multiplier over defense multiplier and now we have a monotonic metric where higher is always better.

In [125]:
attack_means = np.mean(versus_df.values, axis=1)
defense_means = np.mean(versus_df.values, axis=0)
chart_df = pd.DataFrame(index=versus_df.index, data={'attack_means': attack_means, 'defense_means': defense_means})
chart_df.index.name = 'type'

base = alt.Chart(chart_df.reset_index()).encode(
    x = alt.X('defense_means:Q', title='Defense Multiplier (lower is better)', scale=alt.Scale(domain=(0.8, 1.2))),
    y = alt.Y('attack_means:Q', title='Attack Multiplier (higher is better)', scale=alt.Scale(domain=(0.8, 1.2))),
    color = alt.Color('type:N', scale=alt.Scale(scheme='category20'), title='Attack/Pokemon Type')
).properties(
    width = 600,
    height=600
).properties(title='Single Pokemon Type Multipliers')

text = base.mark_text(
    align='left',
    baseline='middle',
    dx=5,
    dy=3
).encode(
    text='type:N'
)

line_df = chart_df.loc[['steel', 'fairy', 'fire', 'ground']]
# print(line_df.reset_index())
pfront = alt.Chart(line_df.reset_index()).mark_line(color='black', strokeDash=[5]).encode(
    x='defense_means:Q',
    y='attack_means:Q'
)

chart = base.mark_circle()

bars = alt.Chart(chart_df.reset_index()).transform_calculate(
    ratio = 'datum.attack_means/datum.defense_means'
).mark_bar().encode(
    x=alt.X('type:N', title='Pokemon type', sort='-y'),
    y=alt.Y('ratio:Q'),
    color=alt.Color('type:N', legend=None)
).properties(width=600,height=600,title='Attack/Defense Multiplier Ratio (higher is better)')

((chart + text + pfront)|bars)
# chart_df.reset_index().columns

We see here that the top left corner is optimal and the types of steel, fairy, fire and ground form a Pareto front for optimizing damage done while minmizing damage received. We also plot the ratio of the two and see that indeed these two have the highest ratios.

This is only for single type match ups though and as we saw many Pokemon have dual types, which have a multiplicative factor so if I used a fire type attack on a grass/water type Pokemon (which doesn't exist but bear with me for this example) then the 2 multiplier and the .5 multiplier would balance out to have a unit multiplier.

We can plot these dual type values and ratios as heat map to see if any trends emerge.

In [126]:
dual_attack_means = pd.DataFrame(columns=versus_df.columns, index=versus_df.index, data=np.outer(attack_means, attack_means))
np.fill_diagonal(dual_attack_means.values, attack_means)
dual_defense_means = pd.DataFrame(columns=versus_df.columns, index=versus_df.index, data=np.outer(defense_means, defense_means))
np.fill_diagonal(dual_defense_means.values, defense_means)
ratio_means = dual_attack_means/dual_defense_means

attack = alt.Chart(altair_conv(dual_attack_means)).mark_rect().encode(
    x=alt.X('defend_type:N', title='Defend Type'),
    y=alt.Y('attack_type:N', title='Attack Type'),
    color=alt.Color('effect:Q', scale=alt.Scale(scheme='redblue', domain=(.5, 1.5)))
).properties(title='Dual Type Attack Multipliers (higher is better)')

defend = alt.Chart(altair_conv(dual_defense_means)).mark_rect().encode(
    x=alt.X('defend_type:N', title='Defend Type'),
    y=alt.Y('attack_type:N', title='Attack Type'),
    color=alt.Color('effect:Q', scale=alt.Scale(scheme='redblue'))
).properties(title='Dual Type Defense Multipliers (lower is better)')

ratio = alt.Chart(altair_conv(ratio_means)).mark_rect().encode(
    x=alt.X('defend_type:N', title='Defend Type'),
    y=alt.Y('attack_type:N', title='Attack Type'),
    color=alt.Color('effect:Q', scale=alt.Scale(scheme='redblue'), title='Mulltiplier Effect')
).properties(title='Attack/Defense Multiplier Ratio (higher is better)')

ratio | attack | defend
# corr

Looking at the ratio heatmap we see that steel performs very, very well in terms of multipliers. Additionally, grass type Pokemon seem to perform very poorly and as before bug type also perform poorly. From these plots it would seem that steel types can expect to have very favorable results across the long term. Also, similar to single type matchups, steel, fairy, fire and ground all perform favorably with the addition of flying and fighting types.

#### Type Matchup Strength Analysis Conclusions

Rather than look at individual Pokemon stats we tried to use an independent portion of the data to see if any types would stand out further. Single and dual type matchups indicated that steel, ground, fairy and fire matchups are favorable for the attacker and further flying and fighting types perform well for dual type matchups. 

## Final Conclusions

In an effort to be the very best, like no one ever was, we attempted to assess how find the best Pokemon. First we standardized Pokemon stats of HP, Attack, Defense, Special Attack, Special Defense and Speed and plotted them against each other to see how they correlate. Most of the stats were positively correlated, which lead us to the conclusion that "stronger Pokemon are stronger" since an increase in HP also means an increase in Attack, etc.

We then took these standardized statistics and combined them into a z-score. We used z-score to rank Pokemon first in terms of absolute standing, but also in terms of finding the best types of Pokemon to capture. This lead us to two conclusions: 1. dragon type Pokemon tend to be more favorable across the board and 2. it's still useful to find the strongest overall Pokemon (e.g. Slaking) since their z-score gets "watered" down by other Pokemon in its type class.

We then used PCA to perform a similar type of analysis as z-score. The first principal component was a good metric for how "legendary" a Pokemon was. We then saw that there was a strong overlap between PCA ranking and Z-score ranking validating our z-score method and confirming our intutition that the more "legendary" a Pokemon is the stronger they are.

Finally, we used a different part of the dataset to just assess Pokemon on how effective their attacks would be against other types rather than look at Pokemon directly. This indicated that steel types, above all others, are preferable in terms of dealing the most damage and receiving the least damage, while dragon, fairy, fire, ground, flying and fighting types are all strong as well.

This analysis wouldn't be complete without determing which Pokemon would make a strong roster though. To do this we'll make a rough metric by multiplying the ratio of the multipliers by PC1 and determining rank from there.

In [127]:
arr = []
for idx, row in pca_stats.iterrows():
    arr.append(row['PC1']*ratio_means.lookup( [row['type_1']], [row['type_2']])[0])

pca_stats['final_strength'] = arr
leg_roster = pca_stats.sort_values('final_strength', ascending=False).head(6)[['name', 'type_1', 'type_2']]
leg_roster.columns.name = 'legendary'
leg_roster.index.name = 'Pokedex Number'
leg_roster = leg_roster.reset_index()
leg_roster.index += 1
norm_roster = pca_stats[pca_stats['status']=='Normal'].sort_values('final_strength', ascending=False).head(6)[['name', 'type_1', 'type_2']]
norm_roster.columns.name = 'normal'
norm_roster.index.name = 'Pokedex Number'
norm_roster = norm_roster.reset_index()
norm_roster.index += 1

print('Legendary Roster\n',leg_roster[['name', 'type_1', 'type_2']])
print('\n\n')
print('Normal Roster\n', norm_roster[['name', 'type_1', 'type_2']])

Legendary Roster
 legendary                      name    type_1  type_2
1              Zacian Crowned Sword     fairy   steel
2          Zamazenta Crowned Shield  fighting   steel
3                             Ho-oh      fire  flying
4                            Dialga     steel  dragon
5                          Solgaleo   psychic   steel
6                           Xerneas     fairy   fairy



Normal Roster
 normal       name  type_1   type_2
1         Slaking  normal   normal
2       Metagross   steel  psychic
3        Garchomp  dragon   ground
4       Tyranitar    rock     dark
5       Dragonite  dragon   flying
6          Goodra  dragon   dragon


So there we have it. Of course the ideal would be to use strictly legendary Pokemon to have the most elite roster, but for us mere mortals there is at least a roster of very strong Pokemon to use as well. In the end though, there are so many factors that go into selecting a Pokemon team and often it could be tailored to who you are facing. If you know you are going up against something heavy in Fire/Fighting/Psychic types we can redo the calculations to determine what ideal roster would like, but if you want the ultimate team that can go up against just about anybody we've determined that dragon types are dominant as are steel and psychic. I hope this helps in your adventure to become the greatest Pokemon trainer the world has ever known!

## References

1. https://stackoverflow.com/questions/53484548/format-text-of-mark-text-in-altair
1. https://stackoverflow.com/questions/60237871/altair-access-rsquared-value-in-a-regression
1. https://inmachineswetrust.com/posts/exploring-pokemon-dataset/
1. https://www.kaggle.com/strakul5/principal-component-analysis-of-pokemon-data
1. https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e
1. https://altair-viz.github.io/gallery/percentage_of_total.html


In [None]:
# alt.Chart(strength_breakdown).transform_density(
#     'hp',
#     groupby=['status'],
#     as_=['hp', 'density'],
#     counts=False
# ).mark_area(
#     opacity=.5
# ).encode(
#     x = 'hp:Q',
#     y = 'density:Q',
#     color=alt.Color('status:N')
# ).repeat(
#     column=stats
# )

# alt.hconcat(*(
#   alt.Chart(strength_breakdown).mark_area(
#       opacity=.5
#   ).encode(
#       alt.X(col, type='quantitative'),
#       alt.Y('density', type='quantitative'),
#       alt.Color('status:N')
#   ).transform_density(
#       col,
#       groupby=['status'],
#       as_=[col, 'density'],
#   ) 
#   for col in stats
# ))

alt.hconcat(*(
  alt.Chart(strength_breakdown).mark_bar(
      opacity=.5
  ).encode(
      alt.X(col, type='quantitative'),
      alt.Y('count()', type='quantitative', stack=None),
      alt.Color('status:N')
  )
  for col in stats
))

## Health Related Plots

In [None]:
df['bmi'] = df['weight_kg'] / df['height_m'] / df['height_m']

psychic_bmi_plot = alt.Chart(df[df['type_1']=='Psychic']).mark_circle().encode(
    x=alt.X('name:O', sort='y', title='Pokemon Type'),
    y=alt.Y('bmi:Q', scale=alt.Scale(type='log'))
)

psychic_bmi_plot

In [67]:
df.columns

Index(['Unnamed: 0', 'pnum', 'name', 'german_name', 'japanese_name',
       'generation', 'status', 'species', 'type_number', 'type_1', 'type_2',
       'height_m', 'weight_kg', 'abilities_number', 'ability_1', 'ability_2',
       'ability_hidden', 'total_points', 'hp', 'attack', 'defense',
       'sp_attack', 'sp_defense', 'speed', 'catch_rate', 'base_friendship',
       'base_experience', 'growth_rate', 'egg_type_number', 'egg_type_1',
       'egg_type_2', 'percentage_male', 'egg_cycles', 'against_normal',
       'against_fire', 'against_water', 'against_electric', 'against_grass',
       'against_ice', 'against_fight', 'against_poison', 'against_ground',
       'against_flying', 'against_psychic', 'against_bug', 'against_rock',
       'against_ghost', 'against_dragon', 'against_dark', 'against_steel',
       'against_fairy'],
      dtype='object')

In [79]:
# base = alt.Chart(df[df['name']!='Cosmoem']).mark_boxplot().encode(
stat_list =  ['height_m', 'weight_kg']
health_stats = pd.merge(
    df,
    pokedex,
    on='pnum'
).loc[:, ['pnum', *stat_list]]

health_stats = pd.merge(
    pokedex,
    health_stats,
    on='pnum'
)
health_stats['bmi'] = health_stats['weight_kg'] / health_stats['height_m'] / health_stats['height_m']

health_stats
alt.Chart(health_stats[health_stats['name']!='Cosmoem']).mark_bar().encode(
    x=alt.X('bmi:Q',scale=alt.Scale(type='linear')),
    y='count():Q'
)

# base = alt.Chart(df).mark_boxplot().encode(
#     y=alt.Y('bmi:Q', scale=alt.Scale(type='log'), title='BMI [kg/m^2]', axis=alt.Axis(tickCount=5)),
#     x=alt.X('type_1:N', title='Pokemon Type'),
#     color=alt.Color('type_1:N', title='Pokemon Type')
# ).properties(
#     width=600,
#     height=600,
#     title='Pokemon BMI by Type'
# )

# base

### Type Analysis

In [None]:
alt.Chart(df).transform_joinaggregate(
    total='count(type_1):Q',
).transform_calculate(
    pct='1 / datum.total'
).mark_bar().encode(
    x=alt.X('type_1:N', title='Pokemon Type', sort='y'),
    y=alt.Y('sum(pct):Q', axis=alt.Axis(format='%'), title='Percent of Pokemon'),
    color=alt.Color('type_1:N', title='Pokemon Type')
).properties(
  title='Pokemon Type Breakdown',
  height=600,
  width=600
)

## Height/Weight Distribution by Type

In [None]:
base = alt.Chart(df).mark_boxplot().encode(
    y=alt.Y('height_m:Q', scale=alt.Scale(type='log')),
    color='type_1:N'
).properties(width=400)


hd = base.encode(
    # x=alt.X('type_1:N', sort=alt.EncodingSortField('height_m:Q', op='median'))
    x=alt.X('type_1:N')
).properties(
    title='By Median height'
)

hd


In [None]:
weight_dist = alt.Chart(df).transform_density(
    'weight_kg',
    as_=['weight_kg','density'],
    groupby=['type_1']
).mark_area(orient='horizontal').encode(
    y=alt.Y('weight_kg:Q', scale=alt.Scale(type='log')),
    color='type_1:N',
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0], grid=False, ticks=True),
    ),
    column=alt.Column(
        'type_1:N',
        header=alt.Header(
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=0,
        ),
    )
).properties(
    width=100
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)

weight_dist

In [None]:
df.head()

Unnamed: 0_level_0,name,german_name,japanese_name,generation,is_sub_legendary,is_legendary,is_mythical,species,type_number,type_1,type_2,height_m,weight_kg,abilities_number,ability_1,ability_2,ability_hidden,total_points,hp,attack,defense,sp_attack,sp_defense,speed,catch_rate,base_friendship,base_experience,growth_rate,egg_type_number,egg_type_1,egg_type_2,percentage_male,egg_cycles,against_normal,against_fire,against_water,against_electric,against_grass,against_ice,against_fight,against_poison,against_ground,against_flying,against_psychic,against_bug,against_rock,against_ghost,against_dragon,against_dark,against_steel,against_fairy
pokedex_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
1,Bulbasaur,Bisasam,フシギダネ (Fushigidane),1,0,0,0,Seed Pokémon,2,Grass,Poison,0.7,6.9,2,Overgrow,,Chlorophyll,318.0,45.0,49.0,49.0,65.0,65.0,45.0,45.0,70.0,64.0,Medium Slow,2,Grass,Monster,87.5,20.0,1.0,2.0,0.5,0.5,0.25,2.0,0.5,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
2,Ivysaur,Bisaknosp,フシギソウ (Fushigisou),1,0,0,0,Seed Pokémon,2,Grass,Poison,1.0,13.0,2,Overgrow,,Chlorophyll,405.0,60.0,62.0,63.0,80.0,80.0,60.0,45.0,70.0,142.0,Medium Slow,2,Grass,Monster,87.5,20.0,1.0,2.0,0.5,0.5,0.25,2.0,0.5,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
3,Venusaur,Bisaflor,フシギバナ (Fushigibana),1,0,0,0,Seed Pokémon,2,Grass,Poison,2.0,100.0,2,Overgrow,,Chlorophyll,525.0,80.0,82.0,83.0,100.0,100.0,80.0,45.0,70.0,236.0,Medium Slow,2,Grass,Monster,87.5,20.0,1.0,2.0,0.5,0.5,0.25,2.0,0.5,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
3,Mega Venusaur,Bisaflor,フシギバナ (Fushigibana),1,0,0,0,Seed Pokémon,2,Grass,Poison,2.4,155.5,1,Thick Fat,,,625.0,80.0,100.0,123.0,122.0,120.0,80.0,45.0,70.0,281.0,Medium Slow,2,Grass,Monster,87.5,20.0,1.0,1.0,0.5,0.5,0.25,1.0,0.5,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
4,Charmander,Glumanda,ヒトカゲ (Hitokage),1,0,0,0,Lizard Pokémon,1,Fire,,0.6,8.5,2,Blaze,,Solar Power,309.0,39.0,52.0,43.0,60.0,50.0,65.0,45.0,70.0,62.0,Medium Slow,2,Dragon,Monster,87.5,20.0,1.0,0.5,2.0,1.0,0.5,0.5,1.0,1.0,2.0,1.0,1.0,0.5,2.0,1.0,1.0,1.0,0.5,0.5
