<a href="https://www.kaggle.com/code/luhaowangsg/exploratory-data-analysis-on-pokemon?scriptVersionId=143954817" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #data visualisation
import seaborn as sns #data visualisation

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Table of Contents
1. [Introduction](#1)
2. [Loading and previewing dataset](#2)
3. [Data clean and pre-processing](#3)
4. [Data Exploration](#4)

<a id="1"></a> <br>
> ## 1: Introduction

Pokemon has always been an important part of my childhood. I still remember the first day my parents bought me a Gameboy Colour and Pokemon Yellow as a birthday present.\
Like a child with a new toy, literally and figuratively, I couldn't wait to put the cartridge in and load up the game.

I can still vividly remember the sound of the Gameboy logo floating across the screen whilst booting up

![Alt text](https://media.giphy.com/media/f9p9GclvDGqcM/giphy.gif)

And to be greeted by the iconic electric mouse Pikachu before taking our first step into Pallet Town and choosing our own starting pokemon.

![Alt text](https://media.giphy.com/media/MKXkDWWDj7D3i/giphy.gif)

Many years down the road, Pokemon has evolved from just a collecting simulator**(*Gotta catch'em all!*)** to having a competitive scene where stats matters.

In this exploratory data analysis, I will be attempting to address the following questions such as

**1. Who is the strongest pokemon?\
2. Which generation has the strongest pokemon?\
3. Who is the best starter to pick for each generation?**

and other questions or insights that appear along the way.

### Just a disclaimer...
As someone who has dabbled with the competitive scene of Pokemon, I understand that whether a pokemon is considered good or bad goes beyond just flat stats but it also dependent on other hidden values such as Individual Values (IV), Effort Values (EV) and the personality traits of a said Pokemon.

That being said, this analysis is purely from the perspective of an everyday player who just wants to form the strongest team of 6 pokemons without going through the long and tedious route of breeding and cultivating Pokemons with perfect traits and hidden values. 

With that out of the way, lets begin!

![Alt text](https://media.giphy.com/media/uWLJEGCSWdmvK/giphy.gif)

<a id="2"></a> <br>
> ## 2: Loading and previewing the dataframe

To explore the dataset, it was loaded into a dataframe using the .info() method, I am able to see that the dataset contains a total of 12 columns with no null values. *(Credits to Josh Korngibel for such a clean dataset!)*

1. **Dex No:** The Pokemon registered number in the National Pokedex, which is like repository for all Pokemon 
2. **Name:** Name of said Pokemon
3. **Base Name:** Base name refers to the basic name of the Pokemon when a Pokemon has alternate forms such as Megas or region specific species.
4. **Type 1:** The primary property of said Pokemon
5. **Type 2:** The secondary property of said Pokemon
6. **BST:** Summation of all base stats of said Pokemon
7. **HP:** HP or Hit Point determines how much damage a Pokemon can take before it faints
8. **Attack:** Attack determines the amount of damage dealt by a Pokemon using a physical move<sup>*</sup> 
9. **Defense:** Defense determines how much damage a Pokemon can resist when hit by a physical move
10. **Sp. Attack:** SP.Attack or Special Attack determines the amount of damage dealt by a Pokemon using a special move<sup>**</sup> 
11. **Sp. Defense:** Sp. Defense or Special Defense determines how much damage a Pokemon can resist when hit by a special move
12. **Speed:** Speed determines which Pokemon will get to act first during battle

<sup>*</sup>Physical moves refer to Pokemon moves that have physical contact with opposing Pokemon, such as punchs, kicks or tackles.\
<sup>**</sup>Special moves refer to Pokemon moves that does not have physical contact with opposing Pokemon, such as beams or psychic attacks.

While there are no null values, the dataset needed some processing to answer some of the questions that I have listed in the introduction, which is explore Pokemon across generations. As such, a column which indicates the generation when the said pokemon was introduced to the series need to be added. Furthermore as I also will be exploring the strength of Pokemon, it will not be fair to compare a common Pokemon with a legendary or mythical Pokemon so I will also need another column to indicate if the said Pokemon is a legendary or not a legendary.

I have also noticed that for Pokemon such as Charmander, who does not have a Type 2 property, a '-' was used.\
Though it might not be a problem but it might not be informational to have a '-' in our legends when I start doing visualisations so I might want to replace the '-' with another indicator such as N/A for Pokemons without Type 2 properties. I did explore the thought of matching the Type 2 attribute with the Type 1 attribute for Pokemon that only has single type but if I think about it from the gameplay perspective, it might not make sense due to how weaknesses works.

For example, a WATER pokemon is weak to lightning so it will take 2x more damage from lightning moves. Similiarly, a FLYING pokemon is also weak to lightning so it will also take 2x more damage from lightning moves. Therefore, if a dual type Pokemon has WATER and FLYING and its Type 1 and Type 2 respectively, such as Gyrados, it will take 4x more damage from lightning moves. So if I were to make Charmander's Type 1 and Type 2 FIRE, technically speaking, Charmander will take 4x more damage from water moves, which is not the case. As such, repeating the Type 1 into the Type 2 column does not really make sense, thus I opted to go with N/A

To sum up, these are the following tasks that will be done in the next step of Data Cleaning and Data Pre-Processing

**Task 1: Create a column called 'Generation' to indicate when the Pokemon was introduced.\
Task 2: Create a column called 'Legendary' to differeniate between common and legendary Pokemons.\
Task 3: To replace the '-' with 'N/A' in the Type 2 column for better clarity.**

In [None]:
#load dataset into dataframe
df = pd.read_csv("/kaggle/input/pokemon-stats-gens-1-9/pokemon.csv")
#backing up the dataframe
bckup_df = df
#preview dataframe
df.info()

In [None]:
df.head()

In [None]:
df.tail()

<a id="3"></a> <br>
> ## 3: Data Cleaning and Pre-processing

### 3.1 Task 1
#### Create a column called 'Generation' to indicate when the Pokemon was introduced

For Task 1, I started by creating an empty column titled 'Generation' and based on the dex number segmentation obtained from [Bulbagarden](https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number), I used a combination of .loc() and .between() to segment out the pokemon based on the range of dex numbers for each generation and label them according as either Gen 1, Gen 2 or Gen 3.

Gen 1 indicates that these Pokemon were the first to be introduced in the series from Pokemon Yellow, Blue and Red and Gen 2 indicates that these Pokemon were introduced later in Pokemon Gold, Silver and Crystal so on and forth.

In [None]:
df['Generation'] = np.nan
df.head()

In [None]:
df.loc[df['Dex No'].between(1, 151), 'Generation'] = 'Gen 1'
df.loc[df['Dex No'].between(152, 251), 'Generation'] = 'Gen 2'
df.loc[df['Dex No'].between(252, 386), 'Generation'] = 'Gen 3'
df.loc[df['Dex No'].between(387, 493), 'Generation'] = 'Gen 4'
df.loc[df['Dex No'].between(494, 649), 'Generation'] = 'Gen 5'
df.loc[df['Dex No'].between(650, 721), 'Generation'] = 'Gen 6'
df.loc[df['Dex No'].between(722, 809), 'Generation'] = 'Gen 7'
df.loc[df['Dex No'].between(810, 905), 'Generation'] = 'Gen 8'
df.loc[df['Dex No'].between(906, 1010), 'Generation'] = 'Gen 9'
df

### 3.2 Task 2
#### Create a column called 'Legendary' to differeniate between common and legendary Pokemons.

Similar to Task 1, I started by creating an empty colunm titled 'Legendary' and filled it up with the boolean value 'False" as Legendaries only compromise a small percentage out of 1000+ Pokemon that we have listed in the National Dex.

Using the list of legendaries provided by [Serebii.net](https://www.serebii.net/pokemon/legendary.shtml#mythical), I took their dex numbers and extracted them out in a seperate dataframe called legend_df for validation.\
In the list provided by Serebii.net, they further categorised legendary pokemon as sub-legendaries, legendaries and mythical. These categories are used in the competitive scene in Pokemon, which is not the direction for this EDA, so for simplicity sake, I will keep them all under a single umbrella called 'Legendary'. 

The list provided by Serebii.net contains a total of 99 legendaries *(55 sub-legendaries, 26 legendaries and 21 mythical)* which excludes the alternate forms such as Articuno who has a base form from Gen 1 and Galarian Form from Gen 8. As such a .drop_duplicates() method was used to only keep the base form for each legendary, and by using .info(), the dataframe has a total of 99 entries, which matches the list provided by Serebii.net.

The dex numbers were then pass down to the original dataframe and the value in the legendary column of the associated dex numbers were change to boolean value 'True'.

A quick validation was done by looking the dex number 151 for the Pokemon Mew, which was the iconic legendary introduced in the first Pokemon movie.

In [None]:
df['Legendary'] = np.nan
df['Legendary'] = df['Legendary'].fillna(False)
df.head()

In [None]:
legend_df = df.loc[df['Dex No'].isin([144, 145, 146, 243, 244, 245, 377, 378, 379, 380, 381, 480, 481, 482, 485, 486, 488, 638, 639, 640, 641, 642, 645, 772, 773, 785, 786, 787, 788, 793, 794, 795, 796, 797, 798, 799, 803,
                                      804, 805, 806, 891, 892, 894, 895, 896, 897, 905, 1001, 1002, 1003, 1004, 150, 249, 250, 382, 383, 384, 483, 484, 487, 643, 644, 646, 716, 717, 718, 789, 790, 791, 792, 800, 888, 889,
                                      890, 898, 1007, 1008, 151, 251, 385, 386, 489, 490, 491, 492, 493, 494, 647, 648, 649, 719, 720, 721, 801, 802, 807, 808, 809, 893])]
legend_df.head(10)

In [None]:
legend_df = legend_df['Base Name'].drop_duplicates()
legend_df.info()

In [None]:
df.loc[df['Dex No'].isin([144, 145, 146, 243, 244, 245, 377, 378, 379, 380, 381, 480, 481, 482, 485, 486, 488, 638, 639, 640, 641, 642, 645, 772, 773, 785, 786, 787, 788, 793, 794, 795, 796, 797, 798, 799, 803,
                                      804, 805, 806, 891, 892, 894, 895, 896, 897, 905, 1001, 1002, 1003, 1004, 150, 249, 250, 382, 383, 384, 483, 484, 487, 643, 644, 646, 716, 717, 718, 789, 790, 791, 792, 800, 888, 889,
                                      890, 898, 1007, 1008, 151, 251, 385, 386, 489, 490, 491, 492, 493, 494, 647, 648, 649, 719, 720, 721, 801, 802, 807, 808, 809, 893]), 'Legendary'] =  True
df.loc[df['Dex No'] == 151]

### 3.3 Task 3
#### To replace the '-' with 'N/A' in the Type 2 column for better clarity.

For Task 3, the replacement of the value '-' was simply done by looking for all hyphens in the column 'Type 2' and replacing it with 'N/A'.

In [None]:
df.loc[df['Type 2'] == '-', 'Type 2'] = 'N/A'
df

<a id="4"></a> <br>
> ## 4: Data Exploration

### 4.1 Pokemon introduced in each generation

To start off with the data exploration, I would like to see the amount of Pokemon introduced in each generation.

Pokemon started with the iconic 151 Pokemons and grew exponentially throughoutout the years but is the growth across each generation equal or was there a sudden surge that happened in between?\
To do this, a simple bar graph and line graph was ploted based on the count of Pokemon in each generation.

The bar graph showed that Generation 1 has the higher amount of Pokemon as compared to the newer generations. Although I know that Pokemon has the current trend of introducing new evolutions or new forms for everyone's iconic Pokemons in Generation 1 but to be topping the chart across the nine generations was indeed a little suprising.

To really determine the number of new Pokemon introduced in each generations, not account for mega evolutions or region-specific forms, I plotting a second set of graphs that only took accounted for the base form for each Pokemon. This meanings for Pokemon like Charizard, that has its original form along with its two different mega evolutions, only one will be taken into account. This will give me a much more accurate view of the number of new Pokemon being released each generation.

From the bar graph generated, it is noticeable that there was a huge drop in new Pokemons introduced after Generation 5, which is Pokemon Black 2 and White 2. This was most likely due to the fact that Generation 6, Pokemon X and Pokemon Y, introduced the new concept of Mega Evolution, which rehashed the design of Pokemons from older generations. This trend of rehashing older Pokemon continued from Generation 6 onwards all the way till Generation 9, which introduced of region-specific forms of older Pokemon such as the Alolan archetype, Galarian archetype, Hisuian archetype and Paldean archetype.

This makes me wonder if GameFreak has ran out of innovative ideas for new Pokemon designs or some market research has been done by them, which gave them data that favours the option of making new design of the good old classics generations of Pokemon. Just a little fruit for thought....

In [None]:
#group pokemon by generations and count
gengrp_df = df.groupby(['Generation'], as_index=False)['Name'].count()

#keep only unique base forms
gengrp_unique_df = df.drop_duplicates(subset='Base Name', keep='first')
gengrp_unique_df = gengrp_unique_df.groupby(['Generation'], as_index=False)['Name'].count()

figures, axes = plt.subplots(2,2, figsize=(15,10))

#plotting for all pokemon
ax1 = sns.barplot(x='Generation', y='Name', data=gengrp_df, ax=axes[0,0])
for i in ax1.containers:
    ax1.bar_label(i,)
ax1.set_title('Pokemon in each generation',size=10, weight="bold") 

ax2 = sns.lineplot(x='Generation', y='Name', data=gengrp_df, marker='o', markersize=8, ax=axes[0,1])
ax2.set_title('Pokemon in each generation',size=10, weight="bold") 

#plotting only for unique base form
ax3 = sns.barplot(x='Generation', y='Name', data=gengrp_unique_df, ax=axes[1,0])
for i in ax3.containers:
    ax3.bar_label(i,)
ax3.set_title('New pokemon introduced in each generation',size=10, weight="bold") 

ax4 = sns.lineplot(x='Generation', y='Name', data=gengrp_unique_df, marker='o', markersize=8, ax=axes[1,1])
ax4.set_title('New pokemon introduced in each generation',size=10, weight="bold") 

ax1.set_ylim(0,220)
ax2.set_ylim(0,220)
ax3.set_ylim(0,220)
ax4.set_ylim(0,220)
plt.show()

### 4.2 Type distribution

Next I look at the overall type distibution of Pokemon across all generations before delving into each individual generation.

For Type 1 distribution, WATER, NORMAL and GRASS were the most common amongst all available types while for Type 2, almost half of the Pokemon does not have a Type 2 and for those who have it, FLYING and PSYCHIC was the most common. According to the cross tab table plotted, 122 pokemon has FLYING as their Type 2 and 49 pokemon has PSYCHI as their Type 2.

When I looked at the Type 1 distribution across the nine generations, most generations has either WATER, NORMAL or GRASS within the top 3 with the exception of Generation 6 where the top Type 1 was FAIRY. This was mainly because Generation 6 was the first time FAIRY type pokemons were introduced. The sudden surge of FAIRY type pokemons in Generation 6 might be a measure to quickly balance the distribution of FAIRY type pokemon amongst the other available types.

As for the Type 2 distribution across the generations, it was reflective of what was shown in the overall piechart as NORMAL, FLYING and PSYCHIC were commonly within the top 3.

In [None]:
data1 = df['Type 1'].value_counts()
labels1 = data1.index

data2 = df['Type 2'].value_counts()
labels2 = data2.index

colors = {'BUG':'lightgreen', 'DRAGON':'indigo', 'ELECTRIC':'yellow', 'FAIRY':'pink', 'FIGHTING':'red', 'FIRE':'orange', 'GHOST':'purple', 'GRASS':'green', 'GROUND':'brown', 'ICE':'lightblue',
           'NORMAL':'lightgrey', 'POISON':'magenta', 'PSYCHIC':'violet','ROCK':'grey', 'WATER':'blue', 'DARK':'darkmagenta', 'STEEL':'silver', 'FLYING':'lightskyblue', 'N/A':'snow'}
wedge_properties = {'linewidth': 1, 'edgecolor': "black"}

fig = plt.figure(figsize=(15,10))

ax1 = fig.add_subplot(121)
ax1.pie(data1, labels=labels1, colors=[colors[key] for key in labels1], startangle=90, wedgeprops=wedge_properties, textprops=dict(color='black'), autopct='%.1f%%')

ax2 = fig.add_subplot(122)
ax2.pie(data2, labels=labels2, colors=[colors[key] for key in labels2], startangle=90, wedgeprops=wedge_properties, textprops=dict(color='black'), autopct='%.1f%%')

plt.show()

In [None]:
typemix_df = pd.crosstab(index=df['Type 1'], columns=df['Type 2'], margins=True)
typemix_df

#### 4.2.1 Type 1 distribution across generations

In [None]:
test_df = df.drop_duplicates(subset='Base Name', keep='first')
typegrp_df = test_df.groupby(['Generation', 'Type 1'])['Name'].count().reset_index(name='Type Count')

generation = ['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6', 'Gen 7', 'Gen 8', 'Gen 9']
colors = {'BUG':'lightgreen', 'DRAGON':'indigo', 'ELECTRIC':'yellow', 'FAIRY':'pink', 'FIGHTING':'red', 'FIRE':'orange', 'GHOST':'purple', 'GRASS':'green', 'GROUND':'brown', 'ICE':'lightblue',
           'NORMAL':'lightgrey', 'POISON':'magenta', 'PSYCHIC':'violet','ROCK':'grey', 'WATER':'blue', 'DARK':'darkmagenta', 'STEEL':'silver', 'FLYING':'lightskyblue'}
wedge_properties = {'linewidth': 1, 'edgecolor': "black"}

fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(18, 18))
for gen_i, ax in zip(generation, axs.flat):
    df1 = typegrp_df[typegrp_df['Generation'] == gen_i]
    labels = df1['Type 1']
    text, autotexts, wedges = ax.pie(df1['Type Count'],
                                   labels=labels,
                                   shadow=False,
                                   colors=[colors[key] for key in labels],
                                   startangle=90,
                                   wedgeprops=wedge_properties,
                                   textprops=dict(color='black'),
                                   autopct='%.1f%%')
    plt.setp(autotexts, size=10)
    ax.set_title(gen_i, size=10, weight="bold", y=-0.05)

fig.suptitle("Type 1 distribution across generations", size=10, weight="bold")
plt.tight_layout()
plt.show()

#### 4.2.3 Type 2 distribution across generations

In [None]:
test_df = df.drop_duplicates(subset='Base Name', keep='first')
typegrp_df = test_df.groupby(['Generation', 'Type 2'])['Name'].count().reset_index(name='Type Count')

generation = ['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6', 'Gen 7', 'Gen 8', 'Gen 9']
colors = {'BUG':'lightgreen', 'DRAGON':'indigo', 'ELECTRIC':'yellow', 'FAIRY':'pink', 'FIGHTING':'red', 'FIRE':'orange', 'GHOST':'purple', 'GRASS':'green', 'GROUND':'brown', 'ICE':'lightblue',
           'NORMAL':'lightgrey', 'POISON':'magenta', 'PSYCHIC':'violet','ROCK':'grey', 'WATER':'blue', 'DARK':'darkmagenta', 'STEEL':'silver', 'FLYING':'lightskyblue', 'N/A':'snow'}
wedge_properties = {'linewidth': 1, 'edgecolor': "black"}

fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(18, 18))
for gen_i, ax in zip(generation, axs.flat):
    df1 = typegrp_df[typegrp_df['Generation'] == gen_i]
    labels = df1['Type 2']
    text, autotexts, wedges = ax.pie(df1['Type Count'],
                                   labels=labels,
                                   shadow=False,
                                   colors=[colors[key] for key in labels],
                                   startangle=90,
                                   wedgeprops=wedge_properties,
                                   textprops=dict(color='black'),
                                   autopct='%.1f%%')
    plt.setp(autotexts, size=10)
    ax.set_title(gen_i, size=10, weight="bold", y=-0.05)

fig.suptitle("Type 2 distribution across generations", size=10, weight="bold")
plt.tight_layout()
plt.show()

#### 4.3 Pokemon strength across generation (Non-Legendary)

To keep the comparision of strength fair, I will be doing legendaries and non-legendaries Pokemon seperately as based on my past experience playing Pokemon, legendaries pokemon are exponentially stronger than non-legendaries. I will also not be taking into account mega evolutions and region-specific pokemons because this archetype of pokemon tended to cater towards the older generations as seen in section 4.1, which may skewed the results.

The comparision of strength will be done by averaging the Base Stats Total (BST) across the nine generations and based on the result shown in the bar graph plotted, Generation 9 has the strongest pokemons while Generation 3 has the weakest. However, when I look at distribution of BST across the generations, Generation 3 has the highest outlier from the pokemon Slaking, which has the BST of 670 as compared to the other generations, which highest BST was 600.

Now when I look at the weakest pokemon, the mininum value of BST was the lowest in Generation 7 from the pokemon Wishiwashi, who has a BST of 175. When solely looking at the lower limits of the BST, there was a huge disparity between Generation 5 and the other generations. This was from the pokemon Patrat, who has a BST of 225, which is a lot higher than the lower limits of the other generation as they were mostly within the 180-200 range.

In [None]:
test_df = df.drop_duplicates(subset='Base Name', keep='first')
test_df = test_df.loc[test_df['Legendary'] == False]

gengrpstr_nl_df = test_df.groupby(['Generation'], as_index=False)['BST'].mean()

fig, axs = plt.subplots(1,3, figsize=(20, 5))

ax1 = sns.barplot(x='Generation', y='BST', data=gengrpstr_nl_df, ax=axs[0])
for i in ax1.containers:
    ax1.bar_label(i, fmt='%.2f')
ax1.title.set_text('Average strength of non-legendaries across generation')

ax2 = sns.boxplot(x='Generation', y='BST', data=test_df, ax=axs[1])
ax2.title.set_text('Strength of non-legendaries across generation (Box Plot)')

ax3 = sns.violinplot(x='Generation', y='BST', data=test_df, ax=axs[2])
ax3.title.set_text('Strength of non-legendaries across generation (Violin Plot)')

In [None]:
test_df = df.drop_duplicates(subset='Base Name', keep='first')
test_df = test_df.loc[test_df['Legendary'] == False]

top10_df = test_df.loc[test_df.groupby('Generation')['BST'].idxmax()]
top10_df

In [None]:
top10_df.sort_values(['BST'], ascending=False)

In [None]:
test_df = df.drop_duplicates(subset='Base Name', keep='first')
test_df = test_df.loc[test_df['Legendary'] == False]

btm10_df = test_df.loc[test_df.groupby('Generation')['BST'].idxmin()]
btm10_df

In [None]:
btm10_df.sort_values(['BST'], ascending=True) 

#### 4.4 Pokemon strength across generation (Legendaries)

Now lets look at the legendaries, when just comparing the average BST of the legendaries alone, the newer generations of 7 to 9 fell a little short compared to the previous generations. Although the legendaries in Generation 6 has the highest average across the generations, when I look at the distribution of BST, the upper limit of Generation 4 was much higher compared to the rest. This was attributed to none other than the God of Pokemon, Arceus, who has BST of 720. When compared to the top legendaries of the other generations, most of them were in the higher 600s and only Arceus went pass the total of 700. Truly benefit the role of the God of Pokemon.

Now things get interesting when we look at the weaker legendaries. In most generations, the legendaries were only on par with their own counterparts within their own generations but in Generation 7 and 8, there was a huge disparity with the weakest legendary in Generation 7, Cosmog, who has a BST of only 200 and the strongest legendary in Generation 7, Solgaleo, who has a BST of 680!

This might indicate a slight paradigm shift in the status of legendaries as after doing some research, legendaries in Generation 7 were mostly presented as guardians or symbolic icons to certain part of the story in Generation 7. Their designs also leaned towards the cuter side, which might be the reason why this generation of legendaries were much weaker than the rest as they are not designed for Pokemon battles.

In [None]:
test_df = df.drop_duplicates(subset='Base Name', keep='first')
test_df = test_df.loc[test_df['Legendary'] == True]

gengrpstr_nl_df = test_df.groupby(['Generation'], as_index=False)['BST'].mean()

fig, axs = plt.subplots(1,3, figsize=(20, 5))

ax1 = sns.barplot(x='Generation', y='BST', data=gengrpstr_nl_df, ax=axs[0])
for i in ax1.containers:
    ax1.bar_label(i, fmt='%.2f')
ax1.title.set_text('Average strength of legendaries across generation')

ax2 = sns.boxplot(x='Generation', y='BST', data=test_df, ax=axs[1])
ax2.title.set_text('Strength of legendaries across generation (Box Plot)')

ax3 = sns.violinplot(x='Generation', y='BST', data=test_df, ax=axs[2])
ax3.title.set_text('Strength of legendaries across generation (Violin Plot)')

In [None]:
test_df = df.drop_duplicates(subset='Base Name', keep='first')
test_df = test_df.loc[test_df['Legendary'] == True]

top10_df = test_df.loc[test_df.groupby('Generation')['BST'].idxmax()]
top10_df

In [None]:
top10_df.sort_values(['BST'], ascending=False)

In [None]:
test_df = df.drop_duplicates(subset='Base Name', keep='first')
test_df = test_df.loc[test_df['Legendary'] == True]

btm10_df = test_df.loc[test_df.groupby('Generation')['BST'].idxmin()]
btm10_df

In [None]:
btm10_df.sort_values(['BST'], ascending=True) 

### 4.5 Which starter to choose?

The biggest decision I ever had to make when I was a child. Which starter pokemon to choose?\
Back when internet was not easily available, I was not able to reseach and look at the evolutionary path of each starter to decide which starter to pick but right now, with the availabilty of data around us, I can make data-driven decisions to choose to best option.

I have plotted two set of bar graphs, one with mega evolutions and with mega evolutions to take into account that if I were to go back in time and play Pokemon Yellow, Blue or Red again, I would not have access to mega evolutions so the best pick would most likely be Charizard solely based on BST alone.

If I wanted to go more depth, I will most likely consolidate the Type 1 and Type 2 of each starter and mapped out the potential weaknesses for each starter pokemon and their evolution to choose the pokemon that has least weakeness but lets leave it to another day.

I also purposefully left Ash's Greninja in for the second set of graphs with Mega Evolution included just for comparison to see how overpowered his Greninja is compared to the other starters.

#### 4.5.1 Without Mega-Evolutions

In [None]:
#gengrp_unique_df = df.drop_duplicates(subset='Base Name', keep='first')
#starters_df = df.loc[df['Dex No'].isin([1,2,3,4,5,6,7,8,9,152,153,154,155,156,157,158,159,160,252,253,254,255,256,257,258,259,260,387,388,389])]
starters_df = df.loc[df['Dex No'].between(1,9) | df['Dex No'].between(152,160) | df['Dex No'].between(252,260) | df['Dex No'].between(387,395) | df['Dex No'].between(495,503) | df['Dex No'].between(650,658)
                     | df['Dex No'].between(722,730) | df['Dex No'].between(810,818) | df['Dex No'].between(906,914)]
starters_df.head(10)

In [None]:
starters_unique_df = starters_df.drop_duplicates(subset='Base Name', keep='first')
starters_unique_df.head(10)

In [None]:
generation = ['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6', 'Gen 7', 'Gen 8', 'Gen 9']
seaborn_palette = sns.color_palette("Set2")

fig, axs = plt.subplots(nrows=5, ncols=2, figsize=(20, 20))
for gen_i, ax in zip(generation, axs.flat):
    df1 = starters_unique_df[starters_unique_df['Generation'] == gen_i]
    ax.bar(df1['Name'], df1['BST'], color=seaborn_palette)
    ax.set_ylim(0, 600)
    ax.set_title(gen_i + " (No Mega Evolutions)",size=10, weight="bold")
    for i in ax.containers:
        ax.bar_label(i,)

axs[4,1].set_axis_off()
plt.tight_layout()
plt.show()

#### 4.5.2 With Mega Evolutions

In [None]:
generation = ['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6', 'Gen 7', 'Gen 8', 'Gen 9']
seaborn_palette = sns.color_palette("Set2")

fig, axs = plt.subplots(nrows=5, ncols=2, figsize=(35, 30))
for gen_i, ax in zip(generation, axs.flat):
    df1 = starters_df[starters_df['Generation'] == gen_i]
    ax.bar(df1['Name'], df1['BST'], color=seaborn_palette)
    ax.set_ylim(0, 700)
    ax.set_title(gen_i + " (Mega Evolutions Included)",size=10, weight="bold")
    for i in ax.containers:
        ax.bar_label(i,)

axs[4,1].set_axis_off()
plt.show()

<a id="5"></a> <br>
> ## 5: Conclusion

I had lots of fun doing this and combining both of my interests together to look at Pokemon from an analytic point of view.

Once again, <font size = '7'>Credits to Josh Korngibel for the amazing dataset!</font> 

I hope to learn more about data analytics and data science as I seek out more opportunities to get my hands dirty.