# Exploratory Data Analysis in Python

Dataset from Kaggle : **"Pokemon with stats"** by *Alberto Barradas*  
Source: https://www.kaggle.com/abcsds/pokemon (requires login)

Inspired by the wonderful EDA on Pokemon Data by [Redwan Huq](http://inmachineswetrust.com/posts/exploring-pokemon-dataset/).

![Gotta Catch 'Em All!](images/PokemonIntro.png)

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

In [None]:
"""
Don't do this unless you know what you're doing.
I'm only including this to remove the warning statements that may overflow the notebook
"""
import warnings
warnings.filterwarnings('ignore')

---

### Import the Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
# We use pd.read_csv() because the data is in a "Comma Separated Values" (csv) file
# This function takes the text file and converts it into a DataFrame (a table) called 'pkmndata'
pkmndata = pd.read_csv('pokemonData.csv')

# .head() shows us the first 5 rows
# Always run this immediately after loading to ensure the data looks the way you expect
pkmndata.head()

Description of the dataset, as available on Kaggle, is as follows.
Learn more : https://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon

> **\#** : ID for each Pokemon (runs from 1 to 721)  
> **Name** : Name of each Pokemon  
> **Type 1** : Each Pokemon has a basic Type, this determines weakness/resistance to attacks  
> **Type 2** : Some Pokemons are dual type and have a Type 2 value (set to nan otherwise)  
> **Total** : Sum of all stats of a Pokemon, a general guide to how strong a Pokemon is  
> **HP** : Hit Points, defines how much damage a Pokemon can withstand before fainting  
> **Attack** : The base modifier for normal attacks by the Pokemon (e.g., scratch, punch etc.)  
> **Defense** : The base damage resistance of the Pokemon against normal attacks  
> **SP Atk** : Special Attack, the base modifier for special attacks (e.g. fire blast, bubble beam)  
> **SP Def** : Special Defense, the base damage resistance against special attacks  
> **Speed** : Determines which Pokemon attacks first each round  
> **Generation** : Each Pokemon belongs to a certain Generation  
> **Legendary** : Legendary Pokemons are powerful, rare, and hard to catch

---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [None]:
print("Data type : ", type(pkmndata))   # Ensure that it is a DF
print("Data dims : ", pkmndata.shape)   # .shape returns a tuple: (Number of Rows, Number of Columns)

Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [None]:
# Data types of all the columns
print(pkmndata.dtypes)

---

## Explore the Dataset

Exploring any dataset requires a solid understanding of the domain -- it is Pokemon, in our case.    
We understand the following basics regarding Pokemon, primarily from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon) and [Bulbapedia](https://bulbapedia.bulbagarden.net/wiki/Generation).    

> **Generation** : There are seven generations of Pokemon as of 2018, with 721 till Generation VI (this dataset).   
> **Type** : Every Pokemon has a *primary* type, and some of them also have a *secondary* type -- dual-type ones.    
> **Legendary** : These Pokemons are rare, powerful, and really hard to catch -- there are 38 upto Generation VI.    

Way more trivia about Pokemon is available online -- but let's come back and retrieve more information from the data.

In [None]:
"""
.info() is the single most useful command for a first look.
It tells us:
  - The name of every column
  - The data type (int64 = whole numbers, object = text/string, float = decimals)
  - The "Non-Null Count" (How many rows actually have data vs. being empty)
"""
pkmndata.info()

#### Generations of Pokemon in the Dataset

In [None]:
# .unique() gives us the list of distinct categories (e.g., [1, 2, 3...])
print("Number of Generations :", len(pkmndata["Generation"].unique()))

# .value_counts() counts how many rows exist for each category
# Automatically sorts them from Most Common -> Least Common
print(pkmndata["Generation"].value_counts())

# sb.catplot (Categorical Plot) draws a bar for each category.
# kind="count" tells Seaborn to calculate the frequency for us
sb.catplot(y = "Generation", data = pkmndata, kind = "count")

#### Types of Pokemon in the Dataset
 

In [None]:
# Primary Types in the Dataset
print("Number of Primary Types :", len(pkmndata["Type 1"].unique()))

# Pokemons of each Primary Type
print(pkmndata["Type 1"].value_counts())
sb.catplot(y = "Type 1", data = pkmndata, kind = "count", height = 4)

In [None]:
"""
note the dropna() function, that we use to drop NA values
    In this case, some pkmn may be monotype (Type 2 = NA)
"""
# Secondary Types in the Dataset
print("Number of Secondary Types :", len(pkmndata["Type 2"].dropna().unique()))

# Pokemons of each Secondary Type
print(pkmndata["Type 2"].dropna().value_counts())
sb.catplot(y = "Type 2", data = pkmndata, kind = "count", height = 4)

In [None]:
# Pokemons with a Single Type -- I
singletype_data = pkmndata[pkmndata["Type 2"].isnull()] # If no "Type 2" --> Monotype (Single Type)
print("Pokemons with just Type 1 :", len(singletype_data))
singletype_data.head()

In [None]:
# Pokemons with Dual Types -- I and II
dualtype_data = pkmndata[pkmndata["Type 2"].isnull() == False]  # If not null-Type 2, means have Type 2 --> Dual Type
print("Pokemons with Types 1 and 2 :", len(dualtype_data))
dualtype_data.head()

#### Types of Pokemon over Generations

In [None]:
# Primary Type over Generations
sb.catplot(y = 'Type 1', data = pkmndata, col = 'Generation', kind = 'count', col_wrap = 2, height = 4)

In [None]:
# Secondary Type over Generations
sb.catplot(y = 'Type 2', data = pkmndata, col = 'Generation', kind = 'count', col_wrap = 2, height = 4)

#### Type distribution of Dual-Type Pokemons

In [None]:
# Pokemons with Dual Types -- I and II
dualtype_data = pkmndata[pkmndata["Type 2"].isnull() == False]
print("Pokemons with Types 1 and 2 :", len(dualtype_data))

"""
.groupby(['Type 1', 'Type 2']): Group the data by every unique pair (e.g., Fire-Water)

.size(): Count how many Pokemon are in each group

.unstack(): Pivot the data. Turn the list of groups into a Matrix (Grid)
    (Rows = Type 1, Columns = Type 2)

.heatmap()
    uses color intensity to show density
    Darker squares = common combinations
    Empty squares = combination doesn't exist in pkmn yet

We have 18 Type 1s and 18 Type 2s --> 18 x 18 = 324 combinations
Bar chart with 300+ bars would be unreadable
Heatmap (grid) here would be easier to scan for patterns 
"""

# Distribution of the Two Types
f = plt.figure(figsize=(6, 6))
sb.heatmap(dualtype_data.groupby(['Type 1', 'Type 2']).size().unstack(), 
           linewidths = 1, 
           annot = True,            # annot=True writes the actual number in the box
           annot_kws = {"size": 8}, # Font size of the numbers
           cmap = "BuGn")           # Color map: Blue to Green

In [None]:
# Distribution of the Two Types over Generations
f, axes = plt.subplots(3, 2, figsize=(12, 18))

dualtype_gen1 = dualtype_data[dualtype_data["Generation"] == 1]
dualtype_gen2 = dualtype_data[dualtype_data["Generation"] == 2]
dualtype_gen3 = dualtype_data[dualtype_data["Generation"] == 3]
dualtype_gen4 = dualtype_data[dualtype_data["Generation"] == 4]
dualtype_gen5 = dualtype_data[dualtype_data["Generation"] == 5]
dualtype_gen6 = dualtype_data[dualtype_data["Generation"] == 6]

sb.heatmap(dualtype_gen1.groupby(['Type 1', 'Type 2']).size().unstack(),
           linewidths = 1, annot = True, annot_kws = {"size": 6}, cmap = "BuGn", ax = axes[0,0])
sb.heatmap(dualtype_gen2.groupby(['Type 1', 'Type 2']).size().unstack(), 
           linewidths = 1, annot = True, annot_kws = {"size": 6}, cmap = "BuGn", ax = axes[0,1])
sb.heatmap(dualtype_gen3.groupby(['Type 1', 'Type 2']).size().unstack(), 
           linewidths = 1, annot = True, annot_kws = {"size": 6}, cmap = "BuGn", ax = axes[1,0])
sb.heatmap(dualtype_gen4.groupby(['Type 1', 'Type 2']).size().unstack(), 
           linewidths = 1, annot = True, annot_kws = {"size": 6}, cmap = "BuGn", ax = axes[1,1])
sb.heatmap(dualtype_gen5.groupby(['Type 1', 'Type 2']).size().unstack(), 
           linewidths = 1, annot = True, annot_kws = {"size": 6}, cmap = "BuGn", ax = axes[2,0])
sb.heatmap(dualtype_gen6.groupby(['Type 1', 'Type 2']).size().unstack(), 
           linewidths = 1, annot = True, annot_kws = {"size": 6}, cmap = "BuGn", ax = axes[2,1])

#### Legendary Pokemons

![Legendary Pokemons](images/PokemonLegendary.png)

We understand that there are 65 Legendary Pokemons till Generation 6. Rare, powerful, interesting, and hard to catch. Let's explore them in the dataset.

In [None]:
# Legendary Pokemons in the Dataset
legnd_data = pkmndata[pkmndata["Legendary"] == True]
print("Number of Legendary Pokemons :", len(legnd_data))

# Legendary Pokemons in each Generation
print(legnd_data["Generation"].value_counts())
sb.catplot(y = "Generation", data = legnd_data, kind = "count", height = 4)

In [None]:
# Legendary Pokemons in the Dataset
legnd_data = pkmndata[pkmndata["Legendary"] == True]
print("Number of Legendary Pokemons :", len(legnd_data))

# Legendary Pokemons in each Primary Type
print(legnd_data["Type 1"].value_counts())
sb.catplot(y = "Type 1", data = legnd_data, kind = "count", 
           order = legnd_data["Type 1"].value_counts().index, height = 4)

In [None]:
# Legendary Pokemons with two Types -- I and II
dualtype_legnd_data = legnd_data[legnd_data["Type 2"].isnull() == False]
print("Legendary Pokemons with Types 1 and 2 :", len(dualtype_legnd_data))


# Distribution over the Two Types
f = plt.figure(figsize=(8, 8))
sb.heatmap(dualtype_legnd_data.groupby(['Type 1', 'Type 2']).size().unstack(), 
           linewidths = 1, annot = True, annot_kws = {"size": 9}, cmap = "BuGn")

#### Statistical Summary of Pokemon Points

Numerical Data

To understand the distribution of stats (HP, Attk, Spd, etc.)

In [None]:
# Extract only the numeric data variables
numeric_data = pd.DataFrame(pkmndata[["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]])

# Summary Statistics for all Variables
numeric_data.describe().round(2)

In [None]:
"""
Plot the "Big Three" plots
    When looking at numbers (e.g. HP, Attack, Speed), we care about the Distribution.
    E.g., is the data clustered in the middle, are there extreme values (outliers)?

Boxplot
    Best for spotting outliers (the dots outside the whiskers) and
    checking where the middle 50% of the data sits.
    
Histogram
    Best for seeing the shape (is it a bell curve? is it skewed to the left?)
    
Violin Plot
    Combination of the two. Shows the boxplot's range, but also the
    Histogram's "fatness" density
"""

# Draw the distributions of all variables
f, axes = plt.subplots(6, 3, figsize=(18, 24))

count = 0
for var in numeric_data:
    sb.boxplot(data = numeric_data[var], orient = "h", ax = axes[count,0])
    sb.histplot(data = numeric_data[var], ax = axes[count,1])
    sb.violinplot(data = numeric_data[var], orient = "h", ax = axes[count,2])
    count += 1

In [None]:
"""
.corr() measures linear relationships between -1 and 1

Remember: Heatmap is a clearer visualization of the correlation matrix (Useful figure)
"""
# Correlation Matrix
print(numeric_data.corr())

# Heatmap of the Correlation Matrix
f = plt.figure(figsize=(6, 6))
sb.heatmap(numeric_data.corr(), vmin = -1, vmax = 1, linewidths = 1,
           annot = True, fmt = ".2f", annot_kws = {"size": 8}, cmap = "RdBu")

In [None]:
# Draw pairs of variables against one another
sb.pairplot(data = numeric_data)

### Cleaning the Dataset (Pre-processing)

#### Unique Names and IDs of Pokemons

Count Unique Names

In [None]:
# .unique() returns an array of every distinct name found in the column.
# len() counts the length of that array.
# If this number matches the total row count (800), every name is unique.
print("Unique Names of Pokemon :", len(pkmndata["Name"].unique()))

Count Unique IDs

In [None]:
# We do the same check for the ID column ("#").
# If this number is LOWER than the total rows, it means some IDs are repeated
print("Unique IDs of Pokemon :", len(pkmndata["#"].unique()))

We know duplicates exist (from the previous step). Now we want to see them.

In [None]:
# .duplicated("#", keep=False):
#   - Scans the "#" column.
#   - keep=False is CRITICAL here. Normally, Python keeps the first copy and marks the rest as duplicates.
#     By saying keep=False, we tell it: "Mark ALL instances as True."
#     We want to see the original AND the copy side-by-side to compare them.
dupid_data = pkmndata[pkmndata.duplicated("#", keep = False)]

# .sort_values(by="Name"):
#   - We sort alphabetically so the duplicate pairs sit next to each other (e.g., Venusaur and VenusaurMega).
# .head(10):
#   - We only look at the first 10 rows to verify our code worked.
dupid_data.sort_values(by = "Name").head(n = 10)

Generate a duplicate report. This block prints a clean list of which Pokemon share the same ID.

In [None]:
# Count how many rows are involved in the duplication
print("Pokemons with Duplicate IDs :", len(dupid_data))

# Get a list of the specific ID numbers that are repeated
dupids = dupid_data["#"].unique()
print("Unique Pokemons with DupIDs :", len(dupids))
print()

print("# \t Count \t List of Pokemons with Duplicate IDs")
print()

# This runs once for every problem ID found.
for dupid in dupids:

    # Filter: Get only the rows matching the current Duplicate ID
    # Select: Get only the "Name" column
    # List: Convert the result into a standard Python list for printing
    dupid_list = list(dupid_data[dupid_data["#"] == dupid]["Name"])

    print(dupid, "\t", len(dupid_list), "\t", dupid_list)

---

## Clean the Dataset

Once we are done with the basic exploration of variables, it's time to *clean* and *tidy-up* the dataset.

In [None]:
"""
Standardizing Column Names.

Objective: Remove spaces, punctuation, and capitalization differences.

Why? In Python, `df.Attack` works, but `df.Sp. Atk` crashes because of the space and dot.
"""

# In Python, variables are often "pointers." If we just said df_clean = df, 
# changing df_clean would ALSO destroy our original raw data.
# .copy() forces Python to create a physically separate backup.
pkmndata_clean = pkmndata.copy()    # .copy() is crucial! 

# Rename "#" to "ID" of Pokemon
# inplace=True means "save this change directly into pkmndata_clean", 
# rather than creating a new table.
pkmndata_clean.rename(columns = {'#': 'ID'}, inplace = True)

# Convert all Variable Names to UPPERCASE
# .str is an accessor that lets us treat the column names like text strings.
pkmndata_clean.columns = pkmndata_clean.columns.str.upper()

# Remove all spaces and dots from Variable Names
# We replace dots "." with nothing "" (deleting them).
# We replace spaces " " with underscores "_" (snake_case).
# Result: "Sp. Atk" becomes "SP_ATK". Much easier to code with!
pkmndata_clean.columns = pkmndata_clean.columns.str.replace(".","")
pkmndata_clean.columns = pkmndata_clean.columns.str.replace(" ","_")

# Print the Variable Information to check
pkmndata_clean.info()

#### Fix Pokemon Names

We take cue from the Pokedex dataset (https://pokemondb.net/pokedex/all), and perform the following (not in order).   

> Convert `[Name]Mega [Name]` to `[Name]Mega`    
> Convert `[Name]Mega [Name] X` to `[Name]MegaX`    
> Convert `[Name]Mega [Name] Y` to `[Name]MegaY`    
> Convert `[Name][Form] Forme` to `[Name][Form]`    
> Convert `[Name][Cloak] Cloak` to `[Name][Cloak]`    
> Convert `[Name][Rotom] Rotom` to `[Name][Rotom]`    
> Convert `[Name][Size] Size` to `[Name][Size]`    
> Convert `HoopaHoopa [Form]` to `Hoopa[Form]`     

Regular Expression (RegEx) search-and-replace is a lovely tool to accomplish such tasks. We use `re` library in Python.

In [None]:
"""
This is like "Find and Replace" on steroids. 
Concept: re.sub(pattern, replacement, text) looks for a pattern and swaps it out.

This block is complex. You don't need to memorize the Regex syntax per se.
Focus on our goal:
    Telling the script to spot patterns (e.g., words ending in `Forme`) 
    and chop them off
"""
# Fix the weird Names of Pokemons
import re # "re" is the library for Regular Expressions

# LAMBDA EXPLANATION:
# .apply(lambda x: ...): 
# Think of this as a mini-loop. "For every single name 'x' in this column, apply this rule..."

# REGEX EXPLANATION: r'(.+)(Forme)'
# (.+) means "Group 1: Any text" (e.g., "Venusaur")
# (Forme) means "Group 2: The word Forme"
# r'\1' means "Replace the whole thing with just Group 1".
# Result: "VenusaurForme" becomes "Venusaur".

# Fix names with extra Extensions (removing "Forme", "Cloak", "Rotom", "Size")
pkmndata_clean["NAME"] = pkmndata_clean["NAME"].apply(lambda x: re.sub(r'(.+)(Forme)',r'\1', x))
pkmndata_clean["NAME"] = pkmndata_clean["NAME"].apply(lambda x: re.sub(r'(.+)(Cloak)',r'\1', x))
pkmndata_clean["NAME"] = pkmndata_clean["NAME"].apply(lambda x: re.sub(r'(.+)(Rotom)',r'\1', x))
pkmndata_clean["NAME"] = pkmndata_clean["NAME"].apply(lambda x: re.sub(r'(.+)(Size)',r'\1', x))

# Special case for Hoopa: Keep the SECOND part (Group 2)
# r'(Hoopa)(.+)' -> Keep Group 2 (\2).
pkmndata_clean["NAME"] = pkmndata_clean["NAME"].apply(lambda x: re.sub(r'(Hoopa)(.+)',r'\2', x))

# Fix names with Mega in between
# Example: "VenusaurMega Venusaur" -> Keep "VenusaurMega"
pkmndata_clean["NAME"] = pkmndata_clean["NAME"].apply(lambda x: re.sub(r'(.+Mega)(.+)',r'\1', x))

# Remove Blanks from all the Names
# \s+ matches one or more empty spaces. We replace them with '' (nothing).
pkmndata_clean["NAME"] = pkmndata_clean["NAME"].apply(lambda x: re.sub(r'\s+','', x))

Verify the cleanup

In [None]:
# We check the duplicates again to see if our name cleaning worked.
dupid_data_clean = pkmndata_clean[pkmndata_clean.duplicated("ID", keep = False)]
print("Pokemons with Duplicate IDs :", len(dupid_data_clean))

dupids_clean = dupid_data_clean["ID"].unique()
print("Unique Pokemons with DupIDs :", len(dupids_clean))
print()

# Print the list to scan for remaining errors
# We expect to see nice clean names now (e.g., "Venusaur" and "VenusaurMega").
print("# \t Count \t List of Pokemons with Duplicate IDs")
print()
for dupid_clean in dupids_clean:
    dupid_list_clean = list(dupid_data_clean[dupid_data_clean["ID"] == dupid_clean]["NAME"])
    print(dupid_clean, "\t", len(dupid_list_clean), "\t", dupid_list_clean)

Manual Edge-Case Fixing

In [None]:
"""
Objective: Some errors are too specific for a pattern (RegEx). 
We must fix them manually. 

Problem: Charizard Mega X and Mega Y might have been named identically 
in the raw data, or stripped too aggressively.
"""

# Check the current state of Charizard (ID 6) and Mewtwo (ID 150)
print(pkmndata_clean[pkmndata_clean["ID"] == 6]["NAME"])
print(pkmndata_clean[pkmndata_clean["ID"] == 150]["NAME"])

In [None]:
# .loc[row_index, column] = value
# NOTE: The numbers 7, 8, 163, 164 are the DataFrame INDICES (Row Numbers), not the Pokemon IDs.
# We are hard-coding these specific rows to have the correct names.

# Fix the X,Y labels for Charizard and Mewtwo
pkmndata_clean.loc[7,"NAME"] = "CharizardMegaX"
pkmndata_clean.loc[8,"NAME"] = "CharizardMegaY"
pkmndata_clean.loc[163,"NAME"] = "MewtwoMegaX"
pkmndata_clean.loc[164,"NAME"] = "MewtwoMegaY"

Set NAME as the Index of the DataFrame

In [None]:
# Before: Rows were labeled 0, 1, 2, 3...
# After: Rows are labeled "Bulbasaur", "Ivysaur", "Venusaur"...
# This makes searching intuitive: df.loc['Pikachu'] is easier than df.loc[25]
pkmndata_clean = pkmndata_clean.set_index('NAME')

# Print the DataFrame to check
# .sample(n=10) grabs 10 random rows. 
# It's sometimes better than .head() because it shows you data from the middle/end too.
pkmndata_clean.sample(n = 10)

In [None]:
# Check the Variable Information
# Confirming everything is the right data type and shape.
pkmndata_clean.info()

#### Tackle the Missing Values

Note that `TYPE_2` has only 414 values, instead of the overall 800. Let's fill-in the missing values with the string `NoType` for clarity about single/dual types.     


Missing values are generally represented as `NaN` in numeric arrays, `None` or `NaN` in object arrays, `NaT` in datetime. In certain cases, the missing values may mean the data is not available or not required (as in here). But it may also be errors from data acquisition or data processing. We should check for that.

In [None]:
# .isnull() returns a True/False table (True = Missing).
# .sum() counts the Trues. 
# We expect TYPE_2 to have many missing values (monotype Pokemon).
pkmndata_clean.isnull().sum()

In [None]:
"""
Handling Missing Data

We have two choices:
    1. Drop the rows (Bad! We lose half our Pokemon).
    2. Fill the empty space with a label.

We choose option 2. We fill NaNs with the string "NoType".
"""

# .fillna(value=...): The function to fill N/A (Not Available) values.
# inplace=True: Update the dataframe directly without creating a new variable.
pkmndata_clean["TYPE_2"].fillna(value = "NoType", inplace = True)

In [None]:
# Check the Clean Dataset
# We run .info() again to see the "Non-Null Count".
# Before: TYPE_2 might have said "414 non-null".
# After: It should say "800 non-null" (matching the Total entries).
pkmndata_clean.info()

In [None]:
# Check the Variable Information
# .value_counts() counts how many Pokemon belong to each Type 2 category.
# Note: The .dropna() here is actually redundant because we just filled all the N/As! 
# It's likely left over from code before we fixed the data, acting as a safety net.
print(pkmndata_clean["TYPE_2"].dropna().value_counts())

---

## Pokemons worth Exploring

![My Favorites](images/PokemonMyFabs.png)

Of course, we all have our favourite Pokemons -- mine are Pikachu, Jigglypuff, Togepi, Bulbasaur and Snorlax -- as you can tell from the image above.    

In [None]:
# My Favorites (entirely based on cuteness index, and not on their power)
pkmndata_clean.loc[["Pikachu", "Jigglypuff", "Togepi", "Bulbasaur", "Snorlax"]]

However, there are some other Pokemons worth exploring -- especially the strongest and the weakest Pokemons, may be for each type and generation.

#### Strongest and Weakest Pokemons

In [None]:
# Strongest Pokemons -- the Top 10
# Ascending = False --> Descending
pkmndata_clean.sort_values('TOTAL', ascending=False).head(10)

In [None]:
# Weakest Pokemons -- the Bottom 10
# Ascending = True --> Low to High
pkmndata_clean.sort_values('TOTAL', ascending=True).head(10)

#### Strongest and Weakest Pokemons -- Legendary and Non-Legendary

In [None]:
# Strongest Legendary Pokemons -- the Top 10
pkmndata_clean[pkmndata_clean["LEGENDARY"] == True].sort_values('TOTAL', ascending=False).head(10)

In [None]:
# Weakest Legendary Pokemons -- the Bottom 10
pkmndata_clean[pkmndata_clean["LEGENDARY"] == True].sort_values('TOTAL', ascending=True).head(10)

In [None]:
# Strongest Non-Legendary Pokemons -- the Top 10
pkmndata_clean[pkmndata_clean["LEGENDARY"] == False].sort_values('TOTAL', ascending=False).head(10)

In [None]:
# Weakest Non-Legendary Pokemons -- the Bottom 10
pkmndata_clean[pkmndata_clean["LEGENDARY"] == False].sort_values('TOTAL', ascending=True).head(10)

#### Strongest and Weakest Pokemons -- Across Generations

In [None]:
# Strongest Pokemons in each Generation -- the Top 10
generation = 1
pkmndata_clean[pkmndata_clean["GENERATION"] == generation].sort_values('TOTAL', ascending=False).head(10)

In [None]:
# Weakest Pokemons in each Generation -- the Bottom 10
generation = 1
pkmndata_clean[pkmndata_clean["GENERATION"] == generation].sort_values('TOTAL', ascending=True).head(10)

#### Strength of Pokemons over various Types

Which combination of types is the strongest? To answer this, we need to aggregate the data.

We will calculate the **Average Total Stats** for every pair of `TYPE_1` and `TYPE_2`.

In [None]:
"""
.groupby(['TYPE_1', 'TYPE_2']):
   Imagine sorting the Pokemon into physical buckets based on their pair of types
   Bucket 1: "Fire + Flying" (Charizard, Moltres...)
   Bucket 2: "Grass + Poison" (Bulbasaur, Vileplume...)
   
.mean():
   Inside each bucket, calculate the average of ALL numerical columns
   (e.g., The average Attack of all Fire/Flying Pokemon)

.loc[:, 'TOTAL']:
   We only care about the 'TOTAL' strength for this specific analysis
   .loc[ : , 'TOTAL' ] means: "Keep all the Groups (rows), but give me only the 'TOTAL' column
   "
"""
# Calculate the Mean Total for every Type Pair
total_means = pkmndata_clean.groupby(['TYPE_1', 'TYPE_2']).mean().loc[:, 'TOTAL']

"""
The `total_means` variable is currently a "Series" with a Multi-Index (Type 1 and Type 2 are the labels).
To read it like a normal table, we need to massage the data.

.reset_index():
   Converts the labels (Type 1, Type 2) back into regular columns. 
   Now it looks like a standard DataFrame (Table) again.

.sort_values('TOTAL', ascending=False):
   Sort by the Total Strength. 
   ascending=False means "High to Low" (Descending).

.head(10):
   Show only the first 10 rows.
"""
print("Top 10 Strongest Type Combinations:")
print(total_means.reset_index().sort_values('TOTAL', ascending=False).head(10).round(2))

# Initialize the figure size
f = plt.figure(figsize=(10, 10))
"""
Currently, `total_means` is a "Long List":
  Fire-Water: 500
  Fire-Grass: 450
  ...

We need a "Wide Matrix" where:
  - Rows    = TYPE_1
  - Columns = TYPE_2
  - Values  = Average TOTAL

.unstack() performs this pivot. It takes the inner index (TYPE_2) and rotates it to become the columns.
"""
# Draw the Heatmap
sb.heatmap(
    total_means.unstack(),     # The Data (Converted to a Matrix)
    linewidths = 1,            # Gap between squares
    annot = True,              # Show the actual numbers inside the squares
    fmt = ".0f",               # Format the numbers: ".0f" means 0 decimal places (Whole numbers)
    annot_kws = {"size": 8},   # Size of the text inside the squares
    cmap = "BuGn"              # Color Map: Blue to Green (Darker = Stronger)
)

#### Strength of Legendary Pokemons over various Types

In [None]:
# Compute the Average TOTAL across every pair of TYPEs
total_means = pkmndata_clean[pkmndata_clean["LEGENDARY"] == True].groupby(['TYPE_1', 'TYPE_2']).mean().loc[:, 'TOTAL']

# Strongest Pokemons in each Pair of Types -- the Top 10
print(total_means.reset_index().sort_values('TOTAL', ascending=False).head(10).round(2))

# Heatmap of Average TOTAL across every pair of TYPEs
f = plt.figure(figsize=(10, 10))
sb.heatmap(total_means.unstack(), linewidths = 1,
           annot = True, fmt = ".0f", annot_kws = {"size": 8}, cmap = "BuGn")

## Essential Steps in Exploratory Data Analysis (EDA)

### 1. **Setup & Initial Inspection**

- **Loading Data**: Using `pandas.read_csv()` to convert raw files into a structured DataFrame.

- **Structure Analysis**: Using `.shape` to check dimensions (rows/columns) and `.info()` to inspect column names, data types (int vs object), and non-null counts.

- **Categorial Overview**: Using `.unique()` to count distinct values and `.value_counts()` to identify frequency distributions (e.g., imbalance in Pokemon Generations).

### 2. **Univariate Analysis (One Variable)**

- **Categorical Visualization**: Using `sb.catplot(kind='count')` to plot frequency bars for categorical variables (e.g., 'Type1' or 'Generation').

- **Numerical Distributions**: Using the 'Big Three' plots to understand stats:

    - **Boxplot** (`sb.boxplot`): To identify quartiles and spot outliers.

    - **Histogram** (`sb.histplot`): To see the frequency distribution shape (Normal vs. Skewed).

    - **Violin Plot** (`sb.violinplot`): To combine boxplot range with density estimation.

- **Statistical Summary**: Using `.describe()` to view key metrics (mean, std, min, max) for all numerical features.

### 3. **Data Cleaning & Pre-processing**

- **Duplicate Management**: Using `.duplicated(keep=False)` tto flag and inspect all instances of repeated IDs before deciding how to handle them.

- **Column Standardization**: Using `.rename()`, `.str.upper()`, and `.str.replace()` to convert messy headers into clean, upper-case "snake_case" (e.g., changing `Sp. Atk' to 'SP_ATK').

- **Pattern Matching (Regex)**: Using `re.sub()` inside a `.apply(lambda x: ...)` function to strip unwanted substrings (like "Forme" and "Cloak") from text data.

- **Manual Correction** Using `.loc[row_index, col_name]` to manually fix specific edge cases that automated cleaning misses (e.g. 'Charizard MegaX' and 'Charizard MegaY')

- **Handling Missing Values**: Using `.isnull().sum()` to detect gaps and `.fillna()` to fill missing data (e.g., filling missing 'Type2' with 'NoType' string) rather than deleting rows.

### 4. Bivariate & Multi-Variate Analysis

- **Correlation Analysis**: Using `.corr()` to calculate linear relationships and `sb.heatmap()` to visualize the correlation matrix (e.g., identifying if Defense relates to Speed).

- **Filtering & Sorting**: Using Boolean indexing (e.g., `data['Legendary'] == True`) to isolate subsets and `.sort_values(ascending=False)` to rank data (e.g., finding the strongest Pokemon)

- **Aggregation**: Using `.groupby(['Col1', 'Col2']).mean()` to calculate statistics for specific combinations (e.g., Average Stats for every Type pair). 

- **Data Pivoting**: Using `.unstack()` to transform a long list of grouped values into a matrix/grid format, enabling 2-D Heatmap Visualization.