# Pokémon Dataset – Exploratory Data Analysis

Disclaimer: `.csv` file downloaded from https://www.kaggle.com/datasets/rounakbanik/pokemon ([CC0: Public Domain](https://creativecommons.org/publicdomain/zero/1.0/)).

In [None]:
%pip install pandas numpy matplotlib seaborn scikit-learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("pokemon.csv")

df.head()

## Basic Dataset Structure
> Please note that this part can also be done using Data Explorer. However, we will demonstrate it here using code too

In [None]:
df.info()

In [None]:
df.describe(include='all')

In [None]:
# Let's check missing values (this can also be done with Data Explorer)

df.isna().sum().sort_values(ascending=False)

## Distribution of Core Battle Stats

In [None]:
stats = ["hp", "attack", "defense", "sp_attack", "sp_defense", "speed", "base_total"]

df[stats].hist(figsize=(12, 8), bins=20)
plt.tight_layout()
plt.show()


## Type Effectiveness Columns

These columns encode multipliers:

- 0 = immune

- 0.5 = resistant

- 1 = normal

- 2 = weak

In [None]:
against_cols = [c for c in df.columns if c.startswith("against_")]
df[against_cols].describe().T

## Type Distributions

### Primary Types

In [None]:
df["type1"].value_counts().plot(kind="bar", figsize=(12, 4), title="Primary Type Distribution")
plt.show()


### Secondary Types

In [None]:
df["type2"].value_counts().head(10).plot(kind="bar", figsize=(12, 4), title="Secondary Types (Top 10)")
plt.show()

## Combat Stats by Type

In [None]:
df.groupby("type1")[stats].mean().sort_values("base_total", ascending=False)

### Legendary vs Non-Legendary Comparison

In [None]:
df.groupby("is_legendary")[stats].mean().T

### Density Plots

In [None]:
import seaborn as sns

sns.kdeplot(data=df, x="base_total", hue="is_legendary", fill=True)
plt.show()

## Body Metrics (Height and Weight)

In [None]:
sns.scatterplot(data=df, x="weight_kg", y="height_m", hue="type1")
plt.show()

### Correlate body metrics with stats. 
Is there correlation between `base_total` and `weight_kg` or `height_m` or other `stats`?

In [None]:
df[["height_m", "weight_kg"] + stats].corr()["base_total"].sort_values(ascending=False)

## Capture Difficulty

In [None]:
df["capture_rate_clean"] = pd.to_numeric(df["capture_rate"].str.replace("None", ""), errors="coerce")
df["capture_rate_clean"].hist()

## Gender Analysis

### Gender Missingness (Genderless Pokemon)

In [None]:
# Genderless = percentage_male is NaN
genderless = df[df['percentage_male'].isna()]

print(f"Number of genderless Pokémon: {len(genderless)}\n")
genderless[['name', 'type1', 'type2', 'generation']]

### All-Female Pokemon

In [None]:
x = len(df[df['percentage_male'] == 0.0])
print(f"There are {x} all-female Pokémon (see below)")
print(df[df['percentage_male'] == 0.0][['name','type1','type2']])


### Gender Ratio by Primary Type

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(data=df, x='type1', y='percentage_male')
plt.xticks(rotation=45)
plt.title("Distribution of Gender Ratios by Primary Type")
plt.show()

### Gender Ratio by Generation

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x='generation', y='percentage_male')
plt.title("Gender Ratio Distribution Across Generations")
plt.show()

## Outliers in Stats

In [None]:
df[stats + ["height_m", "weight_kg"]].boxplot(figsize=(14, 6))
plt.show()

# Questions to Explore with Assistant

1. Are Flying-type Pokémon actually lighter or smaller than other Pokémon?
2. Do different generations create Pokémon with different average stat totals (`base_total`)?
3. Are male-biased or female-biased species consistently different in height or weight?
4. Are certain Pokémon types systematically lighter or smaller?
5. Do certain types or generations produce Pokémon with unusually high or low height-to-weight ratios?