# Phase 1: Dataset Wrangling and Describing

Here, we load the main dataset, join it with additional data, and clean it.

The main dataset, [`FirstGenPokemon.csv`](https://www.kaggle.com/datasets/dizzypanda/gen-1-pokemon), comes from Kaggle. Meanwhile, additional data was retrieved from the Smogon website's [stats directory](https://www.smogon.com/stats/) (specifically, Gen 1 OU stats from [January 2016](https://www.smogon.com/stats/2016-01/gen1ou-0.txt) and [January 2025](https://www.smogon.com/stats/2025-01/gen1ou-0.txt)). Smogon is a community-run project dedicated to the competitive Pokemon scene and is a comprehensive and renowned resource for such.


In [1]:
# * Import necessary modules
import pandas as pd

In [2]:

# * Load main data
main_df = pd.read_csv(r"dataset/FirstGenPokemon.csv")
 
 # ! Remove trailing space in column names
main_df.rename(columns=lambda label: label.strip(' '), inplace=True)

main_df.info()
main_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151 entries, 0 to 150
Data columns (total 35 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Number       151 non-null    int64  
 1   Name         151 non-null    object 
 2   Types        151 non-null    int64  
 3   Type1        151 non-null    object 
 4   Type2        62 non-null     object 
 5   Height(m)    151 non-null    float64
 6   Weight(kg)   151 non-null    float64
 7   Male_Pct     151 non-null    float64
 8   Female_Pct   151 non-null    float64
 9   Capt_Rate    151 non-null    int64  
 10  Exp_Points   151 non-null    int64  
 11  Exp_Speed    151 non-null    object 
 12  Base_Total   151 non-null    int64  
 13  HP           151 non-null    int64  
 14  Attack       151 non-null    int64  
 15  Defense      151 non-null    int64  
 16  Special      151 non-null    int64  
 17  Speed        151 non-null    int64  
 18  Normal_Dmg   151 non-null    float64
 19  Fire_Dmg

Unnamed: 0,Number,Name,Types,Type1,Type2,Height(m),Weight(kg),Male_Pct,Female_Pct,Capt_Rate,...,Poison_Dmg,Ground_Dmg,Flying_Dmg,Psychic_Dmg,Bug_Dmg,Rock_Dmg,Ghost_Dmg,Dragon_Dmg,Evolutions,Legendary
0,1,Bulbasaur,2,grass,poison,0.7,6.9,87.5,12.5,45,...,1.0,1.0,2.0,2.0,4.0,1.0,1,1,2,0
1,2,Ivysaur,2,grass,poison,1.0,13.0,87.5,12.5,45,...,1.0,1.0,2.0,2.0,4.0,1.0,1,1,2,0
2,3,Venusaur,2,grass,poison,2.0,100.0,87.5,12.5,45,...,1.0,1.0,2.0,2.0,4.0,1.0,1,1,2,0


In [6]:
# * Load additional data
gen1_2016_df = pd.read_csv(r"dataset/2016-01-gen1ou-0.csv")
gen1_2025_df = pd.read_csv(r"dataset/2025-01-gen1ou-0.csv")

# * Rename '%' and '%.1' columns
# ? Do we instead remove these?

gen1_2016_df.rename(columns={'%': 'Raw%', '%.1': 'Real%'}, inplace=True)
gen1_2025_df.rename(columns={'%': 'Raw%', '%.1': 'Real%'}, inplace=True)

# * Tag all but the 'Pokemon' column in our additional data by year
# ! We rename the 'Pokemon' column to 'Name' to sync it with the main data
gen1_2016_df.rename(columns=lambda label: f"{label} (2016)" if label != 'Pokemon' else 'Name', inplace=True)
gen1_2025_df.rename(columns=lambda label: f"{label} (2025)" if label != 'Pokemon' else 'Name', inplace=True)

gen1_2016_df.head(3)

Unnamed: 0,Rank (2016),Name,Usage% (2016),Raw (2016),Raw% (2016),Real (2016),Real% (2016)
0,1,Alakazam,53.49071%,6390,53.491%,5475,57.607%
1,2,Tauros,52.33551%,6252,52.336%,4448,46.801%
2,3,Chansey,50.71154%,6058,50.712%,4886,51.410%


In [7]:
# Join data by Pokemon name

joined_df = main_df.merge(gen1_2016_df, on='Name', how='outer').merge(gen1_2025_df, on='Name', how='outer')
joined_df

Unnamed: 0,Number,Name,Types,Type1,Type2,Height(m),Weight(kg),Male_Pct,Female_Pct,Capt_Rate,...,Raw (2016),Raw% (2016),Real (2016),Real% (2016),Rank (2025),Usage% (2025),Raw (2025),Raw% (2025),Real (2025),Real% (2025)
0,63.0,Abra,1.0,psychic,,0.9,19.5,75.0,25.0,200.0,...,7.0,0.059%,7.0,0.074%,125.0,0.00418%,2.0,0.004%,2.0,0.005%
1,142.0,Aerodactyl,2.0,rock,flying,1.8,59.0,87.5,12.5,45.0,...,180.0,1.507%,134.0,1.410%,31.0,1.53656%,735.0,1.537%,605.0,1.469%
2,65.0,Alakazam,1.0,psychic,,1.5,48.0,75.0,25.0,50.0,...,6390.0,53.491%,5475.0,57.607%,6.0,38.02525%,18189.0,38.025%,16057.0,38.994%
3,24.0,Arbok,1.0,poison,,3.5,65.0,50.0,50.0,90.0,...,56.0,0.469%,43.0,0.452%,65.0,0.38884%,186.0,0.389%,147.0,0.357%
4,59.0,Arcanine,1.0,fire,,1.9,155.0,75.0,25.0,75.0,...,629.0,5.265%,466.0,4.903%,35.0,1.27106%,608.0,1.271%,493.0,1.197%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,70.0,Weepinbell,2.0,grass,poison,1.0,6.4,50.0,50.0,120.0,...,,,,,77.0,0.20697%,99.0,0.207%,65.0,0.158%
150,110.0,Weezing,1.0,poison,,1.2,9.5,50.0,50.0,60.0,...,64.0,0.536%,54.0,0.568%,58.0,0.55191%,264.0,0.552%,219.0,0.532%
151,40.0,Wigglytuff,1.0,normal,,1.0,12.0,25.0,75.0,50.0,...,23.0,0.193%,16.0,0.168%,78.0,0.20069%,96.0,0.201%,79.0,0.192%
152,145.0,Zapdos,2.0,electric,flying,1.6,52.6,0.0,0.0,3.0,...,2165.0,18.123%,1678.0,17.656%,9.0,24.54739%,11742.0,24.547%,10278.0,24.960%
