Let's load up our data; fortunately our missing data is semantically zero, so we can fill in NaNs with zeros.

In [1]:
import pandas as pd

DATA = '/kaggle/input/pokemon-go/pokemon.csv'

df = pd.read_csv(filepath_or_buffer=DATA, index_col=['pokemon_id']).fillna(value=0)
df.head()

Unnamed: 0_level_0,pokemon_name,base_attack,base_defense,base_stamina,type,rarity,charged_moves,fast_moves,candy_required,distance,...,base_flee_rate,dodge_probability,max_pokemon_action_frequency,min_pokemon_action_frequency,found_egg,found_evolution,found_wild,found_research,found_raid,found_photobomb
pokemon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Bulbasaur,118,111,128,"['Grass', 'Poison']",Standard,"['Sludge Bomb', 'Seed Bomb', 'Power Whip']","['Vine Whip', 'Tackle']",0.0,3,...,-1.0,0.15,1.6,0.2,True,False,True,True,True,True
2,Ivysaur,151,143,155,"['Grass', 'Poison']",Standard,"['Sludge Bomb', 'Solar Beam', 'Power Whip']","['Razor Leaf', 'Vine Whip']",25.0,3,...,-1.0,0.15,1.6,0.2,False,True,True,True,True,True
3,Venusaur,198,189,190,"['Grass', 'Poison']",Standard,"['Sludge Bomb', 'Petal Blizzard', 'Solar Beam']","['Razor Leaf', 'Vine Whip']",100.0,3,...,-1.0,0.15,1.6,0.2,False,True,True,True,True,True
4,Charmander,116,93,118,['Fire'],Standard,"['Flame Charge', 'Flame Burst', 'Flamethrower']","['Ember', 'Scratch']",0.0,3,...,-1.0,0.15,1.6,0.2,True,False,True,True,True,True
5,Charmeleon,158,126,151,['Fire'],Standard,"['Fire Punch', 'Flame Burst', 'Flamethrower']","['Ember', 'Fire Fang']",25.0,3,...,-1.0,0.15,1.6,0.2,False,True,True,True,True,True


Ideally we would visualize base attack/defense/stamina in three dimensions, but we don't have a good way to do that in a Python notebook, so let's use dimension reduction and two dimensions.

In [2]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
base_df = df[['pokemon_name', 'base_attack', 'base_defense', 'base_stamina', 'rarity',]].copy()
base_df[['x', 'y']] = umap.fit_transform(X=base_df[['base_attack', 'base_defense', 'base_stamina',]])
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:11.269131


Now we can visualize rarity in terms of base attributes,sort of.

In [3]:
from plotly import express

express.scatter(data_frame=base_df, x='x', y='y', color='rarity', hover_name='pokemon_name')

Wow. Almost everything is standard rarity. And with a couple of exceptions our base attack/defense/stamina is a pretty good predictor of rarity.

Let's build a model to predict rarity, and let's use all of the data we sensibly can.

Unfortunately we don't have enough data to distinguish among the different kinds of rare beasts, so we need to clean up our target variable a little.

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

df['target'] = df['rarity'].apply(func=lambda x: 'Not Standard' if x != 'Standard' else x)
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['pokemon_name', 'type', 'rarity', 'charged_moves', 'fast_moves', 'rarity', 'target']), df['target'], test_size=0.2, random_state=2024, stratify=df['target'])

logreg = LogisticRegression(max_iter=100000, tol=1e-12).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test))))
print(classification_report(y_true=y_test, y_pred=logreg.predict(X=X_test)))

model fit in 1412 iterations
accuracy: 0.9851
f1: 0.9853
              precision    recall  f1-score   support

Not Standard       0.90      0.95      0.92        19
    Standard       0.99      0.99      0.99       183

    accuracy                           0.99       202
   macro avg       0.95      0.97      0.96       202
weighted avg       0.99      0.99      0.99       202



Which of our pokemon attributes contribute the most to distinguishing rare from non-rare pokemon?

In [5]:
express.histogram(x=X_train.columns, y=logreg.coef_[0])