In [1]:
'''
My project’s goal is to be able to predict certain traits about Pokémon based off of their in-game features. 
All Pokémon have 6 types of base stats, which are health, attack, defense, special attack, special defense, and speed.
These numbers are represented by integers in game, and each of these base stats determines how a Pokémon will grow 
when it levels up. Other than base stats, Pokémon have a typing, which basically determines its abilities, 
strengths, and weakness. Base stats and typing normally go hand-in-hand. An example is that the Rock-type 
typically has high defense, but low speed. Using these stats and other trends, I’m planning to see if an 
algorithm can correctly predict what type a Pokémon is based off of the stats it receives. 

The data I’m using is a collection of all 898 Pokémon, including their typing, their base stat total 
(which is the sum of all their stats, each one of their 6 individual stats, and the “stage” at which the Pokémon 
is currently at. The base stat totals and the types will act as the features for each Pokémon. 

In this notebook, I'm looking to classify Pokémon as legendary or nonlegendary. 


References:

https://medium.com/analytics-vidhya/evaluating-a-random-forest-model-9d165595ad56

https://towardsdatascience.com/introduction-to-machine-learning-with-pokemon-ccb7c9d1351b

https://towardsdatascience.com/identifying-legendary-pok%C3%A9mon-using-the-random-forest-algorithm-ed0904d07d64


'''


import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

pok = pd.read_csv('pokemon.csv')

pok

Unnamed: 0,#,Name,Stage,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Heights (m),Weight (kg),Generation
0,1.0,Bulbasaur,1,Grass,Poison,318,45,49,49,65,65,45,0.7,6.9,1
1,2.0,Ivysaur,1,Grass,Poison,405,60,62,63,80,80,60,1.0,13.0,1
2,3.0,Venusaur,2,Grass,Poison,525,80,82,83,100,100,80,2.0,100.0,1
3,4.0,Charmander,1,Fire,,309,39,52,43,60,50,65,0.6,8.5,1
4,5.0,Charmeleon,1,Fire,,405,58,64,58,80,65,80,1.1,19.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
979,,Glastrier,3,Ice,,580,100,145,130,65,110,30,2.2,800.0,8
980,,Spectrier,3,Ghost,,580,100,65,60,145,80,130,2.0,44.5,8
981,,Calyrex,4,Psychic,Grass,500,100,80,80,80,80,80,1.1,7.7,8
982,,Calyrex,3,Psychic,Ice,680,100,165,150,85,130,50,2.4,809.1,8


In [2]:
type(pok)

pandas.core.frame.DataFrame

In [3]:
poke = pok.to_numpy()
poke

array([[1.0, 'Bulbasaur', 1, ..., 0.7, 6.9, 1],
       [2.0, 'Ivysaur', 1, ..., 1.0, 13.0, 1],
       [3.0, 'Venusaur', 2, ..., 2.0, 100.0, 1],
       ...,
       [nan, 'Calyrex', 4, ..., 1.1, 7.7, 8],
       [nan, 'Calyrex', 3, ..., 2.4, 809.1, 8],
       [nan, 'Calyrex', 3, ..., 2.4, 53.6, 8]], dtype=object)

In [4]:
X = poke[:,3:14]

In [5]:
'''
I'm using one hot encoding to turn the types into features here. The loop below is used to help the machine
differentiate between a Pokémon's primary and secondary typing. "Type 2 Blank" indicates that a Pokémon does not have
a secondary typing, and is kind of a placeholder feature.
'''

import numpy as np
'''
ONLY RUN THIS CELL ONCE
'''
for i in range(len(X)):
    X[i,0] = 'Type 1 ' + X[i,0]
    if isinstance(X[i,1], str):
        X[i,1] = 'Type 2 ' + X[i,1]
    else:
        X[i,1] = 'Type 2 Blank'
  
X[0,:]

array(['Type 1 Grass', 'Type 2 Poison', 318, 45, 49, 49, 65, 65, 45, 0.7,
       6.9], dtype=object)

In [6]:
'''
The one-hot encoding is done here. This splits each of a Pokémon's typing into 18 different features, 
asking yes or no if a Pokémon is a certain type. For example, Bulbasaur, the first Pokémon is Grass/Poison. This
means it will have a 1 in the "Type 1 Grass" column and "Type 2 Poison" column.
'''

type1 = pd.get_dummies(X[:,0])
type1

Unnamed: 0,Type 1 Bug,Type 1 Dark,Type 1 Dragon,Type 1 Electric,Type 1 Fairy,Type 1 Fighting,Type 1 Fire,Type 1 Flying,Type 1 Ghost,Type 1 Grass,Type 1 Ground,Type 1 Ice,Type 1 Normal,Type 1 Poison,Type 1 Psychic,Type 1 Rock,Type 1 Steel,Type 1 Water
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
979,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
980,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
981,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
982,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [7]:
type2 = pd.get_dummies(X[:,1])
type2

Unnamed: 0,Type 2 Blank,Type 2 Bug,Type 2 Dark,Type 2 Dragon,Type 2 Electric,Type 2 Fairy,Type 2 Fighting,Type 2 Fire,Type 2 Flying,Type 2 Ghost,Type 2 Grass,Type 2 Ground,Type 2 Ice,Type 2 Normal,Type 2 Poison,Type 2 Psychic,Type 2 Rock,Type 2 Steel,Type 2 Water
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
979,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
980,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
981,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
982,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [8]:
typing = pd.concat([type1, type2], axis=1)
typing

Unnamed: 0,Type 1 Bug,Type 1 Dark,Type 1 Dragon,Type 1 Electric,Type 1 Fairy,Type 1 Fighting,Type 1 Fire,Type 1 Flying,Type 1 Ghost,Type 1 Grass,...,Type 2 Ghost,Type 2 Grass,Type 2 Ground,Type 2 Ice,Type 2 Normal,Type 2 Poison,Type 2 Psychic,Type 2 Rock,Type 2 Steel,Type 2 Water
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
979,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
980,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
981,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
982,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [9]:
stats = pok.loc[:,['Total','HP','Attack','Defense','Sp. Atk', 'Sp. Def', 'Speed','Heights (m)','Weight (kg)']]
stats

Unnamed: 0,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Heights (m),Weight (kg)
0,318,45,49,49,65,65,45,0.7,6.9
1,405,60,62,63,80,80,60,1.0,13.0
2,525,80,82,83,100,100,80,2.0,100.0
3,309,39,52,43,60,50,65,0.6,8.5
4,405,58,64,58,80,65,80,1.1,19.0
...,...,...,...,...,...,...,...,...,...
979,580,100,145,130,65,110,30,2.2,800.0
980,580,100,65,60,145,80,130,2.0,44.5
981,500,100,80,80,80,80,80,1.1,7.7
982,680,100,165,150,85,130,50,2.4,809.1


In [10]:
'''
Here I just merged the tables to create the big feature table for all the important information about each Pokémon.
It denotes their typing and has all their stats and physical featues like height and weight.
'''

Xf = pd.concat([typing, stats], axis=1)
Xf

Unnamed: 0,Type 1 Bug,Type 1 Dark,Type 1 Dragon,Type 1 Electric,Type 1 Fairy,Type 1 Fighting,Type 1 Fire,Type 1 Flying,Type 1 Ghost,Type 1 Grass,...,Type 2 Water,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Heights (m),Weight (kg)
0,0,0,0,0,0,0,0,0,0,1,...,0,318,45,49,49,65,65,45,0.7,6.9
1,0,0,0,0,0,0,0,0,0,1,...,0,405,60,62,63,80,80,60,1.0,13.0
2,0,0,0,0,0,0,0,0,0,1,...,0,525,80,82,83,100,100,80,2.0,100.0
3,0,0,0,0,0,0,1,0,0,0,...,0,309,39,52,43,60,50,65,0.6,8.5
4,0,0,0,0,0,0,1,0,0,0,...,0,405,58,64,58,80,65,80,1.1,19.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
979,0,0,0,0,0,0,0,0,0,0,...,0,580,100,145,130,65,110,30,2.2,800.0
980,0,0,0,0,0,0,0,0,1,0,...,0,580,100,65,60,145,80,130,2.0,44.5
981,0,0,0,0,0,0,0,0,0,0,...,0,500,100,80,80,80,80,80,1.1,7.7
982,0,0,0,0,0,0,0,0,0,0,...,0,680,100,165,150,85,130,50,2.4,809.1


In [11]:
type(Xf)

pandas.core.frame.DataFrame

In [12]:
X_final = Xf.to_numpy()
X_final

array([[0.000e+00, 0.000e+00, 0.000e+00, ..., 4.500e+01, 7.000e-01,
        6.900e+00],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 6.000e+01, 1.000e+00,
        1.300e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 8.000e+01, 2.000e+00,
        1.000e+02],
       ...,
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 8.000e+01, 1.100e+00,
        7.700e+00],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 5.000e+01, 2.400e+00,
        8.091e+02],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 1.500e+02, 2.400e+00,
        5.360e+01]])

In [13]:
y = poke[:,2]
y

array([1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2,
       1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1,
       2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1,
       1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 0, 1, 2, 1, 2, 1,
       2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2,
       1, 1, 1, 1, 0, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 0, 0, 1, 2, 0, 0,
       1, 2, 2, 2, 1, 1, 2, 1, 2, 0, 2, 3, 3, 3, 1, 1, 2, 3, 4, 1, 1, 2,
       1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1,
       1, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1,
       2, 1, 0, 2, 2, 1, 2, 0, 1, 2, 1, 2, 0, 2, 0, 0, 1, 1, 2, 1, 2, 1,
       1, 0, 1, 2, 0, 2, 0, 1, 2, 2, 1, 2, 1, 0, 0, 1, 2, 1, 1, 1, 0, 2,
       3, 3, 3, 1, 1, 2, 3, 3, 4, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2,
       1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2,
       1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2,

In [14]:
'''
In this cell, I'm making the classifications a bit simpler. I'm saying that any Pokémon that is not classified as 
legendary is in one classification, and legendaries and mythical Pokémon are in a separate category. Don't worry about
the classifications for this notebook, as they are explained in more detail in the next one.
'''
for i in range(len(y)):
    if y[i] < 3:
        y[i] = 0
    else:
        y[i] = 1    

y = y.astype('int')
y
    

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [15]:
'''
I'm just splitting the data into training and test sets here.
'''

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test =\
        train_test_split(X_final, y, test_size=0.3, random_state=1, stratify=y)

In [16]:
'''
I decided to use a Random Forest Classifier because it's good at classification and handling data with
high-dimensionality. We're working with 40+ features here, so this is definitely the way to go.
'''

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100,max_depth=7)

model.fit(X_train,y_train)

RandomForestClassifier(max_depth=7)

In [17]:
y_predict = model.predict(X_test)

In [18]:
'''
The result we get here are incredible. The machine was able to predict if a Pokémon is legendary or not correctly
97% of the time!
'''

accuracy_score(y_test, y_predict)

0.9763513513513513

In [19]:
'''
In a confusion matrix, the goal is to see where the machine made mistakes. The rows indicate the true labels while 
the columns are the predicted labels. So, in the first row and first column would be a 'true' prediction for a non 
legendary Pokémon, while the first row second column would be a 'false' prediction, meaning the machine predicted it
to be a legendary, but it was not. This works for the other row as well.

The confusion matrix here tells us that the random forest classifier only got 4 incorrect on each type of
classification, which is not bad all things considered. There is one important factor to note...
'''

confusion_matrix(y_test, y_predict)

array([[259,   4],
       [  3,  30]])

In [20]:

'''
The classifications I'm using are broken up into 5 categories, but for this exercise, items classified as 0,1, or 2
are regular Pokémon and legendary Pokémon are 3 and 4. If you combine these values, there are 875 non legendary 
Pokémon to 109 legendaries, which is quite the imbalance. The algorithm might have had an easier time predicting since
there are so many non-legendary Pokémon in comparison to legendaries.
'''

classification = pok.loc[:,'Stage']
classification.value_counts()

1    427
2    348
0    100
3     81
4     28
Name: Stage, dtype: int64

In [21]:
importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

Feature ranking:
1. feature 37 (0.362481)
2. feature 43 (0.099575)
3. feature 41 (0.086623)
4. feature 45 (0.074993)
5. feature 44 (0.073477)
6. feature 38 (0.064956)
7. feature 42 (0.063271)
8. feature 40 (0.054927)
9. feature 39 (0.048310)
10. feature 14 (0.008260)
11. feature 2 (0.007314)


In [24]:
most_important = Xf.iloc[:,[37,43,41,44,38,45,39,40,14,2]]
most_important


Unnamed: 0,Total,Speed,Sp. Atk,Heights (m),HP,Weight (kg),Attack,Defense,Type 1 Psychic,Type 1 Dragon
0,318,45,65,0.7,45,6.9,49,49,0,0
1,405,60,80,1.0,60,13.0,62,63,0,0
2,525,80,100,2.0,80,100.0,82,83,0,0
3,309,65,60,0.6,39,8.5,52,43,0,0
4,405,80,80,1.1,58,19.0,64,58,0,0
...,...,...,...,...,...,...,...,...,...,...
979,580,30,65,2.2,100,800.0,145,130,0,0
980,580,130,145,2.0,100,44.5,65,60,0,0
981,500,80,80,1.1,100,7.7,80,80,1,0
982,680,50,85,2.4,100,809.1,165,150,1,0


In [None]:

'''
Here, we see the most important factors that determine whether a Pokémon is legendary or not. The base stat total is 
the most important factor, which makes sense, as a legendary is overall much stronger than the average Pokémon.
Other notable features are height and weight, which makes sense as well, since legendaries tend to be bigger 
and heavier as well. The data also shows that if you're a legendary, you're most likely to have the Psychic 
or Dragon type, which is an interesting tidbit.
'''