# Predicting Legendary Status of a Pokemon Using Logistic Regression

## Overview

The goal of this project is to predict whether the Pokemon is a legendary Pokemon. The dataset used is available on Kaggle (https://www.kaggle.com/abcsds/pokemon). It contains basic stats for 721 Pokemons, such as HP, Attack, Defense, Special Attack, Special Defense, and Speed.

#### Data Used
- pokemon_data_science.csv

#### Procedure to Train Logistic Regression Model

1. Prepare and transform the data
2. Split dataset into train and test sets (80/20)
3. Use an ensemble of decision trees to find the top features for prediction
4. Tune the C value (regularization strength) using cross validation
5. Instantiate model using optimal value found in the previous step
6. Fit model and calculate model accuracy on train and test data

#### Results
In the end, I trained a Logistic Regression model with the following features: 
- Egg_Group_1_Undiscovered
- Gender_Genderless
- Total
- Gender_Male
- Sp_Atk
- Height_m
- Attack
- Weight_kg
- Defense
- Egg_Group_1_Mineral

The final train accuracy is 98.96%, with the validation accuracy at 99.31%.

## Model Selection Approach

I want to predict if a Pokemon is legendary by solving a binary classification problem. Below are 3 options with their respective pros and cons.

##### Logistic Regression
Pros:
- Easy and efficient to compute
- Doesn't require input features to be scaled
- Easy to interpret
- Can trade off between false positive and false negative errors by changing the threshold while still using the same model

Cons:
- Requires careful feature engineering
- Relatively simple model so predictions might not be good enough for complex data

##### CART
Pros:
- Very easy to interpret, even for users without a technical background
- Automatically selects the significant variables for us
- Can deliver nonlinear predictions

Cons: 
- If we want to modify the loss table then we need to re-run the model
- Generally, it is not as accurate as other easy to use methods like logistic regression
- Not robust as slight changes in train data can lead to a different tree

##### Random Forest
Pros:
- Powerful algorithm that uses the wisdow of crowds principle
- Low variance leads to stable and more accurate predictions

Cons:
- Uses more computational power due to the need to build many trees
- Hard to interpret

In the end, I decided to go with logistic regression because this problem is a relatively simple one that could be solved with great results using a simple model. Moreover, when the proper features are selected, logistic regression can deliver accurate predictions with limited computational requirements. 

## Libraries


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import metrics
from sklearn.linear_model import LogisticRegressionCV

  from numpy.core.umath_tests import inner1d


## Data Import

In [2]:
# read the CSV and store data in a dataframe
stats = pd.read_csv('pokemon_data_science.csv')

# quick look at the dataframe
stats.head()

Unnamed: 0,Number,Name,Type_1,Type_2,Total,HP,Attack,Defense,Sp_Atk,Sp_Def,...,Color,hasGender,Pr_Male,Egg_Group_1,Egg_Group_2,hasMegaEvolution,Height_m,Weight_kg,Catch_Rate,Body_Style
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,...,Green,True,0.875,Monster,Grass,False,0.71,6.9,45,quadruped
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,...,Green,True,0.875,Monster,Grass,False,0.99,13.0,45,quadruped
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,...,Green,True,0.875,Monster,Grass,True,2.01,100.0,45,quadruped
3,4,Charmander,Fire,,309,39,52,43,60,50,...,Red,True,0.875,Monster,Dragon,False,0.61,8.5,45,bipedal_tailed
4,5,Charmeleon,Fire,,405,58,64,58,80,65,...,Red,True,0.875,Monster,Dragon,False,1.09,19.0,45,bipedal_tailed


## Data Preparation

#### Handling null values

First, let's see if there are any null values in the dataset.

In [6]:
stats.isnull().sum()

Number                0
Name                  0
Type_1                0
Type_2              371
Total                 0
HP                    0
Attack                0
Defense               0
Sp_Atk                0
Sp_Def                0
Speed                 0
Generation            0
isLegendary           0
Color                 0
hasGender             0
Pr_Male              77
Egg_Group_1           0
Egg_Group_2         530
hasMegaEvolution      0
Height_m              0
Weight_kg             0
Catch_Rate            0
Body_Style            0
dtype: int64

After some research on Pokemon, I found out that not all Pokemon have a secondary type and egg group. Therefore, it is possible for values in these two columns to be null. I will replace the null values with 'NA' for these two features later on.

Let's dig a little deeper into the Pr_Male column.

In [7]:
stats[stats['Pr_Male'].isnull()]['hasGender'].unique()

array([False])

In [8]:
stats[stats['Pr_Male'].notnull()]['hasGender'].unique()

array([ True])

Pr_Male is null only when a Pokemon is genderless. It is hard to find a value as a replacement for nulls in Pr_Male. Therefore, I decided to create a new (expected) Gender column to replace Pr_Male.

If Pr_Male >= 0.5, then we will classify the Pokemon as a Male. If < 0.5, Female. If null, Genderless. 

hasGender and Pr_Male will be dropped later to prevent multicollinearity.



In [9]:
# if Pr_Male >= 0.5, then Male
# if Pr_Male < 0.5, then Female
# if null, then Genderless
stats['Gender'] = np.where(stats['Pr_Male']>=0.5, 'Male', 
                           np.where(stats['Pr_Male'].notnull(), 'Female', 'Genderless'))
stats['Gender'].value_counts()

Male          597
Genderless     77
Female         47
Name: Gender, dtype: int64

Now, I can safely replace the remaining null values with 'NA'.

In [10]:
stats.fillna('NA', inplace=True)

# check
stats.isnull().sum()

Number              0
Name                0
Type_1              0
Type_2              0
Total               0
HP                  0
Attack              0
Defense             0
Sp_Atk              0
Sp_Def              0
Speed               0
Generation          0
isLegendary         0
Color               0
hasGender           0
Pr_Male             0
Egg_Group_1         0
Egg_Group_2         0
hasMegaEvolution    0
Height_m            0
Weight_kg           0
Catch_Rate          0
Body_Style          0
Gender              0
dtype: int64

I will drop the unnecessary columns before further processing the dataset.

In [11]:
# catch_rate will be dropped because it will not be used per the given instructions
stats.drop(columns=['Catch_Rate', 'Pr_Male', 'hasGender'], inplace=True)

#### Handling categorical variables

The next step is to convert boolean values and categorical variables into numeric values.

In [12]:
# now, let's see what columns we need to make changes to
stats.dtypes

Number                int64
Name                 object
Type_1               object
Type_2               object
Total                 int64
HP                    int64
Attack                int64
Defense               int64
Sp_Atk                int64
Sp_Def                int64
Speed                 int64
Generation            int64
isLegendary            bool
Color                object
Egg_Group_1          object
Egg_Group_2          object
hasMegaEvolution       bool
Height_m            float64
Weight_kg           float64
Body_Style           object
Gender               object
dtype: object

In [13]:
stats['isLegendary'] = stats['isLegendary'].astype('int64')
stats['hasMegaEvolution'] = stats['hasMegaEvolution'].astype('int64')

When using One Hot Encoding to create dummy variables, we will only create k-1 variables for k distinct values to prevent multicollinearity.

In [14]:
# drop_first = True to prevent multicollinearity
col = ['Type_1', 'Type_2', 'Color', 'Egg_Group_1', 'Egg_Group_2', 'Body_Style', 'Gender']
stats = pd.get_dummies(stats, columns=col, drop_first=True)

In [15]:
stats.dtypes

Number                           int64
Name                            object
Total                            int64
HP                               int64
Attack                           int64
Defense                          int64
Sp_Atk                           int64
Sp_Def                           int64
Speed                            int64
Generation                       int64
isLegendary                      int64
hasMegaEvolution                 int64
Height_m                       float64
Weight_kg                      float64
Type_1_Dark                      uint8
Type_1_Dragon                    uint8
Type_1_Electric                  uint8
Type_1_Fairy                     uint8
Type_1_Fighting                  uint8
Type_1_Fire                      uint8
Type_1_Flying                    uint8
Type_1_Ghost                     uint8
Type_1_Grass                     uint8
Type_1_Ground                    uint8
Type_1_Ice                       uint8
Type_1_Normal            

## Feature Engineering

Now that all features are in the proper format, we will create a dataframe with just the features to perform feature engineering.

In [16]:
# create training feature df
X = stats.copy()
X.drop(columns=['Number', 'Name', 'isLegendary'], inplace=True)
y = stats['isLegendary']
X_train, X_val, y_train, y_val = train_test_split(X, 
                                                  y, 
                                                  test_size=0.2, 
                                                  random_state=42)

I will use an ensemble of decision trees to find the top 10 best features for prediction. 

In [17]:
standard_X = StandardScaler().fit_transform(X_train) 
model = ExtraTreesClassifier(n_estimators=100, random_state=42)

n_best_features=10
model.fit(standard_X, y_train)
important_features = model.feature_importances_
best_features = important_features.argsort()[-n_best_features:][::-1]
print(f"Top {n_best_features} best features: {best_features}")

features = []
for i in best_features:
    features.append(X.columns[i])
features

Top 10 best features: [65 95  0 96  4  9  2 10  3 63]


['Egg_Group_1_Undiscovered',
 'Gender_Genderless',
 'Total',
 'Gender_Male',
 'Sp_Atk',
 'Height_m',
 'Attack',
 'Weight_kg',
 'Defense',
 'Egg_Group_1_Mineral']

I will now exclude all the unselected features from the dataframe and prepare the final training and validation datasets.

In [18]:
X = stats[features]
y = stats['isLegendary']
X_train, X_val, y_train, y_val = train_test_split(X, 
                                                  y, 
                                                  test_size=0.2, 
                                                  random_state=42)

## Model Building

I will use the LogisticRegressionCV class from sklearn, which is an estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters. 

In this case, the hyper-parameter I will tune is the C value, which describes the inverse of regularization strength. Smaller C values specify stronger regularization.

By default, the model will predict Y = 1 if p > 0.5. Depending on what the expected reward and loss are for correct and incorrect classifications respectively, we may choose to predict using a different threshold. However, to keep this model simple, we will use 0.5 as the threshold for now. 

In [19]:
logreg = LogisticRegressionCV(cv=10, random_state=42)
logreg.fit(X_train, y_train)

acc_train = logreg.score(X_train, y_train)
y_pred = logreg.predict(X_val)
acc_val = logreg.score(X_val, y_val)   

confusion_matrix = metrics.confusion_matrix(y_val, y_pred)

print('Train labeling accuracy:', str(round(acc_train*100,2)),'%')
print('Validation labeling accuracy:', str(round(acc_val*100,2)),'%')
print('\nConfusion matrix:\n', confusion_matrix)

Train labeling accuracy: 98.96 %
Validation labeling accuracy: 99.31 %

Confusion matrix:
 [[136   1]
 [  0   8]]


## Conclusion
In the end, I trained a Logistic Regression model with the following features: 
- Egg_Group_1_Undiscovered
- Gender_Genderless
- Total
- Gender_Male
- Sp_Atk
- Height_m
- Attack
- Weight_kg
- Defense
- Egg_Group_1_Mineral

10-fold CV is performed to select the best L2 regularization strength. The training and validation test sets are created using a 80/20 split.

The final train accuracy is 98.96%, with the validation accuracy at 99.31%.