## Problem Statement:
1. Pokémon is a group of adorable creatures peacefully colonizing a planet until humans come along and make them combat each other in order to get shiny badges and we can call them Pokémon masters.
2. In this universe, there exists a group of rare and often strong Pokémon, known as Legendary Pokémon. Unfortunately, there are no detailed criteria that define these Pokémon.
3. The only way to recognize a Legendary Pokémon is through information from official media, such as the game or anime.
4. This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. The legend of a pokemon cannot be suspected only by its Attack and Defense. It would be worth finding which variables can define the legend of a pokemon. The strategy is to analyze the data and perform a predictive task of classification to predict the legend of a pokemon using a decision tree algorithm.

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [3]:
df = pd.read_csv("Pokemon.csv")
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB


In [5]:
df.isnull().sum()

#               0
Name            0
Type 1          0
Type 2        386
Total           0
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

In [6]:
df.duplicated().sum()

0

In [7]:
## 1. How many pokemon are from the 5th generation?
df['Generation'].value_counts()

1    166
5    165
3    160
4    121
2    106
6     82
Name: Generation, dtype: int64

In [8]:
## 2. How many pokemon have the highest defense score?
df.loc[df['Defense'] == max(df['Defense'])].shape[0]

3

In [9]:
## 3. How you will be handling missing values in this dataset:
df.isnull().sum()

#               0
Name            0
Type 1          0
Type 2        386
Total           0
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

In [60]:
df.corr()['Generation']

#             0.982516
Total         0.048384
HP            0.058683
Attack        0.051451
Defense       0.042419
Sp. Atk       0.036437
Sp. Def       0.028486
Speed        -0.023121
Generation    1.000000
Legendary     0.079794
Name: Generation, dtype: float64

- Which of the following model is the best fit for predicting the legendary of the pokemon based on the below parameters:
 * Handle the missing values
 * Split the dataset into a 70:30 ratio with random_state as 1.

In [61]:
df1 = df.copy()

In [62]:
45+49+49+65+65+45

318

In [63]:
df1.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [64]:
df1.dropna(inplace=True)
df1.drop(columns=['#', 'Total'], inplace=True)

In [65]:
df1.drop(columns=['Name'], inplace = True)
df1['Legendary'] = df1['Legendary'].astype(int)
df1.head()

Unnamed: 0,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Grass,Poison,45,49,49,65,65,45,1,0
1,Grass,Poison,60,62,63,80,80,60,1,0
2,Grass,Poison,80,82,83,100,100,80,1,0
3,Grass,Poison,80,100,123,122,120,80,1,0
6,Fire,Flying,78,84,78,109,85,100,1,0


In [66]:
cat = df1.select_dtypes(include='object')
le = LabelEncoder()
for i in cat.columns:
    df1.drop(columns= [i], inplace = True)
    le.fit(cat[i])
    df1[i] = le.transform(cat[i])

df1.head()

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Type 1,Type 2
0,45,49,49,65,65,45,1,0,9,13
1,60,62,63,80,80,60,1,0,9,13
2,80,82,83,100,100,80,1,0,9,13
3,80,100,123,122,120,80,1,0,9,13
6,78,84,78,109,85,100,1,0,6,7


In [67]:
df1.dtypes


HP            int64
Attack        int64
Defense       int64
Sp. Atk       int64
Sp. Def       int64
Speed         int64
Generation    int64
Legendary     int32
Type 1        int32
Type 2        int32
dtype: object

In [68]:
X = df1.drop(columns=['Legendary'])
y = df1['Legendary']

In [69]:
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size = 0.3,
                                                   random_state=1)

In [70]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Logistic Regression accuracy: {accuracy}')

Logistic Regression accuracy: 0.944


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [72]:
model1 = DecisionTreeClassifier()
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
accuracy1 = accuracy_score(y_test, y_pred1)
print(f'Decision Tree Classifier accuracy: {accuracy1}')

Decision Tree Classifier accuracy: 0.92


In [75]:
model2 = RandomForestClassifier()
model2.fit(X_train, y_train)
y_pred2 = model2.predict(X_test)
accuracy2 = accuracy_score(y_test, y_pred2)
print(f'Random Forest Classifier: {accuracy2}')

Random Forest Classifier: 0.936


In [81]:
print(classification_report(y_test, y_pred1))
cm1 = confusion_matrix(y_test, y_pred1, labels=[1,0])
print('\nConfusion Matrix: \n', cm1)
Sensitivity = cm1[0,0]/(cm1[0,0]+ cm1[0,1])
print('\nSensitivity: %.2f'%Sensitivity)

              precision    recall  f1-score   support

           0       0.95      0.96      0.96       114
           1       0.56      0.45      0.50        11

    accuracy                           0.92       125
   macro avg       0.75      0.71      0.73       125
weighted avg       0.91      0.92      0.92       125


Confusion Matrix: 
 [[  5   6]
 [  4 110]]

Sensitivity: 0.45
