# <center>Predicting Pokemon Battle Outcomes</center>
### <center>Michael Bailey, Robert Calkins, Matt Falzon</center>

<p align="center">
    <img src="https://github.com/mgfalzon/Final-Tutorial/blob/main/all_pokemons.png?raw=1#center" width="500" height="300">
</P>

### <b>What are Pokémon?</b> <br>

<p>
Pokémon is a video game franchise centered around fictional creatures called 'Pokémon' which humans catch, train, and battle for sport. In the game, the player is tasked with building a team of strong Pokémon which they will use in to challenge other 'Pokemon Trainers'.
</p>

<p>
Each Pokémon has a unique look, typing and a set of base stats. Each Pokemon can be identified by a unique number. The typing follows a rock-paper-scissors like mechanic where, for example, a Pokémon of type water has an advantage over a Pokémon of type fire. The stats determine how much health, strength and speed a Pokémon might have. Since each Pokémon is different it can be hard to tell without experience which would have an edge over another in a battle.
</p>

### <b>Why do we want to predict Pokémon battle outcomes?</b>

<p>
While predicting Pokémon battle outcome might not have a broader implication than anything passed getting better at playing Pokémon, this problem is not unlike many others we see in every day life. Sports players and teams are given a set of stats that are not unlike a set of stats given to a Charzard. If we can train a model to predict the outcome of a battle between two unique Pokémon why would we not be able to extend this to two sports teams or players?
</p>

### <b>How are we going to use Datascience?</b>

<p>
This tutorial seeks to explore the relationship between a pokemon's characteristics and win percentage in 100 simulated battles. In order to preform our analysis we'll be looking at 3 characterstics, a pokemon's type, their base stats, and finally their legendary status. We want to know if we can use these characterstics in order to predict the outcome of future battles. The Kaggle dataset 'Pokemon-Weedle's Cave' by user terminus7 contains two files which will be used to preform our analysis. The first file contains the pokemon charactersitics and the second one contains information about previous battles.
</p>


### <b>Technology</b>

In this tutorial we'll be using the following python libraries. Feel free to follow these links to learn more!

- [Python](https://www.python.org/)
- [pandas](https://pandas.pydata.org/)
- [numpy](https://numpy.org/)
- [matplotlib](https://matplotlib.org/)
- [Sci-Kit Learn](https://scikit-learn.org/stable/)
- [seaborn](https://seaborn.pydata.org/)
- [requests](https://requests.readthedocs.io/en/master/)



In [2]:
import numpy as np
import pandas as pd
import requests
import json
import matplotlib.pyplot as plt
import seaborn as sns

## Data Collection

We are using a dataset found on kaggle. You can find and download the CSV here: 

[Pokemon Weedle's Cave](https://www.kaggle.com/terminus7/pokemon-challenge)

#### Data Size and Content

<p>
This dataset contains two csv files. One contains entries for all 800 pokemon listed in the generation 6 Pokedex(a digital encyclopedia of information about pokemon). Each pokemon has 6 base stats and 1 or 2 types. The table also records each pokemon's generation(which set of games a pokemon first appeared in) and their legendary status. 

The other dataset contains 50,000 simulated battles. The first two columns contain the ids of the combatants and the third one column contains the id of the winner. The Pokemon in the first column attacked first.
</p>

In [3]:
path = "https://raw.githubusercontent.com/mgfalzon/Final-Tutorial/main"
combats = pd.read_csv(f"{path}/combats.csv")
pokemon = pd.read_csv(f"{path}/pokemon.csv")

In [5]:
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,5,Charmander,Fire,,39,52,43,60,50,65,1,False


In [6]:
combats.head()

Unnamed: 0,First_pokemon,Second_pokemon,Winner
0,266,298,298
1,702,701,701
2,191,668,668
3,237,683,683
4,151,231,151


## Data Processing

### Tidy up Pokemon

<p>
Now that we know what our data represents, let's take a closer look at the pokemon table ensure that none of the data is missing. We will have to fill in any gabs that exist.
</p>

In [7]:
pokemon.isna().sum()

#               0
Name            1
Type 1          0
Type 2        386
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

<p>It turns out we have one Pokemon that is missing its name. We will have to go in there and manually find and add it back. We can do this by finding the index and making a call to a Pokemon database to get the information. The missing type 2 indicates that that particular pokemon only has one type. We will leave it like that for now.</p>

In [8]:
missing = pokemon[pokemon['Name'].isna()]
missing

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
62,63,,Fighting,,65,105,60,60,70,95,1,False


In [10]:
#Take a look at the surrounding Pokemon
miss_id = missing.index[0]
pokemon[miss_id - 2 : miss_id + 2]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
60,61,Golduck,Water,,80,82,78,95,80,85,1,False
61,62,Mankey,Fighting,,40,80,35,35,45,70,1,False
62,63,,Fighting,,65,105,60,60,70,95,1,False
63,64,Growlithe,Fire,,55,70,45,70,50,60,1,False


<p>
The missing pokemon follows Mankey in our dataset, and has the same type as Mankey. Based on these observations our missing pokemon and Mankey might be related.

In order to identify our missing pokemon, let's make use of the <a href="https://pokeapi.co/">PokeAPI</a>. The PokeAPI is a <a href="https://en.wikipedia.org/wiki/Representational_state_transfer">RESTful API</a> containing data about all the pokemon games. If we identify the pokemon after Mankey in the pokeAPI we should be able to find our missing pokemon. 
</p>

In [11]:
# pokeAPI to fetch pokemon data
def pokeAPI(s):
    r = requests.get(f"https://pokeapi.co/api/v2/pokemon/{s}")
    return json.loads(r.content)

In [12]:
# Check the pokemon after Mankey in the pokeAPI
res = pokeAPI('mankey')
res = pokeAPI(res['id'] + 1)
display(res['name'], res['types'], res['stats'])

'primeape'

[{'slot': 1,
  'type': {'name': 'fighting', 'url': 'https://pokeapi.co/api/v2/type/2/'}}]

[{'base_stat': 65,
  'effort': 0,
  'stat': {'name': 'hp', 'url': 'https://pokeapi.co/api/v2/stat/1/'}},
 {'base_stat': 105,
  'effort': 2,
  'stat': {'name': 'attack', 'url': 'https://pokeapi.co/api/v2/stat/2/'}},
 {'base_stat': 60,
  'effort': 0,
  'stat': {'name': 'defense', 'url': 'https://pokeapi.co/api/v2/stat/3/'}},
 {'base_stat': 60,
  'effort': 0,
  'stat': {'name': 'special-attack',
   'url': 'https://pokeapi.co/api/v2/stat/4/'}},
 {'base_stat': 70,
  'effort': 0,
  'stat': {'name': 'special-defense',
   'url': 'https://pokeapi.co/api/v2/stat/5/'}},
 {'base_stat': 95,
  'effort': 0,
  'stat': {'name': 'speed', 'url': 'https://pokeapi.co/api/v2/stat/6/'}}]

In [22]:
# Clean stats for readability
data = {stat['stat']['name'] : stat['base_stat'] for stat in res['stats']}
data['type'] = res['types'][0]['type']['name']
df = pd.DataFrame(data, [res['name']])
display("pokeAPI Data", df)
display("Pokemon Data", missing.iloc[:,1:10])

'pokeAPI Data'

Unnamed: 0,hp,attack,defense,special-attack,special-defense,speed,type
primeape,65,105,60,60,70,95,fighting


'Pokemon Data'

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
62,,Fighting,,65,105,60,60,70,95


In [15]:
# Update the data
pokemon['Name'] = np.where(pokemon['Name'].isna(), 'Primeape', pokemon['Name'])
pokemon[miss_id - 2 : miss_id + 2]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
60,61,Golduck,Water,,80,82,78,95,80,85,1,False
61,62,Mankey,Fighting,,40,80,35,35,45,70,1,False
62,63,Primeape,Fighting,,65,105,60,60,70,95,1,False
63,64,Growlithe,Fire,,55,70,45,70,50,60,1,False


### Continuity between Combats and Pokemon

<p>
Our next step is to ensure that no data is missing from the combats dataset. We also need to ensure that every Pokemon in our Pokemon table is present in the combats table. We can check this by comparing the number of unique Pokemon with the number of unique winners and losers.
</p>


In [16]:
#Check if we have any missing data (We dont)
combats.isna().sum()

First_pokemon     0
Second_pokemon    0
Winner            0
dtype: int64

In [18]:
# Insert Loser column for simplification
combats['Loser'] = pd.Series(np.where(combats['Winner'] == combats['First_pokemon'], combats['Second_pokemon'], combats['First_pokemon']))
combats.head(3)

# Get # of unique pokemon, winners, losers
all_pokemon = np.unique(pokemon['#'])
winners = np.unique(combats['Winner'])
losers = np.unique(combats['Loser'])

# Verify that each pokemon has at least one loss and one win
print(f"Total Pokemon: {len(all_pokemon)}")
print(f"Unqiue Winners: {len(winners)}")
print(f"Unqiue Losers: {len(losers)}")

Total Pokemon: 800
Unqiue Winners: 783
Unqiue Losers: 784


#### A disconnect

<p>
There's 17 pokemon missing from our winners column and 16 pokemon missing from our loser column. It's possible that certain pokemon have no wins while others have no losses, so let's see if any of these pokemon don't appear as either winners or losers. If they do not, we will drop them from the table as they do not provide any insight to our analysis of battle winners and lossers.
</p>

In [25]:
# Pokemon that did not win or lose (these pokemon did not compete)
ids = [x for x in all_pokemon if x not in winners and x not in losers]
display(ids, pokemon[pokemon['#'].isin(ids)].head())

[12, 33, 46, 66, 78, 90, 144, 183, 236, 322, 419, 479, 556, 618, 655, 782]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
11,12,Blastoise,Water,,79,83,100,85,105,78,1,False
32,33,Sandshrew,Ground,,50,75,85,20,30,40,1,False
45,46,Wigglytuff,Normal,Fairy,140,70,45,85,50,45,1,False
65,66,Poliwag,Water,,40,50,40,40,40,90,1,False
77,78,Victreebel,Grass,Poison,80,105,65,100,70,70,1,False


In [26]:
# Let's drop those pokemon, they won't provide insight for our analysis
pokemon = pokemon[~pokemon['#'].isin(ids)]

In [27]:
# Pokemon with no wins
worst_pokemon = [x for x in pokemon['#'] if x in losers and x not in winners][0]
pokemon[pokemon['#'] == worst_pokemon]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
230,231,Shuckle,Bug,Rock,20,10,230,10,230,5,2,False


In [29]:
# Get # of unique pokemon, winners, losers
all_pokemon = np.unique(pokemon['#'])
winners = np.unique(combats['Winner'])
losers = np.unique(combats['Loser'])

# Verify that each pokemon has at least one loss and one win
print(f"Total Pokemon: {len(all_pokemon)}")
print(f"Unqiue Winners: {len(winners)}")
print(f"Unqiue Losers: {len(losers)}")

Total Pokemon: 784
Unqiue Winners: 783
Unqiue Losers: 784


## Data Exploration

<p>We know there are 784 competitors with one Pokemon who had 0 wins (Sucks to be a Shuckle). We will now explore the data to try and find any relationships. Is there a clear pattern to which Pokemon will win? Is it a strict game of Rock-Paper-Scissors? Does having a higher speed stat give you an advantage?

We will start by generating the win percentage for each Pokemon.
</p>

In [30]:
# Generate win and loss counts for each pokemon
wins = combats['Winner'].value_counts().sort_index().rename('Wins')
loss = combats['Loser'].value_counts().sort_index().rename('Loss')

# Add 0 to wins for Shuckle
wins[worst_pokemon] = 0

In [31]:
# Calculate win percentage
res = pd.concat([wins, loss], axis=1)
res['win_loss'] = res['Wins'] / (res['Wins'] + res['Loss'])
res['win_pct']  = (res['win_loss'] * 100).round(1)
res.head()

Unnamed: 0,Wins,Loss,win_loss,win_pct
1,37,96,0.278195,27.8
2,46,75,0.380165,38.0
3,89,43,0.674242,67.4
4,70,55,0.56,56.0
5,55,57,0.491071,49.1


In [32]:
# Join with pokemon table
pokemon = pokemon.join(res, on='#')

In [33]:
# Top 50 Pokemon by win percentage
top50 = pokemon.sort_values(by='win_pct', ascending=False).head(50)
top50[['Name', 'win_pct']].head()

Unnamed: 0,Name,win_pct
154,Mega Aerodactyl,98.4
512,Weavile,97.5
703,Tornadus Therian Forme,96.8
19,Mega Beedrill,96.6
153,Aerodactyl,96.5


In [34]:
# Bottom 50 Pokemon by win percentage
bot50 = pokemon.sort_values(by='win_pct', ascending=True).head(50)
bot50[['Name', 'win_pct']].head()

Unnamed: 0,Name,win_pct
230,Shuckle,0.0
289,Silcoon,2.2
189,Togepi,2.5
638,Solosis,3.1
236,Slugma,3.3


In [35]:
# Type frequency
type_freq = pokemon.groupby(by=['Type 1', 'Type 2'], dropna=False)['Name'].count().rename('freq')
type_freq = type_freq.sort_values(ascending=False)
pd.DataFrame(type_freq).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,freq
Type 1,Type 2,Unnamed: 2_level_1
Normal,,59
Water,,57
Psychic,,38
Grass,,31
Fire,,28


In [36]:
# Top 50 pokemon Type frequency
type_freq = top50.groupby(by=['Type 1', 'Type 2'], dropna=False)['Name'].count().rename('count').reset_index()
type_freq = type_freq.sort_values(by='count', ascending=False).reset_index(drop=True)
type_freq.head()

Unnamed: 0,Type 1,Type 2,count
0,Psychic,,6
1,Normal,,4
2,Electric,,4
3,Dark,,3
4,Rock,Flying,3


In [37]:
# Bottom 50 pokemon Type frequency
type_freq = bot50.groupby(by=['Type 1', 'Type 2'], dropna=False)['Name'].count().rename('count').reset_index()
type_freq = type_freq.sort_values(by='count', ascending=False).reset_index(drop=True)
type_freq.head()

Unnamed: 0,Type 1,Type 2,count
0,Bug,,8
1,Psychic,,6
2,Grass,,4
3,Normal,,4
4,Normal,Fairy,3


### Whats going on here?

<p>
We just generated a lot of tables. Lets break it down. We first figured out each Pokemons win rate. Turns out Mega Aerodactyle has the highest win rate at 98%. 

We now want to see if any of the top 50 or bottom 50 have any typings in common. The first table uses the dataset as a whole for the control, the most common type seems to be Normal with 59 Pokemon; followed by water. 

The top 50 has 6 Pyschics as its most frequent type while the bottom 50 has Bug as its most frequent type. Not sure how much we can conclude from this information since both tables have psychic and Normal in them. But there does seem to be a slight difference in typings.
</p>

</p>
Next we will visualize the data to see how the Pokemon stats relate to each other and to try and solidify any relationships. We will split the data by generation, as it will be useful for our analysis moving forward.
<p>

In [38]:
# Split data by generation
gen = {g:df for g, df in pokemon.groupby(by='Generation')}
gen[1].head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Wins,Loss,win_loss,win_pct
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False,37,96,0.278195,27.8
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False,46,75,0.380165,38.0
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False,89,43,0.674242,67.4
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False,70,55,0.56,56.0
4,5,Charmander,Fire,,39,52,43,60,50,65,1,False,55,57,0.491071,49.1


## Data Visualization