<a href="https://colab.research.google.com/github/mblackstock/notebooks/blob/main/notebooks/Pokemon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pokemon Data analysis

In this project, lets do some data analysis on Pokemon data!

<img src="https://miro.medium.com/max/1400/0*ZLujw1b18CnMFxFa.jpg"
    style="width:400px; float: right; margin: 0 40px 40px 40px;"></img>

We have two data sources.  The first source is a list of Pokemon and various characteristics.  The second is a table of the results of combat between pokemon, that is, given two Pokemon that battle, who wins.

Let's pretend there is a company called **Team Rocket** who makes millions of dollars off of pokemon battles. As a data consultant, Team Rocket gives you this data set and ask you to come up with some useful insight on how to improve their business. This can be difficult because there in not much direction given for the analysis.

Lets break down the task.

First, we want to understand the data.  Since the company makes money from battles, we want to direct our efforts toward finding the best pokemon.

It's good to start simple, then dive deeper into the data.

Once we've done some analysis, we should related it back to the business to make suggestions for the company on how they can improve things.

First thing we need to do is to load the data and understand it better.
We may then need to clean the data to fill in any missing values, delete duplicates, fix any problems.

References
* https://www.kaggle.com/rounakbanik/pokemon
* https://www.kaggle.com/mmetter/pokemon-data-analysis-tutorial
* https://www.kaggle.com/rtatman/which-pokemon-win-the-most/notebook
* https://towardsdatascience.com/exploratory-analysis-of-pokemons-using-r-8600229346fb


# Load Pokemon Data

First, lets make sure we have access to our dataset.  The following two lines change the working directory and then *clone* the github repository containing our data to our local runtime in the cloud.  If the directory datasets is already there, you don't need to run this again and can disable it or comment out the lines.

In [5]:
%cd ..
!git clone https://github.com/mblackstock/datasets.git

/Users/mike/dev/notebooks
fatal: destination path 'datasets' already exists and is not an empty directory.


First, we'll import the **pandas** library into our notebook so we can use them in our code.  Pandas is the key library we use for file input and output and data processing.  Much of this course will be about becoming familiar with using Python with Pandas.

In [3]:
import pandas as pd

Next, lets import the data containing info about different pokemon and have a quick look at it to see what it looks like.

In [6]:
pokemon = pd.read_csv("datasets/pokemon/pokemon.csv")
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,5,Charmander,Fire,,39,52,43,60,50,65,1,False


The `read_csv()` method loads csv files into our notebook into what is called a `DataFrame` object.

A DataFrame is a pandas data structure that represents a table.  It is contains an array of individual entries, each entry corresponds to a row and column.

The `head()` method gives us the first 5 rows of a `DataFrame` object.  You can specify the number of rows, or use the `tail()` method to get the last rows.

To get to know DataFrames a bit better, lets get the name of the 3rd pokemon, we specify the column, and then the row.

In [7]:
pokemon['Name'][2]

'Venusaur'

We can get  whole rows using loc to specify the row index.  In this case the index is integer values, but as we'll show later, it doesn't need to be.

In [8]:
print(pokemon.loc[2])
print('---------------')
print(pokemon.loc[[2,3]])

#                    3
Name          Venusaur
Type 1           Grass
Type 2          Poison
HP                  80
Attack              82
Defense             83
Sp. Atk            100
Sp. Def            100
Speed               80
Generation           1
Legendary        False
Name: 2, dtype: object
---------------
   #           Name Type 1  Type 2  HP  Attack  Defense  Sp. Atk  Sp. Def  \
2  3       Venusaur  Grass  Poison  80      82       83      100      100   
3  4  Mega Venusaur  Grass  Poison  80     100      123      122      120   

   Speed  Generation  Legendary  
2     80           1      False  
3     80           1      False  


Let's get some more information about the DataFrame using the `info()` method. 



In [9]:
pokemon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        799 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   HP          800 non-null    int64 
 5   Attack      800 non-null    int64 
 6   Defense     800 non-null    int64 
 7   Sp. Atk     800 non-null    int64 
 8   Sp. Def     800 non-null    int64 
 9   Speed       800 non-null    int64 
 10  Generation  800 non-null    int64 
 11  Legendary   800 non-null    bool  
dtypes: bool(1), int64(8), object(3)
memory usage: 69.7+ KB


Using the `info()` method, we can get lots of interesting and important information about our table (DataFrame).

It tells us about the *index* of our DataFrame.  The index is how we identify the rows in the table.  Here it is a range of values from 0 to 799.

It tells us about all of the columns in each row.  This includes the column number, the name of the column, how many non-null values are in the column, and the type of each column.

It's intersting to note that the Name column has 799 non-null names, meaning there is a pokemon in our table that has a null name.  This could be a problem!



# Cleaning Pokemon Data

Now that we have our pokemon data loaded and have an idea of what is in it, lets see if there is any data cleaning we need to do.

First, we'll make sure we like all of the column names we have.  Then, lets look for any missing data fields.

The first column name is `#` which could be a problem since this is the same character used in Python comments.  Lets change this to something that is more consistent with the other column names:

In [10]:
pokemon.rename(columns={"#":"Number"}, inplace=True)
pokemon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Number      800 non-null    int64 
 1   Name        799 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   HP          800 non-null    int64 
 5   Attack      800 non-null    int64 
 6   Defense     800 non-null    int64 
 7   Sp. Atk     800 non-null    int64 
 8   Sp. Def     800 non-null    int64 
 9   Speed       800 non-null    int64 
 10  Generation  800 non-null    int64 
 11  Legendary   800 non-null    bool  
dtypes: bool(1), int64(8), object(3)
memory usage: 69.7+ KB


The rename method renames the columns as specified.  If you say `inplace=False` the original DataFrame is not modified, rather a new one is created and returned with the changed columns.  We'll just modify the current one.

Next, lets look at how many null entries there are in our DataFrame.  This is done by first using the `isNull()` method, that returns a new DataFrame containing booleans for every entry indicating whether it is null or not.

We then want to get the sum of all True entries, so we use the `sum()` method as shown.

In [11]:
pokemon.isnull().sum()

Number          0
Name            1
Type 1          0
Type 2        386
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

We can see there is one null Name value, and 386 null Type 2 values.  This is OK since only some Pokemon have secondary classification values.  For example, some pokemon can breath fire and fly, so its Type 1 would be fire, and its secondary would be `flying`.  Some pokemon don't have a secondary type, so its OK to be null there.

There is one pokemon without a name though.  We should see if we can fix that!

First, lets find the row that contains the null pokemon.  Here's one way to do that

In [12]:
pokemon.loc[pokemon['Name'].isnull()]

Unnamed: 0,Number,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
62,63,,Fighting,,65,105,60,60,70,95,1,False


Cool, now that we have that, lets fix the table by setting the name.  Let's assume that the indexes follow the official [National Pokedex Number](https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number) table.  Lets look for the one before and after that pokemon, and assume the one missing is between them.


In [13]:
print("The one before is "+pokemon['Name'][61])
print("The one after is "+pokemon['Name'][63])

The one before is Mankey
The one after is Growlithe


The one between Mankey and Growlithe is Primeape, so that's the one.  We can fix it like this:

In [57]:
pokemon['Name'][62] = 'Primeape'
pokemon.loc[62]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pokemon['Name'][62] = 'Primeape'


Number              63
Name          Primeape
Type 1        Fighting
Type 2             NaN
HP                  65
Attack             105
Defense             60
Sp. Atk             60
Sp. Def             70
Speed               95
Generation           1
Legendary        False
Name: 62, dtype: object

Cool - the data is fixed.  Now we can start to analyze the combat data!

# Combat DataFrame

Next, lets load our combat data and have a look at that in the same way.

First, lets load the data, and have a look at it:

In [25]:
combat = pd.read_csv("datasets/pokemon/combats.csv")
combat.head()

Unnamed: 0,First_pokemon,Second_pokemon,Winner
0,266,298,298
1,702,701,701
2,191,668,668
3,237,683,683
4,151,231,151


In [27]:
combat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   First_pokemon   50000 non-null  int64
 1   Second_pokemon  50000 non-null  int64
 2   Winner          50000 non-null  int64
dtypes: int64(3)
memory usage: 1.1 MB


We can see that the combat DataFrame has 3 columns, all integers.

The first column, `First_pokemon` is the first pokemon in the fight, `Second_pokemon` is the second, and `Winner` is the winner of the battle.

There are 50000 fights recorded - cool!  The index into the fights is a range of integers like before.

We can also get a quick look at the *shape* of our two tables like this:

In [28]:
print("Dimensions of Pokemon: " + str(pokemon.shape))
print("Dimensions of Combat: " + str(combat.shape))

Dimensions of Pokemon: (800, 12)
Dimensions of Combat: (50000, 3)


In [5]:
# calculate the win % of each pokemon 
# add the calculation to the pokemon dataset 
total_Wins = combat.Winner.value_counts()
total_Wins


163    152
438    136
154    136
428    134
314    133
      ... 
189      5
639      4
237      4
190      3
290      3
Name: Winner, Length: 783, dtype: int64

In [6]:
# get the number of wins for each pokemon
numberOfWins = combat.groupby('Winner').count()
numberOfWins

Unnamed: 0_level_0,First_pokemon,Second_pokemon
Winner,Unnamed: 1_level_1,Unnamed: 2_level_1
1,37,37
2,46,46
3,89,89
4,70,70
5,55,55
...,...,...
796,39,39
797,116,116
798,60,60
799,89,89


In [7]:
#both methods produce the same results
countByFirst = combat.groupby('Second_pokemon').count()
countBySecond = combat.groupby('First_pokemon').count()
countByFirst



Unnamed: 0_level_0,First_pokemon,Winner
Second_pokemon,Unnamed: 1_level_1,Unnamed: 2_level_1
1,63,63
2,66,66
3,64,64
4,63,63
5,62,62
...,...,...
796,56,56
797,67,67
798,59,59
799,69,69


In [8]:
print("Looking at the dimensions of our dataframes")
print("Count by first winner shape: " + str(countByFirst.shape))
print("Count by second winner shape: " + str(countBySecond.shape))
print("Total Wins shape : " + str(total_Wins.shape))

Looking at the dimensions of our dataframes
Count by first winner shape: (784, 2)
Count by second winner shape: (784, 2)
Total Wins shape : (783,)


Since the total wins has fewer rows than the first and second winner shape, one of the pokemon never won.

In [9]:
find_losing_pokemon= np.setdiff1d(countByFirst.index.values, numberOfWins.index.values)-1 #offset because the index and pokedex number are off by one
losing_pokemon = pokemon.iloc[find_losing_pokemon[0]] # using the number as the pokemon index
print(losing_pokemon)

Number            231
Name          Shuckle
Type 1            Bug
Type 2           Rock
HP                 20
Attack             10
Defense           230
Sp. Atk            10
Sp. Def           230
Speed               5
Generation          2
Legendary       False
Name: 230, dtype: object
