<img src="https://i.imgur.com/9utPGpa.png">

<center><h1> Determining a Pokemon's Type</h1> <h2>CMSC320 Final Tutorial</h2>
<h4>Group Members: Prathamesh Kotgire & Ayyub Ahmed</h4></center>

<h2>Introduction</h2>
<p>
Pokemon is a very popular game that has been around since 1996. It is created by GameFreak and is exclusive to Nintendo consoles. It has evolved greatly over the years and has also increased in popularity due to the release of the mobile game Pokemon Go. The latest main series pokemon games, "Pokemon Ultra Sun" and "Pokemon Ultra Moon," sold over 1 million copies in just the first 3 days after its release. 
</p>
<p>
The main goal of the pokemon games is to construct a team of 6 pokemon with which you would fight other opponents' teams with. A pokemon is a creature that has many attributes and skills that makes it useful for fighting other pokemon. Some of the more defining characteristics of a Pokemon are its' <a href="https://bulbapedia.bulbagarden.net/wiki/Statistic">stats</a>, <a href="http://pokemon.wikia.com/wiki/Types">type</a>, height, weight, <a href="https://bulbapedia.bulbagarden.net/wiki/Catch_rate">catch rate</a>, and gender ratios. A pokemons stats determine how strong it will be in a battle. The higher the stat in an area, the stronger the pokemon is in that area. A pokemon's type referes to the elemental property it is associated with. A pokemon's catch rate determines how hard the pokemon will be to catch in the wild. The lower the value, the harder it is to catch. Although a Pokemon may have many more characteristics, these are the ones that we will be focusing on. 
</p>
<p>
Our goal is to see if there is a way to determine what a Pokemon's type is given its other characteristics. Our Null Hypothesis is that there is no correlation between a Pokemon and its' attributes. This guide will be a step by step tutorial walking you through the data pipeline. We will be scraping data, tidying it, analyzing it, and training a machine learning algorithm. We hope that people can use this tutorial as a basis to do even further and more complex analysis on the pokemon data.
</p>

<h2> Part 1: Data Collection & Data Processing</h2>

<h3> Required Libraries </h3>

You must have the following libraries installed: <br><br>
<a href="https://pandas.pydata.org/about.html">Pandas</a> <br>
<a href="http://www.numpy.org/">Numpy</a> <br>
<a href="https://matplotlib.org/">Matplotlib</a> <br>
<a href="https://seaborn.pydata.org/">Seaborn</a> <br>
<a href="http://docs.python-requests.org/en/master/">Requests</a> <br>
<a href="http://scikit-learn.org/stable/">Sklearn</a> <br>
<a href="http://www.statsmodels.org/stable/index.html">StatsModels</a> <br>
<a href="https://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> <br>

For more information about each library, you can click on its' respective link. <br>
Below is the code to import each of these libraries.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import os.path
import sklearn
import statsmodels.api as sm
from sklearn import *
from sklearn.model_selection import KFold
from bs4 import BeautifulSoup as bs

<h3> Scraping - Part 1</h3>

We will start off by getting the data for each Pokemon's stats. The stat data would consist of a Pokemon's total stat value, hp, attack, defense, special attack, special defense, and speed. <a href="https://pokemondb.net/pokedex/all">Pokemon Database</a> conveniently has the stat values for each pokemon as well as its name and ID.

We will first use Requests to obtain the HTML text from the source: https://pokemondb.net/pokedex/all. Then we will use BeautifulSoup to parse this text more easily. After the text is prettified, we take the infromation from the HTML tables and insert it into a Pandas dataframe so we can later do our analysis more effectively.

In [2]:
# scrape the data from the website
r = requests.get('https://pokemondb.net/pokedex/all')
soup = bs(r.text, 'html.parser')
prettysoup = soup.prettify()
read = pd.read_html(prettysoup)

# create the dataframe, fixing column names
df= pd.DataFrame(read[0])
df.columns = ["NatID","Name","Type","Total","HP","Attack","Defense","SpAtk","SpDef","Speed"]
df.head()

Unnamed: 0,NatID,Name,Type,Total,HP,Attack,Defense,SpAtk,SpDef,Speed
0,1,Bulbasaur,Grass Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass Poison,525,80,82,83,100,100,80
3,3,Venusaur Mega Venusaur,Grass Poison,625,80,100,123,122,120,80
4,4,Charmander,Fire,309,39,52,43,60,50,65


<h3> Tidying - Part 1</h3>

The data that we have just obtained is poorly structured for analysis. We will fix this by tidying our data. To begin, we will split the Type column into 2 columns named Type 1 and Type 2. There are some Pokemon that have multiple forms and therefore create duplicate entries in our data. We will not need this extra data, so we will be dropping those rows.

In [3]:
# separate 'Type' into 'Type1' and 'Type2' and readjust
types = pd.DataFrame(df.Type.str.split('  ',1).tolist(), columns = ['Type1','Type2'])
df = pd.concat([df,types],axis = 1)
df.drop(['Type'], axis=1)
df = df[["NatID","Name","Type1","Type2","Total","HP","Attack","Defense","SpAtk","SpDef","Speed"]]
df['Type2'].fillna(value='None', inplace = True)

# remove duplicates (other forms of the same pokemon i.e. Mega Evolutions)
filter = df['Name'].str.contains("Mega ")
df = df[~filter]
df = df.drop_duplicates(subset='NatID', keep="first")

# reset the index after dropping rows
df = df.reset_index(drop=True)

# fixes names of specific cases
idx = 0
for name in df["Name"]:
    if "  " in name:
        df.loc[idx, "Name"] = name.split(" ",1)[0]
    idx+=1

df.head()

Unnamed: 0,NatID,Name,Type1,Type2,Total,HP,Attack,Defense,SpAtk,SpDef,Speed
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,4,Charmander,Fire,,309,39,52,43,60,50,65
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80


There are 7 generations of Pokemon, and each Pokemon is part of one of these generations. Currently, we do not have the generation each pokemon belongs to and this information could potentiallly be useful. We are able to determine a Pokemon's generation based on its National ID, so we will be manually adding that attribute.

In [4]:
# add a column to classify Pokemon by Generation
idx = 0
df['Generation'] = 0
for x in df["NatID"]:   
    if df['NatID'][idx] <= 151:
        df.loc[idx,'Generation'] = 1
    elif df['NatID'][idx] > 151 and df['NatID'][idx] <= 251:
        df.loc[idx,'Generation'] = 2
    elif df['NatID'][idx] > 251 and df['NatID'][idx] <= 386:
        df.loc[idx,'Generation'] = 3
    elif df['NatID'][idx] > 386 and df['NatID'][idx] <= 493:
        df.loc[idx,'Generation'] = 4
    elif df['NatID'][idx] > 493 and df['NatID'][idx] <= 649:
        df.loc[idx,'Generation'] = 5
    elif df['NatID'][idx] > 649 and df['NatID'][idx] <= 721:
        df.loc[idx,'Generation'] = 6
    else:
        df.loc[idx,'Generation'] = 7
    idx+=1
    
df.head()

Unnamed: 0,NatID,Name,Type1,Type2,Total,HP,Attack,Defense,SpAtk,SpDef,Speed,Generation
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1
3,4,Charmander,Fire,,309,39,52,43,60,50,65,1
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1


<h3> Scraping - Part 2 </h3>

Our dataframe is missing some useful information such as height, weight, catch rate, and gender ratio. We need to get this missing data by scraping the individual webpage for each Pokemon. The individual page of a pokemon is just http://pokemondb.net/pokedex/name, where name is just the name of the pokemon. We wrote a function "retreiveData(name)" that will get the data for each individual pokemon based on its name. The code can  be seen below.

In [5]:
def retrieveData (name):
    # Modify string for special cases
    if (name == "Nidoran♀"):
        name = "Nidoran-f"
    elif (name == "Nidoran♂"):
        name = "Nidoran-m"
    elif (name == "Farfetch'd"):
        name = "Farfetchd"
    elif (name == "Mr. Mime"):
        name = "Mr-Mime"
    elif (name == "Mime Jr."):
        name = "Mime-Jr"
    elif (name == "Flabébé"):
        name = "Flabebe"
    elif (name == "Type: Null"):
        name = "Type-Null"
    elif (" " in name):
        name = name.replace(" ", "-")
         
    # collect data for each Pokemon from the database
    r = requests.get('https://pokemondb.net/pokedex/' + name)
    soup = bs(r.text, 'html.parser')
    prettysoup = soup.prettify()
    read = pd.read_html(prettysoup)
    
    return read

<h3> Tidying - Part 2 </h3>

The new data that we get for each pokemon will not be in a proper format. For this reason, we will also be tidying this data. We will first create a dataframe for the data from the HTML tables. We then have to fix the header of the table because it is currently a row. Next, we proceed to drop any columns for attributes we do not need. We will reformat the height, weight, and catch rate columns so they are only numbers and not strings. The gender ratio will be split so we get a male percentage and a female percentage. Finally, we will make sure the type of each column is what we want it to be. The code is shown below with coments explaining each section.

In [6]:
# a function to retrieve additional info for each Pokemon from their individual webpages
def getPokeData (name):

    # Call function to retrieve data for pokemon
    read = retrieveData(name);
    
    # Get the different types of data
    df1 = pd.DataFrame(read[0])
    df2 = pd.DataFrame(read[1])
    df3 = pd.DataFrame(read[2])

    # Transpose the data so it is attributes are in columns
    df1 = df1.T
    df2 = df2.T
    df3 = df3.T

    # Fix header for each table
    new_header = df1.iloc[0] 
    df1 = df1[1:] 
    df1.columns = new_header 

    new_header = df2.iloc[0] 
    df2 = df2[1:] 
    df2.columns = new_header 

    new_header = df3.iloc[0] 
    df3 = df3[1:] 
    df3.columns = new_header 

    # Drop Irrelevant columns for table 1
    if "Japanese" in df1.columns:
        df1 = df1.drop(["Japanese"], axis=1)
    df1 = df1.drop(["Type", "Abilities", "Species", "Local №"], axis=1)
    df1.columns = ["NatID", "Height (m)", "Weight (kg)"]

    df2 = df2.drop(["EV yield", "Base EXP", "Growth Rate", "Base Happiness"], axis=1)
    df2.columns = ["Catch Rate"]

    df3 = df3.drop(["Egg cycles", "Egg Groups"], axis=1)

    # Concat the 3 dataframes
    df= pd.concat([df1, df2, df3], axis=1)

    # Get rid of extraneous information
    df["Height (m)"].iloc[0] = df["Height (m)"].iloc[0].split("(")[1].split("m")[0]
    df["Weight (kg)"].iloc[0] = df["Weight (kg)"].iloc[0].split("(")[1].split(" ")[0]
    df["Catch Rate"].iloc[0] = df["Catch Rate"].iloc[0].split(" ")[0]


    # Split Gender Ratio into 2 separate colulmns for male and female
    if "Genderless" in df['Gender'].iloc[0]:
        df['Gender'].iloc[0] = "0.0% male, 0.0% female"

    df['Male (%)'] = df.Gender.str.split(",",1).tolist()[0][0]
    df['Male (%)'] = df['Male (%)'].str.split("%",1).tolist()[0][0]
    df['Female (%)'] = df.Gender.str.split(",",1).tolist()[0][1]
    df['Female (%)'] = df['Female (%)'].str.split("%",1).tolist()[0][0]
    df.drop(['Gender'], axis=1, inplace=True)

    # Change typing from string to appropriate type for each column
    df['NatID'] = df['NatID'].astype(int) 
    df['Height (m)'] = df['Height (m)'].astype(float) 
    df['Weight (kg)'] = df['Weight (kg)'].astype(float) 
    df['Catch Rate'] = df['Catch Rate'].astype(int) 
    df['Male (%)'] = df['Male (%)'].astype(float) 
    df['Female (%)'] = df['Female (%)'].astype(float) 
    
    return df

Here is an example of using our "getPokeData(name)" function on the pokemon Pikachu.

In [13]:
getPokeData("Pikachu")

Unnamed: 0,NatID,Height (m),Weight (kg),Catch Rate,Male (%),Female (%)
1,25,0.41,6.0,190,50.0,50.0


<h3> Creating a CSV </h3>

We now have a function that collects more information for each Pokemon. We need to call it for each pokemon and then add that information to the original dataframe we created in Scraping and Tidying - Part 1. We will first get the extra information for each pokemon and put that into one dataframe. Next, we will join this dataframe with the original dataframe on the NatID column. After doing so, we will be left with one full dataframe. Because it can take up to several minutes to scrape the data for each individual pokemon, we only want to do it once and save the resulting dataframe to a csv file. In the future, we can retrieve the data from our csv file instead of re-scraping it. The code is shown below with comments describing each section.

In [19]:
if not os.path.exists("PokeData_Full.csv"):
    # make a dataframe for the additional columns needed
    add_info = pd.DataFrame()

    # scrape the additional data for each Pokemon
    for name in df['Name']:
        try:
            add_info = add_info.append(getPokeData(name))
        except:
            print("Unable to add " + name)

    # merge the new columns of data with the original dataframe
    df= pd.merge(df,add_info, on=['NatID'], how="right")

    # store the dataframe in a '.csv' for ease of access later, rather than scrape data from the website each time
    df.to_csv("PokeData_Full.csv")
else:
    # load the previously stored data
    df = pd.read_csv("PokeData_Full.csv",  encoding='latin-1')
    df.drop("Unnamed: 0", axis = 1, inplace=True)
    
df.head()

Unnamed: 0,NatID,Name,Type1,Type2,Total,HP,Attack,Defense,SpAtk,SpDef,Speed,Generation,Height (m),Weight (kg),Catch Rate,Male (%),Female (%)
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,0.71,6.9,45,87.5,12.5
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,0.99,13.0,45,87.5,12.5
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,2.01,100.0,45,87.5,12.5
3,4,Charmander,Fire,,309,39,52,43,60,50,65,1,0.61,8.5,45,87.5,12.5
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,1.09,19.0,45,87.5,12.5


<h3> Getting ready for Data Analysis </h3>

Currently, we have each of a Pokemon's types in a separate column. However, when we graph a Pokemon's attributes vs Type, with type as the X-axis, we would not be able to represent each type equally. We will be duplicating the data so each Pokemon's types both appear as type 1. Essentially, we are making it easier to represent both the primary type and the secondary type equally. The code is shown below.

In [20]:
# Make a new dataframe that is a copy of the original dataframe
df2 = df.copy();

# Switch the type columns for df2 so that Type1 is now Type2 and Type2 is now Type1
df2 = df2.rename(columns={'Type1': 'Type2','Type2': 'Type1'})
df2 = df2[["NatID", "Name", "Type1", "Type2", "Total", "HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed", \
           "Generation", "Height (m)", "Weight (kg)", "Catch Rate", "Male (%)", "Female (%)"]]

# Create a new dataframe df3 that is also a copy of the original dataframe
df3 = df.copy();

# Add every row from df2 to df3 as long as Type1 is not "None"
for index, row in df2.iterrows():
    if row["Type1"] != "None":
        df3 = df3.append(row)
        
# Type 2 is no longer needed, so drop it and then rename Type1 to Type
df3.drop("Type2", 1,inplace=True)
df3 = df3.rename(columns={'Type1': 'Type'})

df3.head()

Unnamed: 0,NatID,Name,Type,Total,HP,Attack,Defense,SpAtk,SpDef,Speed,Generation,Height (m),Weight (kg),Catch Rate,Male (%),Female (%)
0,1,Bulbasaur,Grass,318,45,49,49,65,65,45,1,0.71,6.9,45,87.5,12.5
1,2,Ivysaur,Grass,405,60,62,63,80,80,60,1,0.99,13.0,45,87.5,12.5
2,3,Venusaur,Grass,525,80,82,83,100,100,80,1,2.01,100.0,45,87.5,12.5
3,4,Charmander,Fire,309,39,52,43,60,50,65,1,0.61,8.5,45,87.5,12.5
4,5,Charmeleon,Fire,405,58,64,58,80,65,80,1,1.09,19.0,45,87.5,12.5


The final data frame should look like the one shown above. Now that all of the data we will need is collected and formatted, we can proceed to analyze it.

<h2> Part 2: Exploratory Data Analysis </h2>

Let's try to find some patterns in data. First we'll separate the data into groups based on Type (regardless of whether its Type 1 or Type 2). Then we'll try to see trends across these groups.

<h2>Part 3: Machine Learning </h2>

The goal of machine learning for our purposes will be to predict the type of a Pokemon given its various other attributes. Our null hypothesis is that there is no significant correlation between a Pokemon's type and its other attributes. We will be training the algorithm with a set of the data and will test it with the remaining portion. This could be useful because there are a lot of fan base made Pokemon games. Often times, developers like to create there own Pokemon when making their own games. These develepors could input a set of attributes and then determine what type that Pokemon should be.

We will be using <a href="http://scikit-learn.org/stable/modules/cross_validation.html">K-Fold cross validation</a> and the will split the data into 10 roughly equal in size groups. The code is shown below.

In [121]:
# Convert the dataframe to a matrix, taking only the attributes we want 
X = df.as_matrix(["Total", "HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed", "Height (m)", "Weight (kg)",\
                  "Catch Rate", "Male (%)", "Female (%)"])

# Convert the column of types from the original dataframe to another matrix
Y = df.as_matrix(["Type1"])

# Create a KFold object with 10 splits
kf = KFold(n_splits=10, shuffle=True)

<h3> K-Nearest Neighbors Model </h3>

The <a href="http://scikit-learn.org/stable/modules/neighbors.html">K-Nearest Neighbors Model</a> basically plots all of the attributes of a Pokemon onto a multi-dimensional graph. It then finds what type a Pokemon should be given its attributes by seeing which type appears most for the K-Nearest neighbors. First we will see which value of k will result in the best Zero-One loss value. A <a href="http://scikit-learn.org/stable/modules/generated/sklearn.metrics.zero_one_loss.html">Zero-One Loss function</a> will essentially tell us how successful the algorithm was in determining a Pokemon's type. A value close to 1 would mean that it was not very successful while a value close to 0 would mean that it was successful. We want to see what amount of neighbors gives us the lowest Zero-One loss value. Below is our code to do this with comments explaining what each line is doing.

In [174]:
# Start k at 1, and increment all the way up to 10
k = 1
for train_idx, test_idx in kf.split(X):
    
    # Create the train and test sets for both X and Y
    X_train, X_test = X[train_idx], X[test_idx]
    Y_train, Y_test = Y[train_idx], Y[test_idx]
    
    # Create the CLF object with the appropriate k value, and weight neighbors by distance
    clf = sklearn.neighbors.KNeighborsClassifier(k, weights='distance')
    
    # Fit the model
    clf.fit(X_train, Y_train.ravel())
    
    # Predict the types for the test data
    prediction = clf.predict(X_test)
    
    # Print the Zero-One Loss value for each k-value
    print(str(k) + " neighbors: " + str(sklearn.metrics.zero_one_loss(Y_test.ravel(),prediction)))
    
    # Increment K for next iteration
    k = k + 1

1 neighbors: 0.79012345679
2 neighbors: 0.79012345679
3 neighbors: 0.825
4 neighbors: 0.675
5 neighbors: 0.8
6 neighbors: 0.7375
7 neighbors: 0.7875
8 neighbors: 0.7875
9 neighbors: 0.8
10 neighbors: 0.725


After running the above code repeated times, we determined that 5 neighbors was consistantly giving us the best result. We will now use the value of k as 5 and run the test 10 times for each set and retrieve the average Zero-One loss value. The code is shown below with comments explaining each line.

In [179]:
# Create a variable to keep track of the total loss so it can be averaged later
total_loss = 0

# For each split, fit the model and test
for train_idx, test_idx in kf.split(X):
    
    # Create the train and test sets for both X and Y
    X_train, X_test = X[train_idx], X[test_idx]
    Y_train, Y_test = Y[train_idx], Y[test_idx]
    
    # Create the CLF object with the appropriate k value, and weight neighbors by distance
    clf = sklearn.neighbors.KNeighborsClassifier(5, weights='distance')
    
    # Fit the model
    clf.fit(X_train, Y_train.ravel())
    
    # Predict the types for the test data
    prediction = clf.predict(X_test)
    
    # Add the Zero-One loss value to the total
    total_loss = total_loss + sklearn.metrics.zero_one_loss(Y_test.ravel(), prediction);
    
# Print the average Zero-One loss value
print("Average Zero-One loss value: " + str(total_loss/10)) 

Average Zero-One loss value: 0.756790123457


We consistantly get a value of around 0.75. This means that based on the attributes we provided to the algorithm, it is unable to accuratly guess the type of the pokemon. 