In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np



In [None]:
! conda install -y -c r r-base='3.3.2' rpy2

Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /Users/piero/anaconda:

The following NEW packages will be INSTALLED:

    cairo:         1.14.8-0            
    fontconfig:    2.12.1-3            
    gettext:       0.19.8-1            
    glib:          2.50.2-1            
    gsl:           2.2.1-0             
    harfbuzz:      0.9.39-2            
    icu:           54.1-0              
    libffi:        3.2.1-1             
    libgcc:        4.8.5-1             
    libiconv:      1.14-0              
    ncurses:       5.9-10              
    olefile:       0.44-py27_0         
    pango:         1.40.3-1            
    pcre:          8.39-1              
    pixman:        0.34.0-0            
    r:             3.3.2-r3.3.2_0     r
    r-base:        3.3.2-1            r
    r-boot:        1.3_18-r3.3.2_0    r
    r-class:       7.3_14-r3.3.2_0    r
    r-cluster:     2.0.5-r3.3.2_0     r
    r-codet

In [None]:
%matplotlib inline
%load_ext rpy2.ipython

In [None]:
%%R
library(reshape2)

In [None]:
%%R
library(ggplot)

# Basic dataframe exploration

The original dataset:

In [None]:
pokemon = pd.read_csv('Pokemon.csv')

df = pd.DataFrame(pokemon)
df.head()

We check for double types

In [None]:
df['Double type'] = df['Type 2'].notnull()
df.head()

The combination of Type 1 and Type 2 are put in a seperate table

In [None]:
df['Combination type'] = df['Type 1'] + '/' + df['Type 2']
df['Combination type'].fillna(df['Type 1'], inplace=True)
df.head()

The number of different types and a top 10 of the types that occur most in the dataset

In [None]:
df['Combination type'].value_counts().head(10)

In [None]:
df['Combination type'].value_counts().count()

The number of single types (True) and double types (False)

In [None]:
df['Type 2'].isnull().value_counts()

We sorted the data by Generation. You can see that the first generation has the most pokemon but the most generations have around 160 pokemon.

In [None]:
df_stats = df.groupby('Generation').mean().iloc[:,1:8]
df_stats

In [None]:
df_nr = df.groupby('Generation').count()['#']
df_nr

# Correlation check

We searched for correlation between statistics.There is no real good correlation between one of those values. You can only see that the total correlates with everything the best. Another higher correlation is between defense and speed defense as well as speed attack and attack.

In [None]:
df_corr = df.iloc[:,2:-4]

In [None]:
correlation = df_corr.corr()
correlation

In [None]:
fig, ax = plt.subplots(figsize=(10,10))   

sns.heatmap(correlation, 
            xticklabels=correlation.columns.values,
            yticklabels=correlation.columns.values,
           annot=True, linewidths=.5, ax=ax)

# Some figures

Below we visualize the boxplots by relevant feature

In [None]:
sns.boxplot(data=df.drop(['#', 'Total', 'Generation', 'Legendary', 'Double type'], 1), orient='h');

In the violin plot below it becomes apparent that the legendary pokemon appear to have a structurally higher HP than normal pokemon. Additionally the spread is lower. This is evident since the legendary pokemon are usually of a high level and therefore are less bound to have major differences in feature scores.

In [None]:
type1 = pokemon['Type 1'].value_counts()
type2 = pokemon['Type 2'].value_counts()
types = pd.concat([type1,type2], axis=1)
plt.figure(figsize=(14, 6))
plt.ylabel('HP')
#boxplot of swarmplot
sns.violinplot(data=pokemon, x='Type 1', y='HP', hue='Legendary');

The difference between legendary and normal pokemon becomes more clear when looking at the total score. This aggregate feature explicitly highlights that legendary have structurally higher aggregate scores in comparison to normal pokemon, which is what is to be expected.

In [None]:
plt.figure(figsize=(14, 6))
#boxplot of swarmplot of violinplot
sns.violinplot(data=pokemon, x='Type 1', y='Total', hue='Legendary');

Next a pie chart is given, presenting the distribution of main pokemon types (Type 1). You can see  that most pokemon are water or normal pokemon.

In [None]:
plt.figure(figsize=(8, 8))
# Create a list of colors (from iWantHue)
colors = ["#E13F29", "#D69A80", "#D63B59", "#AE5552", "#CB5C3B", "#EB8076", "#96624E"]
plt.rcParams['font.size'] = 9.0
plt.pie(x=type1,
        labels=type1.index,
       # with no shadows
        shadow=False,
        # with colors
        colors=colors,
        # with one slide exploded out
        explode=(np.arange(len(type1.index))) * 0.012,
        # with the start angle at 90%
        startangle=90,
        # with the percent listed as a fraction
        autopct='%1.1f%%');

# Introducing a scoring algorithm

Next we want to construct some kind of measurement to score each pokemon. We want to give an indication of the 'strength' of a pokemon, based on a comparison of this pokemon to all the other pokemon in the dataset. To get this score we've constructed an algorithm that runs through the whole dataset for each pokemon and will be comparison based. We basicly try to represent a battle between two pokemon each time a comparison is made. Here we assume optimal behaviour of the pokemon in the sense that the dominant pokemon type (that is, which can do the most damage to the other pokemon) is chosen and a score will be assigned based on the defense and attack tables below.

The scoring is as follows:<br>
Weak to/Super effective against = 2<br>
Resist/Not very effective against = 0.5<br>
Immune to/Does no damage to = 0<br>
Any other type combination = 1<br>

For instance, if a fire type pokemon fights a rock/grass pokemon, the rock type will be dominant over the grass type since fire is weak against rock, which gives a higher score than when compared to grass.

In the end, for each pokemon, all the comparison based values will be summed to give an attack index and a defense index. Note that the score is based on the combination type, so pokemon with the same type initialy have the same score.

To calculate the strength index we substract the defense index from the attack index, we add 200 to make all values positive and finally multiply that number by the 'Total' feature score from the original dataset.

All the steps that are performed are written down as strings, since the computation of these scores can take up a lot of time. The result was exported to an additional csv file named 'Pokemon new dataset', which is used from now on in stead of the original dataset.

In [None]:
defensefile = pd.read_csv('Pokemon type chart defense.csv', delimiter=';')
defensetypes = pd.DataFrame(defensefile)
defensetypes = defensetypes.set_index('Type')
defensetypes

In [None]:
attackfile = pd.read_csv('Pokemon type chart attack.csv', delimiter=';')
attacktypes = pd.DataFrame(attackfile)
attacktypes = attacktypes.dropna().set_index('Type')
attacktypes

In [None]:
# Columns to be filled with some kind of measurement

In [None]:
#df['Attack index'] = 0
#df['Defense index'] = 0
#df['Strength index'] = 0

In [None]:
# Calculating the measurements. Finished file is imported in 'Pokemon new dataset.ipynb'

In [None]:
'''
attackcache = {}
for i in range(len(df)):
    if df.iloc[i,14] in attackcache:
        df.iloc[i,16] = attackcache[df.iloc[i,14]]
        continue
    checklist =  len(df)*[0]
    for j in range(2,4):
        if type(df.iloc[i,j]) != float:
            for g in range(len(df)):
                if checklist[g] < 2:
                    count = 0
                    for h,item in enumerate(df.iloc[g,14].split(sep='/')):
                        if type(item) != float:
                            if h == 1:
                                    if count == 1:
                                        if item in attacktypes.loc[df.iloc[i,j],'Super-effective against (2x)']:
                                            count += 1
                                    elif count == 0.5:
                                        if item in attacktypes.loc[df.iloc[i,j],'Super-effective against (2x)']:
                                            count += 1.5
                                        elif item in attacktypes.loc[df.iloc[i,j],'Not very effective against (1/2x)']:
                                            break
                                        elif item not in attacktypes.loc[df.iloc[i,j],'Does no damage to']:
                                            count += 0.5
                                    else:
                                        if item in attacktypes.loc[df.iloc[i,j],'Super-effective against (2x)']:
                                            count += 2
                                        elif item in attacktypes.loc[df.iloc[i,j],'Not very effective against (1/2x)']:
                                            count += 0.5
                                        elif item not in attacktypes.loc[df.iloc[i,j],'Does no damage to']:
                                            count += 1
                            else:
                                    if item in attacktypes.loc[df.iloc[i,j],'Super-effective against (2x)']:
                                        count += 2
                                        break
                                    elif item in attacktypes.loc[df.iloc[i,j],'Not very effective against (1/2x)']:
                                        count += 0.5
                                    elif item not in attacktypes.loc[df.iloc[i,j],'Does no damage to']:
                                        count += 1
                    if count > checklist[g]:
                        checklist[g] = count
            
    df.iloc[i,16] = sum(checklist)
    attackcache[df.iloc[i,14]] = sum(checklist)
'''

In [None]:
'''
defensecache = {}
for i in range(len(df)):
    if df.iloc[i,14] in defensecache:
        df.iloc[i,17] = defensecache[df.iloc[i,14]]
        continue
    checklist =  len(df)*[0]
    for j in range(2,4):
        if type(df.iloc[i,j]) != float:
            for g in range(len(df)):
                if checklist[g] < 2:
                    count = 0
                    for h,item in enumerate(df.iloc[g,14].split(sep='/')):
                        if type(item) != float:
                            if h == 1:
                                    if count == 1:
                                        if item in defensetypes.loc[df.iloc[i,j],'Weak to (2x)']:
                                            count += 1
                                    elif count == 0.5:
                                        if item in defensetypes.loc[df.iloc[i,j],'Weak to (2x)']:
                                            count += 1.5
                                        elif item in defensetypes.loc[df.iloc[i,j],'Resist (1/2x)']:
                                            break
                                        elif item not in defensetypes.loc[df.iloc[i,j],'Immune to']:
                                            count += 0.5
                                    else:
                                        if item in defensetypes.loc[df.iloc[i,j],'Weak to (2x)']:
                                            count += 2
                                        elif item in defensetypes.loc[df.iloc[i,j],'Resist (1/2x)']:
                                            count += 0.5
                                        elif item not in defensetypes.loc[df.iloc[i,j],'Immune to']:
                                            count += 1
                            else:
                                    if item in defensetypes.loc[df.iloc[i,j],'Weak to (2x)']:
                                        count += 2
                                        break
                                    elif item in defensetypes.loc[df.iloc[i,j],'Resist (1/2x)']:
                                        count += 0.5
                                    elif item not in defensetypes.loc[df.iloc[i,j],'Immune to']:
                                        count += 1
                    if count > checklist[g]:
                        checklist[g] = count
            
    df.iloc[i,17] = sum(checklist)
    defensecache[df.iloc[i,14]] = sum(checklist)
'''

In [None]:
# df['Strength index'] = (df['Attack index'] - df['Defense index'] + 200)*df['Total']/100

In [None]:
# df.to_csv('Pokemon new dataset.csv', encoding='utf-8') 

In [None]:
pokemon_new = pd.read_csv('Pokemon new dataset.csv', sep=',')
new_df = pd.DataFrame(pokemon_new)

In [None]:
new_df.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
new_df.sort_values(by='Strength index', ascending=False)

We read the new_df in R seperately since importing it lead to complications

In [None]:
%%R
new_df = read.csv('Pokemon new dataset.csv')
head(new_df)

Here we plotted the probability of having a certain strength-level

In [None]:
f, ax = plt.subplots(figsize=(15,10))
sns.distplot(new_df['Strength index'],ax=ax)

We sort the pokemon based on the mean strength value for each group, when grouped by the unique Type 1 values. Subsequently we plot the strength index probability functions in that order.

In [None]:
new_df[['Type 1', 'Strength index']].groupby('Type 1').mean().sort_values(by='Strength index').index

In [None]:
%%R -w 1000 -h 1000

# Ordering the data, grouped by Type 1, by mean of the strength index
new_df$Type.1 <- factor(new_df$Type.1, levels = c('Grass', 'Normal', 'Rock', 'Psychic', 'Poison', 'Bug', 'Fighting',
       'Dark', 'Steel', 'Water', 'Fire', 'Ghost', 'Ground', 'Dragon', 'Fairy',
       'Ice', 'Flying', 'Electric')) 

# Plotting it in that order
ggplot(new_df, aes(x = Strength.index)) + geom_density(alpha = 0.5) + facet_wrap(~ Type.1)

We split the dataset by generation and for each type we count the number of times it occurs as a Type 1 and the number of types it occurs as a Type 2. We can perform this operation since the set of pokemon types is the same for Type 1 and Type 2

In [None]:
%%R -w 1000 -h 1000

dfm <- melt(new_df[,c('Generation','Type.1','Type.2')],id.vars = 1) # Dataframe melting
dfm <- dfm[!apply(dfm, 1, function(x) any(x=="")),] # Removing the empty values
dfm

In [None]:
%%R -w 1000 -h 1500

ggplot(dfm, aes(x=factor(value), fill=factor(variable))) +
    geom_bar(position="dodge") + facet_wrap(~ Generation, ncol=1, scales='free_x') + 
    labs(fill='', x='Types')

We are trying to fit the strenght of legendary and non legendary pokemon to a linear function to again highlight the differences between legendary and non-legendary pokemon. Again, the total score of legendary pokemon is structurally higher than the total score of non-legendary pokemon, which also explains the difference in slope between the two lines. In addition, most of the pokemon appearing in the top ten of the strength index are legendary pokemon.

In [None]:
sns.lmplot(data=new_df,x='Strength index',y='Total',hue='Legendary', size=10)

Here we made a boxplot with the difference in strength between pokemon types for Type 1. You can see that some pokemon types are on average stronger than others.

In [None]:
f, ax = plt.subplots(figsize=(15,10))
sns.boxplot(data=new_df, x='Strength index',y='Type 1',ax=ax)

The same was done for Type 2

In [None]:
f, ax = plt.subplots(figsize=(15,10))
sns.boxplot(data=new_df, x='Strength index',y='Type 2',ax=ax)

Finally, we indexed the average strength for all pokemon type combinations, here you can see that a pokemon with the Ground/Fire combination for type on average has the highest strength score. The Rock/Dragon combination gives the lowest score.

In [None]:
grouped_df = new_df[['Type 1','Type 2','Strength index']].groupby(['Type 1','Type 2']).mean().sort_values(by='Strength index',ascending=False)
grouped_df = grouped_df.reset_index()
grouped_df = grouped_df.rename(columns={'Strength index':'Average strength index'})
grouped_df['Average strength index'] = np.around(grouped_df['Average strength index'],2)
grouped_df