In [1]:
import json  # for importing data

## Importing and re-formating data

First, we need to import the data we have obtained.  
1. match or battle log data: `{p1 pokemon, p2 pokemon, did p1 win?, URL}`  
2. base stats of each pokemon: `{'Id', 'Name', 'Type(s)', 'TotalBS', 'HP', 'Attack', 'Defense', 'Special Attack', 'Special Defense', 'Speed'}`  

<mark>TODO: should pl pokemon be the pokemon which moves first in the match? it is very ambiguous to have pl pokemon just picked by system randomly.</mark>

In [23]:
import json

# 1. match or battle log data
with open('./data/match_ok.json', 'r') as f:
    matches = json.load(f)
# 2. base stats of each pokemon
with open('./data/baseStats.json', 'r') as f:
    baseStats = json.load(f)

# to see how they look like
print(matches[3])
print(baseStats[3])

{'pokemon1': 'arcanine', 'pokemon2': 'sharpedo', 'pokemon1 wins': 1, 'url': 'https://replay.pokemonshowdown.com/destiny-challengecup1v1-550920.log'}
{'Id': 3, 'Name': 'Mega Venusaur', 'Type(s)': ['GRASS', 'POISON'], 'TotalBS': '625', 'HP': '80', 'Attack': '100', 'Defense': '123', 'Special Attack': '122', 'Special Defense': '120', 'Speed': '80'}



Then we need to construct the features and labels from these 2 databases,  
into a format looks like: `{'p1HP' ,'p1ATK' ,'p1DEF' ,'p1SpATK' ,'p1SpDEF' ,'p1SPD' ,'p2HP' ,'p2ATK' ,'p2DEF' ,'p2SpATK' ,'p2SpDEF' ,'p2SPD' ,'p1wins'}`

In [19]:
data = []
nameNotFound = []
N = len(matches)
for idx, match in enumerate(matches):  # from each match
    # extracting battle info of each match
    p1_pokemon = match['pokemon1']
    p2_pokemon = match['pokemon2']
    p1_wins = match['pokemon1 wins']
    # searching for the baseStats given the names of pokemons 
    info = [i for i in baseStats if p1_pokemon == i['Name'].lower() or p2_pokemon == i['Name'].lower()]  # this give [dict{p1 pokemon}, dict{p2 pokemon}]
    # creating new row format
    if len(info) == 2:  # for some reason some of the match contains pokemons that can't be found
        row = {
            'p1HP': int(info[0]['HP']),
            'p1ATK' : int(info[0]['Attack']),
            'p1DEF' : int(info[0]['Defense']),
            'p1SpATK' : int(info[0]['Special Attack']),
            'p1SpDEF' : int(info[0]['Special Defense']),
            'p1SPD' : int(info[0]['Speed']),
            'p2HP' : int(info[1]['HP']),
            'p2ATK' : int(info[1]['Attack']),
            'p2DEF' : int(info[1]['Defense']),
            'p2SpATK' : int(info[1]['Special Attack']),
            'p2SpDEF' : int(info[1]['Special Defense']),
            'p2SPD' : int(info[1]['Speed']),
            'p1wins': p1_wins
            }   
        data.append(row)  # store in a list
    else: 
        case = {'match index': idx, 'pokemon1': p1_pokemon, 'pokemon2': p2_pokemon, 'URL': match['url']}
        nameNotFound.append(case)
    print(f'{idx+1}/{N} done', end="\r")

print(f'\n{len(data)} rows of data can be proceeded, \n{len(nameNotFound)} rows of data went missing')

8895/8895 done
8475 rows of data can be proceeded, 
420 rows of data went missing


Seems like the problem of missing data is due to p1_pokemon is the same as p2_pokemon.  
<mark>TODO: are repeated pokemons bad data?</mark>

Convert them into pandas DataFrame for better visualistion and statistics summaries. 

In [29]:
import pandas as pd

df = pd.DataFrame(data)
df.head()

Unnamed: 0,p1HP,p1ATK,p1DEF,p1SpATK,p1SpDEF,p1SPD,p2HP,p2ATK,p2DEF,p2SpATK,p2SpDEF,p2SPD,p1wins
0,50,75,75,65,65,50,100,150,120,120,100,90,1
1,85,73,70,73,115,67,80,80,90,110,130,110,1
2,66,41,77,61,87,23,100,120,100,150,120,90,0
3,90,110,80,100,80,95,70,120,40,95,40,95,1
4,115,140,130,55,55,40,85,50,95,120,115,80,0


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8475 entries, 0 to 8474
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   p1HP     8475 non-null   int64
 1   p1ATK    8475 non-null   int64
 2   p1DEF    8475 non-null   int64
 3   p1SpATK  8475 non-null   int64
 4   p1SpDEF  8475 non-null   int64
 5   p1SPD    8475 non-null   int64
 6   p2HP     8475 non-null   int64
 7   p2ATK    8475 non-null   int64
 8   p2DEF    8475 non-null   int64
 9   p2SpATK  8475 non-null   int64
 10  p2SpDEF  8475 non-null   int64
 11  p2SPD    8475 non-null   int64
 12  p1wins   8475 non-null   int64
dtypes: int64(13)
memory usage: 860.9 KB


In [31]:
df.describe()

Unnamed: 0,p1HP,p1ATK,p1DEF,p1SpATK,p1SpDEF,p1SPD,p2HP,p2ATK,p2DEF,p2SpATK,p2SpDEF,p2SPD,p1wins
count,8475.0,8475.0,8475.0,8475.0,8475.0,8475.0,8475.0,8475.0,8475.0,8475.0,8475.0,8475.0,8475.0
mean,82.725074,91.734041,84.913628,86.044956,85.775457,78.676224,83.767788,98.15233,89.796814,89.126136,87.443894,81.63115,0.51528
std,29.844925,29.647326,28.294264,29.296954,22.656316,26.470136,26.374245,30.915781,28.455747,29.616206,26.053244,28.020722,0.499796
min,1.0,5.0,5.0,10.0,20.0,5.0,1.0,5.0,5.0,10.0,20.0,5.0,0.0
25%,65.0,72.0,70.0,60.0,75.0,60.0,70.0,75.0,70.0,63.0,71.0,60.0,0.0
50%,80.0,90.0,83.0,95.0,85.0,80.0,80.0,100.0,90.0,95.0,90.0,85.0,1.0
75%,95.0,115.0,100.0,109.0,100.0,100.0,97.0,120.0,100.0,110.0,100.0,100.0,1.0
max,255.0,181.0,230.0,173.0,230.0,200.0,255.0,181.0,230.0,173.0,230.0,200.0,1.0


done
___

## Statistics explanations 
I m gna skip this part for now and will come back later

skipped
___

## Feature engineering
I m gna skip this part for now and will come back later

skipped
___

## $f_m$ Optimisation
This is a measure of how good a particular threshold value is:   
$$ 
\begin{align}
    f_m &= \text{TPR} - m \times \text{FPR} \\
    m &= \frac{1 - \text{prevalence}}{\text{prevalence}} \times \frac{\text{cost of negative}}{\text{cost of positive}} \\
    \text{threshold}_{best} &= \mathrm{argmax}\ f_m
\end{align}
$$
Sometimes:  
$$
\begin{align}
    \text{cost of negative} &= \text{cost of False positive} - \text{cost of True negative} \\
    \text{cost of positive} &= \text{cost of False negative} - \text{cost of True positive} \\
    m &= \frac{1 - \text{prevalence}}{\text{prevalence}} \times \frac{C_{FP}-C_{TN}}{C_{FN}-C_{TP}}
\end{align}
$$
$\text{cost of negative}$ and $\text{cost of positive}$ are pre-determined values and usually we guess them based on the consequences for getting negative or positive results.  
So, within a set of values for threshold, the higher $f_m$ value, the better threshold. 

In [None]:
class FmOptimisation()

    def __init__(self):
        pass
        
    
    