<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Hero-Stats-Transformer" data-toc-modified-id="Hero-Stats-Transformer-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Hero Stats Transformer</a></span></li><li><span><a href="#Partidas-Dota-2---Transfomador" data-toc-modified-id="Partidas-Dota-2---Transfomador-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Partidas Dota 2 - Transfomador</a></span></li><li><span><a href="#Join-entre-dataframes" data-toc-modified-id="Join-entre-dataframes-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Join entre dataframes</a></span><ul class="toc-item"><li><span><a href="#Filtrando-game_mode-e-lobby_type" data-toc-modified-id="Filtrando-game_mode-e-lobby_type-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Filtrando game_mode e lobby_type</a></span></li><li><span><a href="#Unificando-datastes" data-toc-modified-id="Unificando-datastes-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Unificando datastes</a></span></li></ul></li><li><span><a href="#Salvando-dataset" data-toc-modified-id="Salvando-dataset-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Salvando dataset</a></span></li></ul></div>

---
# Transformador 1: Anexar os status de herói aos dados disponíveis

**Objetivo do experimento:** Anexar os status de cada herói ao dataset de treinamento. Ao final o dataset deve possuir mais de 500 features. Abaixo a *Ordem de procedimentos:*

**Arquivo Hero_stats**
1. Carregar o arquivo hero_stats
2. Aplicar a transformação nos dados de hero_stats

**Arquivos de partidas**
1. Procurar por todos os dados disponíveis na pasta 'raw_data'
2. Concatenar todos em um único dataframe
3. Remover valores nulos
4. Remover colunas duplicadas
5. Adicionar uma coluna para um herói de cada time

**União de dados**
1. Filtrar apenas por partidas rankeadas e game_mode = 'game_mode_all_draft'
2. Fazer um join entre os datasets

**Loading libraries and ``hero_stats`` JSON**

In [1]:
data_path = '../data/raw_data'

import pandas as pd
import numpy as np
from datetime import datetime
import os

import sys
sys.path.append('../utils')
import transformer_utils

## Hero Stats Transformer

In [2]:
hero_cols = [
        'hero_id','is_Melle', 'base_mana_regen', 'base_armor', 'base_attack_min',
       'base_attack_max', 'base_str', 'base_agi', 'base_int', 'str_gain',
       'agi_gain', 'int_gain', 'attack_range', 'projectile_speed',
       'attack_rate', 'move_speed', 'legs', 'turbo_picks',
       'turbo_wins', 'pro_win', 'pro_pick', 'pro_ban', '1_pick', '1_win',
       '2_pick', '2_win', '3_pick', '3_win', '4_pick', '4_win', '5_pick',
       '5_win', '6_pick', '6_win', '7_pick', '7_win', '8_pick', '8_win',
       'null_pick', 'primary_attr_agi', 'primary_attr_int', 'primary_attr_str',
       'Nuker', 'Disabler', 'Initiator', 'Durable', 'Support', 'Jungler',
       'Carry', 'Pusher', 'Escape'
        ]

In [3]:
hero_stats_raw = pd.read_json(data_path+'/hero_stats.json')
hero_stats_df = transformer_utils.hero_stats_tranformer(hero_stats_raw)

# Removing hero name
hero_stats_df = hero_stats_df.loc[:, hero_cols]

hero_stats_df.head()

Unnamed: 0,hero_id,is_Melle,base_mana_regen,base_armor,base_attack_min,base_attack_max,base_str,base_agi,base_int,str_gain,...,primary_attr_str,Nuker,Disabler,Initiator,Durable,Support,Jungler,Carry,Pusher,Escape
0,1,1,0.0,0.0,29,33,23,24,12,1.3,...,0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,2,1,0.0,-1.0,27,31,25,20,18,3.4,...,1,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0
2,3,0,0.0,1.0,35,41,22,22,22,2.6,...,0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
3,4,1,0.0,2.0,35,41,24,22,17,2.7,...,0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
4,5,0,1.0,-1.0,28,34,18,16,16,2.2,...,0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


---
## Partidas Dota 2 - Transfomador 

**Searching for each .csv file in the 'raw_data' folder**

In [4]:
lst_df = []
for root, dirs, files in os.walk(data_path):
    for filename in files:
        xlsx_file, file_extension = os.path.splitext(filename)
        print('.csv file found:\n')
        if(file_extension == '.csv'):
            print(filename)
            file_path = root +'\\' + filename
            lst_df.append(pd.read_csv(file_path))  

.csv file found:

21-04-24 16h14m31s.csv
.csv file found:

.csv file found:

.csv file found:



**Drop NA**

In [5]:
match_df = pd.concat(lst_df)

print('Dataframe shape:', match_df.shape)
print('Total nan: \n\n', match_df.isna().sum())

match_df.dropna(inplace=True)
print('\nDataframe shape:', match_df.shape)

Dataframe shape: (46000, 9)
Total nan: 

 Unnamed: 0      0
match_id        0
radiant_win     0
avg_mmr         0
duration        0
lobby_type      0
game_mode       0
radiant_team    0
dire_team       0
dtype: int64

Dataframe shape: (46000, 9)


**Remove duplicated rows**

In [6]:
match_df.drop_duplicates(subset=['match_id'])
print('\nDataframe shape:', match_df.shape)


Dataframe shape: (46000, 9)


In [7]:
team_id = [121,119,2,13,70]
aux = hero_stats_df.query("hero_id in @team_id") \
                        .loc[:,['hero_id', 'is_Melle', 
                               'base_mana_regen', 'base_armor',
                                'base_attack_min']]
aux

Unnamed: 0,hero_id,is_Melle,base_mana_regen,base_armor,base_attack_min
1,2,1,0.0,-1.0,27
12,13,0,0.0,-3.0,23
68,70,1,0.0,1.0,24
113,119,0,0.0,-1.0,27
115,121,0,0.0,0.0,21


In [8]:
aux.values

array([[  2.,   1.,   0.,  -1.,  27.],
       [ 13.,   0.,   0.,  -3.,  23.],
       [ 70.,   1.,   0.,   1.,  24.],
       [119.,   0.,   0.,  -1.,  27.],
       [121.,   0.,   0.,   0.,  21.]])

In [9]:
aux.values.ravel()

array([  2.,   1.,   0.,  -1.,  27.,  13.,   0.,   0.,  -3.,  23.,  70.,
         1.,   0.,   1.,  24., 119.,   0.,   0.,  -1.,  27., 121.,   0.,
         0.,   0.,  21.])

In [10]:
match_df.head()

Unnamed: 0.1,Unnamed: 0,match_id,radiant_win,avg_mmr,duration,lobby_type,game_mode,radiant_team,dire_team
0,0,5963069008,True,3439,1649,7,22,12111921370,12983806330
1,1,5963069009,True,3774,1848,0,22,1091010812884,12369677371
2,2,5963069012,True,3311,1951,7,22,1298135885,3213537121126
3,3,5963069204,True,3408,995,0,22,835617629,2632516135
4,4,5963069208,False,4621,1818,7,22,1110026442,13542253074


---
## Concatenar informações de cada herói
1. **``create_features_df``**: cria um dataframe para os dados serem sumarizados; 
2. **``ravel_feature_by_team``:** para cada linha, faz uma busca do status do time dentro do dataframe de hero_status e trazer a informação completa por herói;
3. **``populate_feature_df``**: adiciona o resultado agregado ao dataframe final.

In [11]:
def create_features_df(match_df, hero_stats_df):
    # Creating dataframe for features
    hero_stats_feature_cols = hero_stats_df.columns[1:]
    
    feature_cols = []
    for i in range(5):
        feature_cols += list('Radiant_'+str(i+1)+'_'+hero_stats_feature_cols) \
                        + list('Dire_'+str(i+1)+'_'+hero_stats_feature_cols)


    zero_mtx = np.zeros([match_df.shape[0], len(feature_cols)])
    feature_df = pd.DataFrame(zero_mtx, columns=feature_cols)
    
    return feature_df

In [12]:
def ravel_feature_by_team(team_heros_id, hero_stats_df):
    
    intermediate_df = hero_stats_df.query("hero_id in @team_heros_id") \
                                .loc[:, hero_stats_df.columns != 'hero_id']
    
    ravel_row = intermediate_df.values.ravel()
    
    return ravel_row

In [13]:
def populate_feature_df(match_df, agg_function = 'sum'):

    cols_by_team = int(feature_df.shape[1] / 2)
    
    for i in np.arange(match_df.shape[0]):
        row = match_df.iloc[i,:]

        radiant_team_str = map(int, row['radiant_team'].split(","))
        radiant_team = list(radiant_team_str)

        dire_team_str = map(int, row['dire_team'].split(","))
        dire_team = list(dire_team_str)

        agg_row_radiant = ravel_feature_by_team(radiant_team, hero_stats_df) 

        agg_row_dire = ravel_feature_by_team(dire_team, hero_stats_df) 
        
        feature_df.iloc[i, 0:cols_by_team] = agg_row_radiant
        feature_df.iloc[i, cols_by_team:] = agg_row_dire
    
    return feature_df

In [14]:
feature_df = create_features_df(match_df, hero_stats_df)
feature_df = populate_feature_df(match_df, agg_function = 'sum')
feature_df.head()

Unnamed: 0,Radiant_1_is_Melle,Radiant_1_base_mana_regen,Radiant_1_base_armor,Radiant_1_base_attack_min,Radiant_1_base_attack_max,Radiant_1_base_str,Radiant_1_base_agi,Radiant_1_base_int,Radiant_1_str_gain,Radiant_1_agi_gain,...,Dire_5_primary_attr_str,Dire_5_Nuker,Dire_5_Disabler,Dire_5_Initiator,Dire_5_Durable,Dire_5_Support,Dire_5_Jungler,Dire_5_Carry,Dire_5_Pusher,Dire_5_Escape
0,1.0,0.0,-1.0,27.0,31.0,25.0,20.0,18.0,3.4,2.2,...,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,-2.0,9.0,18.0,22.0,24.0,19.0,3.0,4.3,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,-1.0,28.0,34.0,18.0,16.0,16.0,2.2,1.6,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.4,-1.0,27.0,32.0,18.0,18.0,22.0,2.2,3.7,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,-1.0,27.0,31.0,25.0,20.0,18.0,3.4,2.2,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


## Incluir dados restantes


In [15]:
cols_to_include = ['match_id', 'radiant_win', 'avg_mmr',
                   'duration', 'lobby_type', 'game_mode']

feature_df[cols_to_include] = match_df[cols_to_include]

In [16]:
feature_df.head()

Unnamed: 0,Radiant_1_is_Melle,Radiant_1_base_mana_regen,Radiant_1_base_armor,Radiant_1_base_attack_min,Radiant_1_base_attack_max,Radiant_1_base_str,Radiant_1_base_agi,Radiant_1_base_int,Radiant_1_str_gain,Radiant_1_agi_gain,...,Dire_5_Jungler,Dire_5_Carry,Dire_5_Pusher,Dire_5_Escape,match_id,radiant_win,avg_mmr,duration,lobby_type,game_mode
0,1.0,0.0,-1.0,27.0,31.0,25.0,20.0,18.0,3.4,2.2,...,0.0,1.0,0.0,0.0,5963069008,True,3439,1649,7,22
1,0.0,0.0,-2.0,9.0,18.0,22.0,24.0,19.0,3.0,4.3,...,0.0,0.0,0.0,1.0,5963069009,True,3774,1848,0,22
2,0.0,1.0,-1.0,28.0,34.0,18.0,16.0,16.0,2.2,1.6,...,0.0,1.0,0.0,0.0,5963069012,True,3311,1951,7,22
3,0.0,0.4,-1.0,27.0,32.0,18.0,18.0,22.0,2.2,3.7,...,0.0,1.0,0.0,0.0,5963069204,True,3408,995,0,22
4,1.0,0.0,-1.0,27.0,31.0,25.0,20.0,18.0,3.4,2.2,...,0.0,1.0,0.0,0.0,5963069208,False,4621,1818,7,22


**Converting string to numerical**

In [17]:
feature_df['radiant_win'] = feature_df['radiant_win'].astype(int)
feature_df = feature_df.apply(pd.to_numeric)
feature_df

Unnamed: 0,Radiant_1_is_Melle,Radiant_1_base_mana_regen,Radiant_1_base_armor,Radiant_1_base_attack_min,Radiant_1_base_attack_max,Radiant_1_base_str,Radiant_1_base_agi,Radiant_1_base_int,Radiant_1_str_gain,Radiant_1_agi_gain,...,Dire_5_Jungler,Dire_5_Carry,Dire_5_Pusher,Dire_5_Escape,match_id,radiant_win,avg_mmr,duration,lobby_type,game_mode
0,1.0,0.0,-1.0,27.0,31.0,25.0,20.0,18.0,3.4,2.2,...,0.0,1.0,0.0,0.0,5963069008,1,3439,1649,7,22
1,0.0,0.0,-2.0,9.0,18.0,22.0,24.0,19.0,3.0,4.3,...,0.0,0.0,0.0,1.0,5963069009,1,3774,1848,0,22
2,0.0,1.0,-1.0,28.0,34.0,18.0,16.0,16.0,2.2,1.6,...,0.0,1.0,0.0,0.0,5963069012,1,3311,1951,7,22
3,0.0,0.4,-1.0,27.0,32.0,18.0,18.0,22.0,2.2,3.7,...,0.0,1.0,0.0,0.0,5963069204,1,3408,995,0,22
4,1.0,0.0,-1.0,27.0,31.0,25.0,20.0,18.0,3.4,2.2,...,0.0,1.0,0.0,0.0,5963069208,0,4621,1818,7,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45995,0.0,0.0,0.0,29.0,35.0,18.0,18.0,18.0,2.2,1.5,...,0.0,1.0,1.0,0.0,5960541818,0,3414,2507,7,22
45996,0.0,0.0,2.0,33.0,41.0,19.0,11.0,22.0,2.1,1.2,...,0.0,1.0,0.0,0.0,5960541916,1,3349,2156,7,22
45997,0.0,0.0,0.0,29.0,35.0,18.0,18.0,18.0,2.2,1.5,...,0.0,1.0,0.0,0.0,5960542119,0,3448,2534,7,22
45998,0.0,0.3,1.0,19.0,25.0,19.0,20.0,18.0,2.7,3.5,...,0.0,0.0,0.0,0.0,5960542310,1,3702,1650,7,3


## Salvando dataset
**Saving data frame on 'working data' folder**

In [18]:
working_data_path = '../data/working_data/1_TRA_v2_'
start_file = datetime.now().strftime("%Y-%m-%d")
output_file = working_data_path + start_file + '_working_data.csv'

feature_df.to_csv(output_file, index=False)